Skip to content

fix(lora): resume from checkpoint fails due to strict state_dict loading#1296

Open
duchengyao wants to merge 1 commit into
fishaudio:mainfrom
duchengyao:fix-lora-checkpoint-resume
Open

fix(lora): resume from checkpoint fails due to strict state_dict loading#1296
duchengyao wants to merge 1 commit into
fishaudio:mainfrom
duchengyao:fix-lora-checkpoint-resume

Conversation

@duchengyao
Copy link
Copy Markdown

Is this PR adding new feature or fix a BUG?

Fix BUG.

Is this pull request related to any issue? If yes, please link the issue.

#1295

Problem

TextToSemantic.on_save_checkpoint intentionally saves only LoRA parameters to reduce checkpoint size (~100MB vs ~9GB). However, this causes Lightning's restore_model() to fail during resume because load_state_dict is called with strict=True, and the frozen base model weights are missing from the saved state_dict.

Fix

Override load_state_dict in TextToSemantic to always use strict=False.

  • Only LoRA weights are updated (base weights remain from from_pretrained)
  • Optimizer states and LR schedulers are correctly restored
  • Full fine-tuning is unaffected (non-LoRA checkpoints have all keys)

Before

RuntimeError: Error(s) in loading state_dict for TextToSemantic:
	Missing key(s): model.embeddings.weight, model.codebook_embeddings.weight, ...

After

LoRA training resumes from checkpoint with no errors.

Files changed

  • fish_speech/models/text2semantic/lit_module.py — add load_state_dict override

Testing

Tested on a single RTX 4090 (48GB) with LoRA r=8 on s2-pro:

  1. Train for N steps → interrupt
  2. Re-run training script → resume from latest checkpoint successfully
  3. Loss curve is continuous, optimizer/scheduler states restored

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant