Questions about fine-tuning strategy and hyperparameters for Korean ASR/AST tasks

#37
by junnei - opened

Hello!

First of all, thank you very much for sharing your excellent work on Phi-4-multimodal-instruct and providing helpful fine-tuning examples.

I'm currently working on fine-tuning your model using Korean speech data.
(with Korean ASR datasets, and additional custom translated CoVoST2 dataset for AST)
[Model] [New Custom Dataset]

While preparing this experiment,
I have a few questions regarding the provided fine-tuning script and recommended training strategies:

1. Initial Fine-tuning Hyperparameters:

In your example script, you set TRAIN_SIZE = 50000, learning_rate = 4.0e-5, and batch_size = 128.
Given that the dataset (CoVoST2) is significantly larger than 50K samples,
Could you explain the reasoning behind training for only one epoch? Was this decision based on empirical observations, convergence criteria, or other specific considerations?

2. Clarification on Fine-tuning Scope (LoRA vs Entire LLM):

Could you clarify whether the provided example (sample_finetune_speech.py) trains only the Speech LoRA adapters or fine-tunes the entire language model?
I thought that the provided example finetunes all LLM parameters, but in a previous discussion (#1), someone said that it was recommended to freeze the original model and set model.embed_tokens_extend.audio_embed requires_grad = True. Could you confirm which approach is recommended by default?

3. Sequential Training Strategy for Korean ASR & AST:

Furthermore, I'm planning a sequential training approach with Korean ASR datasets, and custom translated CoVoST2.

According to your technical report, two-stage paradigm with ASR pre-training followed by post-training for AST.
So, My current plan is to first train the model on Korean ASR data for 5 epochs, then continue with AST data for 2-3 epochs.

In this scenario:

  • Would you recommend fine-tuning the entire model yield, or freezing the entire LLM and only training audio adapter components would better results for Korean language adaptation?
  • Could you suggest optimal hyperparameter adjustments (e.g., epochs, learning rate, batch size) suitable for this sequential ASR-to-AST training scenario with Korean data?

Thank you once again for your valuable contributions. Your insights would greatly help me successfully apply Phi-4-multimodal-instruct to my research.

4. Model Saving and Sharing:

It's an easy thing, but I need your help.
After fine-tuning with provided example, how should I save and upload the model to Hugging Face while maintaining the original repository structure with separate model and speech-lora folders? Is there a recommended approach or specific function to handle this separation when saving the fine-tuned model?

Sign up or log in to comment