--- datasets: - kresnik/zeroth_korean - mozilla-foundation/common_voice_17_0 - PolyAI/minds14 metrics: - bleu - cer base_model: - microsoft/Phi-4-multimodal-instruct language: - ko license: mit tags: - korean - stt - custom_code - phi - phi-4-multimodal model-index: - name: Phi-4-mm-inst-zeroth-kor results: - task: type: speech-to-text-translation dataset: name: fleurs (ko-en test intersection) type: seastar105/fleurs_ko_en_test metrics: - type: bleu value: 7.03 name: ko2en - type: bleu value: 7.04 name: ko2en-cot - type: bleu value: 12.5 name: en2ko (ko-mecab) - type: bleu value: 9.54 name: en2ko-cot (ko-mecab) - task: type: automatic-speech-recognition dataset: name: zeroth_korean test type: kresnik/zeroth_korean metrics: - type: cer value: 7.02 name: test CER --- # Phi-4-multimodal-finetune-ko-speech This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets: - kresnik/zeroth_korean - mozilla-foundation/common_voice_17_0 (Used Korean speech only) - PolyAI/minds14 (Used Korean speech only) - Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py) Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz. The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed. Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point. ## Evaluation Evaluation was done on the following datasets: - ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples). - AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples). Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py). Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting. | Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot | |----------------------|-------------|--------------|------------------|--------------|------------------| | original | 198.32 | 5.63 | 2.42 | 6.86 | 4.17 | | finetune (4 epochs) | 2.72 | 7.11 | 9.95 | 13.22 | 10.45 | | finetune (1 epoch) | 3.80 | 7.03 | 7.04 | 12.50 | 9.54 | | Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 | ## Usage ### Requirements Works with the following packages. Please make sure to install them before using the model. ``` flash_attn==2.7.4.post1 torch==2.6.0 transformers==4.48.2 accelerate==1.4.0 soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 torchvision==0.21.0 backoff==2.2.1 peft==0.14.0 datasets==3.3.2 librosa==0.10.2.post1 pandas==2.2.3 ``` ### Sample code ```python from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig max_new_tokens = 256 orig_model_path = "microsoft/Phi-4-multimodal-instruct" ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech" generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json') processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( ft_model_path, trust_remote_code=True, torch_dtype='auto', _attn_implementation='flash_attention_2', ).cuda() user_prompt = '<|user|>' assistant_prompt = '<|assistant|>' prompt_suffix = '<|end|>' # task prompt is from technical report asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}' ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}' ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}' ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}' ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}' asr_ds = load_dataset("kresnik/zeroth_korean", split="test") # ASR item = asr_ds[0] audio = (item["audio"]["array"], item["audio"]["sampling_rate"]) inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device) generate_ids = model.generate( **inputs, max_new_tokens=max_new_tokens, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(response) # "몬토 킬은 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다" ``` ### Demos Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly. ## References - https://huggingface.co/microsoft/Phi-4-multimodal-instruct - https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor