---
datasets:
- kresnik/zeroth_korean
- mozilla-foundation/common_voice_17_0
- PolyAI/minds14
metrics:
- bleu
- cer
base_model:
- microsoft/Phi-4-multimodal-instruct
language:
- ko
license: mit
tags:
- korean
- stt
- custom_code
- phi
- phi-4-multimodal
model-index:
- name: Phi-4-mm-inst-zeroth-kor
  results:
  - task:
      type: speech-to-text-translation
    dataset:
      name: fleurs (ko-en test intersection)
      type: seastar105/fleurs_ko_en_test
    metrics:
    - type: bleu
      value: 7.03
      name: ko2en
    - type: bleu
      value: 7.04
      name: ko2en-cot
    - type: bleu
      value: 12.5
      name: en2ko (ko-mecab)
    - type: bleu
      value: 9.54
      name: en2ko-cot (ko-mecab)
  - task:
      type: automatic-speech-recognition
    dataset:
      name: zeroth_korean test
      type: kresnik/zeroth_korean
    metrics:
    - type: cer
      value: 7.02
      name: test CER
---

# Phi-4-multimodal-finetune-ko-speech

This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:

- kresnik/zeroth_korean
- mozilla-foundation/common_voice_17_0 (Used Korean speech only)
- PolyAI/minds14 (Used Korean speech only)
- Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)

Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.

The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)

Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.

Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.

## Evaluation

Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).

Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.

| Model                | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|----------------------|-------------|--------------|------------------|--------------|------------------|
| original             |  198.32     | 5.63         | 2.42             | 6.86         | 4.17             |
| finetune (4 epochs) |  2.72       | 7.11         | 9.95             | 13.22        | 10.45            |
| finetune (1 epoch) |  3.80       | 7.03         | 7.04             | 12.50        | 9.54             |
| Phi-4-mm-inst-zeroth-kor |  7.02       | 7.07         | 9.19             | 13.08        | 9.35             |

## Usage

### Requirements

Works with the following packages. Please make sure to install them before using the model.
```
flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.4.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.14.0
datasets==3.3.2
librosa==0.10.2.post1
pandas==2.2.3
```

### Sample code
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

max_new_tokens = 256
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    ft_model_path,
    trust_remote_code=True,
    torch_dtype='auto',
    _attn_implementation='flash_attention_2',
).cuda()

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'

asr_ds = load_dataset("kresnik/zeroth_korean", split="test")

# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] 
print(response) # "몬토 킬은 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다"
```

### Demos
Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.

## References

- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
- https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor