---
library_name: transformers
datasets:
- Bingsu/zeroth-korean
- google/fleurs
language:
- ko
metrics:
- cer
- wer
- bleu
base_model:
- microsoft/Phi-4-multimodal-instruct
model-index:
- name: Phi-4-multimodal-instruct-ko-asr
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      type: Bingsu/zeroth_korean
      name: zeroth-korean-test
    metrics:
    - type: bleu
      name: zeroth-test-BLEU
      value: 94.837
    - type: cer
      name: zeroth-test-CER
      value: 1.429
    - type: wer
      name: zeroth-test-WER
      value: 2.951
  - task:
      type: automatic-speech-recognition
    dataset:
      type: google/flerus
      name: flerus-ko-test
    metrics:
    - type: bleu
      name: fleurs-test-BLEU
      value: 67.659
    - type: cer
      name: fleurs-test-CER
      value: 7.951
    - type: wer
      name: fleurs-test-WER
      value: 18.313
pipeline_tag: automatic-speech-recognition
---


This model is fine-tuned from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on [Bingsu/zeroth-korean](https://huggingface.co/datasets/Bingsu/zeroth-korean), [google/flerus](https://huggingface.co/datasets/Bingsu/google/flerus) in 5 epochs.

This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean.

## Evaluation

Evaluation by 
```
from whisper_normalizer.basic import BasicTextNormalizer
from evaluate import load

normalizer = BasicTextNormalizer()
cer_metric = load("cer")
wer_metric = load("wer")
```

|        Model       | zeroth-test-BLEU | zeroth-test-CER | zeroth-test-WER | fleurs-test-BLEU | fleurs-test-CER | fleurs-test-WER |
|--------------------|------------------|-----------------|-----------------|------------------|-----------------|-----------------|
|       original     |      0.071       |      126.4      |      121.5      |       0.010      |      115.7      |      112.8      |
| finetune (this model) |    94.837     |      1.429      |      2.951      |       67.659     |      7.951      |      18.313     |


Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).

Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) and  [Phi-4-multimodal-finetune-ko-speech](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech), ASR is significantly improved.

| Model                | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|----------------------|-------------|--------------|------------------|--------------|------------------|
| original             |  198.32     | 5.63         | 2.42             | 6.86         | 4.17             |
| daekeun-ml/Phi-4-multimodal-finetune-ko-speech|  3.80       | 7.03         | 7.04             | 12.50        | 9.54             |
| seastar105/Phi-4-mm-inst-zeroth-kor |  7.02       | 7.07         | 9.19             | 13.08        | 9.35             |
| ASR finetune (this model)|  1.31       | 7.46         | 6.24             | 12.15        | 8.91             |
| + AST finetune with (AST)[https://huggingface.co/datasets/junnei/covost2]|  3.88       | 8.07         | 10.09            | 18.82        | 15.41             |