metadata
library_name: transformers
datasets:
- Bingsu/zeroth-korean
- google/fleurs
language:
- ko
metrics:
- cer
- wer
- bleu
base_model:
- microsoft/Phi-4-multimodal-instruct
model-index:
- name: Phi-4-multimodal-instruct-ko-asr
results:
- task:
type: automatic-speech-recognition
dataset:
type: Bingsu/zeroth_korean
name: zeroth-korean-test
metrics:
- type: bleu
name: zeroth-test-BLEU
value: 94.837
- type: cer
name: zeroth-test-CER
value: 1.429
- type: wer
name: zeroth-test-WER
value: 2.951
- task:
type: automatic-speech-recognition
dataset:
type: google/flerus
name: flerus-ko-test
metrics:
- type: bleu
name: fleurs-test-BLEU
value: 67.659
- type: cer
name: fleurs-test-CER
value: 7.951
- type: wer
name: fleurs-test-WER
value: 18.313
pipeline_tag: automatic-speech-recognition
This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Bingsu/zeroth-korean, google/flerus in 5 epochs.
This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.
After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean.
Evaluation
Evaluation by
from whisper_normalizer.basic import BasicTextNormalizer
from evaluate import load
normalizer = BasicTextNormalizer()
cer_metric = load("cer")
wer_metric = load("wer")
Model | zeroth-test-BLEU | zeroth-test-CER | zeroth-test-WER | fleurs-test-BLEU | fleurs-test-CER | fleurs-test-WER |
---|---|---|---|---|---|---|
original | 0.071 | 126.4 | 121.5 | 0.010 | 115.7 | 112.8 |
finetune (this model) | 94.837 | 1.429 | 2.951 | 67.659 | 7.951 | 18.313 |
Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).
Script is retrieved from here.
Compared to Phi-4-mm-inst-zeroth-kor and Phi-4-multimodal-finetune-ko-speech, ASR is significantly improved.
Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
---|---|---|---|---|---|
original | 198.32 | 5.63 | 2.42 | 6.86 | 4.17 |
daekeun-ml/Phi-4-multimodal-finetune-ko-speech | 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
seastar105/Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
ASR finetune (this model) | 1.31 | 7.46 | 6.24 | 12.15 | 8.91 |
+ AST finetune with (AST)[https://huggingface.co/datasets/junnei/covost2] | 3.88 | 8.07 | 10.09 | 18.82 | 15.41 |