File size: 3,710 Bytes
769b2d9 319097b fdaef65 319097b fdaef65 319097b 769b2d9 319097b 769b2d9 319097b 769b2d9 319097b 769b2d9 319097b 769b2d9 319097b 769b2d9 319097b 769b2d9 0dad31e 769b2d9 0dad31e 769b2d9 0dad31e 769b2d9 0dad31e c65944b 769b2d9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
library_name: transformers
datasets:
- Bingsu/zeroth-korean
- google/fleurs
language:
- ko
metrics:
- cer
- wer
- bleu
base_model:
- microsoft/Phi-4-multimodal-instruct
model-index:
- name: Phi-4-multimodal-instruct-ko-asr
results:
- task:
type: automatic-speech-recognition
dataset:
type: Bingsu/zeroth_korean
name: zeroth-korean-test
metrics:
- type: bleu
name: zeroth-test-BLEU
value: 94.837
- type: cer
name: zeroth-test-CER
value: 1.429
- type: wer
name: zeroth-test-WER
value: 2.951
- task:
type: automatic-speech-recognition
dataset:
type: google/flerus
name: flerus-ko-test
metrics:
- type: bleu
name: fleurs-test-BLEU
value: 67.659
- type: cer
name: fleurs-test-CER
value: 7.951
- type: wer
name: fleurs-test-WER
value: 18.313
pipeline_tag: automatic-speech-recognition
---
This model is fine-tuned from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on [Bingsu/zeroth-korean](https://huggingface.co/datasets/Bingsu/zeroth-korean), [google/flerus](https://huggingface.co/datasets/Bingsu/google/flerus) in 5 epochs.
This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.
After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean.
## Evaluation
Evaluation by
```
from whisper_normalizer.basic import BasicTextNormalizer
from evaluate import load
normalizer = BasicTextNormalizer()
cer_metric = load("cer")
wer_metric = load("wer")
```
| Model | zeroth-test-BLEU | zeroth-test-CER | zeroth-test-WER | fleurs-test-BLEU | fleurs-test-CER | fleurs-test-WER |
|--------------------|------------------|-----------------|-----------------|------------------|-----------------|-----------------|
| original | 0.071 | 126.4 | 121.5 | 0.010 | 115.7 | 112.8 |
| finetune (this model) | 94.837 | 1.429 | 2.951 | 67.659 | 7.951 | 18.313 |
Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).
Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) and [Phi-4-multimodal-finetune-ko-speech](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech), ASR is significantly improved.
| Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|----------------------|-------------|--------------|------------------|--------------|------------------|
| original | 198.32 | 5.63 | 2.42 | 6.86 | 4.17 |
| daekeun-ml/Phi-4-multimodal-finetune-ko-speech| 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
| seastar105/Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
| ASR finetune (this model)| 1.31 | 7.46 | 6.24 | 12.15 | 8.91 |
| + AST finetune with (AST)[https://huggingface.co/datasets/junnei/covost2]| 3.88 | 8.07 | 10.09 | 18.82 | 15.41 |
|