junnei's picture
Update README.md
c65944b verified
metadata
library_name: transformers
datasets:
  - Bingsu/zeroth-korean
  - google/fleurs
language:
  - ko
metrics:
  - cer
  - wer
  - bleu
base_model:
  - microsoft/Phi-4-multimodal-instruct
model-index:
  - name: Phi-4-multimodal-instruct-ko-asr
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          type: Bingsu/zeroth_korean
          name: zeroth-korean-test
        metrics:
          - type: bleu
            name: zeroth-test-BLEU
            value: 94.837
          - type: cer
            name: zeroth-test-CER
            value: 1.429
          - type: wer
            name: zeroth-test-WER
            value: 2.951
      - task:
          type: automatic-speech-recognition
        dataset:
          type: google/flerus
          name: flerus-ko-test
        metrics:
          - type: bleu
            name: fleurs-test-BLEU
            value: 67.659
          - type: cer
            name: fleurs-test-CER
            value: 7.951
          - type: wer
            name: fleurs-test-WER
            value: 18.313
pipeline_tag: automatic-speech-recognition

This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Bingsu/zeroth-korean, google/flerus in 5 epochs.

This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean.

Evaluation

Evaluation by

from whisper_normalizer.basic import BasicTextNormalizer
from evaluate import load

normalizer = BasicTextNormalizer()
cer_metric = load("cer")
wer_metric = load("wer")
Model zeroth-test-BLEU zeroth-test-CER zeroth-test-WER fleurs-test-BLEU fleurs-test-CER fleurs-test-WER
original 0.071 126.4 121.5 0.010 115.7 112.8
finetune (this model) 94.837 1.429 2.951 67.659 7.951 18.313

Evaluation was done on the following datasets:

  • ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
  • AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from here.

Compared to Phi-4-mm-inst-zeroth-kor and Phi-4-multimodal-finetune-ko-speech, ASR is significantly improved.

Model zeroth-test fleurs-ko2en fleurs-ko2en-cot fleurs-en2ko fleurs-en2ko-cot
original 198.32 5.63 2.42 6.86 4.17
daekeun-ml/Phi-4-multimodal-finetune-ko-speech 3.80 7.03 7.04 12.50 9.54
seastar105/Phi-4-mm-inst-zeroth-kor 7.02 7.07 9.19 13.08 9.35
ASR finetune (this model) 1.31 7.46 6.24 12.15 8.91
+ AST finetune with (AST)[https://huggingface.co/datasets/junnei/covost2] 3.88 8.07 10.09 18.82 15.41