--- library_name: transformers datasets: - Bingsu/zeroth-korean - google/fleurs language: - ko metrics: - cer - wer - bleu base_model: - microsoft/Phi-4-multimodal-instruct model-index: - name: Phi-4-multimodal-instruct-ko-asr results: - task: type: automatic-speech-recognition dataset: type: Bingsu/zeroth_korean name: zeroth-korean-test metrics: - type: bleu name: zeroth-test-BLEU value: 94.837 - type: cer name: zeroth-test-CER value: 1.429 - type: wer name: zeroth-test-WER value: 2.951 - task: type: automatic-speech-recognition dataset: type: google/flerus name: flerus-ko-test metrics: - type: bleu name: fleurs-test-BLEU value: 67.659 - type: cer name: fleurs-test-CER value: 7.951 - type: wer name: fleurs-test-WER value: 18.313 pipeline_tag: automatic-speech-recognition --- This model is fine-tuned from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on [Bingsu/zeroth-korean](https://huggingface.co/datasets/Bingsu/zeroth-korean), [google/flerus](https://huggingface.co/datasets/Bingsu/google/flerus) in 5 epochs. This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100. After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean. ## Evaluation Evaluation by ``` from whisper_normalizer.basic import BasicTextNormalizer from evaluate import load normalizer = BasicTextNormalizer() cer_metric = load("cer") wer_metric = load("wer") ``` | Model | zeroth-test-BLEU | zeroth-test-CER | zeroth-test-WER | fleurs-test-BLEU | fleurs-test-CER | fleurs-test-WER | |--------------------|------------------|-----------------|-----------------|------------------|-----------------|-----------------| | original | 0.071 | 126.4 | 121.5 | 0.010 | 115.7 | 112.8 | | finetune (this model) | 94.837 | 1.429 | 2.951 | 67.659 | 7.951 | 18.313 | Evaluation was done on the following datasets: - ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples). - AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples). Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py). Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) and [Phi-4-multimodal-finetune-ko-speech](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech), ASR is significantly improved. | Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot | |----------------------|-------------|--------------|------------------|--------------|------------------| | original | 198.32 | 5.63 | 2.42 | 6.86 | 4.17 | | daekeun-ml/Phi-4-multimodal-finetune-ko-speech| 3.80 | 7.03 | 7.04 | 12.50 | 9.54 | | seastar105/Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 | | ASR finetune (this model)| 1.31 | 7.46 | 6.24 | 12.15 | 8.91 | | + AST finetune with (AST)[https://huggingface.co/datasets/junnei/covost2]| 3.88 | 8.07 | 10.09 | 18.82 | 15.41 |