junnei
/

Phi-4-multimodal-instruct-ko-asr

Automatic Speech Recognition

text-generation

Model card Files Files and versions Community

Phi-4-multimodal-instruct-ko-asr / README.md

junnei's picture

Update README.md

c65944b verified about 3 hours ago

|

history blame contribute delete

3.71 kB

	---
	library_name: transformers
	datasets:
	- Bingsu/zeroth-korean
	- google/fleurs
	language:
	- ko
	metrics:
	- cer
	- wer
	- bleu
	base_model:
	- microsoft/Phi-4-multimodal-instruct
	model-index:
	- name: Phi-4-multimodal-instruct-ko-asr
	results:
	- task:
	type: automatic-speech-recognition
	dataset:
	type: Bingsu/zeroth_korean
	name: zeroth-korean-test
	metrics:
	- type: bleu
	name: zeroth-test-BLEU
	value: 94.837
	- type: cer
	name: zeroth-test-CER
	value: 1.429
	- type: wer
	name: zeroth-test-WER
	value: 2.951
	- task:
	type: automatic-speech-recognition
	dataset:
	type: google/flerus
	name: flerus-ko-test
	metrics:
	- type: bleu
	name: fleurs-test-BLEU
	value: 67.659
	- type: cer
	name: fleurs-test-CER
	value: 7.951
	- type: wer
	name: fleurs-test-WER
	value: 18.313
	pipeline_tag: automatic-speech-recognition
	---



	This model is fine-tuned from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on [Bingsu/zeroth-korean](https://huggingface.co/datasets/Bingsu/zeroth-korean), [google/flerus](https://huggingface.co/datasets/Bingsu/google/flerus) in 5 epochs.

	This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

	After that, we will check if it can perform scalable work through additional training with synthetic data from CoVoST2 Dataset into Korean.

	## Evaluation

	Evaluation by
	```
	from whisper_normalizer.basic import BasicTextNormalizer
	from evaluate import load

	normalizer = BasicTextNormalizer()
	cer_metric = load("cer")
	wer_metric = load("wer")
	```

	\| Model \| zeroth-test-BLEU \| zeroth-test-CER \| zeroth-test-WER \| fleurs-test-BLEU \| fleurs-test-CER \| fleurs-test-WER \|
	\|--------------------\|------------------\|-----------------\|-----------------\|------------------\|-----------------\|-----------------\|
	\| original \| 0.071 \| 126.4 \| 121.5 \| 0.010 \| 115.7 \| 112.8 \|
	\| finetune (this model) \| 94.837 \| 1.429 \| 2.951 \| 67.659 \| 7.951 \| 18.313 \|


	Evaluation was done on the following datasets:
	- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
	- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

	Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).

	Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) and [Phi-4-multimodal-finetune-ko-speech](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech), ASR is significantly improved.

	\| Model \| zeroth-test \| fleurs-ko2en \| fleurs-ko2en-cot \| fleurs-en2ko \| fleurs-en2ko-cot \|
	\|----------------------\|-------------\|--------------\|------------------\|--------------\|------------------\|
	\| original \| 198.32 \| 5.63 \| 2.42 \| 6.86 \| 4.17 \|
	\| daekeun-ml/Phi-4-multimodal-finetune-ko-speech\| 3.80 \| 7.03 \| 7.04 \| 12.50 \| 9.54 \|
	\| seastar105/Phi-4-mm-inst-zeroth-kor \| 7.02 \| 7.07 \| 9.19 \| 13.08 \| 9.35 \|
	\| ASR finetune (this model)\| 1.31 \| 7.46 \| 6.24 \| 12.15 \| 8.91 \|
	\| + AST finetune with (AST)[https://huggingface.co/datasets/junnei/covost2]\| 3.88 \| 8.07 \| 10.09 \| 18.82 \| 15.41 \|