daekeun-ml commited on
Commit
f44c7a9
·
verified ·
1 Parent(s): 59813d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -2
README.md CHANGED
@@ -49,7 +49,6 @@ tags:
49
  - phi-4-multimodal
50
  ---
51
 
52
-
53
  # Phi-4-multimodal-finetune-ko-speech
54
 
55
  This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
@@ -75,7 +74,7 @@ Evaluation was done on the following datasets:
75
 
76
  Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
77
 
78
- Compared to [this fine-tuned model](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
79
 
80
  | Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
81
  |----------------------|-------------|--------------|------------------|--------------|------------------|
@@ -83,6 +82,70 @@ Compared to [this fine-tuned model](https://huggingface.co/seastar105/Phi-4-mm-i
83
  | finetune (this model)| 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
84
  | Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ## References
87
 
88
  - https://huggingface.co/microsoft/Phi-4-multimodal-instruct
 
49
  - phi-4-multimodal
50
  ---
51
 
 
52
  # Phi-4-multimodal-finetune-ko-speech
53
 
54
  This is a fine-tuned model for Korean speech-to-text translation, from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the following datasets:
 
74
 
75
  Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
76
 
77
+ Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
78
 
79
  | Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
80
  |----------------------|-------------|--------------|------------------|--------------|------------------|
 
82
  | finetune (this model)| 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
83
  | Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
84
 
85
+ ## Usage
86
+
87
+ ### Requirements
88
+
89
+ Works with the following packages. Please make sure to install them before using the model.
90
+ ```
91
+ flash_attn==2.7.4.post1
92
+ torch==2.6.0
93
+ transformers==4.48.2
94
+ accelerate==1.3.0
95
+ soundfile==0.13.1
96
+ pillow==11.1.0
97
+ scipy==1.15.2
98
+ torchvision==0.21.0
99
+ backoff==2.2.1
100
+ peft==0.13.2
101
+ ```
102
+
103
+ ### Sample code
104
+ ```python
105
+ from datasets import load_dataset
106
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
107
+
108
+ max_new_tokens = 256
109
+ orig_model_path = "microsoft/Phi-4-multimodal-instruct"
110
+ ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
111
+ generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
112
+ processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
113
+ model = AutoModelForCausalLM.from_pretrained(
114
+ ft_model_path,
115
+ trust_remote_code=True,
116
+ torch_dtype='auto',
117
+ _attn_implementation='flash_attention_2',
118
+ ).cuda()
119
+
120
+ user_prompt = '<|user|>'
121
+ assistant_prompt = '<|assistant|>'
122
+ prompt_suffix = '<|end|>'
123
+
124
+ # task prompt is from technical report
125
+ asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
126
+ ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
127
+ ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
128
+ ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
129
+ ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
130
+
131
+ asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
132
+
133
+ # ASR
134
+ item = asr_ds[0]
135
+ audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
136
+ inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
137
+ generate_ids = model.generate(
138
+ **inputs,
139
+ max_new_tokens=max_new_tokens,
140
+ generation_config=generation_config,
141
+ )
142
+ generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
143
+ response = processor.batch_decode(
144
+ generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
145
+ )[0]
146
+ print(response) # "몬토 킬은 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다"
147
+ ```
148
+
149
  ## References
150
 
151
  - https://huggingface.co/microsoft/Phi-4-multimodal-instruct