--- library_name: transformers license: mit base_model: microsoft/Phi-4-multimodal-instruct tags: - generated_from_trainer model-index: - name: Phi-4-mm-inst-asr-turkish-unf results: [] datasets: - ysdede/khanacademy-turkish - ysdede/khanacademy-turkish-math - ysdede/commonvoice_17_tr_fixed language: - tr --- # Phi-4-mm-inst-asr-turkish-unf This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct). **Model Background**: This benchmark evaluates a fine-tuned version of Microsoft's **Phi-4-mm-instruct**, a multimodal model not originally designed for Turkish ASR. Key points: 1. **Initial Limitations**: - No Turkish ASR support in base model - Initial WER 100+% 2. **Fine-Tuning Process**: - Unfroze encoder layers for Turkish adaptation - Trained for 1 epoch on Turkish audio-text pairs 3. **Current Status**: - Achieved significant WER reduction (100+% → 9.7% on CommonVoice)* - Still under active development for better generalization - Results shared as incremental progress documentation **Why This Matters**: - Demonstrates adaptability of multimodal architectures - Provides baseline for Turkish ASR in resource-constrained scenarios - Encourages exploration of under-supported languages * **Note on CommonVoice Results**: - CommonVoice's relatively low WER (9.7%) may benefit from: - Potential speaker leakage between splits (same speakers in train/test) - Clean audio conditions despite non-professional recordings - Short utterance structure (average 4-5 seconds) - See **below** for full context on CommonVoice characteristics in the "Dataset Notes" section. ### Benchmark Results **Testing Environment**: Google Colab with L4 GPU (24 GB VRAM) | Model | WER (%) | CER (%) | Inference Speed (xRT) | Batch Size | Audio Duration (hrs) | Samples Processed | | :--------------------------------- | -------:| -------:| --------------------: | ----------:| --------------------:| -----------------:| | ysdede/commonvoice_17_tr_fixed | 9.7 | 2.72 | x26 | 32 | 7.1 | 8,576 | | erenfazlioglu/turkishvoicedataset | 11.52 | 3.93 | x20 | 16 | 7.8 | 2,496 | | ysdede/khanacademy-turkish | 12.04 | 7.78 | x16 | 16 | 3.8 | 1,344 | | ysdede/yeni-split-0 | 20.58 | 13.2 | x16 | 16 | 18 | 5,936 | | ymoslem/MediaSpeech | 25.48 | 15.16 | x35 | 32 | 10 | 2,496 | | dssnt1 | 27.23 | 9.6 | x12 | 16 | 2.5 | 1,200 | | ysdede/yeni-split-lq-noisy | 39.4 | 27 | x19 | 16 | 12 | 3,440 | **Dataset Notes**: - **Finetuning Datasets**: - `commonvoice_17_tr_fixed`: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential **speaker leakage** (same speakers in train/test with different utterances) - `khanacademy-turkish`: Educational lectures with STEM vocabulary - `yeni-split-0`: Noisy real-world recordings - **Benchmark-only Datasets**: - `turkishvoicedataset`: Synthetic TTS news (clean but artificial prosody) - `yeni-split-lq-noisy`: Challenging noisy samples with alignment errors **Text Normalization Challenges**: ⚠️ Current WER/CER scores may be inflated due to: 1. Lack of standardized Turkish ASR text normalization pipeline 2. Case/punctuation inconsistencies in references 3. Agglutinative language morphology affecting word boundaries **Evaluation Note**: For Turkish ASR benchmarking, I developed a [text normalizer](https://github.com/ysdede/trnorm) to address language-specific scoring challenges. While imperfect, it helps: - Convert numbers/dates to words - Standardize compound word formatting - Reduce punctuation-related mismatches This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings. **Performance Factors**: - CommonVoice's relatively low WER (9.7%) likely benefits from: - High audio quality despite non-professional speakers - Potential speaker familiarity patterns (same speakers in both splits) - Short utterance structure (average 4-5 seconds) ## Training procedure [finetuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing) ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0001 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 1 ### Framework versions - Transformers 4.48.3 - Pytorch 2.5.1+cu124 - Datasets 3.3.2 - Tokenizers 0.21.0