Phi-4-mm-inst-asr-turkish-unf
This model is a fine-tuned version of microsoft/Phi-4-multimodal-instruct.
Model Background:
This benchmark evaluates a fine-tuned version of Microsoft's Phi-4-mm-instruct, a multimodal model not originally designed for Turkish ASR. Key points:
Initial Limitations:
- No Turkish ASR support in base model
- Initial WER 100+%
Fine-Tuning Process:
- Unfroze encoder layers for Turkish adaptation
- Trained for 1 epoch on Turkish audio-text pairs
Current Status:
- Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*
- Still under active development for better generalization
- Results shared as incremental progress documentation
Why This Matters:
- Demonstrates adaptability of multimodal architectures
- Provides baseline for Turkish ASR in resource-constrained scenarios
- Encourages exploration of under-supported languages
- Note on CommonVoice Results:
- CommonVoice's relatively low WER (9.7%) may benefit from:
- Potential speaker leakage between splits (same speakers in train/test)
- Clean audio conditions despite non-professional recordings
- Short utterance structure (average 4-5 seconds)
- See below for full context on CommonVoice characteristics in the "Dataset Notes" section.
- CommonVoice's relatively low WER (9.7%) may benefit from:
Benchmark Results
Testing Environment: Google Colab with L4 GPU (24 GB VRAM)
Model | WER (%) | CER (%) | Inference Speed (xRT) | Batch Size | Audio Duration (hrs) | Samples Processed |
---|---|---|---|---|---|---|
ysdede/commonvoice_17_tr_fixed | 9.7 | 2.72 | x26 | 32 | 7.1 | 8,576 |
erenfazlioglu/turkishvoicedataset | 11.52 | 3.93 | x20 | 16 | 7.8 | 2,496 |
ysdede/khanacademy-turkish | 12.04 | 7.78 | x16 | 16 | 3.8 | 1,344 |
ysdede/yeni-split-0 | 20.58 | 13.2 | x16 | 16 | 18 | 5,936 |
ymoslem/MediaSpeech | 25.48 | 15.16 | x35 | 32 | 10 | 2,496 |
dssnt1 | 27.23 | 9.6 | x12 | 16 | 2.5 | 1,200 |
ysdede/yeni-split-lq-noisy | 39.4 | 27 | x19 | 16 | 12 | 3,440 |
Dataset Notes:
Finetuning Datasets:
commonvoice_17_tr_fixed
: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential speaker leakage (same speakers in train/test with different utterances)khanacademy-turkish
: Educational lectures with STEM vocabularyyeni-split-0
: Noisy real-world recordings
Benchmark-only Datasets:
turkishvoicedataset
: Synthetic TTS news (clean but artificial prosody)yeni-split-lq-noisy
: Challenging noisy samples with alignment errors
Text Normalization Challenges:
⚠️ Current WER/CER scores may be inflated due to:
- Lack of standardized Turkish ASR text normalization pipeline
- Case/punctuation inconsistencies in references
- Agglutinative language morphology affecting word boundaries
Evaluation Note:
For Turkish ASR benchmarking, I developed a text normalizer to address language-specific scoring challenges. While imperfect, it helps:
- Convert numbers/dates to words
- Standardize compound word formatting
- Reduce punctuation-related mismatches
This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.
Performance Factors:
- CommonVoice's relatively low WER (9.7%) likely benefits from:
- High audio quality despite non-professional speakers
- Potential speaker familiarity patterns (same speakers in both splits)
- Short utterance structure (average 4-5 seconds)
Training procedure
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Framework versions
- Transformers 4.48.3
- Pytorch 2.5.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.0
- Downloads last month
- 46
Model tree for ysdede/Phi-4-mm-inst-asr-turkish-unf
Base model
microsoft/Phi-4-multimodal-instruct