metadata

library_name: transformers
license: mit
base_model: microsoft/Phi-4-multimodal-instruct
tags:
  - generated_from_trainer
model-index:
  - name: Phi-4-mm-inst-asr-turkish-unf
    results: []
datasets:
  - ysdede/khanacademy-turkish
  - ysdede/khanacademy-turkish-math
  - ysdede/commonvoice_17_tr_fixed
language:
  - tr

Phi-4-mm-inst-asr-turkish-unf

This model is a fine-tuned version of microsoft/Phi-4-multimodal-instruct.

Model Background:
This benchmark evaluates a fine-tuned version of Microsoft's Phi-4-mm-instruct, a multimodal model not originally designed for Turkish ASR. Key points:

Initial Limitations:
- No Turkish ASR support in base model
- Initial WER 100+%
Fine-Tuning Process:
- Unfroze encoder layers for Turkish adaptation
- Trained for 1 epoch on Turkish audio-text pairs
Current Status:
- Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*
- Still under active development for better generalization
- Results shared as incremental progress documentation

Why This Matters:

Demonstrates adaptability of multimodal architectures
Provides baseline for Turkish ASR in resource-constrained scenarios
Encourages exploration of under-supported languages

Note on CommonVoice Results:
- CommonVoice's relatively low WER (9.7%) may benefit from:
  - Potential speaker leakage between splits (same speakers in train/test)
  - Clean audio conditions despite non-professional recordings
  - Short utterance structure (average 4-5 seconds)
- See below for full context on CommonVoice characteristics in the "Dataset Notes" section.

Benchmark Results

Testing Environment: Google Colab with L4 GPU (24 GB VRAM)

Model	WER (%)	CER (%)	Inference Speed (xRT)	Batch Size	Audio Duration (hrs)	Samples Processed
ysdede/commonvoice_17_tr_fixed	9.7	2.72	x26	32	7.1	8,576
erenfazlioglu/turkishvoicedataset	11.52	3.93	x20	16	7.8	2,496
ysdede/khanacademy-turkish	12.04	7.78	x16	16	3.8	1,344
ysdede/yeni-split-0	20.58	13.2	x16	16	18	5,936
ymoslem/MediaSpeech	25.48	15.16	x35	32	10	2,496
dssnt1	27.23	9.6	x12	16	2.5	1,200
ysdede/yeni-split-lq-noisy	39.4	27	x19	16	12	3,440

Dataset Notes:

Finetuning Datasets:
- commonvoice_17_tr_fixed: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential speaker leakage (same speakers in train/test with different utterances)
- khanacademy-turkish: Educational lectures with STEM vocabulary
- yeni-split-0: Noisy real-world recordings
Benchmark-only Datasets:
- turkishvoicedataset: Synthetic TTS news (clean but artificial prosody)
- yeni-split-lq-noisy: Challenging noisy samples with alignment errors

Text Normalization Challenges:
⚠️ Current WER/CER scores may be inflated due to:

Lack of standardized Turkish ASR text normalization pipeline
Case/punctuation inconsistencies in references
Agglutinative language morphology affecting word boundaries

Evaluation Note:
For Turkish ASR benchmarking, I developed a text normalizer to address language-specific scoring challenges. While imperfect, it helps:

Convert numbers/dates to words
Standardize compound word formatting
Reduce punctuation-related mismatches

This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.

Performance Factors:

CommonVoice's relatively low WER (9.7%) likely benefits from:
- High audio quality despite non-professional speakers
- Potential speaker familiarity patterns (same speakers in both splits)
- Short utterance structure (average 4-5 seconds)

Training procedure

finetuning Colab notebook

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Framework versions

Transformers 4.48.3
Pytorch 2.5.1+cu124
Datasets 3.3.2
Tokenizers 0.21.0