ysdede's picture
Update README.md
cb293b2 verified
metadata
library_name: transformers
license: mit
base_model: microsoft/Phi-4-multimodal-instruct
tags:
  - generated_from_trainer
model-index:
  - name: Phi-4-mm-inst-asr-turkish-unf
    results: []
datasets:
  - ysdede/khanacademy-turkish
  - ysdede/khanacademy-turkish-math
  - ysdede/commonvoice_17_tr_fixed
language:
  - tr

Phi-4-mm-inst-asr-turkish-unf

This model is a fine-tuned version of microsoft/Phi-4-multimodal-instruct.

Model Background:
This benchmark evaluates a fine-tuned version of Microsoft's Phi-4-mm-instruct, a multimodal model not originally designed for Turkish ASR. Key points:

  1. Initial Limitations:

    • No Turkish ASR support in base model
    • Initial WER 100+%
  2. Fine-Tuning Process:

    • Unfroze encoder layers for Turkish adaptation
    • Trained for 1 epoch on Turkish audio-text pairs
  3. Current Status:

    • Achieved significant WER reduction (100+% → 9.7% on CommonVoice)*
    • Still under active development for better generalization
    • Results shared as incremental progress documentation

Why This Matters:

  • Demonstrates adaptability of multimodal architectures
  • Provides baseline for Turkish ASR in resource-constrained scenarios
  • Encourages exploration of under-supported languages
  • Note on CommonVoice Results:
    • CommonVoice's relatively low WER (9.7%) may benefit from:
      • Potential speaker leakage between splits (same speakers in train/test)
      • Clean audio conditions despite non-professional recordings
      • Short utterance structure (average 4-5 seconds)
    • See below for full context on CommonVoice characteristics in the "Dataset Notes" section.

Benchmark Results

Testing Environment: Google Colab with L4 GPU (24 GB VRAM)

Model WER (%) CER (%) Inference Speed (xRT) Batch Size Audio Duration (hrs) Samples Processed
ysdede/commonvoice_17_tr_fixed 9.7 2.72 x26 32 7.1 8,576
erenfazlioglu/turkishvoicedataset 11.52 3.93 x20 16 7.8 2,496
ysdede/khanacademy-turkish 12.04 7.78 x16 16 3.8 1,344
ysdede/yeni-split-0 20.58 13.2 x16 16 18 5,936
ymoslem/MediaSpeech 25.48 15.16 x35 32 10 2,496
dssnt1 27.23 9.6 x12 16 2.5 1,200
ysdede/yeni-split-lq-noisy 39.4 27 x19 16 12 3,440

Dataset Notes:

  • Finetuning Datasets:

    • commonvoice_17_tr_fixed: Crowd-sourced clean speech (not professional studio recordings) with shuffled splits - potential speaker leakage (same speakers in train/test with different utterances)
    • khanacademy-turkish: Educational lectures with STEM vocabulary
    • yeni-split-0: Noisy real-world recordings
  • Benchmark-only Datasets:

    • turkishvoicedataset: Synthetic TTS news (clean but artificial prosody)
    • yeni-split-lq-noisy: Challenging noisy samples with alignment errors

Text Normalization Challenges:
⚠️ Current WER/CER scores may be inflated due to:

  1. Lack of standardized Turkish ASR text normalization pipeline
  2. Case/punctuation inconsistencies in references
  3. Agglutinative language morphology affecting word boundaries

Evaluation Note:
For Turkish ASR benchmarking, I developed a text normalizer to address language-specific scoring challenges. While imperfect, it helps:

  • Convert numbers/dates to words
  • Standardize compound word formatting
  • Reduce punctuation-related mismatches

This preprocessing makes WER/CER calculations slightly fairer compared to raw scoring, though manual verification remains recommended. The tool is actively being refined based on validation set findings.

Performance Factors:

  • CommonVoice's relatively low WER (9.7%) likely benefits from:
    • High audio quality despite non-professional speakers
    • Potential speaker familiarity patterns (same speakers in both splits)
    • Short utterance structure (average 4-5 seconds)

Training procedure

finetuning Colab notebook

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.99) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Framework versions

  • Transformers 4.48.3
  • Pytorch 2.5.1+cu124
  • Datasets 3.3.2
  • Tokenizers 0.21.0