How to use - transcriptions seems hit and miss too often?

#2
by chrisoutwright - opened

using:

asr_pipeline = pipeline(
    "automatic-speech-recognition", 
    model="nizarmichaud/whisper-large-v3-turbo-swissgerman",
    device=device
)
# Force the decoder to output in German for transcription
asr_pipeline.model.config.forced_decoder_ids = asr_pipeline.tokenizer.get_decoder_prompt_ids(


    language="de", 
    task="transcribe"
)

I get not really acceptable results

gets transscribed to:

"
für Fabian, für die schöne Zeit, die er hier hatte.
Und dann noch ein Abschiedsgeschenk, das es auch in Österreich hat.
"

anything wrong? Should be Easter Card Gift ... there are quite a lot of issues with the model not picking up subtle stuff that is quite usual in swiss german. But then it is technically also quite hard to manage that with finetuning only.

chrisoutwright changed discussion title from How to use? to How to use - transcriptions seems hit and miss too often?
This comment has been hidden

Hey @chrisoutwright ,

Thanks for pointing this out! Indeed, Swiss German is tricky because it’s not really a written language (https://en.wikipedia.org/wiki/Swiss_German: "There are no official rules of Swiss German orthography.") and has lots of smaller dialects between cantons. The datasets we used for fine-tuning reflect that inconsistency. Some were direct translations to standard German, while others tried to transcribe Swiss German more literally. Expressions like “uu-gärn” either didn’t make it into the datasets or weren’t transcribed properly, which is why the model sometimes struggles with those.

The thing is, during fine-tuning, the model tries to fit the data it’s trained on as best as it can. That means it inherits the quirks of the datasets—some being more like translations than true transcriptions. My collaborators and I landed on using the model more for semantic analysis than a purely phonetic approach. Basically, it’s not always about capturing the exact words but more about understanding the general meaning behind them.

The base Whisper model also has its limitations. Even in ideal conditions, it can skip words or sentences—that’s just how the architecture works. That's why we talk so much about accuracy, Word Error Rate, BLEU scores etc. They're not perfect, and probably will never be. Slower playback can sometimes help, as you noticed, but it’s not a fix-all or it would have been a common trick in ASR.

Cheers,
Nizar

thanks for the quick response, I had by accident hidden my comment on my smartphone, cannot undo it.. but am happy to see how one can improve it still. So far it would probably help to use more everyday speech, but the amount of variations in the different dialects really make it a challenge, while keeping it as one language.

Indeed, one thing that could be done is to fine-tune again the model on specific dialects. Some databases provide informations about the canton of origin of the people speaking.

The issue is the more you tighten the language or population to increase specific results, the less available data you have to fit your functions. Meta and UZH are doing great research about these issues (respectfully for low-data and swiss-german), which can be interesting to go into if you'd need to increase strict-phonetic performances.

Sign up or log in to comment