reazonspeech-k2-v2-ja-en

reazonspeech-k2-v2-ja-en is an automatic speech recognition (ASR) model trained on ReazonSpeech v2.0 corpus and LibriSpeech.

This model provides end-to-end Japanese and English speech recognition based on Next-gen Kaldi.

Model Architecture

  • Character-based RNN-T model.

  • This model utilizes an enhanced Transformer architecture called Zipformer.

Usage

We recommend implementing this model by using the reazonspeech library.

from reazonspeech.k2.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model(device="cpu", precision="fp32", language="ja-en") 
ret = transcribe(model, audio)
print(ret.text)

This model utilizes BBPE, so tokens for Japanese are represented by character sequences such as ▁ƊģŊ
While time stamps are associated with each transcribed token, these tokens are encoded on the byte-level and cannot be directly understood.
However, the English tokens are at a subword level printed in regular alphabetical text and can be directly understood.

Performance

This model was validated post training with the following results.

Word Error Rates (WERs) listed below:

Datasets ReazonSpeech ReazonSpeech LibriSpeech LibriSpeech
Zipformer WER (%) dev test test-clean test-other
greedy_search 5.9 4.07 3.46 8.35
modified_beam_search 4.87 3.61 3.28 8.07

Character Error Rates (CERs) for Japanese listed below:

Decoding Method In-Distribution CER JSUT CommonVoice TEDx
greedy search 12.56 6.93 9.75 9.67
modified beam search 11.59 6.97 9.55 9.51

Additional tests were performed with manually procurred audio files (see test_wavs/transcripts.txt).
The model performs reasonably well as long as the input audio contains a single language.
However when multiple languages are included in the same input, the model struggles to provide an accurate transcription (see test_multi).
This result can be avoided by properly segmenting audio into chunks, separated by pauses in speech.

  • test_ja_1: 57% (CER)
  • test_ja_2: 26% (CER)
  • test_multi: 99% (CER)
  • test_en_1: 12% (WER)
  • test_en_2: 27% (WER)

License

Apache Licence 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including reazon-research/reazonspeech-k2-v2-ja-en