You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

wav2vec2-jv-large-openslr

This model is a fine-tuned version of facebook/wav2vec2-large on the OpenSLR41 datasets. It achieves the following results on the evaluation set:

  • Loss: 0.2727
  • Wer: 0.1523

Model description

The model is a fine-tuned version of wav2vec2, specifically adapted using the OpenSLR 41 dataset, which is focused on the Javanese language domain. This adaptation enables the model to effectively recognize and process spoken Javanese, leveraging the robust capabilities of the wav2vec2 architecture combined with domain-specific training data.

Intended uses & limitations

This model is intended for transcribing spoken Javanese language from audio recordings. It achieves a Word Error Rate (WER) of 15%, indicating that while the model performs reasonably well, it still produces significant transcription errors. Users should be aware that the accuracy may vary, particularly in cases with challenging audio conditions or less common dialects. Additionally, this model requires input audio at a sample rate of 16kHz, which may limit its applicability for recordings at different sample rates or lower quality audio files.

Training and evaluation data

The model use OpenSLR41 datasets, and split into 2 section (training and testing), then the model is trained using 1xA100 GPU with a training duration of 4-5 hours.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 75
  • mixed_precision_training: Native AMP

Log Data | Training results

Training Loss Epoch Step Validation Loss Wer
0.6739 2.8329 2000 0.5742 0.4900
0.4767 5.6657 4000 0.4601 0.4101
0.3889 8.4986 6000 0.3921 0.3329
0.3381 11.3314 8000 0.3323 0.3081
0.2842 14.1643 10000 0.3467 0.3081
0.2505 16.9972 12000 0.3186 0.2833
0.2158 19.8300 14000 0.3003 0.2522
0.1885 22.6629 16000 0.2877 0.2405
0.1695 25.4958 18000 0.3089 0.2405
0.1494 28.3286 20000 0.2924 0.2254
0.1331 31.1615 22000 0.2796 0.2068
0.1293 33.9943 24000 0.2734 0.1895
0.1083 36.8272 26000 0.2844 0.1826
0.0955 39.6601 28000 0.2665 0.1744
0.085 42.4929 30000 0.2772 0.1695
0.0799 45.3258 32000 0.2747 0.1654
0.072 48.1586 34000 0.2746 0.1558
0.0934 50.9915 36000 0.2979 0.1764
0.0912 53.8244 38000 0.2914 0.1778
0.0812 56.6572 40000 0.2762 0.1785
0.0779 59.4901 42000 0.2752 0.1688
0.0718 62.3229 44000 0.2623 0.1633
0.0656 65.1558 46000 0.2704 0.1647
0.0606 67.9887 48000 0.2632 0.1571
0.0564 70.8215 50000 0.2711 0.1551
0.0562 73.6544 52000 0.2727 0.1523

How to run (Gradio Web)

import torch
import torchaudio
import gradio as gr
import numpy as np
from transformers import pipeline, AutoProcessor, AutoModelForCTC

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and processor
MODEL_NAME = "<fill this to your model>"
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = AutoModelForCTC.from_pretrained(MODEL_NAME)

# Move model to GPU
model.to(device)

# Create the pipeline with the model and processor
transcriber = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device)

def transcribe(audio):
    sr, y = audio
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))

    return transcriber({"sampling_rate": sr, "raw": y})["text"]

demo = gr.Interface(
    transcribe,
    gr.Audio(sources=["upload"]),
    "text",
)

demo.launch(share=True)

How to run

import torch
import torchaudio
import gradio as gr
import numpy as np
from transformers import pipeline, AutoProcessor, AutoModelForCTC

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model and processor
MODEL_NAME = "<fill this to actual model>"
processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = AutoModelForCTC.from_pretrained(MODEL_NAME)

# Move model to GPU
model.to(device)

# Load audio file
AUDIO_PATH = "<replace 'path_to_audio_file.wav' with the actual path to your audio file>"
audio_input, sample_rate = torchaudio.load(AUDIO_PATH)

# Ensure the audio is mono (1 channel)
if audio_input.shape[0] > 1:
    audio_input = torch.mean(audio_input, dim=0, keepdim=True)

# Resample audio if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    audio_input = resampler(audio_input)

# Process the audio input
input_values = processor(audio_input.squeeze(), sampling_rate=16000, return_tensors="pt").input_values

# Move input values to GPU
input_values = input_values.to(device)

# Perform inference
with torch.no_grad():
    logits = model(input_values).logits

# Decode the logits to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]

print("Transcription:", transcription)

Framework versions

  • Transformers 4.44.0
  • Pytorch 2.2.1+cu118
  • Datasets 2.20.0
  • Tokenizers 0.19.1
Downloads last month
0
Safetensors
Model size
315M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for johaness14/wav2vec2-jv-large-openslr

Finetuned
(17)
this model

Dataset used to train johaness14/wav2vec2-jv-large-openslr

Collection including johaness14/wav2vec2-jv-large-openslr