INT4 Whisper medium ONNX Model

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the repository of INT4 weight only quantization for the Whisper medium model in ONNX format, powered by Intel® Neural Compressor and Intel® Extension for Transformers.

This INT4 ONNX model is generated by Intel® Neural Compressor's weight-only quantization method.

Model Detail Description
Model Authors - Company Intel
Date October 8, 2023
Version 1
Type Speech Recognition
Paper or Other Resources -
License Apache 2.0
Questions or Comments Community Tab
Intended Use Description
Primary intended uses You can use the raw model for automatic speech recognition inference
Primary intended users Anyone doing automatic speech recognition inference
Out-of-scope uses This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

Export to ONNX Model

The FP32 model is exported with openai/whisper-medium:

optimum-cli export onnx --model openai/whisper-medium whisper-medium-with-past/ --task automatic-speech-recognition-with-past --opset 13

Install ONNX Runtime

Install onnxruntime>=1.16.0 to support MatMulFpQ4 operator.

Run Quantization

Build Intel® Neural Compressor from master branch and run INT4 weight-only quantization.

The weight-only quantization cofiguration is as below:

dtype group_size scheme algorithm
INT4 32 sym RTN

We provide the key code below. For the complete script, please refer to whisper example.

from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
    config = PostTrainingQuantConfig(
        approach="weight_only",
        calibration_sampling_size=[8],
        op_type_dict={".*": {"weight": {"bits": 4, 
                                        "algorithm": ["RTN"], 
                                        "scheme": ["sym"], 
                                        "group_size": 32}}},)
    q_model = quantization.fit(
        os.path.join("/path/to/whisper-medium-with-past", model), # FP32 model path
        config,
        calib_dataloader=dataloader)
    q_model.save(os.path.join("/path/to/whisper-medium-onnx-int4", model)) # INT4 model path

Evaluation

Operator Statistics

Below shows the operator statistics in the INT4 ONNX model:

Model Op Type Total INT4 weight FP32 weight
encoder_model MatMul 192 144 48
decoder_model MatMul 337 241 96
decoder_with_past_model MatMul 289 193 96

Evaluation of wer

Evaluate the model on librispeech_asr dataset with below code:

import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-medium'
model_path = 'whisper-medium-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")

Metrics (Model Performance):

Model Model Size (GB) wer
FP32 4.9 2.88
INT4 1.1 2.98
Downloads last month
5
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Intel/whisper-medium-onnx-int4-inc

Collection including Intel/whisper-medium-onnx-int4-inc