No description provided.

This is ready @Shitao and @ldwang

michaelfeil changed pull request title from Upload 2 files to Onnx support

@Shitao and @ldwang friendly reminder

Beijing Academy of Artificial Intelligence org

@michaelfeil , thank you very much for your PR! Before merging, we need to ensure the accuracy of the ONNX file, meaning it matches the results of the original model. We would greatly appreciate it if you could provide some information about these two files and how to use them for inference.

@Shitao thaks for the response.

Please consider the following testing script that I wrote for this PR. My advise for reproducability is to use file_name="onnx/model.onnx". The main benefit of onnx will be in the fast onnx execution on cpu with the quantized model.

requires optimum and onnxruntime pip install optimum[onnxruntime]

from optimum.onnxruntime import ORTModelForFeatureExtraction  # type: ignore

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5', revision="refs/pr/13")
model_ort = ORTModelForFeatureExtraction.from_pretrained('BAAI/bge-large-en-v1.5', revision="refs/pr/13",file_name="onnx/model.onnx")
model.eval()

# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

model_output_ort = model_ort(**encoded_input)
# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    
    # testing
    import numpy as np
    np.testing.assert_allclose(
        model_output.last_hidden_state.cpu().numpy(), 
        model_output_ort.last_hidden_state.cpu().numpy(),
          rtol=1e-4, atol=1e-4)
for bge-large-en-v1.5
# (model_output.last_hidden_state.cpu().numpy() - model_output_ort.last_hidden_state.cpu().numpy()).max() == 2.6538968e-05

Should I update the docs/readme as well?

Beijing Academy of Artificial Intelligence org

Sorry for the late reply, our holiday has just ended.
Thanks for your contribution again! I have run the test command and there is no error in model.onnx. (but has a warning: The ONNX file onnx/model.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.)
However, The output of model_quantized.onnx differs significantly from that of the original model. I suggest not adding the quantized version at this time.

Beijing Academy of Artificial Intelligence org

Should I update the docs/readme as well?

If you could provide some instructions on how to use ONNX for inference in the README, that would be great.

Can this be done in a separate PR? Would not mind contributing it (bit short on time), but there is no easy way to push to a PR branch in the huggingface.co UI.

Beijing Academy of Artificial Intelligence org

Can you remove the model_quantized.onnx?

Beijing Academy of Artificial Intelligence org

Done so!

Thanks

Shitao changed pull request status to merged

Sign up or log in to comment