Guide for creating openvino and qint8 for hf model
Can anyone provide a short quick way to create the openvino and quantized into an existing repo?
- given a model: https://huggingface.co/intfloat/multilingual-e5-small
- convert the model into openvino -> quantize
- push to a different hg repo
I'm running into issue.
This is how i do:
- download the model locally via
fast_model = SentenceTransformer('intfloat/multilingual-e5-small', backend="openvino")
- then it will not find the xml file, and will export the model to OpenVINO.
- then quantize and upload the model into my repo fails
export_static_quantized_openvino_model(
fast_model,
quantization_config=None,
model_name_or_path="my-repo/multilingual-e5-small-openvino",
push_to_hub=True,
create_pr=True,
)
I have an Intel CPU with enough memory:
Issue:
[CPU] Add node with name '__module.embeddings/aten::add/Add' Exception from src\plugins\intel_cpu\src\shape_inference\custom\eltwise.cpp:45:
Eltwise shape infer input shapes dim index: 1 mismatch
Hello!
Your workflow looks correct, and I'm able to reproduce the error with my new backend exporter as well: https://huggingface.co/spaces/tomaarsen/backend-export
I suspect that something about the model and/or its configuration is incompatible with OpenVINO and/or optimum-intel
. Other models should still work as expected.
- Tom Aarsen
I will try onnx and int8 quantized to see if it works. If it works, I will stick with onnx + qint8.
@tomaarsen
how to just load the onnx-qint8 model into the memory?
Currently, it is loading the bin too, it is useless.
this is how i do:
fast_model = SentenceTransformer('deepfile/multilingual-e5-small-onnx-qint8', backend="onnx", model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"})
Currently, it is loading the bin too, it is useless.
Do you mean the pytorch_model.bin
? That one isn't downloaded for me:
modules.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 349/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 199/199 [00:00<?, ?B/s]
README.md: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 149k/149k [00:00<00:00, 1.78MB/s]
sentence_bert_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 53.0/53.0 [00:00<?, ?B/s]
config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 731/731 [00:00<?, ?B/s]
The ONNX file model_qint8_avx512_vnni.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
model_qint8_avx512_vnni.onnx: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 118M/118M [00:07<00:00, 15.3MB/s]
2024-11-13 15:36:17.1217059 [E:onnxruntime:Default, provider_bridge_ort.cc:1978 onnxruntime::TryGetProviderInfo_TensorRT] D:\a\_work\1\s\onnxruntime\core\session\provider_bridge_ort.cc:1637 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Users\tom\.conda\envs\sentence-transformers\Lib\site-packages\onnxruntime\capi\onnxruntime_providers_tensorrt.dll"
*************** EP Error ***************
EP Error D:\a\_work\1\s\onnxruntime\python\onnxruntime_pybind_state.cc:490 onnxruntime::python::RegisterTensorRTPluginsAsCustomOps Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and that your GPU is supported.
when using ['TensorrtExecutionProvider', 'CUDAExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
****************************************
2024-11-13 15:36:17.3815522 [W:onnxruntime:, transformer_memcpy.cc:74 onnxruntime::MemcpyTransformer::ApplyImpl] 288 Memcpy nodes are added to the graph torch_jit for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
tokenizer_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.17k/1.17k [00:00<?, ?B/s]
sentencepiece.bpe.model: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.07M/5.07M [00:00<00:00, 15.3MB/s]
tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 17.1M/17.1M [00:01<00:00, 15.2MB/s]
special_tokens_map.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 965/965 [00:00<?, ?B/s]
1_Pooling/config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 296/296 [00:00<?, ?B/s]
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: ORTModelForFeatureExtraction
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
The TensorRT error is because I have a partial TensorRT install, that's unrelated.
- Tom Aarsen
Oh, you are right, it is true, it is not loading the bin file. I debugged it after saving the model into a different folder. It was confusing because I used the same cache folder to quantize and push into hf.
But now, when I monitor the memory consumption, it is around 1GB, python itself is just 280MB, and my onnx-qint8 model is 115MB, but how come the total size is above 1GB?
I don't know if it is because of the dynamic quantization I did on the model.
from sentence_transformers import SentenceTransformer
fast_model = SentenceTransformer('C:\\<user-dir>\\deepfile\\multilingual-e5-small-onnx-qint8', backend="onnx", model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"})
sentences = ['This framework generates embeddings for each input sentence. Sentences are passed as a list of string. The quick brown fox jumps over the lazy dog.']
sentences5000 = [sentences[0]]*5000
import time
start = time.time()
print("Encoding sentence...")
embeddings = fast_model.encode(sentences5000)
print("Time sentence: ", time.time()-start)
I believe the extra memory is due to 1) overhead from torch
and 2) memory for e.g. the sentences & its embeddings.
I think if you try the original multilingual-e5-small, your memory usage should be a bit higher. It should also be a bit slower.
- Tom Aarsen
Okay, makes sense, instead of keeping the model in memory, I should release or yield to avoid that.
Thank you so much :)