|
--- |
|
license: mit |
|
tags: |
|
- sentence-embeddings |
|
- endpoints-template |
|
- optimum |
|
library_name: generic |
|
--- |
|
|
|
# Optimized and Quantized [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a custom pipeline.py |
|
|
|
|
|
This repository implements a `custom` task for `sentence-embeddings` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optimum](https://huggingface.co/docs/optimum/index). The code for the customized pipeline is in the [pipeline.py](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/pipeline.py). |
|
|
|
In the [how to create your own optimized and quantized model](#how-to-create-your-own-optimized-and-quantized-model) you will learn how the model was converted & optimized, it is based on the [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers) blog post. It also includes how to create your custom pipeline and test it. There is also a [notebook](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/convert.ipynb) included. |
|
|
|
To use deploy this model a an Inference Endpoint you have to select `Custom` as task to use the `pipeline.py` file. -> _double check if it is selected_ |
|
|
|
### expected Request payload |
|
|
|
```json |
|
{ |
|
"inputs": "The sky is a blue today and not gray", |
|
} |
|
``` |
|
|
|
below is an example on how to run a request using Python and `requests`. |
|
|
|
## Run Request |
|
|
|
```python |
|
import json |
|
from typing import List |
|
import requests as r |
|
import base64 |
|
|
|
ENDPOINT_URL = "" |
|
HF_TOKEN = "" |
|
|
|
|
|
def predict(document_string:str=None): |
|
|
|
payload = {"inputs": document_string} |
|
response = r.post( |
|
ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload |
|
) |
|
return response.json() |
|
|
|
|
|
prediction = predict( |
|
path_to_image="The sky is a blue today and not gray" |
|
) |
|
``` |
|
|
|
expected output |
|
|
|
```python |
|
{'embeddings': [[-0.021580450236797333, |
|
0.021715054288506508, |
|
0.00979710929095745, |
|
-0.0005379787762649357, |
|
0.04682469740509987, |
|
-0.013600599952042103, |
|
... |
|
} |
|
``` |
|
|
|
|
|
|
|
## How to create your own optimized and quantized model |
|
|
|
Steps: |
|
[1. Convert model to ONNX](#1-convert-model-to-onnx) |
|
[2. Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum) |
|
[3. Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints) |
|
|
|
Helpful links: |
|
* [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers) |
|
* [Create Custom Handler Endpoints](https://link-to-docs) |
|
|
|
## Setup & Installation |
|
|
|
```python |
|
%%writefile requirements.txt |
|
optimum[onnxruntime]==1.3.0 |
|
mkl-include |
|
mkl |
|
``` |
|
|
|
install requirements |
|
|
|
```python |
|
!pip install -r requirements.txt |
|
``` |
|
|
|
## 1. Convert model to ONNX |
|
|
|
|
|
```python |
|
from optimum.onnxruntime import ORTModelForFeatureExtraction |
|
from transformers import AutoTokenizer |
|
from pathlib import Path |
|
|
|
|
|
model_id="sentence-transformers/all-MiniLM-L6-v2" |
|
onnx_path = Path(".") |
|
|
|
# load vanilla transformers and convert to onnx |
|
model = ORTModelForFeatureExtraction.from_pretrained(model_id, from_transformers=True) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
# save onnx checkpoint and tokenizer |
|
model.save_pretrained(onnx_path) |
|
tokenizer.save_pretrained(onnx_path) |
|
``` |
|
|
|
|
|
## 2. Optimize & quantize model with Optimum |
|
|
|
|
|
```python |
|
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer |
|
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig |
|
|
|
# create ORTOptimizer and define optimization configuration |
|
optimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task) |
|
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations |
|
|
|
# apply the optimization configuration to the model |
|
optimizer.export( |
|
onnx_model_path=onnx_path / "model.onnx", |
|
onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx", |
|
optimization_config=optimization_config, |
|
) |
|
|
|
|
|
# create ORTQuantizer and define quantization configuration |
|
dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task) |
|
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False) |
|
|
|
# apply the quantization configuration to the model |
|
model_quantized_path = dynamic_quantizer.export( |
|
onnx_model_path=onnx_path / "model-optimized.onnx", |
|
onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx", |
|
quantization_config=dqconfig, |
|
) |
|
|
|
|
|
``` |
|
|
|
## 3. Create Custom Handler for Inference Endpoints |
|
|
|
|
|
```python |
|
%%writefile pipeline.py |
|
from typing import Dict, List, Any |
|
from optimum.onnxruntime import ORTModelForFeatureExtraction |
|
from transformers import AutoTokenizer |
|
import torch.nn.functional as F |
|
import torch |
|
|
|
# copied from the model card |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
|
|
|
|
class PreTrainedPipeline(): |
|
def __init__(self, path=""): |
|
# load the optimized model |
|
self.model = ORTModelForFeatureExtraction.from_pretrained(path, file_name="model-quantized.onnx") |
|
self.tokenizer = AutoTokenizer.from_pretrained(path) |
|
|
|
def __call__(self, data: Any) -> List[List[Dict[str, float]]]: |
|
""" |
|
Args: |
|
data (:obj:): |
|
includes the input data and the parameters for the inference. |
|
Return: |
|
A :obj:`list`:. The list contains the embeddings of the inference inputs |
|
""" |
|
inputs = data.get("inputs", data) |
|
|
|
# tokenize the input |
|
encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt') |
|
# run the model |
|
outputs = self.model(**encoded_inputs) |
|
# Perform pooling |
|
sentence_embeddings = mean_pooling(outputs, encoded_inputs['attention_mask']) |
|
# Normalize embeddings |
|
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) |
|
# postprocess the prediction |
|
return {"embeddings": sentence_embeddings.tolist()} |
|
``` |
|
|
|
test custom pipeline |
|
|
|
|
|
```python |
|
from pipeline import PreTrainedPipeline |
|
|
|
# init handler |
|
my_handler = PreTrainedPipeline(path=".") |
|
|
|
# prepare sample payload |
|
request = {"inputs": "I am quite excited how this will turn out"} |
|
|
|
# test the handler |
|
%timeit my_handler(request) |
|
|
|
``` |
|
|
|
results |
|
|
|
``` |
|
1.55 ms ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
|
``` |
|
|
|
|