philschmid's picture
philschmid HF staff
Update README.md
3fcf89c
---
license: mit
tags:
- sentence-embeddings
- endpoints-template
- optimum
library_name: generic
---
# Optimized and Quantized [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a custom pipeline.py
This repository implements a `custom` task for `sentence-embeddings` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optimum](https://huggingface.co/docs/optimum/index). The code for the customized pipeline is in the [pipeline.py](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/pipeline.py).
In the [how to create your own optimized and quantized model](#how-to-create-your-own-optimized-and-quantized-model) you will learn how the model was converted & optimized, it is based on the [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers) blog post. It also includes how to create your custom pipeline and test it. There is also a [notebook](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/convert.ipynb) included.
To use deploy this model a an Inference Endpoint you have to select `Custom` as task to use the `pipeline.py` file. -> _double check if it is selected_
### expected Request payload
```json
{
"inputs": "The sky is a blue today and not gray",
}
```
below is an example on how to run a request using Python and `requests`.
## Run Request
```python
import json
from typing import List
import requests as r
import base64
ENDPOINT_URL = ""
HF_TOKEN = ""
def predict(document_string:str=None):
payload = {"inputs": document_string}
response = r.post(
ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
)
return response.json()
prediction = predict(
path_to_image="The sky is a blue today and not gray"
)
```
expected output
```python
{'embeddings': [[-0.021580450236797333,
0.021715054288506508,
0.00979710929095745,
-0.0005379787762649357,
0.04682469740509987,
-0.013600599952042103,
...
}
```
## How to create your own optimized and quantized model
Steps:
[1. Convert model to ONNX](#1-convert-model-to-onnx)
[2. Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
[3. Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)
Helpful links:
* [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers)
* [Create Custom Handler Endpoints](https://link-to-docs)
## Setup & Installation
```python
%%writefile requirements.txt
optimum[onnxruntime]==1.3.0
mkl-include
mkl
```
install requirements
```python
!pip install -r requirements.txt
```
## 1. Convert model to ONNX
```python
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
from pathlib import Path
model_id="sentence-transformers/all-MiniLM-L6-v2"
onnx_path = Path(".")
# load vanilla transformers and convert to onnx
model = ORTModelForFeatureExtraction.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
```
## 2. Optimize & quantize model with Optimum
```python
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
# create ORTOptimizer and define optimization configuration
optimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task)
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations
# apply the optimization configuration to the model
optimizer.export(
onnx_model_path=onnx_path / "model.onnx",
onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
optimization_config=optimization_config,
)
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.export(
onnx_model_path=onnx_path / "model-optimized.onnx",
onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
quantization_config=dqconfig,
)
```
## 3. Create Custom Handler for Inference Endpoints
```python
%%writefile pipeline.py
from typing import Dict, List, Any
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch.nn.functional as F
import torch
# copied from the model card
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
class PreTrainedPipeline():
def __init__(self, path=""):
# load the optimized model
self.model = ORTModelForFeatureExtraction.from_pretrained(path, file_name="model-quantized.onnx")
self.tokenizer = AutoTokenizer.from_pretrained(path)
def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
"""
Args:
data (:obj:):
includes the input data and the parameters for the inference.
Return:
A :obj:`list`:. The list contains the embeddings of the inference inputs
"""
inputs = data.get("inputs", data)
# tokenize the input
encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
# run the model
outputs = self.model(**encoded_inputs)
# Perform pooling
sentence_embeddings = mean_pooling(outputs, encoded_inputs['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
# postprocess the prediction
return {"embeddings": sentence_embeddings.tolist()}
```
test custom pipeline
```python
from pipeline import PreTrainedPipeline
# init handler
my_handler = PreTrainedPipeline(path=".")
# prepare sample payload
request = {"inputs": "I am quite excited how this will turn out"}
# test the handler
%timeit my_handler(request)
```
results
```
1.55 ms ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```