Update README.md

3fcf89c over 2 years ago

6.68 kB

	---
	license: mit
	tags:
	- sentence-embeddings
	- endpoints-template
	- optimum
	library_name: generic
	---

	# Optimized and Quantized [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a custom pipeline.py


	This repository implements a `custom` task for `sentence-embeddings` for 🤗 Inference Endpoints for accelerated inference using [🤗 Optimum](https://huggingface.co/docs/optimum/index). The code for the customized pipeline is in the [pipeline.py](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/pipeline.py).

	In the [how to create your own optimized and quantized model](#how-to-create-your-own-optimized-and-quantized-model) you will learn how the model was converted & optimized, it is based on the [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers) blog post. It also includes how to create your custom pipeline and test it. There is also a [notebook](https://huggingface.co/philschmid/all-MiniLM-L6-v2-optimum-embeddings/blob/main/convert.ipynb) included.

	To use deploy this model a an Inference Endpoint you have to select `Custom` as task to use the `pipeline.py` file. -> _double check if it is selected_

	### expected Request payload

	```json
	{
	"inputs": "The sky is a blue today and not gray",
	}
	```

	below is an example on how to run a request using Python and `requests`.

	## Run Request

	```python
	import json
	from typing import List
	import requests as r
	import base64

	ENDPOINT_URL = ""
	HF_TOKEN = ""


	def predict(document_string:str=None):

	payload = {"inputs": document_string}
	response = r.post(
	ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
	)
	return response.json()


	prediction = predict(
	path_to_image="The sky is a blue today and not gray"
	)
	```

	expected output

	```python
	{'embeddings': [[-0.021580450236797333,
	0.021715054288506508,
	0.00979710929095745,
	-0.0005379787762649357,
	0.04682469740509987,
	-0.013600599952042103,
	...
	}
	```



	## How to create your own optimized and quantized model

	Steps:
	[1. Convert model to ONNX](#1-convert-model-to-onnx)
	[2. Optimize & quantize model with Optimum](#2-optimize--quantize-model-with-optimum)
	[3. Create Custom Handler for Inference Endpoints](#3-create-custom-handler-for-inference-endpoints)

	Helpful links:
	* [Accelerate Sentence Transformers with Hugging Face Optimum](https://www.philschmid.de/optimize-sentence-transformers)
	* [Create Custom Handler Endpoints](https://link-to-docs)

	## Setup & Installation

	```python
	%%writefile requirements.txt
	optimum[onnxruntime]==1.3.0
	mkl-include
	mkl
	```

	install requirements

	```python
	!pip install -r requirements.txt
	```

	## 1. Convert model to ONNX


	```python
	from optimum.onnxruntime import ORTModelForFeatureExtraction
	from transformers import AutoTokenizer
	from pathlib import Path


	model_id="sentence-transformers/all-MiniLM-L6-v2"
	onnx_path = Path(".")

	# load vanilla transformers and convert to onnx
	model = ORTModelForFeatureExtraction.from_pretrained(model_id, from_transformers=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	# save onnx checkpoint and tokenizer
	model.save_pretrained(onnx_path)
	tokenizer.save_pretrained(onnx_path)
	```


	## 2. Optimize & quantize model with Optimum


	```python
	from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
	from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

	# create ORTOptimizer and define optimization configuration
	optimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task)
	optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

	# apply the optimization configuration to the model
	optimizer.export(
	onnx_model_path=onnx_path / "model.onnx",
	onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
	optimization_config=optimization_config,
	)


	# create ORTQuantizer and define quantization configuration
	dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
	dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

	# apply the quantization configuration to the model
	model_quantized_path = dynamic_quantizer.export(
	onnx_model_path=onnx_path / "model-optimized.onnx",
	onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
	quantization_config=dqconfig,
	)


	```

	## 3. Create Custom Handler for Inference Endpoints


	```python
	%%writefile pipeline.py
	from typing import Dict, List, Any
	from optimum.onnxruntime import ORTModelForFeatureExtraction
	from transformers import AutoTokenizer
	import torch.nn.functional as F
	import torch

	# copied from the model card
	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0] #First element of model_output contains all token embeddings
	input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
	return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


	class PreTrainedPipeline():
	def __init__(self, path=""):
	# load the optimized model
	self.model = ORTModelForFeatureExtraction.from_pretrained(path, file_name="model-quantized.onnx")
	self.tokenizer = AutoTokenizer.from_pretrained(path)

	def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
	"""
	Args:
	data (:obj:):
	includes the input data and the parameters for the inference.
	Return:
	A :obj:`list`:. The list contains the embeddings of the inference inputs
	"""
	inputs = data.get("inputs", data)

	# tokenize the input
	encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
	# run the model
	outputs = self.model(**encoded_inputs)
	# Perform pooling
	sentence_embeddings = mean_pooling(outputs, encoded_inputs['attention_mask'])
	# Normalize embeddings
	sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
	# postprocess the prediction
	return {"embeddings": sentence_embeddings.tolist()}
	```

	test custom pipeline


	```python
	from pipeline import PreTrainedPipeline

	# init handler
	my_handler = PreTrainedPipeline(path=".")

	# prepare sample payload
	request = {"inputs": "I am quite excited how this will turn out"}

	# test the handler
	%timeit my_handler(request)

	```

	results

	```
	1.55 ms ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
	```