michaeldinzinger
commited on
Commit
·
f23f5c5
1
Parent(s):
a7fc931
Update README.md with text
Browse files- README.md +149 -0
- save_safetensors.py +2 -4
README.md
CHANGED
@@ -10786,3 +10786,152 @@ tags:
|
|
10786 |
- mteb
|
10787 |
license: mit
|
10788 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10786 |
- mteb
|
10787 |
license: mit
|
10788 |
---
|
10789 |
+
<h1 align="center">Combination of Embedding Models: [Arctic M (v1.5)](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) & [BGE Small (en; v1.5)](https://huggingface.co/BAAI/bge-small-en-v1.5)</h1>
|
10790 |
+
<h4 align="center">
|
10791 |
+
<p>
|
10792 |
+
<a href="#acknowledgement">Acknowledgement</a>
|
10793 |
+
<a href=#this-model>This Model</a> |
|
10794 |
+
<a href=#usage>Usage</a> |
|
10795 |
+
<p>
|
10796 |
+
</h4>
|
10797 |
+
|
10798 |
+
## Acknowledgement
|
10799 |
+
|
10800 |
+
First of all, we want to acknowledge the original creators of the [Snowflake/snowflake-arctic-embed-m-v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) models which are used to create this model. Our model is just a combination of these two models, and we have not made any changes to the original models.
|
10801 |
+
|
10802 |
+
Furthermore, we want to acknowledge the team of Marqo, who has worked on the idea of combining two models through concatenation in parallel to ourselves. Their initial effort allowed to re-use existing pieces of code, in particular the [modeling script](https://huggingface.co/PaDaS-Lab/arctic-m-bge-small/blob/main/modeling_arctic_m_bge_small.py) for bringing the combined model to HuggingFace.
|
10803 |
+
|
10804 |
+
## Combination of Embedding Models
|
10805 |
+
|
10806 |
+
- Embedding models become more powerful and applicabole in many use cases, but the next big challenge is to make them also more efficient in terms of resource consumption.
|
10807 |
+
- Our ambition is to experiment with the combination of two models to see if we can achieve a better performance with less resources. Early results have shown that models that are different to each others can complement each other and lead to better results. For coming up with a good combination, the selection of models is crucial, and the diversity (in terms of MTEB performance, architecture, training data, etc.) is a important part of it.
|
10808 |
+
- What kind of combination do we use? We have combined the embeddings of two models by concatenating them, the most straightforward technique of combination. Before concatenation, it is important to normalize the embeddings to make sure that the embeddings are in the same scale.
|
10809 |
+
|
10810 |
+
- We have combined the [Snowflake/snowflake-arctic-embed-m-v1.5](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5) and [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) models to create this model. The combined model produces an embedding with 1152 dimensions (768+384) and has a total of 142M parameters (109+33).
|
10811 |
+
- This model combination performs well on the MTEB Leaderboard and is a good starting point for further experiments. However, we are aware that the combination of models is a complex topic and it is no good to only search for combinations that can climb the leaderboard. Yet, it is remarkable that the mere concatenation of two models can lead to an increase of Average nDCG@10 on the MTEB English Retrieval benchmark from 55.14 to 56.5, a climb of few spots in the leaderboard what is otherwise achieved with extensive efforts in engineering. Furthermore, it is interesting that the combination presented by the [Chimera model](https://huggingface.co/Marqo/marqo-chimera-arctic-bge-m) is performing significantly worse on the Leaderboard, even though the employed models are - on their own - more potent than pair of models combined in this repository. The reasons might be manifold, could it be the different number of model parameters, differences in the training process or maybe it just boils down to the fact how well two models can complement each other on the specific tasks of the respective benchmark. Anyways, we are looking forward to further experiments and discussions on this topic.
|
10812 |
+
|
10813 |
+
## Usage
|
10814 |
+
|
10815 |
+
```python
|
10816 |
+
import numpy as np
|
10817 |
+
import torch
|
10818 |
+
from torch.utils.data import DataLoader
|
10819 |
+
from transformers import AutoModel, AutoTokenizer, PreTrainedTokenizerFast, BatchEncoding, DataCollatorWithPadding
|
10820 |
+
from functools import partial
|
10821 |
+
from datasets import Dataset
|
10822 |
+
from tqdm import tqdm
|
10823 |
+
from typing import *
|
10824 |
+
|
10825 |
+
|
10826 |
+
NUM_WORKERS = 4
|
10827 |
+
BATCH_SIZE = 32
|
10828 |
+
|
10829 |
+
|
10830 |
+
def transform_func(tokenizer: PreTrainedTokenizerFast,
|
10831 |
+
max_length: int,
|
10832 |
+
examples: Dict[str, List]) -> BatchEncoding:
|
10833 |
+
return tokenizer(examples['contents'],
|
10834 |
+
max_length=max_length,
|
10835 |
+
padding=True,
|
10836 |
+
return_token_type_ids=False,
|
10837 |
+
truncation=True)
|
10838 |
+
|
10839 |
+
|
10840 |
+
def move_to_cuda(sample):
|
10841 |
+
if len(sample) == 0:
|
10842 |
+
return {}
|
10843 |
+
|
10844 |
+
def _move_to_cuda(maybe_tensor):
|
10845 |
+
if torch.is_tensor(maybe_tensor):
|
10846 |
+
return maybe_tensor.cuda(non_blocking=True)
|
10847 |
+
elif isinstance(maybe_tensor, dict):
|
10848 |
+
return {key: _move_to_cuda(value) for key, value in maybe_tensor.items()}
|
10849 |
+
elif isinstance(maybe_tensor, list):
|
10850 |
+
return [_move_to_cuda(x) for x in maybe_tensor]
|
10851 |
+
elif isinstance(maybe_tensor, tuple):
|
10852 |
+
return tuple([_move_to_cuda(x) for x in maybe_tensor])
|
10853 |
+
elif isinstance(maybe_tensor, Mapping):
|
10854 |
+
return type(maybe_tensor)({k: _move_to_cuda(v) for k, v in maybe_tensor.items()})
|
10855 |
+
else:
|
10856 |
+
return maybe_tensor
|
10857 |
+
|
10858 |
+
return _move_to_cuda(sample)
|
10859 |
+
|
10860 |
+
|
10861 |
+
class RetrievalModel():
|
10862 |
+
def __init__(self, pretrained_model_name: str, **kwargs):
|
10863 |
+
self.pretrained_model_name = pretrained_model_name
|
10864 |
+
self.encoder = AutoModel.from_pretrained(pretrained_model_name, trust_remote_code=True)
|
10865 |
+
self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, trust_remote_code=True)
|
10866 |
+
self.gpu_count = torch.cuda.device_count()
|
10867 |
+
self.batch_size = BATCH_SIZE
|
10868 |
+
|
10869 |
+
self.query_instruction = 'Represent this sentence for searching relevant passages: {}'
|
10870 |
+
self.document_instruction = '{}'
|
10871 |
+
self.pool_type = 'cls'
|
10872 |
+
self.max_length = 512
|
10873 |
+
|
10874 |
+
self.encoder.cuda()
|
10875 |
+
self.encoder.eval()
|
10876 |
+
|
10877 |
+
def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
|
10878 |
+
input_texts = [self.query_instruction.format(q) for q in queries]
|
10879 |
+
return self._do_encode(input_texts)
|
10880 |
+
|
10881 |
+
def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
|
10882 |
+
input_texts = [self.document_instruction.format('{} {}'.format(d.get('title', ''), d['text']).strip()) for d in corpus]
|
10883 |
+
return self._do_encode(input_texts)
|
10884 |
+
|
10885 |
+
@torch.no_grad()
|
10886 |
+
def _do_encode(self, input_texts: List[str]) -> np.ndarray:
|
10887 |
+
dataset: Dataset = Dataset.from_dict({'contents': input_texts})
|
10888 |
+
dataset.set_transform(partial(transform_func, self.tokenizer, self.max_length))
|
10889 |
+
|
10890 |
+
data_collator = DataCollatorWithPadding(self.tokenizer, pad_to_multiple_of=8)
|
10891 |
+
data_loader = DataLoader(
|
10892 |
+
dataset,
|
10893 |
+
batch_size=self.batch_size * self.gpu_count,
|
10894 |
+
shuffle=False,
|
10895 |
+
drop_last=False,
|
10896 |
+
num_workers=NUM_WORKERS,
|
10897 |
+
collate_fn=data_collator,
|
10898 |
+
pin_memory=True)
|
10899 |
+
|
10900 |
+
encoded_embeds = []
|
10901 |
+
for batch_dict in tqdm(data_loader, desc='encoding', mininterval=10):
|
10902 |
+
batch_dict = move_to_cuda(batch_dict)
|
10903 |
+
|
10904 |
+
with torch.amp.autocast('cuda'):
|
10905 |
+
outputs = self.encoder(**batch_dict)
|
10906 |
+
encoded_embeds.append(outputs.cpu().numpy())
|
10907 |
+
|
10908 |
+
return np.concatenate(encoded_embeds, axis=0)
|
10909 |
+
|
10910 |
+
|
10911 |
+
model = RetrievalModel('PaDaS-Lab/arctic-m-bge-small')
|
10912 |
+
embeds_q = model.encode_queries(['What is the capital of France?'])
|
10913 |
+
# [[-0.01099197 -0.08366653 0.0060241 ... 0.03182805 -0.00674182 0.058571 ]]
|
10914 |
+
embeds_d = model.encode_corpus([{'title': 'Paris', 'text': 'Paris is the capital of France.'}])
|
10915 |
+
# [[ 0.0391828 -0.02951912 0.10862264 ... -0.05373885 -0.00368348 0.02323797]]
|
10916 |
+
```
|
10917 |
+
|
10918 |
+
### Libraries
|
10919 |
+
|
10920 |
+
```
|
10921 |
+
torch==2.5.0
|
10922 |
+
transformers==4.42.3
|
10923 |
+
mteb==1.12.94
|
10924 |
+
```
|
10925 |
+
|
10926 |
+
## Citation
|
10927 |
+
|
10928 |
+
```bibtex
|
10929 |
+
@misc{https://doi.org/10.48550/arxiv.2407.08275,
|
10930 |
+
doi = {10.48550/ARXIV.2407.08275},
|
10931 |
+
url = {https://arxiv.org/abs/2407.08275},
|
10932 |
+
author = {Caspari, Laura and Dastidar, Kanishka Ghosh and Zerhoudi, Saber and Mitrovic, Jelena and Granitzer, Michael},
|
10933 |
+
title = {Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems},
|
10934 |
+
year = {2024},
|
10935 |
+
copyright = {Creative Commons Attribution 4.0 International}
|
10936 |
+
}
|
10937 |
+
```
|
save_safetensors.py
CHANGED
@@ -1,4 +1,3 @@
|
|
1 |
-
from safetensors.torch import save_file
|
2 |
from modeling_arctic_m_bge_small import ConcatModel, ConcatModelConfig
|
3 |
|
4 |
config = ConcatModelConfig()
|
@@ -8,8 +7,7 @@ model.load_weights_from_automodels(
|
|
8 |
has_pooling_layer=[True, True]
|
9 |
)
|
10 |
|
11 |
-
|
12 |
-
|
13 |
-
save_file(state_dict, output_path)
|
14 |
|
15 |
print(f'Model saved as {output_path}')
|
|
|
|
|
1 |
from modeling_arctic_m_bge_small import ConcatModel, ConcatModelConfig
|
2 |
|
3 |
config = ConcatModelConfig()
|
|
|
7 |
has_pooling_layer=[True, True]
|
8 |
)
|
9 |
|
10 |
+
output_path = 'model'
|
11 |
+
model.save_pretrained(output_path)
|
|
|
12 |
|
13 |
print(f'Model saved as {output_path}')
|