Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,176 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
base_model:
|
6 |
+
- answerdotai/ModernBERT-base
|
7 |
+
pipeline_tag: sentence-similarity
|
8 |
+
library_name: transformers
|
9 |
+
---
|
10 |
+
|
11 |
+
# gte-modernbert-base
|
12 |
+
|
13 |
+
We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
|
14 |
+
|
15 |
+
The `gte-modernbert` models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as **MTEB**, **LoCO**, and **COIR** evaluation.
|
16 |
+
|
17 |
+
## Model Overview
|
18 |
+
|
19 |
+
- Developed by: Tongyi Lab, Alibaba Group
|
20 |
+
- Model Type: Text Embedding
|
21 |
+
- Primary Language: English
|
22 |
+
- Model Size: 149M
|
23 |
+
- Max Input Length: 8192 tokens
|
24 |
+
- Output Dimension: 768
|
25 |
+
|
26 |
+
### Model list
|
27 |
+
| Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
|
28 |
+
|:--------------------------------------------------------------------------------------:|:--------:|:----------------------:|:----------:|:---------------:|:---------:| :-----: | :-----: |
|
29 |
+
| [`gte-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | English | text embedding | 149M | 8192 | 768 | 64.29 | 55.33 | 87.57 | 77.69 |
|
30 |
+
| [`gte-reranker-modernbert-base`](hhttps://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | English | text reranker | 149M | 8192 | - | 56.19 | 90.68 | 79.31 |
|
31 |
+
|
32 |
+
## Usage
|
33 |
+
|
34 |
+
Use with `Transformers`
|
35 |
+
|
36 |
+
```python
|
37 |
+
# Requires transformers>=4.48.0
|
38 |
+
|
39 |
+
import torch.nn.functional as F
|
40 |
+
from transformers import AutoModel, AutoTokenizer
|
41 |
+
|
42 |
+
input_texts = [
|
43 |
+
"what is the capital of China?",
|
44 |
+
"how to implement quick sort in python?",
|
45 |
+
"Beijing",
|
46 |
+
"sorting algorithms"
|
47 |
+
]
|
48 |
+
|
49 |
+
model_path = 'Alibaba-NLP/gte-modernbert-base'
|
50 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
51 |
+
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
|
52 |
+
|
53 |
+
# Tokenize the input texts
|
54 |
+
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
|
55 |
+
|
56 |
+
outputs = model(**batch_dict)
|
57 |
+
embeddings = outputs.last_hidden_state[:, 0]
|
58 |
+
|
59 |
+
# (Optionally) normalize embeddings
|
60 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
61 |
+
scores = (embeddings[:1] @ embeddings[1:].T) * 100
|
62 |
+
print(scores.tolist())
|
63 |
+
```
|
64 |
+
|
65 |
+
Use with `sentence-transformers`:
|
66 |
+
|
67 |
+
```python
|
68 |
+
# Requires sentence_transformers>=2.7.0
|
69 |
+
|
70 |
+
from sentence_transformers import SentenceTransformer
|
71 |
+
from sentence_transformers.util import cos_sim
|
72 |
+
|
73 |
+
sentences = ['That is a happy person', 'That is a very happy person']
|
74 |
+
|
75 |
+
model = SentenceTransformer('Alibaba-NLP/gte-modernbert-base', trust_remote_code=True)
|
76 |
+
embeddings = model.encode(sentences)
|
77 |
+
print(cos_sim(embeddings[0], embeddings[1]))
|
78 |
+
```
|
79 |
+
|
80 |
+
Use with `transformers.js`:
|
81 |
+
|
82 |
+
```js
|
83 |
+
// npm i @xenova/transformers
|
84 |
+
import { pipeline, dot } from '@xenova/transformers';
|
85 |
+
|
86 |
+
// Create feature extraction pipeline
|
87 |
+
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-modernbert-base', {
|
88 |
+
quantized: false, // Comment out this line to use the quantized version
|
89 |
+
});
|
90 |
+
|
91 |
+
// Generate sentence embeddings
|
92 |
+
const sentences = [
|
93 |
+
"what is the capital of China?",
|
94 |
+
"how to implement quick sort in python?",
|
95 |
+
"Beijing",
|
96 |
+
"sorting algorithms"
|
97 |
+
]
|
98 |
+
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
|
99 |
+
|
100 |
+
// Compute similarity scores
|
101 |
+
const [source_embeddings, ...document_embeddings ] = output.tolist();
|
102 |
+
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
|
103 |
+
console.log(similarities);
|
104 |
+
```
|
105 |
+
|
106 |
+
## Training Details
|
107 |
+
|
108 |
+
The `gte-modernbert` series of models follows the training scheme of the previous [GTE models](https://huggingface.co/collections/Alibaba-NLP/gte-models-6680f0b13f885cb431e6d469), with the only difference being that the pre-training language model base has been replaced from [GTE-MLM](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base) to [ModernBert](https://huggingface.co/answerdotai/ModernBERT-base). For more training details, please refer to our paper: [mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval](https://aclanthology.org/2024.emnlp-industry.103/)
|
109 |
+
|
110 |
+
## Evaluation
|
111 |
+
|
112 |
+
### MTEB
|
113 |
+
|
114 |
+
The results of other models are retrieved from [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Given that all models in the `gte-modernbert` series have a size of less than 1B parameters, we focused exclusively on the results of models under 1B from the MTEB leaderboard.
|
115 |
+
|
116 |
+
| Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
|
117 |
+
|:------------------------------------------------------------------------------------------------:|:--------------:|:---------:|:---------------:|:------------:|:-----------:|:---:|:---:|:---:|:---:|:-----------:|:--------:|
|
118 |
+
| [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 |
|
119 |
+
| [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 |
|
120 |
+
| [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
|
121 |
+
| [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | 137 | 768 | 8192 | **64.11** | 77.17 | 46.82 | 85.33 | 57.66 | 54.09 | 81.97 | 31.17 |
|
122 |
+
| [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 109 | 768 | 512 | 63.55 | 75.53 | 45.77 | 86.55 | 58.86 | 53.25 | 82.4 | 31.07 |
|
123 |
+
| [gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | 409 | 1024 | 8192 | 65.39 | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 |
|
124 |
+
| [modernbert-embed-base](https://huggingface.co/nomic-ai/modernbert-embed-base) | 149 | 768 | 8192 | 62.62 | 74.31 | 44.98 | 83.96 | 56.42 | 52.89 | 81.78 | 31.39 |
|
125 |
+
| [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | | 768 | 8192 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01| 81.94 | 30.4 |
|
126 |
+
| [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) | 305 | 768 | 8192 | 61.4 | 70.89 | 44.31 | 84.24 | 57.47 |51.08 | 82.11 | 30.58 |
|
127 |
+
| [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 572 | 1024 | 8192 | 65.51 | 82.58 |45.21 |84.01 |58.13 |53.88 | 85.81 | 29.71 |
|
128 |
+
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 149 | 768 | 8192 | 64.29 | 76.32 | 45.31 | 86.49 | 58.33 | 55.33 | 83.41 | 29.17 |
|
129 |
+
|
130 |
+
|
131 |
+
### LoCo (Long Document Retrieval)
|
132 |
+
|
133 |
+
| Model Name | Dimension | Sequence Length | Average (5) | QsmsumRetrieval | SummScreenRetrieval | QasperAbastractRetrieval | QasperTitleRetrieval | GovReportRetrieval |
|
134 |
+
|:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
135 |
+
| [gte-qwen1.5-7b](https://huggingface.co/Alibaba-NLP/gte-qwen1.5-7b) | 4096 | 32768 | 87.57 | 49.37 | 93.10 | 99.67 | 97.54 | 98.21 |
|
136 |
+
| [gte-large-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-v1.5) |1024 | 8192 | 86.71 | 44.55 | 92.61 | 99.82 | 97.81 | 98.74 |
|
137 |
+
| [gte-base-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-v1.5) | 768 | 8192 | 87.44 | 49.91 | 91.78 | 99.82 | 97.13 | 98.58 |
|
138 |
+
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 88.88 | 54.45 | 93.00 | 99.82 | 98.03 | 98.70 |
|
139 |
+
| [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 90.68 | 70.86 | 94.06 | 99.73 | 99.11 | 89.67 |
|
140 |
+
|
141 |
+
### COIR (Code Retrieval Task)
|
142 |
+
|
143 |
+
| Model Name | Dimension | Sequence Length | Average(20) | CodeSearchNet-ccr-go | CodeSearchNet-ccr-java | CodeSearchNet-ccr-javascript | CodeSearchNet-ccr-php | CodeSearchNet-ccr-python | CodeSearchNet-ccr-ruby | CodeSearchNet-go | CodeSearchNet-java | CodeSearchNet-javascript | CodeSearchNet-php | CodeSearchNet-python | CodeSearchNet-ruby | apps | codefeedback-mt | codefeedback-st | codetrans-contest | codetrans-dl | cosqa | stackoverflow-qa | synthetic-text2sql |
|
144 |
+
|:----:|:---:|:---:|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
145 |
+
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 77.26 | 95.15 | 94.75 | 96.55 | 91.64 | 95.31 | 90.71 | 86.41 | 79.09 | 97.66 | 80.22 | 42.05 | 55.2 | 84.77 | 52.53 |
|
146 |
+
| [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 79.31 | 94.15 | 93.57 | 94.27 | 91.51 | 93.93 | 90.63 | 88.32 | 83.27 | 76.05 | 85.12 | 88.16 | 77.59 | 57.54 | 82.34 | 85.95 | 71.89 |
|
147 |
+
|
148 |
+
|
149 |
+
|
150 |
+
### BEIR
|
151 |
+
|
152 |
+
| Model Name | Dimension | Sequence Length | Average(15) | ArguAna | ClimateFEVER | CQADupstackAndroidRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID |
|
153 |
+
| :----: | :----: | :----: | :----: | :----: | :---: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
|
154 |
+
| [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 |
|
155 |
+
| [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 |
|
156 |
+
|
157 |
+
## Citation
|
158 |
+
|
159 |
+
If you find our paper or models helpful, feel free to give us a cite.
|
160 |
+
|
161 |
+
```
|
162 |
+
@inproceedings{zhang2024mgte,
|
163 |
+
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
|
164 |
+
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
|
165 |
+
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
|
166 |
+
pages={1393--1412},
|
167 |
+
year={2024}
|
168 |
+
}
|
169 |
+
|
170 |
+
@article{li2023towards,
|
171 |
+
title={Towards general text embeddings with multi-stage contrastive learning},
|
172 |
+
author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
|
173 |
+
journal={arXiv preprint arXiv:2308.03281},
|
174 |
+
year={2023}
|
175 |
+
}
|
176 |
+
```
|