thenlper commited on
Commit
4f868af
·
verified ·
1 Parent(s): b6cc2b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -3
README.md CHANGED
@@ -1,3 +1,176 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - answerdotai/ModernBERT-base
7
+ pipeline_tag: sentence-similarity
8
+ library_name: transformers
9
+ ---
10
+
11
+ # gte-modernbert-base
12
+
13
+ We are excited to introduce the `gte-modernbert` series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The `gte-modernbert` series models include both text embedding models and rerank models.
14
+
15
+ The `gte-modernbert` models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as **MTEB**, **LoCO**, and **COIR** evaluation.
16
+
17
+ ## Model Overview
18
+
19
+ - Developed by: Tongyi Lab, Alibaba Group
20
+ - Model Type: Text Embedding
21
+ - Primary Language: English
22
+ - Model Size: 149M
23
+ - Max Input Length: 8192 tokens
24
+ - Output Dimension: 768
25
+
26
+ ### Model list
27
+ | Models | Language | Model Type | Model Size | Max Seq. Length | Dimension | MTEB-en | BEIR | LoCo | CoIR |
28
+ |:--------------------------------------------------------------------------------------:|:--------:|:----------------------:|:----------:|:---------------:|:---------:| :-----: | :-----: |
29
+ | [`gte-modernbert-base`](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | English | text embedding | 149M | 8192 | 768 | 64.29 | 55.33 | 87.57 | 77.69 |
30
+ | [`gte-reranker-modernbert-base`](hhttps://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | English | text reranker | 149M | 8192 | - | 56.19 | 90.68 | 79.31 |
31
+
32
+ ## Usage
33
+
34
+ Use with `Transformers`
35
+
36
+ ```python
37
+ # Requires transformers>=4.48.0
38
+
39
+ import torch.nn.functional as F
40
+ from transformers import AutoModel, AutoTokenizer
41
+
42
+ input_texts = [
43
+ "what is the capital of China?",
44
+ "how to implement quick sort in python?",
45
+ "Beijing",
46
+ "sorting algorithms"
47
+ ]
48
+
49
+ model_path = 'Alibaba-NLP/gte-modernbert-base'
50
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
51
+ model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
52
+
53
+ # Tokenize the input texts
54
+ batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
55
+
56
+ outputs = model(**batch_dict)
57
+ embeddings = outputs.last_hidden_state[:, 0]
58
+
59
+ # (Optionally) normalize embeddings
60
+ embeddings = F.normalize(embeddings, p=2, dim=1)
61
+ scores = (embeddings[:1] @ embeddings[1:].T) * 100
62
+ print(scores.tolist())
63
+ ```
64
+
65
+ Use with `sentence-transformers`:
66
+
67
+ ```python
68
+ # Requires sentence_transformers>=2.7.0
69
+
70
+ from sentence_transformers import SentenceTransformer
71
+ from sentence_transformers.util import cos_sim
72
+
73
+ sentences = ['That is a happy person', 'That is a very happy person']
74
+
75
+ model = SentenceTransformer('Alibaba-NLP/gte-modernbert-base', trust_remote_code=True)
76
+ embeddings = model.encode(sentences)
77
+ print(cos_sim(embeddings[0], embeddings[1]))
78
+ ```
79
+
80
+ Use with `transformers.js`:
81
+
82
+ ```js
83
+ // npm i @xenova/transformers
84
+ import { pipeline, dot } from '@xenova/transformers';
85
+
86
+ // Create feature extraction pipeline
87
+ const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-modernbert-base', {
88
+ quantized: false, // Comment out this line to use the quantized version
89
+ });
90
+
91
+ // Generate sentence embeddings
92
+ const sentences = [
93
+ "what is the capital of China?",
94
+ "how to implement quick sort in python?",
95
+ "Beijing",
96
+ "sorting algorithms"
97
+ ]
98
+ const output = await extractor(sentences, { normalize: true, pooling: 'cls' });
99
+
100
+ // Compute similarity scores
101
+ const [source_embeddings, ...document_embeddings ] = output.tolist();
102
+ const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
103
+ console.log(similarities);
104
+ ```
105
+
106
+ ## Training Details
107
+
108
+ The `gte-modernbert` series of models follows the training scheme of the previous [GTE models](https://huggingface.co/collections/Alibaba-NLP/gte-models-6680f0b13f885cb431e6d469), with the only difference being that the pre-training language model base has been replaced from [GTE-MLM](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base) to [ModernBert](https://huggingface.co/answerdotai/ModernBERT-base). For more training details, please refer to our paper: [mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval](https://aclanthology.org/2024.emnlp-industry.103/)
109
+
110
+ ## Evaluation
111
+
112
+ ### MTEB
113
+
114
+ The results of other models are retrieved from [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Given that all models in the `gte-modernbert` series have a size of less than 1B parameters, we focused exclusively on the results of models under 1B from the MTEB leaderboard.
115
+
116
+ | Model Name | Param Size (M) | Dimension | Sequence Length | Average (56) | Class. (12) | Clust. (11) | Pair Class. (3) | Reran. (4) | Retr. (15) | STS (10) | Summ. (1) |
117
+ |:------------------------------------------------------------------------------------------------:|:--------------:|:---------:|:---------------:|:------------:|:-----------:|:---:|:---:|:---:|:---:|:-----------:|:--------:|
118
+ | [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 335 | 1024 | 512 | 64.68 | 75.64 | 46.71 | 87.2 | 60.11 | 54.39 | 85 | 32.71 |
119
+ | [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) | 560 | 1024 | 514 | 64.41 | 77.56 | 47.1 | 86.19 | 58.58 | 52.47 | 84.78 | 30.39 |
120
+ | [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 335 | 1024 | 512 | 64.23 | 75.97 | 46.08 | 87.12 | 60.03 | 54.29 | 83.11 | 31.61 |
121
+ | [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) | 137 | 768 | 8192 | **64.11** | 77.17 | 46.82 | 85.33 | 57.66 | 54.09 | 81.97 | 31.17 |
122
+ | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 109 | 768 | 512 | 63.55 | 75.53 | 45.77 | 86.55 | 58.86 | 53.25 | 82.4 | 31.07 |
123
+ | [gte-large-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) | 409 | 1024 | 8192 | 65.39 | 77.75 | 47.95 | 84.63 | 58.50 | 57.91 | 81.43 | 30.91 |
124
+ | [modernbert-embed-base](https://huggingface.co/nomic-ai/modernbert-embed-base) | 149 | 768 | 8192 | 62.62 | 74.31 | 44.98 | 83.96 | 56.42 | 52.89 | 81.78 | 31.39 |
125
+ | [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | | 768 | 8192 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01| 81.94 | 30.4 |
126
+ | [gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) | 305 | 768 | 8192 | 61.4 | 70.89 | 44.31 | 84.24 | 57.47 |51.08 | 82.11 | 30.58 |
127
+ | [jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) | 572 | 1024 | 8192 | 65.51 | 82.58 |45.21 |84.01 |58.13 |53.88 | 85.81 | 29.71 |
128
+ | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 149 | 768 | 8192 | 64.29 | 76.32 | 45.31 | 86.49 | 58.33 | 55.33 | 83.41 | 29.17 |
129
+
130
+
131
+ ### LoCo (Long Document Retrieval)
132
+
133
+ | Model Name | Dimension | Sequence Length | Average (5) | QsmsumRetrieval | SummScreenRetrieval | QasperAbastractRetrieval | QasperTitleRetrieval | GovReportRetrieval |
134
+ |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
135
+ | [gte-qwen1.5-7b](https://huggingface.co/Alibaba-NLP/gte-qwen1.5-7b) | 4096 | 32768 | 87.57 | 49.37 | 93.10 | 99.67 | 97.54 | 98.21 |
136
+ | [gte-large-v1.5](https://huggingface.co/Alibaba-NLP/gte-large-v1.5) |1024 | 8192 | 86.71 | 44.55 | 92.61 | 99.82 | 97.81 | 98.74 |
137
+ | [gte-base-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-v1.5) | 768 | 8192 | 87.44 | 49.91 | 91.78 | 99.82 | 97.13 | 98.58 |
138
+ | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 88.88 | 54.45 | 93.00 | 99.82 | 98.03 | 98.70 |
139
+ | [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 90.68 | 70.86 | 94.06 | 99.73 | 99.11 | 89.67 |
140
+
141
+ ### COIR (Code Retrieval Task)
142
+
143
+ | Model Name | Dimension | Sequence Length | Average(20) | CodeSearchNet-ccr-go | CodeSearchNet-ccr-java | CodeSearchNet-ccr-javascript | CodeSearchNet-ccr-php | CodeSearchNet-ccr-python | CodeSearchNet-ccr-ruby | CodeSearchNet-go | CodeSearchNet-java | CodeSearchNet-javascript | CodeSearchNet-php | CodeSearchNet-python | CodeSearchNet-ruby | apps | codefeedback-mt | codefeedback-st | codetrans-contest | codetrans-dl | cosqa | stackoverflow-qa | synthetic-text2sql |
144
+ |:----:|:---:|:---:|:---:|:---:| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
145
+ | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 77.26 | 95.15 | 94.75 | 96.55 | 91.64 | 95.31 | 90.71 | 86.41 | 79.09 | 97.66 | 80.22 | 42.05 | 55.2 | 84.77 | 52.53 |
146
+ | [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 79.31 | 94.15 | 93.57 | 94.27 | 91.51 | 93.93 | 90.63 | 88.32 | 83.27 | 76.05 | 85.12 | 88.16 | 77.59 | 57.54 | 82.34 | 85.95 | 71.89 |
147
+
148
+
149
+
150
+ ### BEIR
151
+
152
+ | Model Name | Dimension | Sequence Length | Average(15) | ArguAna | ClimateFEVER | CQADupstackAndroidRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID |
153
+ | :----: | :----: | :----: | :----: | :----: | :---: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: |
154
+ | [gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 |
155
+ | [gte-reranker-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base) | - | 8192 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 |
156
+
157
+ ## Citation
158
+
159
+ If you find our paper or models helpful, feel free to give us a cite.
160
+
161
+ ```
162
+ @inproceedings{zhang2024mgte,
163
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
164
+ author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
165
+ booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
166
+ pages={1393--1412},
167
+ year={2024}
168
+ }
169
+
170
+ @article{li2023towards,
171
+ title={Towards general text embeddings with multi-stage contrastive learning},
172
+ author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
173
+ journal={arXiv preprint arXiv:2308.03281},
174
+ year={2023}
175
+ }
176
+ ```