mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Abstract
We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.
Community
New multilingual embedding and reranking models!
We also released our pre-trained english (https://huggingface.co/Alibaba-NLP/gte-en-mlm-base https://huggingface.co/Alibaba-NLP/gte-en-mlm-large) and multilingual (https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base) MLM models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe (2024)
- Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models (2024)
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment (2024)
- News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation (2024)
- MINERS: Multilingual Language Models as Semantic Retrievers (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Did you, guys, essentially describe ModernBERT several months before it was released? The modifications you made to BERT in your paper are almost identical to the ModernBERT improvements. That's impressive!
Yes, thank you for finding this.
To be honest, there are already serveral works (before modernBERT) pre-trained so called modern BERT, such as MosaicBERT, jinaBERT, NomicBERT, and our GTE.
We simply consider this pre-training of 8k-context encoders as a necessary prerequisite for building state-of-the-art long context embedding and reranking models.
We also released our pre-trained english (https://huggingface.co/Alibaba-NLP/gte-en-mlm-base https://huggingface.co/Alibaba-NLP/gte-en-mlm-large) and multilingual (https://huggingface.co/Alibaba-NLP/gte-multilingual-mlm-base) MLM models.
Models citing this paper 19
Browse 19 models citing this paperDatasets citing this paper 0
No dataset linking this paper