This model is the query encoder of the Wiki BM25 Lexical Model (螞) from the SPAR paper:
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?
Xilun Chen, Kushal Lakhotia, Barlas O臒uz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta and Wen-tau Yih
Meta AI
The associated github repo is available here: https://github.com/facebookresearch/dpr-scale/tree/main/spar
This model is a BERT-base sized dense retriever trained on Wikipedia articles to imitate the behavior of BM25. The following models are also available:
Pretrained Model | Corpus | Teacher | Architecture | Query Encoder Path | Context Encoder Path |
---|---|---|---|---|---|
Wiki BM25 螞 | Wikipedia | BM25 | BERT-base | facebook/spar-wiki-bm25-lexmodel-query-encoder | facebook/spar-wiki-bm25-lexmodel-context-encoder |
PAQ BM25 螞 | PAQ | BM25 | BERT-base | facebook/spar-paq-bm25-lexmodel-query-encoder | facebook/spar-paq-bm25-lexmodel-context-encoder |
MARCO BM25 螞 | MS MARCO | BM25 | BERT-base | facebook/spar-marco-bm25-lexmodel-query-encoder | facebook/spar-marco-bm25-lexmodel-context-encoder |
MARCO UniCOIL 螞 | MS MARCO | UniCOIL | BERT-base | facebook/spar-marco-unicoil-lexmodel-query-encoder | facebook/spar-marco-unicoil-lexmodel-context-encoder |
Using the Lexical Model (螞) Alone
This model should be used together with the associated context encoder, similar to the DPR model.
import torch
from transformers import AutoTokenizer, AutoModel
# The tokenizer is the same for the query and context encoder
tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
query = "Where was Marie Curie born?"
contexts = [
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eug猫ne Curie, a doctor of French Catholic origin from Alsace."
]
# Apply tokenizer
query_input = tokenizer(query, return_tensors='pt')
ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
# Compute embeddings: take the last-layer hidden state of the [CLS] token
query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
# Compute similarity scores using dot product
score1 = query_emb @ ctx_emb[0] # 341.3268
score2 = query_emb @ ctx_emb[1] # 340.1626
Using the Lexical Model (螞) with a Base Dense Retriever as in SPAR
As 螞 learns lexical matching from a sparse teacher retriever, it can be used in combination with a standard dense retriever (e.g. DPR, Contriever) to build a dense retriever that excels at both lexical and semantic matching.
In the following example, we show how to build the SPAR-Wiki model for Open-Domain Question Answering by concatenating the embeddings of DPR and the Wiki BM25 螞.
import torch
from transformers import AutoTokenizer, AutoModel
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
# DPR model
dpr_ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
dpr_query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")
# Wiki BM25 螞 model
lexmodel_tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
query = "Where was Marie Curie born?"
contexts = [
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eug猫ne Curie, a doctor of French Catholic origin from Alsace."
]
# Compute DPR embeddings
dpr_query_input = dpr_query_tokenizer(query, return_tensors='pt')['input_ids']
dpr_query_emb = dpr_query_encoder(dpr_query_input).pooler_output
dpr_ctx_input = dpr_ctx_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
dpr_ctx_emb = dpr_ctx_encoder(**dpr_ctx_input).pooler_output
# Compute 螞 embeddings
lexmodel_query_input = lexmodel_tokenizer(query, return_tensors='pt')
lexmodel_query_emb = lexmodel_query_encoder(**query_input).last_hidden_state[:, 0, :]
lexmodel_ctx_input = lexmodel_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
lexmodel_ctx_emb = lexmodel_context_encoder(**ctx_input).last_hidden_state[:, 0, :]
# Form SPAR embeddings via concatenation
# The concatenation weight is only applied to query embeddings
# Refer to the SPAR paper for details
concat_weight = 0.7
spar_query_emb = torch.cat(
[dpr_query_emb, concat_weight * lexmodel_query_emb],
dim=-1,
)
spar_ctx_emb = torch.cat(
[dpr_ctx_emb, lexmodel_ctx_emb],
dim=-1,
)
# Compute similarity scores
score1 = spar_query_emb @ spar_ctx_emb[0] # 317.6931
score2 = spar_query_emb @ spar_ctx_emb[1] # 314.6144
- Downloads last month
- 14