Add model, tokenizer, & initial model card

Files changed (6) hide show

README.md +154 -0
config.json +10 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,154 @@

+---
+language: "en"
+tags:
+- dpr
+- dense-passage-retrieval
+- knowledge-distillation
+datasets:
+- ms_marco
+---
+# Margin-MSE Trained ColBERT
+We provide a retrieval trained DistilBert-based ColBERT model (https://arxiv.org/pdf/2004.12832.pdf). Our model is trained with Margin-MSE using a 3 teacher BERT_Cat (concatenated BERT scoring) ensemble on MSMARCO-Passage.
+This instance can be used to **re-rank a candidate set** or **directly for a vector index based dense retrieval**. The architecure is a 6-layer DistilBERT, with an additional single linear layer at the end.
+If you want to know more about our simple, yet effective knowledge distillation method for efficient information retrieval models for a variety of student architectures that is used for this model instance check out our paper: https://arxiv.org/abs/2010.02666 🎉
+For more information, training data, source code, and a minimal usage example please visit: https://github.com/sebastian-hofstaetter/neural-ranking-kd
+## Configuration
+- fp16 trained, so fp16 inference shouldn't be a problem
+- We use no compression: 768 dim output vectors (better suited for re-ranking, or storage for smaller collections, MSMARCO gets to ~1TB vector storage with fp16 ... ups)
+- Query [MASK] augmention = 8x regardless of batch-size (needs to be added before the model, see the usage example in GitHub repo for more)
+## Model Code
+````python
+from transformers import AutoTokenizer,AutoModel, PreTrainedModel,PretrainedConfig
+from typing import Dict
+import torch
+class ColBERTConfig(PretrainedConfig):
+    model_type = "ColBERT"
+    bert_model: str
+    compression_dim: int = 768
+    dropout: float = 0.0
+    return_vecs: bool = False
+    trainable: bool = True
+class ColBERT(PreTrainedModel):
+    """
+    ColBERT model from: https://arxiv.org/pdf/2004.12832.pdf
+    We use a dot-product instead of cosine per term (slightly better)
+    """
+    config_class = ColBERTConfig
+    base_model_prefix = "bert_model"
+    def __init__(self,
+                 cfg) -> None:
+        super().__init__(cfg)
+        self.bert_model = AutoModel.from_pretrained(cfg.bert_model)
+        for p in self.bert_model.parameters():
+            p.requires_grad = cfg.trainable
+        self.compressor = torch.nn.Linear(self.bert_model.config.hidden_size, cfg.compression_dim)
+    def forward(self,
+                query: Dict[str, torch.LongTensor],
+                document: Dict[str, torch.LongTensor]):
+        query_vecs = self.forward_representation(query)
+        document_vecs = self.forward_representation(document)
+        score = self.forward_aggregation(query_vecs,document_vecs,query["attention_mask"],document["attention_mask"])
+        return score
+    def forward_representation(self,
+                               tokens,
+                               sequence_type=None) -> torch.Tensor:
+        vecs = self.bert_model(**tokens)[0] # assuming a distilbert model here
+        vecs = self.compressor(vecs)
+        # if encoding only, zero-out the mask values so we can compress storage
+        if sequence_type == "doc_encode" or sequence_type == "query_encode":
+            vecs = vecs * tokens["tokens"]["mask"].unsqueeze(-1)
+        return vecs
+    def forward_aggregation(self,query_vecs, document_vecs,query_mask,document_mask):
+        # create initial term-x-term scores (dot-product)
+        score = torch.bmm(query_vecs, document_vecs.transpose(2,1))
+        # mask out padding on the doc dimension (mask by -1000, because max should not select those, setting it to 0 might select them)
+        exp_mask = document_mask.bool().unsqueeze(1).expand(-1,score.shape[1],-1)
+        score[~exp_mask] = - 10000
+        # max pooling over document dimension
+        score = score.max(-1).values
+        # mask out paddding query values
+        score[~(query_mask.bool())] = 0
+        # sum over query values
+        score = score.sum(-1)
+        return score
+tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # honestly not sure if that is the best way to go, but it works :)
+model = ColBERT.from_pretrained("sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco")
+````
+## Effectiveness on MSMARCO Passage & TREC Deep Learning '19
+We trained our model on the MSMARCO standard ("small"-400K query) training triples with knowledge distillation with a batch size of 32 on a single consumer-grade GPU (11GB memory).
+For re-ranking we used the top-1000 BM25 results.
+### MSMARCO-DEV
+Here, we use the larger 49K query DEV set (same range as the smaller 7K DEV set, minimal changes possible)
+|                                  | MRR@10 | NDCG@10 |
+|----------------------------------|--------|---------|
+| BM25                             | .194   | .241    |
+| **Margin-MSE ColBERT** (Re-ranking) | .375   | .436   |
+### TREC-DL'19
+For MRR we use the recommended binarization point of the graded relevance of 2. This might skew the results when compared to other binarization point numbers.
+|                                  | MRR@10 | NDCG@10 |
+|----------------------------------|--------|---------|
+| BM25                             | .689   | .501    |
+| **Margin-MSE ColBERT** (Re-ranking) | .878   | .744    |
+For more metrics, baselines, info and analysis, please see the paper: https://arxiv.org/abs/2010.02666
+## Limitations & Bias
+- The model inherits social biases from both DistilBERT and MSMARCO.
+- The model is only trained on relatively short passages of MSMARCO (avg. 60 words length), so it might struggle with longer text.
+## Citation
+If you use our model checkpoint please cite our work as:
+```
+@misc{hofstaetter2020_crossarchitecture_kd,
+      title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation},
+      author={Sebastian Hofst{\"a}tter and Sophia Althammer and Michael Schr{\"o}der and Mete Sertkan and Allan Hanbury},
+      year={2020},
+      eprint={2010.02666},
+      archivePrefix={arXiv},
+      primaryClass={cs.IR}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "architectures": [
+    "ColBERT"
+  ],
+  "bert_model": "distilbert-base-uncased",
+  "compression_dim": 768,
+  "model_type": "ColBERT",
+  "return_vecs": true,
+  "trainable": true
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb2a93cee563cc0ee7b8b5709835f57781338bf47fd2819fcf6265f29f598b26
+size 267837019

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "name_or_path": "distilbert-base-uncased"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff