Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -40,7 +40,21 @@ model_output = model(**encoded_input)
 Sequence embeddings can be produced as follows:
-TBA (just mean pool not including special tokens)
 ### Fine-tune

 Sequence embeddings can be produced as follows:
+```python
+def sequence_embeddings(encoded_input, model_output):
+    mask = encoded_input['attention_mask'].float()
+    d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens
+    # make sep token invisible
+    for i in d:
+        mask[i, d[i]] = 0
+    mask[:, 0] = 0.0 # make cls token invisible
+    mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
+    sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
+    sum_mask = torch.clamp(mask.sum(1), min=1e-9)
+    return sum_embeddings / sum_mask
+seq_embeds = sequence_embeddings(encoded_input, model_output)
+```
 ### Fine-tune