Model Introduction
360Zhinao-search uses the self-developed BERT model as the base for multi-task fine-tuning, which has an average score of 75.05 on the Retrieval task on the C-MTEB-Retrieval benchmark, currently ranking first.
C-MTEB-Retrieval leaderboard contains a total of 8 [query, passage] similarity retrieval subtasks in different fields, using NDCG@10 (Normalized Discounted Cumulative Gain @ 10) as the evaluation index.
Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg |
---|---|---|---|---|---|---|---|---|---|
360Zhinao-search | 87.12 | 83.32 | 87.57 | 85.02 | 46.73 | 68.9 | 63.69 | 78.09 | 75.05 |
AGE_Hybrid | 86.88 | 80.65 | 89.28 | 83.66 | 47.26 | 69.28 | 65.94 | 76.79 | 74.97 |
OpenSearch-text-hybrid | 86.76 | 79.93 | 87.85 | 84.03 | 46.56 | 68.79 | 65.92 | 75.43 | 74.41 |
piccolo-large-zh-v2 | 86.14 | 79.54 | 89.14 | 86.78 | 47.58 | 67.75 | 64.88 | 73.1 | 74.36 |
stella-large-zh-v3-1792d | 85.56 | 79.14 | 87.13 | 82.44 | 46.87 | 68.62 | 65.18 | 73.89 | 73.6 |
Optimization points
- Data filtering: Strictly prevent the C-MTEB-Retrieval test data from leaking, and clean all queries and passages in the test set;
- Data source enhancement: Use open source data and LLM synthetic data to improve data diversity;
- Negative example mining: Use multiple methods to deeply mine difficult-to-distinguish negative examples to improve information gain;
- Training efficiency: multi-machine multi-GPU training + Deepspeed method to optimize GPU memory utilization.
Usage
from typing import cast, List, Dict, Union
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np
tokenizer = AutoTokenizer.from_pretrained('qihoo360/360Zhinao-search')
model = AutoModel.from_pretrained('qihoo360/360Zhinao-search')
sentences = ['天空是什么颜色的', '天空是蓝色的']
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512)
if __name__ == "__main__":
with torch.no_grad():
last_hidden_state = model(**inputs, return_dict=True).last_hidden_state
embeddings = last_hidden_state[:, 0]
embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
embeddings = embeddings.cpu().numpy()
print("embeddings:")
print(embeddings)
cos_sim = np.dot(embeddings[0], embeddings[1])
print("cos_sim:", cos_sim)
Reference
License
The source code of this repository follows the open-source license Apache 2.0.
360Zhinao open-source models support commercial use. If you wish to use these models or continue training them for commercial purposes, please contact us via email ([email protected]) to apply. For the specific license agreement, please see <<360 Zhinao Open-Source Model License>>.
- Downloads last month
- 347
Spaces using qihoo360/360Zhinao-search 2
Collection including qihoo360/360Zhinao-search
Evaluation results
- map on MTEB CMedQAv1test set self-reported87.005
- mrr on MTEB CMedQAv1test set self-reported89.347
- map on MTEB CMedQAv2test set self-reported88.483
- mrr on MTEB CMedQAv2test set self-reported90.578
- map on MTEB MMarcoRerankingself-reported32.409
- mrr on MTEB MMarcoRerankingself-reported31.487
- map on MTEB T2Rerankingself-reported67.803
- mrr on MTEB T2Rerankingself-reported78.145
- map_at_1 on MTEB CmedqaRetrievalself-reported27.171
- map_at_10 on MTEB CmedqaRetrievalself-reported40.109