rryisthebest
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@ FIRST is a language models trained specifically for listwise reranking tasks, le
|
|
15 |
- **License:** MIT
|
16 |
- **Finetuned from model:** [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
|
17 |
|
18 |
-
### Model Sources
|
19 |
|
20 |
<!-- Provide the basic links for the model. -->
|
21 |
|
@@ -26,24 +26,25 @@ FIRST is a language models trained specifically for listwise reranking tasks, le
|
|
26 |
### Evaluations
|
27 |
|
28 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
29 |
-
At the time of release, FIRST demonstrates superior performance across a variety of datasets. The table below provides a detailed performance comparison against other LLM rerankers on the BEIR benchmark.
|
30 |
| Reranker | Training Data | Avg. | Climate FEVER | DBPedia | FEVER | FiQA | Hotpot QA | MS Marco | NFCorpus | NQ | Sci-docs | Sci-fact | Trec-COVID |
|
31 |
|---------------|----------------|-------|---------------|---------|-------|-------|-----------|----------|----------|-------|----------|----------|------------|
|
32 |
| Rank Vicuna | GPT 3.5 | 50.7 | **28.2** | 50.0 | 81.0 | 35.9 | 73.5 | 36.7 | 33.1 | 58.6 | 18.4 | 70.5 | 71.3 |
|
33 |
| Rank Zephyr | GPT 3.5 + 3.5 | 53.7 | 25.6 | 50.0 | 80.1 | **42.2** | 71.6 | 42.7 | **37.7** | 65.6 | **20.5** | **76.7** | 78.4 |
|
34 |
| **FIRST** | GPT-4 | **54.3** | 26.7 | **50.9**| **81.7**| 42.2 | **74.2** | **44.4** | 37.4 | **66.4**| 20.4 | 74.6 | **78.8** |
|
35 |
-
More details can be found in the paper.
|
36 |
|
|
|
|
|
37 |
## Bias, Risks, and Limitations
|
38 |
|
39 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
40 |
We forward here an excerpt from the [Zephyr-7B-β model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta/blob/main/README.md#bias-risks--limitations):
|
41 |
|
42 |
-
>
|
43 |
|
44 |
FIRST is trained specifically on monolingual English data, effectiveness on multilingual sets is not guaranteed.
|
45 |
|
46 |
-
## Citation
|
47 |
|
48 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
49 |
If you find FIRST useful for your work, please consider citing our paper:
|
|
|
15 |
- **License:** MIT
|
16 |
- **Finetuned from model:** [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)
|
17 |
|
18 |
+
### Model Sources
|
19 |
|
20 |
<!-- Provide the basic links for the model. -->
|
21 |
|
|
|
26 |
### Evaluations
|
27 |
|
28 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
29 |
+
At the time of release, FIRST demonstrates superior performance across a variety of reranking datasets. The table below provides a detailed performance comparison against other LLM rerankers on the BEIR benchmark.
|
30 |
| Reranker | Training Data | Avg. | Climate FEVER | DBPedia | FEVER | FiQA | Hotpot QA | MS Marco | NFCorpus | NQ | Sci-docs | Sci-fact | Trec-COVID |
|
31 |
|---------------|----------------|-------|---------------|---------|-------|-------|-----------|----------|----------|-------|----------|----------|------------|
|
32 |
| Rank Vicuna | GPT 3.5 | 50.7 | **28.2** | 50.0 | 81.0 | 35.9 | 73.5 | 36.7 | 33.1 | 58.6 | 18.4 | 70.5 | 71.3 |
|
33 |
| Rank Zephyr | GPT 3.5 + 3.5 | 53.7 | 25.6 | 50.0 | 80.1 | **42.2** | 71.6 | 42.7 | **37.7** | 65.6 | **20.5** | **76.7** | 78.4 |
|
34 |
| **FIRST** | GPT-4 | **54.3** | 26.7 | **50.9**| **81.7**| 42.2 | **74.2** | **44.4** | 37.4 | **66.4**| 20.4 | 74.6 | **78.8** |
|
|
|
35 |
|
36 |
+
|
37 |
+
More details can be found in the paper.
|
38 |
## Bias, Risks, and Limitations
|
39 |
|
40 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
41 |
We forward here an excerpt from the [Zephyr-7B-β model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta/blob/main/README.md#bias-risks--limitations):
|
42 |
|
43 |
+
> Zephyr-7B-β has not been aligned to human preferences for safety within the RLHF phase or deployed with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so). It is also unknown what the size and composition of the corpus was used to train the base model ([mistralai/Mistral-7B-v0.1]), however it is likely to have included a mix of Web data and technical sources like books and code. See the [Falcon 180B model card](https://huggingface.co/tiiuae/falcon-180B#training-data) for an example of this.
|
44 |
|
45 |
FIRST is trained specifically on monolingual English data, effectiveness on multilingual sets is not guaranteed.
|
46 |
|
47 |
+
## Citation
|
48 |
|
49 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
50 |
If you find FIRST useful for your work, please consider citing our paper:
|