Spaces:
Running
Running
Joschka Strueber
commited on
Commit
·
1b549fb
1
Parent(s):
b90e0d3
[Ref] change table size
Browse files
app.py
CHANGED
@@ -169,15 +169,15 @@ for model similarity which adjusts for chance agreement due to accuracy. Using C
|
|
169 |
biased towards more similar models controlling for the model's capability. (2) Gain from training strong models on annotations \
|
170 |
of weak supervisors (weak-to-strong generalization) is higher when the two models are more different. (3) Concerningly, model \
|
171 |
errors are getting more correlated as capabilities increase.""")
|
172 |
-
|
173 |
-
|
174 |
gr.Markdown("""
|
175 |
- **Datasets**: [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) benchmark datasets \n
|
176 |
- Some datasets are not multiple-choice - for these, the metrics are not applicable. \n
|
177 |
- **Models**: Open LLM Leaderboard models \n
|
178 |
- Every model evaluation is gated on Hugging Face and access has to be requested. \n
|
179 |
- We requested access for the most popular models, but some may be missing. \n
|
180 |
-
- Notably, loading data is not possible for
|
181 |
- **Metrics**: CAPA (probabilistic), CAPA (deterministic), Error Consistency""")
|
182 |
|
183 |
if __name__ == "__main__":
|
|
|
169 |
biased towards more similar models controlling for the model's capability. (2) Gain from training strong models on annotations \
|
170 |
of weak supervisors (weak-to-strong generalization) is higher when the two models are more different. (3) Concerningly, model \
|
171 |
errors are getting more correlated as capabilities increase.""")
|
172 |
+
with gr.Row():
|
173 |
+
gr.Image(value="data/table_capa.png", label="Comparison of different similarity metrics for multiple-choice questions", interactive=False, scale=1)
|
174 |
gr.Markdown("""
|
175 |
- **Datasets**: [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) benchmark datasets \n
|
176 |
- Some datasets are not multiple-choice - for these, the metrics are not applicable. \n
|
177 |
- **Models**: Open LLM Leaderboard models \n
|
178 |
- Every model evaluation is gated on Hugging Face and access has to be requested. \n
|
179 |
- We requested access for the most popular models, but some may be missing. \n
|
180 |
+
- Notably, loading data is not possible for some meta-llama and gemma models.
|
181 |
- **Metrics**: CAPA (probabilistic), CAPA (deterministic), Error Consistency""")
|
182 |
|
183 |
if __name__ == "__main__":
|