Update README.md
Browse files
README.md
CHANGED
@@ -89,45 +89,29 @@ print(response)
|
|
89 |
|
90 |
### 3.1 Arena-Hard-Auto
|
91 |
|
92 |
-
All results below, except those for `Xwen-
|
93 |
|
94 |
#### 3.1.1 No Style Control
|
95 |
|
96 |
-
|
|
97 |
-
|
|
98 |
-
| **Xwen-
|
99 |
-
| Qwen2.5-
|
100 |
-
|
|
101 |
-
| Llama-3.1-
|
102 |
-
| Llama-3
|
103 |
-
|
|
104 |
-
| O1-Preview-2024-09-12 π | **92.0** (Top-1 among π) | (-1.2, 1.0) |
|
105 |
-
| O1-Mini-2024-09-12 π | 90.4 | (-1.1, 1.3) |
|
106 |
-
| GPT-4-Turbo-2024-04-09 π | 82.6 | (-1.8, 1.5) |
|
107 |
-
| GPT-4-0125-Preview π | 78.0 | (-2.1, 2.4) |
|
108 |
-
| GPT-4o-2024-08-06 π | 77.9 | (-2.0, 2.1) |
|
109 |
-
| Yi-Lightning π | 81.5 | (-1.6, 1.6) |
|
110 |
-
| Yi-Largeπ | 63.7 | (-2.6, 2.4) |
|
111 |
-
| GLM-4-0520 π | 63.8 | (-2.9, 2.8) |
|
112 |
|
113 |
#### 3.1.2 Style Control
|
114 |
|
115 |
-
|
|
116 |
-
|
|
117 |
-
| **Xwen-
|
118 |
-
| Qwen2.5-
|
119 |
-
|
|
120 |
-
| Llama-3.1-
|
121 |
-
| Llama-3
|
122 |
-
|
|
123 |
-
| O1-Preview-2024-09-12 π | 81.7 | (-2.2, 2.1) |
|
124 |
-
| O1-Mini-2024-09-12 π | 79.3 | (-2.8, 2.3) |
|
125 |
-
| GPT-4-Turbo-2024-04-09 π | 74.3 | (-2.4, 2.4) |
|
126 |
-
| GPT-4-0125-Preview π | 73.6 | (-2.0, 2.0) |
|
127 |
-
| GPT-4o-2024-08-06 π | 71.1 | (-2.5, 2.0) |
|
128 |
-
| Yi-Lightning π | 66.9 | (-3.3, 2.7) |
|
129 |
-
| Yi-Large-Preview π | 65.1 | (-2.5, 2.5) |
|
130 |
-
| GLM-4-0520 π | 61.4 | (-2.6, 2.4) |
|
131 |
|
132 |
|
133 |
|
|
|
89 |
|
90 |
### 3.1 Arena-Hard-Auto
|
91 |
|
92 |
+
All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
|
93 |
|
94 |
#### 3.1.1 No Style Control
|
95 |
|
96 |
+
| | Score | 95% CIs |
|
97 |
+
| ----------------------- | -------- | ----------- |
|
98 |
+
| **Xwen-7B-Chat** π | **59.4** | (-2.4, 2.1) |
|
99 |
+
| Qwen2.5-7B-Instruct π | 50.4 | (-2.9, 2.5) |
|
100 |
+
| Gemma-2-27B-IT π | 57.5 | (-2.1, 2.4) |
|
101 |
+
| Llama-3.1-8B-Instruct π | 21.3 | (-1.9, 2.2) |
|
102 |
+
| Llama-3-8B-Instruct π | 20.6 | (-2.0, 1.9) |
|
103 |
+
| Starling-LM-7B-beta π | 23.0 | (-1.8, 1.8) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
104 |
|
105 |
#### 3.1.2 Style Control
|
106 |
|
107 |
+
| | Score | 95% CIs |
|
108 |
+
| ----------------------- | -------- | ----------- |
|
109 |
+
| **Xwen-7B-Chat** π | **50.3** | (-3.8, 2.8) |
|
110 |
+
| Qwen2.5-7B-Instruct π | 46.9 | (-3.1, 2.7) |
|
111 |
+
| Gemma-2-27B-IT π | 47.5 | (-2.5, 2.7) |
|
112 |
+
| Llama-3.1-8B-Instruct π | 18.3 | (-1.6, 1.6) |
|
113 |
+
| Llama-3-8B-Instruct π | 19.8 | (-1.6, 1.9) |
|
114 |
+
| Starling-LM-7B-beta π | 26.1 | (-2.6, 2.0) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
|
116 |
|
117 |
|