Update README.md
Browse files
README.md
CHANGED
@@ -90,12 +90,34 @@ print(response)
|
|
90 |
|
91 |
π: Proprietary
|
92 |
|
93 |
-
### 3.1 Arena-Hard-Auto
|
94 |
|
95 |
-
All results below, except those for `Xwen-
|
96 |
|
97 |
#### 3.1.1 No Style Control
|
98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
| | Score | 95% CIs |
|
100 |
| ----------------------- | -------- | ----------- |
|
101 |
| **Xwen-7B-Chat** π | **59.4** | (-2.4, 2.1) |
|
@@ -105,8 +127,31 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
|
|
105 |
| Llama-3-8B-Instruct π | 20.6 | (-2.0, 1.9) |
|
106 |
| Starling-LM-7B-beta π | 23.0 | (-1.8, 1.8) |
|
107 |
|
|
|
|
|
108 |
#### 3.1.2 Style Control
|
109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
110 |
| | Score | 95% CIs |
|
111 |
| ----------------------- | -------- | ----------- |
|
112 |
| **Xwen-7B-Chat** π | **50.3** | (-3.8, 2.8) |
|
@@ -122,6 +167,24 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
|
|
122 |
> [!IMPORTANT]
|
123 |
> We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
|
124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
| | Score |
|
126 |
| ------------------ | -------- |
|
127 |
| **Xwen-7B-Chat** π | **6.88** |
|
@@ -132,14 +195,29 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
|
|
132 |
> [!IMPORTANT]
|
133 |
> We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
|
134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
135 |
| | Score |
|
136 |
| ------------------ | -------- |
|
137 |
| **Xwen-7B-Chat** π | **7.98** |
|
138 |
| Qwen2.5-7B-Chat π | 7.71 |
|
139 |
|
140 |
|
141 |
-
|
142 |
-
|
143 |
## References
|
144 |
|
145 |
[1] Yang, An, et al. "Qwen2. 5 technical report." arXiv preprint arXiv:2412.15115 (2024).
|
|
|
90 |
|
91 |
π: Proprietary
|
92 |
|
93 |
+
### 3.1 Arena-Hard-Auto-v0.1
|
94 |
|
95 |
+
All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
|
96 |
|
97 |
#### 3.1.1 No Style Control
|
98 |
|
99 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
100 |
+
|
101 |
+
| | Score | 95% CIs |
|
102 |
+
| --------------------------------- | ------------------------ | ----------- |
|
103 |
+
| **Xwen-72B-Chat** π | **86.1** (Top-1 among π) | (-1.5, 1.7) |
|
104 |
+
| Qwen2.5-72B-Instruct π | 78.0 | (-1.8, 1.8) |
|
105 |
+
| Athene-v2-Chat π | 85.0 | (-1.4, 1.7) |
|
106 |
+
| Llama-3.1-Nemotron-70B-Instruct π | 84.9 | (-1.7, 1.8) |
|
107 |
+
| Llama-3.1-405B-Instruct-FP8 π | 69.3 | (-2.4, 2.2) |
|
108 |
+
| Claude-3-5-Sonnet-20241022 π | 85.2 | (-1.4, 1.6) |
|
109 |
+
| O1-Preview-2024-09-12 π | **92.0** (Top-1 among π) | (-1.2, 1.0) |
|
110 |
+
| O1-Mini-2024-09-12 π | 90.4 | (-1.1, 1.3) |
|
111 |
+
| GPT-4-Turbo-2024-04-09 π | 82.6 | (-1.8, 1.5) |
|
112 |
+
| GPT-4-0125-Preview π | 78.0 | (-2.1, 2.4) |
|
113 |
+
| GPT-4o-2024-08-06 π | 77.9 | (-2.0, 2.1) |
|
114 |
+
| Yi-Lightning π | 81.5 | (-1.6, 1.6) |
|
115 |
+
| Yi-Largeπ | 63.7 | (-2.6, 2.4) |
|
116 |
+
| GLM-4-0520 π | 63.8 | (-2.9, 2.8) |
|
117 |
+
|
118 |
+
|
119 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
120 |
+
|
121 |
| | Score | 95% CIs |
|
122 |
| ----------------------- | -------- | ----------- |
|
123 |
| **Xwen-7B-Chat** π | **59.4** | (-2.4, 2.1) |
|
|
|
127 |
| Llama-3-8B-Instruct π | 20.6 | (-2.0, 1.9) |
|
128 |
| Starling-LM-7B-beta π | 23.0 | (-1.8, 1.8) |
|
129 |
|
130 |
+
|
131 |
+
|
132 |
#### 3.1.2 Style Control
|
133 |
|
134 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
135 |
+
|
136 |
+
| | Score | 95% CIs |
|
137 |
+
| --------------------------------- | ------------------------ | ----------- |
|
138 |
+
| **Xwen-72B-Chat** π | **72.4** (Top-1 Among π) | (-4.3, 4.1) |
|
139 |
+
| Qwen2.5-72B-Instruct π | 63.3 | (-2.5, 2.3) |
|
140 |
+
| Athene-v2-Chat π | 72.1 | (-2.5, 2.5) |
|
141 |
+
| Llama-3.1-Nemotron-70B-Instruct π | 71.0 | (-2.8, 3.1) |
|
142 |
+
| Llama-3.1-405B-Instruct-FP8 π | 67.1 | (-2.2, 2.8) |
|
143 |
+
| Claude-3-5-Sonnet-20241022 π | **86.4** (Top-1 Among π) | (-1.3, 1.3) |
|
144 |
+
| O1-Preview-2024-09-12 π | 81.7 | (-2.2, 2.1) |
|
145 |
+
| O1-Mini-2024-09-12 π | 79.3 | (-2.8, 2.3) |
|
146 |
+
| GPT-4-Turbo-2024-04-09 π | 74.3 | (-2.4, 2.4) |
|
147 |
+
| GPT-4-0125-Preview π | 73.6 | (-2.0, 2.0) |
|
148 |
+
| GPT-4o-2024-08-06 π | 71.1 | (-2.5, 2.0) |
|
149 |
+
| Yi-Lightning π | 66.9 | (-3.3, 2.7) |
|
150 |
+
| Yi-Large-Preview π | 65.1 | (-2.5, 2.5) |
|
151 |
+
| GLM-4-0520 π | 61.4 | (-2.6, 2.4) |
|
152 |
+
|
153 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
154 |
+
|
155 |
| | Score | 95% CIs |
|
156 |
| ----------------------- | -------- | ----------- |
|
157 |
| **Xwen-7B-Chat** π | **50.3** | (-3.8, 2.8) |
|
|
|
167 |
> [!IMPORTANT]
|
168 |
> We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
|
169 |
|
170 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
171 |
+
|
172 |
+
| | Score |
|
173 |
+
| ----------------------------- | ------------------------ |
|
174 |
+
| **Xwen-72B-Chat** π | **7.57** (Top-1 Among π) |
|
175 |
+
| Qwen2.5-72B-Instruct π | 7.51 |
|
176 |
+
| Deepseek V2.5 π | 7.38 |
|
177 |
+
| Mistral-Large-Instruct-2407 π | 7.10 |
|
178 |
+
| Llama3.1-70B-Instruct π | 5.81 |
|
179 |
+
| Llama-3.1-405B-Instruct-FP8 π | 5.56 |
|
180 |
+
| GPT-4o-0513 π | **7.59** (Top-1 Among π) |
|
181 |
+
| Claude-3.5-Sonnet-20240620 π | 7.17 |
|
182 |
+
| Yi-Lightning π | 7.54 |
|
183 |
+
| Yi-Large-Preview π | 7.20 |
|
184 |
+
|
185 |
+
|
186 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
187 |
+
|
188 |
| | Score |
|
189 |
| ------------------ | -------- |
|
190 |
| **Xwen-7B-Chat** π | **6.88** |
|
|
|
195 |
> [!IMPORTANT]
|
196 |
> We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
|
197 |
|
198 |
+
**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
|
199 |
+
|
200 |
+
| | Score |
|
201 |
+
| ----------------------------- | ------------------------ |
|
202 |
+
| **Xwen-72B-Chat** π | **8.64** (Top-1 Among π) |
|
203 |
+
| Qwen2.5-72B-Instruct π | 8.62 |
|
204 |
+
| Deepseek V2.5 π | 8.43 |
|
205 |
+
| Mistral-Large-Instruct-2407 π | 8.53 |
|
206 |
+
| Llama3.1-70B-Instruct π | 8.23 |
|
207 |
+
| Llama-3.1-405B-Instruct-FP8 π | 8.36 |
|
208 |
+
| GPT-4o-0513 π | 8.59 |
|
209 |
+
| Claude-3.5-Sonnet-20240620 π | 6.96 |
|
210 |
+
| Yi-Lightning π | **8.75** (Top-1 Among π) |
|
211 |
+
| Yi-Large-Preview π | 8.32 |
|
212 |
+
|
213 |
+
**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
|
214 |
+
|
215 |
| | Score |
|
216 |
| ------------------ | -------- |
|
217 |
| **Xwen-7B-Chat** π | **7.98** |
|
218 |
| Qwen2.5-7B-Chat π | 7.71 |
|
219 |
|
220 |
|
|
|
|
|
221 |
## References
|
222 |
|
223 |
[1] Yang, An, et al. "Qwen2. 5 technical report." arXiv preprint arXiv:2412.15115 (2024).
|