shenzhi-wang commited on
Commit
a8851bc
Β·
verified Β·
1 Parent(s): 73a0ca0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -4
README.md CHANGED
@@ -90,12 +90,34 @@ print(response)
90
 
91
  πŸ”’: Proprietary
92
 
93
- ### 3.1 Arena-Hard-Auto
94
 
95
- All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
96
 
97
  #### 3.1.1 No Style Control
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  | | Score | 95% CIs |
100
  | ----------------------- | -------- | ----------- |
101
  | **Xwen-7B-Chat** πŸ”‘ | **59.4** | (-2.4, 2.1) |
@@ -105,8 +127,31 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
105
  | Llama-3-8B-Instruct πŸ”‘ | 20.6 | (-2.0, 1.9) |
106
  | Starling-LM-7B-beta πŸ”‘ | 23.0 | (-1.8, 1.8) |
107
 
 
 
108
  #### 3.1.2 Style Control
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  | | Score | 95% CIs |
111
  | ----------------------- | -------- | ----------- |
112
  | **Xwen-7B-Chat** πŸ”‘ | **50.3** | (-3.8, 2.8) |
@@ -122,6 +167,24 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
122
  > [!IMPORTANT]
123
  > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
124
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  | | Score |
126
  | ------------------ | -------- |
127
  | **Xwen-7B-Chat** πŸ”‘ | **6.88** |
@@ -132,14 +195,29 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
132
  > [!IMPORTANT]
133
  > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
134
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  | | Score |
136
  | ------------------ | -------- |
137
  | **Xwen-7B-Chat** πŸ”‘ | **7.98** |
138
  | Qwen2.5-7B-Chat πŸ”‘ | 7.71 |
139
 
140
 
141
-
142
-
143
  ## References
144
 
145
  [1] Yang, An, et al. "Qwen2. 5 technical report." arXiv preprint arXiv:2412.15115 (2024).
 
90
 
91
  πŸ”’: Proprietary
92
 
93
+ ### 3.1 Arena-Hard-Auto-v0.1
94
 
95
+ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
96
 
97
  #### 3.1.1 No Style Control
98
 
99
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
100
+
101
+ | | Score | 95% CIs |
102
+ | --------------------------------- | ------------------------ | ----------- |
103
+ | **Xwen-72B-Chat** πŸ”‘ | **86.1** (Top-1 among πŸ”‘) | (-1.5, 1.7) |
104
+ | Qwen2.5-72B-Instruct πŸ”‘ | 78.0 | (-1.8, 1.8) |
105
+ | Athene-v2-Chat πŸ”‘ | 85.0 | (-1.4, 1.7) |
106
+ | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 84.9 | (-1.7, 1.8) |
107
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 69.3 | (-2.4, 2.2) |
108
+ | Claude-3-5-Sonnet-20241022 πŸ”’ | 85.2 | (-1.4, 1.6) |
109
+ | O1-Preview-2024-09-12 πŸ”’ | **92.0** (Top-1 among πŸ”’) | (-1.2, 1.0) |
110
+ | O1-Mini-2024-09-12 πŸ”’ | 90.4 | (-1.1, 1.3) |
111
+ | GPT-4-Turbo-2024-04-09 πŸ”’ | 82.6 | (-1.8, 1.5) |
112
+ | GPT-4-0125-Preview πŸ”’ | 78.0 | (-2.1, 2.4) |
113
+ | GPT-4o-2024-08-06 πŸ”’ | 77.9 | (-2.0, 2.1) |
114
+ | Yi-Lightning πŸ”’ | 81.5 | (-1.6, 1.6) |
115
+ | Yi-LargeπŸ”’ | 63.7 | (-2.6, 2.4) |
116
+ | GLM-4-0520 πŸ”’ | 63.8 | (-2.9, 2.8) |
117
+
118
+
119
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
120
+
121
  | | Score | 95% CIs |
122
  | ----------------------- | -------- | ----------- |
123
  | **Xwen-7B-Chat** πŸ”‘ | **59.4** | (-2.4, 2.1) |
 
127
  | Llama-3-8B-Instruct πŸ”‘ | 20.6 | (-2.0, 1.9) |
128
  | Starling-LM-7B-beta πŸ”‘ | 23.0 | (-1.8, 1.8) |
129
 
130
+
131
+
132
  #### 3.1.2 Style Control
133
 
134
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
135
+
136
+ | | Score | 95% CIs |
137
+ | --------------------------------- | ------------------------ | ----------- |
138
+ | **Xwen-72B-Chat** πŸ”‘ | **72.4** (Top-1 Among πŸ”‘) | (-4.3, 4.1) |
139
+ | Qwen2.5-72B-Instruct πŸ”‘ | 63.3 | (-2.5, 2.3) |
140
+ | Athene-v2-Chat πŸ”‘ | 72.1 | (-2.5, 2.5) |
141
+ | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 71.0 | (-2.8, 3.1) |
142
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 67.1 | (-2.2, 2.8) |
143
+ | Claude-3-5-Sonnet-20241022 πŸ”’ | **86.4** (Top-1 Among πŸ”’) | (-1.3, 1.3) |
144
+ | O1-Preview-2024-09-12 πŸ”’ | 81.7 | (-2.2, 2.1) |
145
+ | O1-Mini-2024-09-12 πŸ”’ | 79.3 | (-2.8, 2.3) |
146
+ | GPT-4-Turbo-2024-04-09 πŸ”’ | 74.3 | (-2.4, 2.4) |
147
+ | GPT-4-0125-Preview πŸ”’ | 73.6 | (-2.0, 2.0) |
148
+ | GPT-4o-2024-08-06 πŸ”’ | 71.1 | (-2.5, 2.0) |
149
+ | Yi-Lightning πŸ”’ | 66.9 | (-3.3, 2.7) |
150
+ | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
151
+ | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
152
+
153
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
154
+
155
  | | Score | 95% CIs |
156
  | ----------------------- | -------- | ----------- |
157
  | **Xwen-7B-Chat** πŸ”‘ | **50.3** | (-3.8, 2.8) |
 
167
  > [!IMPORTANT]
168
  > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
169
 
170
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
171
+
172
+ | | Score |
173
+ | ----------------------------- | ------------------------ |
174
+ | **Xwen-72B-Chat** πŸ”‘ | **7.57** (Top-1 Among πŸ”‘) |
175
+ | Qwen2.5-72B-Instruct πŸ”‘ | 7.51 |
176
+ | Deepseek V2.5 πŸ”‘ | 7.38 |
177
+ | Mistral-Large-Instruct-2407 πŸ”‘ | 7.10 |
178
+ | Llama3.1-70B-Instruct πŸ”‘ | 5.81 |
179
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 5.56 |
180
+ | GPT-4o-0513 πŸ”’ | **7.59** (Top-1 Among πŸ”’) |
181
+ | Claude-3.5-Sonnet-20240620 πŸ”’ | 7.17 |
182
+ | Yi-Lightning πŸ”’ | 7.54 |
183
+ | Yi-Large-Preview πŸ”’ | 7.20 |
184
+
185
+
186
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
187
+
188
  | | Score |
189
  | ------------------ | -------- |
190
  | **Xwen-7B-Chat** πŸ”‘ | **6.88** |
 
195
  > [!IMPORTANT]
196
  > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
197
 
198
+ **Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
199
+
200
+ | | Score |
201
+ | ----------------------------- | ------------------------ |
202
+ | **Xwen-72B-Chat** πŸ”‘ | **8.64** (Top-1 Among πŸ”‘) |
203
+ | Qwen2.5-72B-Instruct πŸ”‘ | 8.62 |
204
+ | Deepseek V2.5 πŸ”‘ | 8.43 |
205
+ | Mistral-Large-Instruct-2407 πŸ”‘ | 8.53 |
206
+ | Llama3.1-70B-Instruct πŸ”‘ | 8.23 |
207
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 8.36 |
208
+ | GPT-4o-0513 πŸ”’ | 8.59 |
209
+ | Claude-3.5-Sonnet-20240620 πŸ”’ | 6.96 |
210
+ | Yi-Lightning πŸ”’ | **8.75** (Top-1 Among πŸ”’) |
211
+ | Yi-Large-Preview πŸ”’ | 8.32 |
212
+
213
+ **Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
214
+
215
  | | Score |
216
  | ------------------ | -------- |
217
  | **Xwen-7B-Chat** πŸ”‘ | **7.98** |
218
  | Qwen2.5-7B-Chat πŸ”‘ | 7.71 |
219
 
220
 
 
 
221
  ## References
222
 
223
  [1] Yang, An, et al. "Qwen2. 5 technical report." arXiv preprint arXiv:2412.15115 (2024).