shenzhi-wang commited on
Commit
fafa8b7
Β·
verified Β·
1 Parent(s): 51f420f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -33
README.md CHANGED
@@ -89,45 +89,29 @@ print(response)
89
 
90
  ### 3.1 Arena-Hard-Auto
91
 
92
- All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
93
 
94
  #### 3.1.1 No Style Control
95
 
96
- | | Score | 95% CIs |
97
- | --------------------------------- | ------------------------ | ----------- |
98
- | **Xwen-72B-Chat** πŸ”‘ | **86.1** (Top-1 among πŸ”‘) | (-1.5, 1.7) |
99
- | Qwen2.5-72B-Instruct πŸ”‘ | 78.0 | (-1.8, 1.8) |
100
- | Athene-v2-Chat πŸ”‘ | 85.0 | (-1.4, 1.7) |
101
- | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 84.9 | (-1.7, 1.8) |
102
- | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 69.3 | (-2.4, 2.2) |
103
- | Claude-3-5-Sonnet-20241022 πŸ”’ | 85.2 | (-1.4, 1.6) |
104
- | O1-Preview-2024-09-12 πŸ”’ | **92.0** (Top-1 among πŸ”’) | (-1.2, 1.0) |
105
- | O1-Mini-2024-09-12 πŸ”’ | 90.4 | (-1.1, 1.3) |
106
- | GPT-4-Turbo-2024-04-09 πŸ”’ | 82.6 | (-1.8, 1.5) |
107
- | GPT-4-0125-Preview πŸ”’ | 78.0 | (-2.1, 2.4) |
108
- | GPT-4o-2024-08-06 πŸ”’ | 77.9 | (-2.0, 2.1) |
109
- | Yi-Lightning πŸ”’ | 81.5 | (-1.6, 1.6) |
110
- | Yi-LargeπŸ”’ | 63.7 | (-2.6, 2.4) |
111
- | GLM-4-0520 πŸ”’ | 63.8 | (-2.9, 2.8) |
112
 
113
  #### 3.1.2 Style Control
114
 
115
- | | Score | 95% CIs |
116
- | --------------------------------- | ------------------------ | ----------- |
117
- | **Xwen-72B-Chat** πŸ”‘ | **72.4** (Top-1 Among πŸ”‘) | (-4.3, 4.1) |
118
- | Qwen2.5-72B-Instruct πŸ”‘ | 63.3 | (-2.5, 2.3) |
119
- | Athene-v2-Chat πŸ”‘ | 72.1 | (-2.5, 2.5) |
120
- | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 71.0 | (-2.8, 3.1) |
121
- | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 67.1 | (-2.2, 2.8) |
122
- | Claude-3-5-Sonnet-20241022 πŸ”’ | **86.4** (Top-1 Among πŸ”’) | (-1.3, 1.3) |
123
- | O1-Preview-2024-09-12 πŸ”’ | 81.7 | (-2.2, 2.1) |
124
- | O1-Mini-2024-09-12 πŸ”’ | 79.3 | (-2.8, 2.3) |
125
- | GPT-4-Turbo-2024-04-09 πŸ”’ | 74.3 | (-2.4, 2.4) |
126
- | GPT-4-0125-Preview πŸ”’ | 73.6 | (-2.0, 2.0) |
127
- | GPT-4o-2024-08-06 πŸ”’ | 71.1 | (-2.5, 2.0) |
128
- | Yi-Lightning πŸ”’ | 66.9 | (-3.3, 2.7) |
129
- | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
130
- | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
131
 
132
 
133
 
 
89
 
90
  ### 3.1 Arena-Hard-Auto
91
 
92
+ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
93
 
94
  #### 3.1.1 No Style Control
95
 
96
+ | | Score | 95% CIs |
97
+ | ----------------------- | -------- | ----------- |
98
+ | **Xwen-7B-Chat** πŸ”‘ | **59.4** | (-2.4, 2.1) |
99
+ | Qwen2.5-7B-Instruct πŸ”‘ | 50.4 | (-2.9, 2.5) |
100
+ | Gemma-2-27B-IT πŸ”‘ | 57.5 | (-2.1, 2.4) |
101
+ | Llama-3.1-8B-Instruct πŸ”‘ | 21.3 | (-1.9, 2.2) |
102
+ | Llama-3-8B-Instruct πŸ”‘ | 20.6 | (-2.0, 1.9) |
103
+ | Starling-LM-7B-beta πŸ”‘ | 23.0 | (-1.8, 1.8) |
 
 
 
 
 
 
 
 
104
 
105
  #### 3.1.2 Style Control
106
 
107
+ | | Score | 95% CIs |
108
+ | ----------------------- | -------- | ----------- |
109
+ | **Xwen-7B-Chat** πŸ”‘ | **50.3** | (-3.8, 2.8) |
110
+ | Qwen2.5-7B-Instruct πŸ”‘ | 46.9 | (-3.1, 2.7) |
111
+ | Gemma-2-27B-IT πŸ”‘ | 47.5 | (-2.5, 2.7) |
112
+ | Llama-3.1-8B-Instruct πŸ”‘ | 18.3 | (-1.6, 1.6) |
113
+ | Llama-3-8B-Instruct πŸ”‘ | 19.8 | (-1.6, 1.9) |
114
+ | Starling-LM-7B-beta πŸ”‘ | 26.1 | (-2.6, 2.0) |
 
 
 
 
 
 
 
 
115
 
116
 
117