shenzhi-wang commited on
Commit
d812772
Β·
verified Β·
1 Parent(s): ecf1e15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -16
README.md CHANGED
@@ -88,23 +88,46 @@ print(response)
88
  πŸ”’: Proprietary
89
 
90
  ### 3.1 Arena-Hard-Auto
91
- | | Score | 95% CIs |
92
- | --------------------------------- | -------- | ----------- |
93
- | **Xwen-72B-Chat** πŸ”‘ | **86.1** | (-1.5, 1.7) |
94
- | Qwen2.5-72B-Chat πŸ”‘ | 63.3 | (-2.5, 2.3) |
95
- | Athene-v2-Chat πŸ”‘ | 72.1 | (-2.5, 2.5) |
96
- | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 71.0 | (-2.8, 3.1) |
97
- | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 67.1 | (-2.2, 2.8) |
98
- | Claude-3-5-Sonnet-20241022 πŸ”’ | **86.4** | (-1.3, 1.3) |
99
- | O1-Preview-2024-09-12 πŸ”’ | 81.7 | (-2.2, 2.1) |
100
- | O1-Mini-2024-09-12 πŸ”’ | 79.3 | (-2.8, 2.3) |
101
- | GPT-4-Turbo-2024-04-09 πŸ”’ | 74.3 | (-2.4, 2.4) |
102
- | GPT-4-0125-Preview πŸ”’ | 73.6 | (-2.0, 2.0) |
103
- | GPT-4o-2024-08-06 πŸ”’ | 71.1 | (-2.5, 2.0) |
104
- | Yi-Lightning πŸ”’ | 66.9 | (-3.3, 2.7) |
105
- | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
106
- | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
107
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
 
110
 
 
88
  πŸ”’: Proprietary
89
 
90
  ### 3.1 Arena-Hard-Auto
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
+ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
93
+
94
+ #### 3.1.1 No Style Control
95
+
96
+ | | Score | 95% CIs |
97
+ | --------------------------------- | ------------------------ | ----------- |
98
+ | **Xwen-72B-Chat** πŸ”‘ | **86.1** (Top-1 among πŸ”‘) | (-1.5, 1.7) |
99
+ | Qwen2.5-72B-Chat πŸ”‘ | 78.0 | (-1.8, 1.8) |
100
+ | Athene-v2-Chat πŸ”‘ | 85.0 | (-1.4, 1.7) |
101
+ | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 84.9 | (-1.7, 1.8) |
102
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 69.3 | (-2.4, 2.2) |
103
+ | Claude-3-5-Sonnet-20241022 πŸ”’ | 85.2 | (-1.4, 1.6) |
104
+ | O1-Preview-2024-09-12 πŸ”’ | **92.0** (Top-1 among πŸ”’) | (-1.2, 1.0) |
105
+ | O1-Mini-2024-09-12 πŸ”’ | 90.4 | (-1.1, 1.3) |
106
+ | GPT-4-Turbo-2024-04-09 πŸ”’ | 82.6 | (-1.8, 1.5) |
107
+ | GPT-4-0125-Preview πŸ”’ | 78.0 | (-2.1, 2.4) |
108
+ | GPT-4o-2024-08-06 πŸ”’ | 77.9 | (-2.0, 2.1) |
109
+ | Yi-Lightning πŸ”’ | 81.5 | (-1.6, 1.6) |
110
+ | Yi-LargeπŸ”’ | 63.7 | (-2.6, 2.4) |
111
+ | GLM-4-0520 πŸ”’ | 63.8 | (-2.9, 2.8) |
112
+
113
+ #### 3.1.2 Style Control
114
+
115
+ | | Score | 95% CIs |
116
+ | --------------------------------- | ------------------------ | ----------- |
117
+ | **Xwen-72B-Chat** πŸ”‘ | **72.4** (Top-1 Among πŸ”‘) | (-4.3, 4.1) |
118
+ | Qwen2.5-72B-Chat πŸ”‘ | 63.3 | (-2.5, 2.3) |
119
+ | Athene-v2-Chat πŸ”‘ | 72.1 | (-2.5, 2.5) |
120
+ | Llama-3.1-Nemotron-70B-Instruct πŸ”‘ | 71.0 | (-2.8, 3.1) |
121
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 67.1 | (-2.2, 2.8) |
122
+ | Claude-3-5-Sonnet-20241022 πŸ”’ | **86.4** (Top-1 Among πŸ”’) | (-1.3, 1.3) |
123
+ | O1-Preview-2024-09-12 πŸ”’ | 81.7 | (-2.2, 2.1) |
124
+ | O1-Mini-2024-09-12 πŸ”’ | 79.3 | (-2.8, 2.3) |
125
+ | GPT-4-Turbo-2024-04-09 πŸ”’ | 74.3 | (-2.4, 2.4) |
126
+ | GPT-4-0125-Preview πŸ”’ | 73.6 | (-2.0, 2.0) |
127
+ | GPT-4o-2024-08-06 πŸ”’ | 71.1 | (-2.5, 2.0) |
128
+ | Yi-Lightning πŸ”’ | 66.9 | (-3.3, 2.7) |
129
+ | Yi-Large-Preview πŸ”’ | 65.1 | (-2.5, 2.5) |
130
+ | GLM-4-0520 πŸ”’ | 61.4 | (-2.6, 2.4) |
131
 
132
 
133