Update README.md
Browse files
README.md
CHANGED
@@ -88,23 +88,46 @@ print(response)
|
|
88 |
π: Proprietary
|
89 |
|
90 |
### 3.1 Arena-Hard-Auto
|
91 |
-
| | Score | 95% CIs |
|
92 |
-
| --------------------------------- | -------- | ----------- |
|
93 |
-
| **Xwen-72B-Chat** π | **86.1** | (-1.5, 1.7) |
|
94 |
-
| Qwen2.5-72B-Chat π | 63.3 | (-2.5, 2.3) |
|
95 |
-
| Athene-v2-Chat π | 72.1 | (-2.5, 2.5) |
|
96 |
-
| Llama-3.1-Nemotron-70B-Instruct π | 71.0 | (-2.8, 3.1) |
|
97 |
-
| Llama-3.1-405B-Instruct-FP8 π | 67.1 | (-2.2, 2.8) |
|
98 |
-
| Claude-3-5-Sonnet-20241022 π | **86.4** | (-1.3, 1.3) |
|
99 |
-
| O1-Preview-2024-09-12 π | 81.7 | (-2.2, 2.1) |
|
100 |
-
| O1-Mini-2024-09-12 π | 79.3 | (-2.8, 2.3) |
|
101 |
-
| GPT-4-Turbo-2024-04-09 π | 74.3 | (-2.4, 2.4) |
|
102 |
-
| GPT-4-0125-Preview π | 73.6 | (-2.0, 2.0) |
|
103 |
-
| GPT-4o-2024-08-06 π | 71.1 | (-2.5, 2.0) |
|
104 |
-
| Yi-Lightning π | 66.9 | (-3.3, 2.7) |
|
105 |
-
| Yi-Large-Preview π | 65.1 | (-2.5, 2.5) |
|
106 |
-
| GLM-4-0520 π | 61.4 | (-2.6, 2.4) |
|
107 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
|
109 |
|
110 |
|
|
|
88 |
π: Proprietary
|
89 |
|
90 |
### 3.1 Arena-Hard-Auto
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
+
All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
|
93 |
+
|
94 |
+
#### 3.1.1 No Style Control
|
95 |
+
|
96 |
+
| | Score | 95% CIs |
|
97 |
+
| --------------------------------- | ------------------------ | ----------- |
|
98 |
+
| **Xwen-72B-Chat** π | **86.1** (Top-1 among π) | (-1.5, 1.7) |
|
99 |
+
| Qwen2.5-72B-Chat π | 78.0 | (-1.8, 1.8) |
|
100 |
+
| Athene-v2-Chat π | 85.0 | (-1.4, 1.7) |
|
101 |
+
| Llama-3.1-Nemotron-70B-Instruct π | 84.9 | (-1.7, 1.8) |
|
102 |
+
| Llama-3.1-405B-Instruct-FP8 π | 69.3 | (-2.4, 2.2) |
|
103 |
+
| Claude-3-5-Sonnet-20241022 π | 85.2 | (-1.4, 1.6) |
|
104 |
+
| O1-Preview-2024-09-12 π | **92.0** (Top-1 among π) | (-1.2, 1.0) |
|
105 |
+
| O1-Mini-2024-09-12 π | 90.4 | (-1.1, 1.3) |
|
106 |
+
| GPT-4-Turbo-2024-04-09 π | 82.6 | (-1.8, 1.5) |
|
107 |
+
| GPT-4-0125-Preview π | 78.0 | (-2.1, 2.4) |
|
108 |
+
| GPT-4o-2024-08-06 π | 77.9 | (-2.0, 2.1) |
|
109 |
+
| Yi-Lightning π | 81.5 | (-1.6, 1.6) |
|
110 |
+
| Yi-Largeπ | 63.7 | (-2.6, 2.4) |
|
111 |
+
| GLM-4-0520 π | 63.8 | (-2.9, 2.8) |
|
112 |
+
|
113 |
+
#### 3.1.2 Style Control
|
114 |
+
|
115 |
+
| | Score | 95% CIs |
|
116 |
+
| --------------------------------- | ------------------------ | ----------- |
|
117 |
+
| **Xwen-72B-Chat** π | **72.4** (Top-1 Among π) | (-4.3, 4.1) |
|
118 |
+
| Qwen2.5-72B-Chat π | 63.3 | (-2.5, 2.3) |
|
119 |
+
| Athene-v2-Chat π | 72.1 | (-2.5, 2.5) |
|
120 |
+
| Llama-3.1-Nemotron-70B-Instruct π | 71.0 | (-2.8, 3.1) |
|
121 |
+
| Llama-3.1-405B-Instruct-FP8 π | 67.1 | (-2.2, 2.8) |
|
122 |
+
| Claude-3-5-Sonnet-20241022 π | **86.4** (Top-1 Among π) | (-1.3, 1.3) |
|
123 |
+
| O1-Preview-2024-09-12 π | 81.7 | (-2.2, 2.1) |
|
124 |
+
| O1-Mini-2024-09-12 π | 79.3 | (-2.8, 2.3) |
|
125 |
+
| GPT-4-Turbo-2024-04-09 π | 74.3 | (-2.4, 2.4) |
|
126 |
+
| GPT-4-0125-Preview π | 73.6 | (-2.0, 2.0) |
|
127 |
+
| GPT-4o-2024-08-06 π | 71.1 | (-2.5, 2.0) |
|
128 |
+
| Yi-Lightning π | 66.9 | (-3.3, 2.7) |
|
129 |
+
| Yi-Large-Preview π | 65.1 | (-2.5, 2.5) |
|
130 |
+
| GLM-4-0520 π | 61.4 | (-2.6, 2.4) |
|
131 |
|
132 |
|
133 |
|