xwen-team
/

Xwen-7B-Chat

@@ -90,12 +90,34 @@ print(response)
 🔒: Proprietary
-### 3.1 Arena-Hard-Auto
-All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
 #### 3.1.1 No Style Control
 |                         | Score    | 95% CIs     |
 | ----------------------- | -------- | ----------- |
 | **Xwen-7B-Chat** 🔑      | **59.4** | (-2.4, 2.1) |
@@ -105,8 +127,31 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
 | Llama-3-8B-Instruct 🔑   | 20.6     | (-2.0, 1.9) |
 | Starling-LM-7B-beta 🔑   | 23.0     | (-1.8, 1.8) |
 #### 3.1.2 Style Control
 |                         | Score    | 95% CIs     |
 | ----------------------- | -------- | ----------- |
 | **Xwen-7B-Chat** 🔑      | **50.3** | (-3.8, 2.8) |
@@ -122,6 +167,24 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
 |                    | Score    |
 | ------------------ | -------- |
 | **Xwen-7B-Chat** 🔑 | **6.88** |
@@ -132,14 +195,29 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
 |                    | Score    |
 | ------------------ | -------- |
 | **Xwen-7B-Chat** 🔑 | **7.98** |
 | Qwen2.5-7B-Chat 🔑  | 7.71     |
 ## References
 [1] Yang, An, et al. "Qwen2. 5 technical report." arXiv preprint arXiv:2412.15115 (2024).

 🔒: Proprietary
+### 3.1 Arena-Hard-Auto-v0.1
+All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
 #### 3.1.1 No Style Control
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
+|                                   | Score                    | 95% CIs     |
+| --------------------------------- | ------------------------ | ----------- |
+| **Xwen-72B-Chat** 🔑               | **86.1** (Top-1 among 🔑) | (-1.5, 1.7) |
+| Qwen2.5-72B-Instruct 🔑            | 78.0                     | (-1.8, 1.8) |
+| Athene-v2-Chat 🔑                  | 85.0                     | (-1.4, 1.7) |
+| Llama-3.1-Nemotron-70B-Instruct 🔑 | 84.9                     | (-1.7, 1.8) |
+| Llama-3.1-405B-Instruct-FP8 🔑     | 69.3                     | (-2.4, 2.2) |
+| Claude-3-5-Sonnet-20241022 🔒      | 85.2                     | (-1.4, 1.6) |
+| O1-Preview-2024-09-12 🔒           | **92.0** (Top-1 among 🔒) | (-1.2, 1.0) |
+| O1-Mini-2024-09-12 🔒              | 90.4                     | (-1.1, 1.3) |
+| GPT-4-Turbo-2024-04-09 🔒          | 82.6                     | (-1.8, 1.5) |
+| GPT-4-0125-Preview 🔒              | 78.0                     | (-2.1, 2.4) |
+| GPT-4o-2024-08-06 🔒               | 77.9                     | (-2.0, 2.1) |
+| Yi-Lightning 🔒                    | 81.5                     | (-1.6, 1.6) |
+| Yi-Large🔒                         | 63.7                     | (-2.6, 2.4) |
+| GLM-4-0520 🔒                      | 63.8                     | (-2.9, 2.8) |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
 |                         | Score    | 95% CIs     |
 | ----------------------- | -------- | ----------- |
 | **Xwen-7B-Chat** 🔑      | **59.4** | (-2.4, 2.1) |
 | Llama-3-8B-Instruct 🔑   | 20.6     | (-2.0, 1.9) |
 | Starling-LM-7B-beta 🔑   | 23.0     | (-1.8, 1.8) |
 #### 3.1.2 Style Control
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
+|                                   | Score                    | 95% CIs     |
+| --------------------------------- | ------------------------ | ----------- |
+| **Xwen-72B-Chat** 🔑               | **72.4** (Top-1 Among 🔑) | (-4.3, 4.1) |
+| Qwen2.5-72B-Instruct 🔑            | 63.3                     | (-2.5, 2.3) |
+| Athene-v2-Chat 🔑                  | 72.1                     | (-2.5, 2.5) |
+| Llama-3.1-Nemotron-70B-Instruct 🔑 | 71.0                     | (-2.8, 3.1) |
+| Llama-3.1-405B-Instruct-FP8 🔑     | 67.1                     | (-2.2, 2.8) |
+| Claude-3-5-Sonnet-20241022 🔒      | **86.4** (Top-1 Among 🔒) | (-1.3, 1.3) |
+| O1-Preview-2024-09-12 🔒           | 81.7                     | (-2.2, 2.1) |
+| O1-Mini-2024-09-12 🔒              | 79.3                     | (-2.8, 2.3) |
+| GPT-4-Turbo-2024-04-09 🔒          | 74.3                     | (-2.4, 2.4) |
+| GPT-4-0125-Preview 🔒              | 73.6                     | (-2.0, 2.0) |
+| GPT-4o-2024-08-06 🔒               | 71.1                     | (-2.5, 2.0) |
+| Yi-Lightning 🔒                    | 66.9                     | (-3.3, 2.7) |
+| Yi-Large-Preview 🔒                | 65.1                     | (-2.5, 2.5) |
+| GLM-4-0520 🔒                      | 61.4                     | (-2.6, 2.4) |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
 |                         | Score    | 95% CIs     |
 | ----------------------- | -------- | ----------- |
 | **Xwen-7B-Chat** 🔑      | **50.3** | (-3.8, 2.8) |
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
+|                               | Score                    |
+| ----------------------------- | ------------------------ |
+| **Xwen-72B-Chat** 🔑           | **7.57** (Top-1 Among 🔑) |
+| Qwen2.5-72B-Instruct 🔑            | 7.51                     |
+| Deepseek V2.5 🔑               | 7.38                     |
+| Mistral-Large-Instruct-2407 🔑 | 7.10                     |
+| Llama3.1-70B-Instruct 🔑       | 5.81                     |
+| Llama-3.1-405B-Instruct-FP8 🔑 | 5.56                     |
+| GPT-4o-0513 🔒                 | **7.59** (Top-1 Among 🔒) |
+| Claude-3.5-Sonnet-20240620 🔒  | 7.17                     |
+| Yi-Lightning 🔒                | 7.54                     |
+| Yi-Large-Preview 🔒            | 7.20                     |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
 |                    | Score    |
 | ------------------ | -------- |
 | **Xwen-7B-Chat** 🔑 | **6.88** |
 > [!IMPORTANT]
 > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
+**Comparison of Xwen-72B-Chat with other LLMs at a comparable level:**
+|                               | Score                    |
+| ----------------------------- | ------------------------ |
+| **Xwen-72B-Chat** 🔑           | **8.64** (Top-1 Among 🔑) |
+| Qwen2.5-72B-Instruct 🔑            | 8.62                     |
+| Deepseek V2.5 🔑               | 8.43                     |
+| Mistral-Large-Instruct-2407 🔑 | 8.53                     |
+| Llama3.1-70B-Instruct 🔑       | 8.23                     |
+| Llama-3.1-405B-Instruct-FP8 🔑 | 8.36                     |
+| GPT-4o-0513 🔒                 | 8.59                     |
+| Claude-3.5-Sonnet-20240620 🔒  | 6.96                     |
+| Yi-Lightning 🔒                | **8.75** (Top-1 Among 🔒) |
+| Yi-Large-Preview 🔒            | 8.32                     |
+**Comparison of Xwen-7B-Chat with other LLMs at a comparable level:**
 |                    | Score    |
 | ------------------ | -------- |
 | **Xwen-7B-Chat** 🔑 | **7.98** |
 | Qwen2.5-7B-Chat 🔑  | 7.71     |
 ## References
 [1] Yang, An, et al. "Qwen2. 5 technical report." arXiv preprint arXiv:2412.15115 (2024).