Update README.md
Browse files
README.md
CHANGED
@@ -114,6 +114,28 @@ All results below, except those for `Xwen-7B-Chat`, are sourced from [Arena-Hard
|
|
114 |
| Starling-LM-7B-beta π | 26.1 | (-2.6, 2.0) |
|
115 |
|
116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
|
118 |
## References
|
119 |
|
|
|
114 |
| Starling-LM-7B-beta π | 26.1 | (-2.6, 2.0) |
|
115 |
|
116 |
|
117 |
+
### 3.2 AlignBench-v1.1
|
118 |
+
|
119 |
+
> [!IMPORTANT]
|
120 |
+
> We replaced the original judge model, `GPT-4-0613`, in AlignBench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the AlignBench-v1.1 scores reported elsewhere.
|
121 |
+
|
122 |
+
| | Score |
|
123 |
+
| ------------------ | -------- |
|
124 |
+
| **Xwen-7B-Chat** π | **6.88** |
|
125 |
+
| Qwen2.5-7B-Chat π | 6.56 |
|
126 |
+
|
127 |
+
### 3.3 MT-Bench
|
128 |
+
|
129 |
+
> [!IMPORTANT]
|
130 |
+
> We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
|
131 |
+
|
132 |
+
| | Score |
|
133 |
+
| ------------------ | -------- |
|
134 |
+
| **Xwen-7B-Chat** π | **7.98** |
|
135 |
+
| Qwen2.5-7B-Chat π | 7.71 |
|
136 |
+
|
137 |
+
|
138 |
+
|
139 |
|
140 |
## References
|
141 |
|