Update README.md
Browse files
README.md
CHANGED
@@ -25,8 +25,8 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
|
|
25 |
|
26 |
### appendices: model evaluation (written by deekseek-ai)
|
27 |
|
28 |
-
####
|
29 |
-
|
30 |
|
31 |
| Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 |
|
32 |
|----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|
|
@@ -55,7 +55,7 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
|
|
55 |
| | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
|
56 |
| | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
|
57 |
|
58 |
-
|
59 |
|
60 |
| Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating |
|
61 |
|------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------|
|
@@ -69,3 +69,5 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
|
|
69 |
| DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 |
|
70 |
| DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 |
|
71 |
| DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 |
|
|
|
|
|
|
25 |
|
26 |
### appendices: model evaluation (written by deekseek-ai)
|
27 |
|
28 |
+
#### deepseek-r1-evaluation
|
29 |
+
for all our (here refer to deekseek-ai) models, the maximum generation length is set to 32,768 tokens; for benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
|
30 |
|
31 |
| Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 |
|
32 |
|----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|
|
|
|
55 |
| | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
|
56 |
| | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
|
57 |
|
58 |
+
#### distilled model evaluation
|
59 |
|
60 |
| Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating |
|
61 |
|------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------|
|
|
|
69 |
| DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 |
|
70 |
| DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 |
|
71 |
| DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 |
|
72 |
+
|
73 |
+
\* these two tables are quoted from deepseek-ai
|