calcuis
/

deepseek-r1

@@ -25,8 +25,8 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
 ### appendices: model evaluation (written by deekseek-ai)
-#### DeepSeek-R1-Evaluation
- For all our (here refer to deekseek-ai) models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
 | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 |
 |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|
@@ -55,7 +55,7 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
 | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
 | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
-### Distilled Model Evaluation
 | Model                                    | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating |
 |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------|
@@ -69,3 +69,5 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
 | DeepSeek-R1-Distill-Qwen-32B        | **72.6**         | 83.3              | 94.3            | 62.1                 | 57.2                 | 1691              |
 | DeepSeek-R1-Distill-Llama-8B         | 50.4             | 80.0              | 89.1            | 49.0                 | 39.6                 | 1205              |
 | DeepSeek-R1-Distill-Llama-70B        | 70.0             | **86.7**          | **94.5**        | **65.2**             | **57.5**             | 1633              |

 ### appendices: model evaluation (written by deekseek-ai)
+#### deepseek-r1-evaluation
+ for all our (here refer to deekseek-ai) models, the maximum generation length is set to 32,768 tokens; for benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
 | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 |
 |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|
 | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
 | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
+#### distilled model evaluation
 | Model                                    | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating |
 |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------|
 | DeepSeek-R1-Distill-Qwen-32B        | **72.6**         | 83.3              | 94.3            | 62.1                 | 57.2                 | 1691              |
 | DeepSeek-R1-Distill-Llama-8B         | 50.4             | 80.0              | 89.1            | 49.0                 | 39.6                 | 1205              |
 | DeepSeek-R1-Distill-Llama-70B        | 70.0             | **86.7**          | **94.5**        | **65.2**             | **57.5**             | 1633              |
+\* these two tables are quoted from deepseek-ai