---
license: cc-by-4.0
datasets:
- allenai/c4
language:
- en
metrics:
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
pipeline_tag: text-generation
---


# Overview
This document presents the evaluation results of `DeepSeek-R1-Distill-Llama-70B`, a **4-bit quantized model using GPTQ**, evaluated with the **Language Model Evaluation Harness** on the **ARC-Challenge** benchmark.

---

## 📊 Evaluation Summary

| **Metric**            | **Value**  | **Description**  | **8bit**  |
|----------------------|-----------|-----------------|-----------|
| **Accuracy (acc,none)** | `21.2%`  | Raw accuracy - percentage of correct answers. | `21.2%`  |
| **Standard Error (acc_stderr,none)** | `1.19%` | Uncertainty in the accuracy estimate. | `1.2%`  |
| **Normalized Accuracy (acc_norm,none)** | `25.4%`  | Accuracy after dataset-specific normalization. | `25.2%`  |
| **Standard Error (acc_norm_stderr,none)** | `1.27%` | Uncertainty for normalized accuracy. | `1.3%`  |

📌 **Interpretation:**
- The model correctly answered **21.2% of the questions**.
- After **normalization**, the accuracy slightly improves to **25.4%**.
- The **standard error (~1.27%)** indicates a small margin of uncertainty.

---

## ⚙️ Model Configuration

- **Model:** `DeepSeek-R1-Distill-Llama-70B`
- **Parameters:** `70 billion`
- **Quantization:** `4-bit GPTQ`
- **Source:** Hugging Face (`hf`)
- **Precision:** `torch.float16`
- **Hardware:** `NVIDIA A100 80GB PCIe`
- **CUDA Version:** `12.4`
- **PyTorch Version:** `2.6.0+cu124`
- **Batch Size:** `1`
- **Evaluation Time:** `365.89 seconds (~6 minutes)`

📌 **Interpretation:**
- The evaluation was performed on a **high-performance GPU (A100 80GB)**.
- The model is significantly larger than the previous 8B version, with **GPTQ 4-bit quantization reducing memory footprint**.
- A **single-sample batch size** was used, which might slow evaluation speed.

---

## 📂 Dataset Information

- **Dataset:** `AI2 ARC-Challenge`
- **Task Type:** `Multiple Choice`
- **Number of Samples Evaluated:** `1,172`
- **Few-shot Examples Used:** `0` (Zero-shot setting)

📌 **Interpretation:**
- This benchmark assesses **grade-school-level scientific reasoning**.
- Since **no few-shot examples** were provided, the model was evaluated in a **pure zero-shot setting**.

---

## 📈 Performance Insights

- The `"higher_is_better"` flag confirms that **higher accuracy is preferred**.
- The model's **raw accuracy (21.2%)** is significantly lower compared to state-of-the-art models (**60–80%** on ARC-Challenge).
- **Quantization Impact:** The **4-bit GPTQ quantization** reduces memory usage but may also impact accuracy slightly.
- **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing).

---

📌 Let us know if you need further analysis or model tuning! 🚀