Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3.1
|
3 |
+
datasets:
|
4 |
+
- allenai/c4
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- accuracy
|
9 |
+
base_model:
|
10 |
+
- meta-llama/Llama-3.1-8B-Instruct
|
11 |
+
---
|
12 |
+
# Language Model Evaluation Results
|
13 |
+
|
14 |
+
## Overview
|
15 |
+
This document presents the evaluation results of `Llama-3.1-8B-Instruct-gptq-4bit` using the **Language Model Evaluation Harness** on the **ARC-Challenge** benchmark.
|
16 |
+
|
17 |
+
---
|
18 |
+
|
19 |
+
## π Evaluation Summary
|
20 |
+
|
21 |
+
| **Metric** | **Value** | **Description** |
|
22 |
+
|----------------------|-----------|-----------------|
|
23 |
+
| **Accuracy (acc,none)** | `47.1%` | Raw accuracy - percentage of correct answers. |
|
24 |
+
| **Standard Error (acc_stderr,none)** | `1.46%` | Uncertainty in the accuracy estimate. |
|
25 |
+
| **Normalized Accuracy (acc_norm,none)** | `49.9%` | Accuracy after dataset-specific normalization. |
|
26 |
+
| **Standard Error (acc_norm_stderr,none)** | `1.46%` | Uncertainty for normalized accuracy. |
|
27 |
+
|
28 |
+
π **Interpretation:**
|
29 |
+
- The model correctly answered **47.1% of the questions**.
|
30 |
+
- After **normalization**, the accuracy slightly improves to **49.9%**.
|
31 |
+
- The **standard error (~1.46%)** indicates a small margin of uncertainty.
|
32 |
+
|
33 |
+
---
|
34 |
+
|
35 |
+
## βοΈ Model Configuration
|
36 |
+
|
37 |
+
- **Model:** `Llama-3.1-8B-Instruct-gptq-4bit`
|
38 |
+
- **Parameters:** `1.05 billion` (Quantized 4-bit model)
|
39 |
+
- **Source:** Hugging Face (`hf`)
|
40 |
+
- **Precision:** `torch.float16`
|
41 |
+
- **Hardware:** `NVIDIA A100 80GB PCIe`
|
42 |
+
- **CUDA Version:** `12.4`
|
43 |
+
- **PyTorch Version:** `2.6.0+cu124`
|
44 |
+
- **Batch Size:** `1`
|
45 |
+
- **Evaluation Time:** `365.89 seconds (~6 minutes)`
|
46 |
+
|
47 |
+
π **Interpretation:**
|
48 |
+
- The evaluation was performed on a **high-performance GPU (A100 80GB)**.
|
49 |
+
- The model is **4-bit quantized**, reducing memory usage but possibly affecting accuracy.
|
50 |
+
- A **single-sample batch size** was used, which might slow evaluation speed.
|
51 |
+
|
52 |
+
---
|
53 |
+
|
54 |
+
## π Dataset Information
|
55 |
+
|
56 |
+
- **Dataset:** `AI2 ARC-Challenge`
|
57 |
+
- **Task Type:** `Multiple Choice`
|
58 |
+
- **Number of Samples Evaluated:** `1,172`
|
59 |
+
- **Few-shot Examples Used:** `0` (Zero-shot setting)
|
60 |
+
|
61 |
+
π **Interpretation:**
|
62 |
+
- This benchmark assesses **grade-school-level scientific reasoning**.
|
63 |
+
- Since **no few-shot examples** were provided, the model was evaluated in a **pure zero-shot setting**.
|
64 |
+
|
65 |
+
---
|
66 |
+
|
67 |
+
## π Performance Insights
|
68 |
+
|
69 |
+
- The `"higher_is_better"` flag confirms that **higher accuracy is preferred**.
|
70 |
+
- The model's **raw accuracy (47.1%)** is moderate compared to state-of-the-art models (**60β80%** on ARC-Challenge).
|
71 |
+
- **Quantization Impact:** The **4-bit quantized model** might perform slightly worse than a full-precision version.
|
72 |
+
- **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing).
|
73 |
+
|
74 |
+
---
|
75 |
+
|
76 |
+
π Let us know if you need further analysis or model tuning! π
|