rwmasood commited on
Commit
1f864fb
Β·
verified Β·
1 Parent(s): 412a538

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -0
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ datasets:
4
+ - allenai/c4
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ base_model:
10
+ - meta-llama/Llama-3.1-8B-Instruct
11
+ ---
12
+ # Language Model Evaluation Results
13
+
14
+ ## Overview
15
+ This document presents the evaluation results of `Llama-3.1-8B-Instruct-gptq-4bit` using the **Language Model Evaluation Harness** on the **ARC-Challenge** benchmark.
16
+
17
+ ---
18
+
19
+ ## πŸ“Š Evaluation Summary
20
+
21
+ | **Metric** | **Value** | **Description** |
22
+ |----------------------|-----------|-----------------|
23
+ | **Accuracy (acc,none)** | `47.1%` | Raw accuracy - percentage of correct answers. |
24
+ | **Standard Error (acc_stderr,none)** | `1.46%` | Uncertainty in the accuracy estimate. |
25
+ | **Normalized Accuracy (acc_norm,none)** | `49.9%` | Accuracy after dataset-specific normalization. |
26
+ | **Standard Error (acc_norm_stderr,none)** | `1.46%` | Uncertainty for normalized accuracy. |
27
+
28
+ πŸ“Œ **Interpretation:**
29
+ - The model correctly answered **47.1% of the questions**.
30
+ - After **normalization**, the accuracy slightly improves to **49.9%**.
31
+ - The **standard error (~1.46%)** indicates a small margin of uncertainty.
32
+
33
+ ---
34
+
35
+ ## βš™οΈ Model Configuration
36
+
37
+ - **Model:** `Llama-3.1-8B-Instruct-gptq-4bit`
38
+ - **Parameters:** `1.05 billion` (Quantized 4-bit model)
39
+ - **Source:** Hugging Face (`hf`)
40
+ - **Precision:** `torch.float16`
41
+ - **Hardware:** `NVIDIA A100 80GB PCIe`
42
+ - **CUDA Version:** `12.4`
43
+ - **PyTorch Version:** `2.6.0+cu124`
44
+ - **Batch Size:** `1`
45
+ - **Evaluation Time:** `365.89 seconds (~6 minutes)`
46
+
47
+ πŸ“Œ **Interpretation:**
48
+ - The evaluation was performed on a **high-performance GPU (A100 80GB)**.
49
+ - The model is **4-bit quantized**, reducing memory usage but possibly affecting accuracy.
50
+ - A **single-sample batch size** was used, which might slow evaluation speed.
51
+
52
+ ---
53
+
54
+ ## πŸ“‚ Dataset Information
55
+
56
+ - **Dataset:** `AI2 ARC-Challenge`
57
+ - **Task Type:** `Multiple Choice`
58
+ - **Number of Samples Evaluated:** `1,172`
59
+ - **Few-shot Examples Used:** `0` (Zero-shot setting)
60
+
61
+ πŸ“Œ **Interpretation:**
62
+ - This benchmark assesses **grade-school-level scientific reasoning**.
63
+ - Since **no few-shot examples** were provided, the model was evaluated in a **pure zero-shot setting**.
64
+
65
+ ---
66
+
67
+ ## πŸ“ˆ Performance Insights
68
+
69
+ - The `"higher_is_better"` flag confirms that **higher accuracy is preferred**.
70
+ - The model's **raw accuracy (47.1%)** is moderate compared to state-of-the-art models (**60–80%** on ARC-Challenge).
71
+ - **Quantization Impact:** The **4-bit quantized model** might perform slightly worse than a full-precision version.
72
+ - **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing).
73
+
74
+ ---
75
+
76
+ πŸ“Œ Let us know if you need further analysis or model tuning! πŸš€