Update README.md

de01892 verified 17 days ago

3.56 kB

	---
	license: llama3.1
	datasets:
	- allenai/c4
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	---
	# Language Model Evaluation Results

	## Overview
	This document presents the evaluation results of `Llama-3.1-8B-Instruct-gptq-4bit` using the Language Model Evaluation Harness on the ARC-Challenge benchmark.

	---

	## 📊 Evaluation Summary

	\| Metric \| Value \| Description \| [original](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \|
	\|----------------------\|-----------\|-----------------\|-----------\|
	\| Accuracy (acc,none) \| `47.1%` \| Raw accuracy - percentage of correct answers. \| `53.1%` \|
	\| Standard Error (acc_stderr,none) \| `1.46%` \| Uncertainty in the accuracy estimate. \| `1.45%` \|
	\| Normalized Accuracy (acc_norm,none) \| `49.9%` \| Accuracy after dataset-specific normalization. \| `56.8%` \|
	\| Standard Error (acc_norm_stderr,none) \| `1.46%` \| Uncertainty for normalized accuracy. \| `1.45%` \|

	📌 Interpretation:
	- The model correctly answered 47.1% of the questions.
	- After normalization, the accuracy slightly improves to 49.9%.
	- The standard error (~1.46%) indicates a small margin of uncertainty.

	---

	## ⚙️ Model Configuration

	- Model: `Llama-3.1-8B-Instruct-gptq-4bit`
	- Parameters: `1.05 billion` (Quantized 4-bit model)
	- Source: Hugging Face (`hf`)
	- Precision: `torch.float16`
	- Hardware: `NVIDIA A100 80GB PCIe`
	- CUDA Version: `12.4`
	- PyTorch Version: `2.6.0+cu124`
	- Batch Size: `1`
	- Evaluation Time: `365.89 seconds (~6 minutes)`

	📌 Interpretation:
	- The evaluation was performed on a high-performance GPU (A100 80GB).
	- The model is 4-bit quantized, reducing memory usage but possibly affecting accuracy.
	- A single-sample batch size was used, which might slow evaluation speed.

	---

	## 📂 Dataset Information

	- Dataset: `AI2 ARC-Challenge`
	- Task Type: `Multiple Choice`
	- Number of Samples Evaluated: `1,172`
	- Few-shot Examples Used: `0` (Zero-shot setting)

	📌 Interpretation:
	- This benchmark assesses grade-school-level scientific reasoning.
	- Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

	---

	## 📈 Performance Insights

	- The `"higher_is_better"` flag confirms that higher accuracy is preferred.
	- The model's raw accuracy (47.1%) is moderate compared to state-of-the-art models (60–80% on ARC-Challenge).
	- Quantization Impact: The 4-bit quantized model might perform slightly worse than a full-precision version.
	- Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

	---

	📌 Let us know if you need further analysis or model tuning! 🚀

	## Citation
	If you use this model in your research or project, please cite it as follows:

	📌 Dr. Wasif Masood (2024). 4bit Llama-3.1-8B-Instruct. Version 1.0.
	Available at: [https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit](https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit)

	### BibTeX:
	```bibtex
	@dataset{rwmasood2024,
	author = {Dr. Wasif Masood and Empirisch Tech GmbH},
	title = {Llama-3.1-8B 4 bit quantized},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit},
	version = {1.0},
	license = {llama3.1},
	institution = {Empirisch Tech GmbH}
	}

	---
	license: llama3.1
	datasets:
	- allenai/c4
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	---
	# Language Model Evaluation Results

	## Overview
	This document presents the evaluation results of `Llama-3.1-8B-Instruct-gptq-4bit` using the Language Model Evaluation Harness on the ARC-Challenge benchmark.

	---

	## 📊 Evaluation Summary

	\| Metric \| Value \| Description \| [original](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \|
	\|----------------------\|-----------\|-----------------\|-----------\|
	\| Accuracy (acc,none) \| `47.1%` \| Raw accuracy - percentage of correct answers. \| `53.1%` \|
	\| Standard Error (acc_stderr,none) \| `1.46%` \| Uncertainty in the accuracy estimate. \| `1.45%` \|
	\| Normalized Accuracy (acc_norm,none) \| `49.9%` \| Accuracy after dataset-specific normalization. \| `56.8%` \|
	\| Standard Error (acc_norm_stderr,none) \| `1.46%` \| Uncertainty for normalized accuracy. \| `1.45%` \|

	📌 Interpretation:
	- The model correctly answered 47.1% of the questions.
	- After normalization, the accuracy slightly improves to 49.9%.
	- The standard error (~1.46%) indicates a small margin of uncertainty.

	---

	## ⚙️ Model Configuration

	- Model: `Llama-3.1-8B-Instruct-gptq-4bit`
	- Parameters: `1.05 billion` (Quantized 4-bit model)
	- Source: Hugging Face (`hf`)
	- Precision: `torch.float16`
	- Hardware: `NVIDIA A100 80GB PCIe`
	- CUDA Version: `12.4`
	- PyTorch Version: `2.6.0+cu124`
	- Batch Size: `1`
	- Evaluation Time: `365.89 seconds (~6 minutes)`

	📌 Interpretation:
	- The evaluation was performed on a high-performance GPU (A100 80GB).
	- The model is 4-bit quantized, reducing memory usage but possibly affecting accuracy.
	- A single-sample batch size was used, which might slow evaluation speed.

	---

	## 📂 Dataset Information

	- Dataset: `AI2 ARC-Challenge`
	- Task Type: `Multiple Choice`
	- Number of Samples Evaluated: `1,172`
	- Few-shot Examples Used: `0` (Zero-shot setting)

	📌 Interpretation:
	- This benchmark assesses grade-school-level scientific reasoning.
	- Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

	---

	## 📈 Performance Insights

	- The `"higher_is_better"` flag confirms that higher accuracy is preferred.
	- The model's raw accuracy (47.1%) is moderate compared to state-of-the-art models (60–80% on ARC-Challenge).
	- Quantization Impact: The 4-bit quantized model might perform slightly worse than a full-precision version.
	- Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

	---

	📌 Let us know if you need further analysis or model tuning! 🚀

	## Citation
	If you use this model in your research or project, please cite it as follows:

	📌 Dr. Wasif Masood (2024). 4bit Llama-3.1-8B-Instruct. Version 1.0.
	Available at: [https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit](https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit)

	### BibTeX:
	```bibtex
	@dataset{rwmasood2024,
	author = {Dr. Wasif Masood and Empirisch Tech GmbH},
	title = {Llama-3.1-8B 4 bit quantized},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit},
	version = {1.0},
	license = {llama3.1},
	institution = {Empirisch Tech GmbH}
	}