novita
/

DeepSeek-R1-Distill-Llama-70B-w8a8kv8-s888

compressed-tensors

Model card Files Files and versions Community

DeepSeek-R1-Distill-Llama-70B-w8a8kv8-s888 / README.md

novita-ai's picture

Update README.md

21edcfa verified 13 days ago

|

history blame contribute delete

2.44 kB

	---
	license: mit
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
	---

	# About
	This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.

	# Model Description
	DeepSeek-R1-Distill-Llama-70B is available in two configurations:
	- Standard configuration (bf16)
	- Optimized configuration with weight and KV cache quantization (w8a8kv8)

	# Key Features
	- Model Architecture: Based on the Llama architecture with 70B parameters
	- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
	- Quantization Innovation:
	- Weight quantization
	- KV cache optimization using fp8
	- Context Length: Supports up to 131,072 tokens
	- Precision Options:
	- bf16 for standard version
	- w8a8kv8 for optimized version

	# Methods
	The model employs advanced quantization techniques:
	- Weight quantization for model compression
	- KV cache optimization using fp8
	- Backend optimization with FLASHINFER for enhanced performance

	# Model Usage

	## Quick Start

	For optimal performance with w8a8kv8 configuration:
	```python
	# Environment setup
	export VLLM_ATTENTION_BACKEND=FLASHINFER

	# Model configuration
	model_config = {
	"max_model_len": 131072,
	"max_gen_tokens": 1024,
	"tensor_parallel_size": 2,
	"kv_cache_dtype": "fp8"
	}
	```

	# Hardware Requirements
	- Standard (bf16): 4 GPUs, tensor parallel size = 4
	- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2

	# Model Evaluation

	## Benchmark Results
	1. Throughput Performance:
	- w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
	2. MMLU Benchmark Scores:
	- bf16: 0.5158 (exact match)
	- w8a8kv8: 0.5169 (exact match)
	3. Subject-specific Performance:
	- Notable improvements in:
	- Biology (+1.11%)
	- Economics (+0.83%)
	- Physics (+0.92%)
	- Slight variations in:
	- History (-1.57%)
	- Law (-1.46%)

	# Limitations and Bias
	- Requires specific backend optimizations for fp8 KV cache
	- Performance may vary depending on hardware configuration
	- Subject-specific performance shows slight variations across different domains

	# Community
	Join our community discussions and get support:
	- Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P)