novita-ai's picture
Update README.md
21edcfa verified
---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
---
# About
This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
# Model Description
DeepSeek-R1-Distill-Llama-70B is available in two configurations:
- Standard configuration (bf16)
- Optimized configuration with weight and KV cache quantization (w8a8kv8)
# Key Features
- Model Architecture: Based on the Llama architecture with 70B parameters
- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
- Quantization Innovation:
- Weight quantization
- KV cache optimization using fp8
- Context Length: Supports up to 131,072 tokens
- Precision Options:
- bf16 for standard version
- w8a8kv8 for optimized version
# Methods
The model employs advanced quantization techniques:
- Weight quantization for model compression
- KV cache optimization using fp8
- Backend optimization with FLASHINFER for enhanced performance
# Model Usage
## Quick Start
For optimal performance with w8a8kv8 configuration:
```python
# Environment setup
export VLLM_ATTENTION_BACKEND=FLASHINFER
# Model configuration
model_config = {
"max_model_len": 131072,
"max_gen_tokens": 1024,
"tensor_parallel_size": 2,
"kv_cache_dtype": "fp8"
}
```
# Hardware Requirements
- Standard (bf16): 4 GPUs, tensor parallel size = 4
- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2
# Model Evaluation
## Benchmark Results
1. Throughput Performance:
- w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
2. MMLU Benchmark Scores:
- bf16: 0.5158 (exact match)
- w8a8kv8: 0.5169 (exact match)
3. Subject-specific Performance:
- Notable improvements in:
- Biology (+1.11%)
- Economics (+0.83%)
- Physics (+0.92%)
- Slight variations in:
- History (-1.57%)
- Law (-1.46%)
# Limitations and Bias
- Requires specific backend optimizations for fp8 KV cache
- Performance may vary depending on hardware configuration
- Subject-specific performance shows slight variations across different domains
# Community
Join our community discussions and get support:
- Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P)