--- license: mit base_model: - deepseek-ai/DeepSeek-R1-Distill-Llama-70B --- # About This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy. # Model Description DeepSeek-R1-Distill-Llama-70B is available in two configurations: - Standard configuration (bf16) - Optimized configuration with weight and KV cache quantization (w8a8kv8) # Key Features - Model Architecture: Based on the Llama architecture with 70B parameters - Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration - Quantization Innovation: - Weight quantization - KV cache optimization using fp8 - Context Length: Supports up to 131,072 tokens - Precision Options: - bf16 for standard version - w8a8kv8 for optimized version # Methods The model employs advanced quantization techniques: - Weight quantization for model compression - KV cache optimization using fp8 - Backend optimization with FLASHINFER for enhanced performance # Model Usage ## Quick Start For optimal performance with w8a8kv8 configuration: ```python # Environment setup export VLLM_ATTENTION_BACKEND=FLASHINFER # Model configuration model_config = { "max_model_len": 131072, "max_gen_tokens": 1024, "tensor_parallel_size": 2, "kv_cache_dtype": "fp8" } ``` # Hardware Requirements - Standard (bf16): 4 GPUs, tensor parallel size = 4 - Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2 # Model Evaluation ## Benchmark Results 1. Throughput Performance: - w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16 2. MMLU Benchmark Scores: - bf16: 0.5158 (exact match) - w8a8kv8: 0.5169 (exact match) 3. Subject-specific Performance: - Notable improvements in: - Biology (+1.11%) - Economics (+0.83%) - Physics (+0.92%) - Slight variations in: - History (-1.57%) - Law (-1.46%) # Limitations and Bias - Requires specific backend optimizations for fp8 KV cache - Performance may vary depending on hardware configuration - Subject-specific performance shows slight variations across different domains # Community Join our community discussions and get support: - Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P)