yxue-jamandtea commited on
Commit
08c3d03
·
verified ·
1 Parent(s): d11432d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -3
README.md CHANGED
@@ -1,3 +1,98 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ base_model:
7
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
8
+ tags:
9
+ - chat
10
+ library_name: transformers
11
+ ---
12
+
13
+ # Model Overview
14
+
15
+ - **Model Optimizations:**
16
+ - **Weight quantization:** FP8
17
+ - **Activation quantization:** FP8
18
+ - **Release Date:** 1/28/2025
19
+
20
+ Quantized version of [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/) to FP8 data type, ready for inference with SGLang >= 0.3 or vLLM >= 0.5.2.
21
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
22
+
23
+ ## Deployment
24
+
25
+ ### Use with SGLang
26
+
27
+ ```bash
28
+ python -m sglang.launch_server --model-path JamAndTeaStudios/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic \
29
+ --port 30000 --host 0.0.0.0
30
+ ```
31
+
32
+ ## Creation
33
+
34
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
35
+
36
+ <details>
37
+ <summary>Model Creation Code</summary>
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+
42
+ from llmcompressor.modifiers.quantization import QuantizationModifier
43
+ from llmcompressor.transformers import oneshot
44
+
45
+ MODEL_ID = "google/gemma-2-27b-it"
46
+
47
+ # 1) Load model.
48
+ model = AutoModelForCausalLM.from_pretrained(
49
+ MODEL_ID, device_map="auto", torch_dtype="auto"
50
+ )
51
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
52
+
53
+ # 2) Configure the quantization algorithm and scheme.
54
+ # In this case, we:
55
+ # * quantize the weights to fp8 with per channel via ptq
56
+ # * quantize the activations to fp8 with dynamic per token
57
+ recipe = QuantizationModifier(
58
+ targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
59
+ )
60
+
61
+ # 3) Apply quantization and save in compressed-tensors format.
62
+ OUTPUT_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
63
+ oneshot(
64
+ model=model,
65
+ recipe=recipe,
66
+ tokenizer=tokenizer,
67
+ output_dir=OUTPUT_DIR,
68
+ )
69
+
70
+ # Confirm generations of the quantized model look sane.
71
+ print("========== SAMPLE GENERATION ==============")
72
+ input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
73
+ output = model.generate(input_ids, max_new_tokens=20)
74
+ print(tokenizer.decode(output[0]))
75
+ print("==========================================")
76
+ ```
77
+ </details>
78
+
79
+ ## Evaluation
80
+
81
+ TBA
82
+
83
+ ## Play Retail Mage
84
+
85
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64f908994110f1806f2c356a/vsWXpQqgHIqN4f4BM-RfS.png)
86
+
87
+ [Retail Mage (Steam)](https://store.steampowered.com/app/3224380/Retail_Mage/) is an immersive sim that uses online LLM inference in almost all features in the gameplay!
88
+
89
+ Reviews
90
+
91
+ “A true to life experience detailing how customer service really works.”
92
+ 10/10 – kpolupo
93
+
94
+ “I enjoyed how many things were flammable in the store.”
95
+ 5/5 – mr_srsbsns
96
+
97
+ “I've only known that talking little crow plushie in MageMart for a day and a half but if anything happened to him I would petrify everyone in this store and then myself.”
98
+ 7/7 – neondenki