yxue-jamandtea commited on
Commit
4edbdf9
·
verified ·
1 Parent(s): dc76cbb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ base_model:
7
+ - google/gemma-2-9b-it
8
+ tags:
9
+ - chat
10
+ library_name: transformers
11
+ ---
12
+
13
+ # Model Overview
14
+
15
+ - **Model Optimizations:**
16
+ - **Weight quantization:** FP8
17
+ - **Activation quantization:** FP8
18
+ - **Release Date:** 1/28/2025
19
+
20
+ Quantized version of [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it/) to FP8 data type, ready for inference with SGLang >= 0.3 or vLLM >= 0.5.2.
21
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
22
+
23
+ ## Deployment
24
+
25
+ ### Use with SGLang
26
+
27
+ On SGLang, gemma 2's context length has been extended to 8192 thanks to their sliding window attention support.
28
+
29
+ ```bash
30
+ python -m sglang.launch_server --model-path JamAndTeaStudios/gemma-2-9b-it-FP8-Dynamic \
31
+ --port 30000 --host 0.0.0.0
32
+ ```
33
+
34
+ ## Creation
35
+
36
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
37
+
38
+ <details>
39
+ <summary>Model Creation Code</summary>
40
+
41
+ ```python
42
+ from transformers import AutoModelForCausalLM, AutoTokenizer
43
+
44
+ from llmcompressor.modifiers.quantization import QuantizationModifier
45
+ from llmcompressor.transformers import oneshot
46
+
47
+ MODEL_ID = "google/gemma-2-27b-it"
48
+
49
+ # 1) Load model.
50
+ model = AutoModelForCausalLM.from_pretrained(
51
+ MODEL_ID, device_map="auto", torch_dtype="auto"
52
+ )
53
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
54
+
55
+ # 2) Configure the quantization algorithm and scheme.
56
+ # In this case, we:
57
+ # * quantize the weights to fp8 with per channel via ptq
58
+ # * quantize the activations to fp8 with dynamic per token
59
+ recipe = QuantizationModifier(
60
+ targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
61
+ )
62
+
63
+ # 3) Apply quantization and save in compressed-tensors format.
64
+ OUTPUT_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
65
+ oneshot(
66
+ model=model,
67
+ recipe=recipe,
68
+ tokenizer=tokenizer,
69
+ output_dir=OUTPUT_DIR,
70
+ )
71
+
72
+ # Confirm generations of the quantized model look sane.
73
+ print("========== SAMPLE GENERATION ==============")
74
+ input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
75
+ output = model.generate(input_ids, max_new_tokens=20)
76
+ print(tokenizer.decode(output[0]))
77
+ print("==========================================")
78
+ ```
79
+ </details>
80
+
81
+ ## Evaluation
82
+
83
+ TBA
84
+
85
+ ## Play Retail Mage
86
+
87
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64f908994110f1806f2c356a/vsWXpQqgHIqN4f4BM-RfS.png)
88
+
89
+ [Retail Mage](https://store.steampowered.com/app/3224380/Retail_Mage/) is an immersive sim that uses online LLM inference in almost all features in the gameplay!
90
+
91
+ Reviews
92
+
93
+ “A true to life experience detailing how customer service really works.”
94
+ 10/10 – kpolupo
95
+
96
+ “I enjoyed how many things were flammable in the store.”
97
+ 5/5 – mr_srsbsns
98
+
99
+ “I've only known that talking little crow plushie in MageMart for a day and a half but if anything happened to him I would petrify everyone in this store and then myself.”
100
+ 7/7 – neondenki