ruslanmv commited on
Commit
4cd524c
·
verified ·
1 Parent(s): 2de27b5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -12
README.md CHANGED
@@ -60,22 +60,71 @@ pip install transformers
60
  Use the following Python snippet to load and generate text with **Granite-3.1-8B-Reasoning**:
61
 
62
  ```python
63
- from transformers import AutoModelForCausalLM, AutoTokenizer
64
-
65
- device = "auto"
66
- model_path = "ruslanmv/granite-3.1-8b-Reasoning"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
- tokenizer = AutoTokenizer.from_pretrained(model_path)
69
- model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
70
- model.eval()
 
71
 
72
- input_text = "Can you explain the difference between inductive and deductive reasoning?"
73
- input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
74
 
75
- output = model.generate(**input_tokens, max_length=4000)
76
- output_text = tokenizer.batch_decode(output)
77
 
78
- print(output_text)
 
 
79
  ```
80
 
81
  ---
 
60
  Use the following Python snippet to load and generate text with **Granite-3.1-8B-Reasoning**:
61
 
62
  ```python
63
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
64
+ import torch
65
+
66
+ # Model and tokenizer
67
+ model_name = "ruslanmv/granite-3.1-8b-Reasoning" # Or "ruslanmv/granite-3.1-2b-Reasoning"
68
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ model_name,
71
+ device_map='auto', # or 'cuda' if you have only one GPU
72
+ torch_dtype=torch.float16, # Use float16 for faster and less memory intensive inference
73
+ load_in_4bit=True # Enable 4-bit quantization for lower memory usage - requires bitsandbytes
74
+ )
75
+
76
+ # Prepare dataset
77
+ SYSTEM_PROMPT = """
78
+ Respond in the following format:
79
+ <reasoning>
80
+ ...
81
+ </reasoning>
82
+ <answer>
83
+ ...
84
+ </answer>
85
+ """
86
+ text = tokenizer.apply_chat_template([
87
+ {"role" : "system", "content" : SYSTEM_PROMPT},
88
+ {"role" : "user", "content" : "Calculate pi."},
89
+ ], tokenize = False, add_generation_prompt = True)
90
+
91
+ inputs = tokenizer(text, return_tensors="pt").to("cuda") # Move input tensor to GPU
92
+
93
+ # Sampling parameters
94
+ generation_config = GenerationConfig(
95
+ temperature = 0.8,
96
+ top_p = 0.95,
97
+ max_new_tokens = 1024, # Equivalent to max_tokens in the original code, but for generation
98
+ )
99
+
100
+ # Inference
101
+ with torch.inference_mode(): # Use inference mode for faster generation
102
+ outputs = model.generate(**inputs, generation_config=generation_config)
103
+
104
+ output = tokenizer.decode(outputs[0], skip_special_tokens=True)
105
+
106
+ # Find the start of the actual response
107
+ start_index = output.find("assistant")
108
+ if start_index != -1:
109
+ # Remove the initial part including "assistant"
110
+ output = output[start_index + len("assistant"):].strip()
111
+
112
+ print(output)
113
+ ```
114
 
115
+ You will get something like:
116
+ ```
117
+ <reasoning>
118
+ Pi is an irrational number, which means it cannot be exactly calculated as it has an infinite number of decimal places. However, we can approximate pi using various mathematical formulas. One of the simplest methods is the Leibniz formula for pi, which is an infinite series:
119
 
120
+ pi = 4 * (1 - 1/3 + 1/5 - 1/7 + 1/9 - 1/11 +...)
 
121
 
122
+ This series converges to pi as more terms are added.
123
+ </reasoning>
124
 
125
+ <answer>
126
+ The exact value of pi cannot be calculated due to its infinite decimal places. However, using the Leibniz formula, we can approximate pi to a certain number of decimal places. For example, after calculating the first 500 terms of the series, we get an approximation of pi as 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679.
127
+ </answer>
128
  ```
129
 
130
  ---