ruslanmv
/

granite-3.1-8b-Reasoning

@@ -60,22 +60,71 @@ pip install transformers
 Use the following Python snippet to load and generate text with **Granite-3.1-8B-Reasoning**:
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-device = "auto"
-model_path = "ruslanmv/granite-3.1-8b-Reasoning"
-tokenizer = AutoTokenizer.from_pretrained(model_path)
-model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
-model.eval()
-input_text = "Can you explain the difference between inductive and deductive reasoning?"
-input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
-output = model.generate(**input_tokens, max_length=4000)
-output_text = tokenizer.batch_decode(output)
-print(output_text)
 ```
 ---

 Use the following Python snippet to load and generate text with **Granite-3.1-8B-Reasoning**:
 ```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+import torch
+# Model and tokenizer
+model_name = "ruslanmv/granite-3.1-8b-Reasoning" # Or "ruslanmv/granite-3.1-2b-Reasoning"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map='auto', # or 'cuda' if you have only one GPU
+    torch_dtype=torch.float16, # Use float16 for faster and less memory intensive inference
+    load_in_4bit=True # Enable 4-bit quantization for lower memory usage - requires bitsandbytes
+)
+# Prepare dataset
+SYSTEM_PROMPT = """
+Respond in the following format:
+<reasoning>
+...
+</reasoning>
+<answer>
+...
+</answer>
+"""
+text = tokenizer.apply_chat_template([
+    {"role" : "system", "content" : SYSTEM_PROMPT},
+    {"role" : "user", "content" : "Calculate pi."},
+], tokenize = False, add_generation_prompt = True)
+inputs = tokenizer(text, return_tensors="pt").to("cuda") # Move input tensor to GPU
+# Sampling parameters
+generation_config = GenerationConfig(
+    temperature = 0.8,
+    top_p = 0.95,
+    max_new_tokens = 1024, # Equivalent to max_tokens in the original code, but for generation
+)
+# Inference
+with torch.inference_mode(): # Use inference mode for faster generation
+    outputs = model.generate(**inputs, generation_config=generation_config)
+output = tokenizer.decode(outputs[0], skip_special_tokens=True)
+# Find the start of the actual response
+start_index = output.find("assistant")
+if start_index != -1:
+    # Remove the initial part including "assistant"
+    output = output[start_index + len("assistant"):].strip()
+print(output)
+```
+You will get something like:
+```
+<reasoning>
+Pi is an irrational number, which means it cannot be exactly calculated as it has an infinite number of decimal places. However, we can approximate pi using various mathematical formulas. One of the simplest methods is the Leibniz formula for pi, which is an infinite series:
+pi = 4 * (1 - 1/3 + 1/5 - 1/7 + 1/9 - 1/11 +...)
+This series converges to pi as more terms are added.
+</reasoning>
+<answer>
+The exact value of pi cannot be calculated due to its infinite decimal places. However, using the Leibniz formula, we can approximate pi to a certain number of decimal places. For example, after calculating the first 500 terms of the series, we get an approximation of pi as 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679.
+</answer>
 ```
 ---