AaryanK
/

Qwen_2.5_3B_GRPO_Reasoning_XIOSERV

@@ -8,24 +8,29 @@ tags:
 - DeepSeek
 - CoT
 - finetune
-- LoRA
 ---
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64e1a459ff3fd4fd8eedb456/b-ptT_Fg55OLI1W4iu3mO.png)
-A finetuned variant of **Qwen 2.5 3B Instruct** designed specifically for improved reasoning and instruction-following capabilities. This model has been built by engineers at [xioserv.com](https://xioserv.com) and incorporates specialized modifications to enhance performance for structured reasoning tasks.
 ---
 ## Overview
-The **AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV** model is a refined version of the Qwen 2.5 3B Instruct model. It is optimized to provide responses in a structured format, making it particularly useful for tasks requiring clear separation between reasoning steps and final answers.
 ---
 ## System Prompt
-To ensure consistent and structured outputs, please use the following system prompt when interacting with the model:
 ```
 Respond in the following format:
@@ -37,7 +42,13 @@ Respond in the following format:
 </answer>
 ```
-This prompt guides the model to first articulate its reasoning process and then provide the final answer, which is especially useful in contexts where transparency in decision-making is valued.
 ---
@@ -47,18 +58,18 @@ We have provided **GGUF files** that can be run with **llama.cpp** for efficient
 To run the model with **llama.cpp**, follow the instructions in the [llama.cpp repository](https://github.com/ggerganov/llama.cpp).
-Ensure that you include the system prompt in your input or as part of your prompt engineering strategy for optimal performance.
 ---
 ## Acknowledgements
-- **xioserv.com** – For the engineering efforts in finetuning this model.
 - **Hugging Face** – For providing an accessible platform to share and deploy models.
 For any questions or contributions, please open an issue or submit a pull request on our [GitHub repository](https://github.com/AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV).
 ---
-Happy coding!

 - DeepSeek
 - CoT
 - finetune
 ---
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64e1a459ff3fd4fd8eedb456/9-Siwp9GLyx5F9SNzThAi.png)
+A fine-tuned variant of **Qwen 2.5 3B Instruct** designed specifically for improved **toggleable reasoning** and **instruction-following capabilities**. This model has been built by engineers at [xioserv.com](https://xioserv.com) and incorporates specialized modifications to enhance performance for structured reasoning tasks.
 ---
 ## Overview
+The **AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV** model is a refined version of the Qwen 2.5 3B Instruct model. It is optimized to provide responses in a structured format, making it particularly useful for tasks requiring clear separation between reasoning steps and final answers.
+### **Toggleable Reasoning Mode**
+- If you include the **system prompt**, the model will **explicitly separate reasoning and the final answer**.
+- If you **omit the system prompt**, the model will **respond naturally** without structured reasoning.
+This makes the model highly **versatile**, allowing users to choose between structured reasoning and direct responses based on their specific use case.
 ---
 ## System Prompt
+To enable structured reasoning, use the following system prompt:
 ```
 Respond in the following format:
 </answer>
 ```
+If you do not include this prompt, the model will respond in a **standard, conversational** manner without explicitly separating reasoning from the final answer.
+---
+## Methodology
+To replicate the 'aha moment,' we employed **Group Relative Policy Optimization (GRPO)**, a variant of **Proximal Policy Optimization (PPO)**, which enhances reasoning capabilities while optimizing memory usage. This approach aligns with the techniques outlined in the **DeepSeekMath** paper, where GRPO was instrumental in advancing reasoning in language models. By integrating GRPO with reinforcement learning, our model autonomously refines its problem-solving strategies, mirroring the **self-reflective behavior** observed in **DeepSeek's R1**.
 ---
 To run the model with **llama.cpp**, follow the instructions in the [llama.cpp repository](https://github.com/ggerganov/llama.cpp).
+Ensure that you include the system prompt in your input **if you want structured reasoning output**. Otherwise, the model will function like a standard instruct model.
 ---
 ## Acknowledgements
+- **xioserv.com** – For the engineering efforts in fine-tuning this model.
 - **Hugging Face** – For providing an accessible platform to share and deploy models.
 For any questions or contributions, please open an issue or submit a pull request on our [GitHub repository](https://github.com/AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV).
 ---
+Happy coding!