mrzjy
/

Qwen2.5-1.5B-GRPO-Creative-Ad-Generation

Text Generation

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

mrzjy commited on 8 days ago

Commit

9eafb6a

·

verified ·

1 Parent(s): 3db8939

Update README.md

Files changed (1) hide show

README.md +10 -3

README.md CHANGED Viewed

@@ -14,11 +14,18 @@ base_model:
 # Model Card
-This is a pure reinforcement learning (RL) experiment applied to an open-domain task: creative advertisement generation.
-**Objective:** To explore the feasibility of applying R1-like methods to an open-domain task without a verifiable ground-truth reward, at the very least demonstrating its potential.
-**Note**: Despite its small size, the resulting model demonstrates intriguing generative capabilities—though it's still **far from perfect**.
 ## Quick start

 # Model Card
+Unlike the impressive DeepSeek-R1(-Zero), this project focuses on a pure reinforcement learning (RL) experiment applied to an open-domain task: creative advertisement generation.
+**Objective:**
+- To investigate the feasibility of applying R1-like methods to an open-domain task without a verifiable ground-truth reward, while at least demonstrating its potential.
+- To explore whether `<think>` and `<answer>` rewards can be explicitly designed to provide strong guidance through RL based on human prior knowledge.
+**Note**:
+- Our goal is **not** to induce self-reflective thinking, but to align with human thought processes purely through RL, without any supervised fine-tuning (SFT) on any constructed dataset.
+Despite its small size, the resulting 1.5B-GRPO model demonstrates intriguing generative capabilities—though it's still **far from perfect**.
 ## Quick start