Update README.md
Browse files
README.md
CHANGED
@@ -14,11 +14,18 @@ base_model:
|
|
14 |
|
15 |
# Model Card
|
16 |
|
17 |
-
|
18 |
|
19 |
-
**Objective:**
|
20 |
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
## Quick start
|
24 |
|
|
|
14 |
|
15 |
# Model Card
|
16 |
|
17 |
+
Unlike the impressive DeepSeek-R1(-Zero), this project focuses on a pure reinforcement learning (RL) experiment applied to an open-domain task: creative advertisement generation.
|
18 |
|
19 |
+
**Objective:**
|
20 |
|
21 |
+
- To investigate the feasibility of applying R1-like methods to an open-domain task without a verifiable ground-truth reward, while at least demonstrating its potential.
|
22 |
+
- To explore whether `<think>` and `<answer>` rewards can be explicitly designed to provide strong guidance through RL based on human prior knowledge.
|
23 |
+
|
24 |
+
**Note**:
|
25 |
+
|
26 |
+
- Our goal is **not** to induce self-reflective thinking, but to align with human thought processes purely through RL, without any supervised fine-tuning (SFT) on any constructed dataset.
|
27 |
+
|
28 |
+
Despite its small size, the resulting 1.5B-GRPO model demonstrates intriguing generative capabilities—though it's still **far from perfect**.
|
29 |
|
30 |
## Quick start
|
31 |
|