mrzjy commited on
Commit
9eafb6a
·
verified ·
1 Parent(s): 3db8939

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -14,11 +14,18 @@ base_model:
14
 
15
  # Model Card
16
 
17
- This is a pure reinforcement learning (RL) experiment applied to an open-domain task: creative advertisement generation.
18
 
19
- **Objective:** To explore the feasibility of applying R1-like methods to an open-domain task without a verifiable ground-truth reward, at the very least demonstrating its potential.
20
 
21
- **Note**: Despite its small size, the resulting model demonstrates intriguing generative capabilities—though it's still **far from perfect**.
 
 
 
 
 
 
 
22
 
23
  ## Quick start
24
 
 
14
 
15
  # Model Card
16
 
17
+ Unlike the impressive DeepSeek-R1(-Zero), this project focuses on a pure reinforcement learning (RL) experiment applied to an open-domain task: creative advertisement generation.
18
 
19
+ **Objective:**
20
 
21
+ - To investigate the feasibility of applying R1-like methods to an open-domain task without a verifiable ground-truth reward, while at least demonstrating its potential.
22
+ - To explore whether `<think>` and `<answer>` rewards can be explicitly designed to provide strong guidance through RL based on human prior knowledge.
23
+
24
+ **Note**:
25
+
26
+ - Our goal is **not** to induce self-reflective thinking, but to align with human thought processes purely through RL, without any supervised fine-tuning (SFT) on any constructed dataset.
27
+
28
+ Despite its small size, the resulting 1.5B-GRPO model demonstrates intriguing generative capabilities—though it's still **far from perfect**.
29
 
30
  ## Quick start
31