Great work! I have two questions regarding the reward design:
How do you balance the different reward components? I assume it's through trial and error, but I'm particularly interested in:
- The scale of each reward component
- How numerical adjustments impact the RL training process
- The relative weights between different rewards
Regarding the F1 score calculation: Is it computed based on the number of entries in your graph? I'm curious about the granularity of reward design, as different reward components seem to operate at different levels of detail:
- Format reward appears to be one-dimensional
- F1 reward seems to be a composite metric derived from multiple sub- level data points
This granularity difference in reward design could potentially affect the training dynamics. Would love to hear your thoughts on handling these different scales of feedback. Thanks :)