Do you understand how the reward model is built there? They say it's formed a rule-based on correctness, so is it only applied to prompts taken from math problems and leet-code problems? How were the prompts chosen/generated in the RL phase?
nb
ndvb
AI & ML interests
None yet
Recent Activity
commented on
an
article
26 days ago
From Zero to Reasoning Hero: How DeepSeek-R1 Leverages Reinforcement Learning to Master Complex Reasoning
updated
a model
6 months ago
ndvb/segformer-b0-finetuned-segments-sidewalk-oct-22
updated
a collection
over 1 year ago
Text to image
Organizations
None yet
ndvb's activity

upvoted
a
paper
over 1 year ago
Code to run the benchmark
#2 opened over 1 year ago
by
ndvb

What about training?
#2 opened over 1 year ago
by
ndvb

Code to run the Glue on Huggingface models?
#11 opened over 1 year ago
by
ndvb

How can we see the code that does the training?
#2 opened over 1 year ago
by
ndvb

Adding `safetensors` variant of this model
1
#1 opened over 1 year ago
by
SFconvertbot

How do I export it to torchscript?
2
#2 opened over 2 years ago
by
elavneet