Spaces:
Running
Recommend a dataset in the scientific domain made by us: EricLu/SCP-116K
https://huggingface.co/datasets/EricLu/SCP-116K
SCP-116K is a large-scale dataset containing 116,756 high-quality scientific problem-solution pairs, automatically extracted from web crawled documents. The dataset covers multiple scientific disciplines including physics, chemistry, and biology, targeting undergraduate to doctoral-level content. Each problem is accompanied by its matched solution, as well as solutions generated by advanced language models (o1-mini and QwQ-32B-preview) along with validation flags.
Unfortunately, due to some technical reasons, we have not provided R1 distillation data, and we will solve this problem as soon as possible in subsequent iterations
That's very cool! thanks for creating this dataset :)
Please also have a look at this dataset: https://agi.safe.ai/
i think the downside of thinking models is that even for simple question they may take alot of thinking tokens but i think we should have dataset to Train llms to figure out when to use thinking strategy and when to simply answer the question like regular llms do