Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
36.4
TFLOPS
15
50
333
alkinun
AtAndDev
Follow
davanstrien's profile picture
Mi6paulino's profile picture
louisbrulenaudet's profile picture
22 followers
·
20 following
alkinun
alkinun
AI & ML interests
LLMs, Alignment, Merging, Unsloth, DPO, SFT, ORPO, SPIN..
Recent Activity
reacted
to
AlexBodner
's
post
with 🔥
about 13 hours ago
Just published a post explaining Monte Carlo Tree Search: the magic behind AlphaZero and now used to tackle reasoning benchmarks with LLMs. Check it out because it's a must know nowadays! https://x.com/AlexBodner_/status/1877789879398244382
reacted
to
davanstrien
's
post
with 🤝
about 22 hours ago
The https://huggingface.co/datasets/data-is-better-together/fineweb-c dataset is growing! This week a few more languages have got 1,000 annotations for the educational quality of data from https://huggingface.co/datasets/HuggingFaceFW/fineweb-2. Why should you care? The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1). Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining. Why not use an LLM? LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in. The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things: - Evaluate whether an LLM can label the educational quality for texts in that language well - Directly be used for training quality classifiers - Help discover other rules and huerisitcs for refining fineweb2 further for different languages. This week the following languages where done: Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community Contribute yourself here: https://huggingface.co/spaces/data-is-better-together/fineweb-c
reacted
to
davanstrien
's
post
with 🔥
about 22 hours ago
The https://huggingface.co/datasets/data-is-better-together/fineweb-c dataset is growing! This week a few more languages have got 1,000 annotations for the educational quality of data from https://huggingface.co/datasets/HuggingFaceFW/fineweb-2. Why should you care? The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1). Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining. Why not use an LLM? LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in. The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things: - Evaluate whether an LLM can label the educational quality for texts in that language well - Directly be used for training quality classifiers - Help discover other rules and huerisitcs for refining fineweb2 further for different languages. This week the following languages where done: Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community Contribute yourself here: https://huggingface.co/spaces/data-is-better-together/fineweb-c
View all activity
Organizations
Posts
1
view post
Post
394
@
s3nh
Hey man check your discord! Got some news.
See translation
spaces
3
Sort: Recently updated
Sleeping
🐢
DeepSense.ai
Bicycle and E-Bike Detection Model
Sleeping
💻
marco-qwq-7B
Running
on
Zero
💻
AIDC AI Marco O1
models
7
Sort: Recently updated
AtAndDev/marco-qwq-7B
Text Generation
•
Updated
Dec 8, 2024
•
21
AtAndDev/Ogno-Monarch-Neurotic-9B-Passthrough
Text Generation
•
Updated
Mar 1, 2024
•
66
AtAndDev/Ogno-Monarch-Neurotic-7B-Dare-Ties
Text Generation
•
Updated
Mar 1, 2024
•
72
AtAndDev/Marcoro14-7B-Slerp
Text Generation
•
Updated
Mar 1, 2024
•
10
AtAndDev/CapybaraMarcoroni-7B
Text Generation
•
Updated
Jan 7, 2024
•
785
AtAndDev/ShortKing-3b-v0.2
Text Generation
•
Updated
Oct 2, 2023
•
73
•
2
AtAndDev/ShortKing-1.4b-v0.1
Text Generation
•
Updated
Sep 29, 2023
•
3.63k
•
2
datasets
10
Sort: Recently updated
AtAndDev/chain-of-diffusion
Viewer
•
Updated
6 days ago
•
6.45k
•
51
AtAndDev/clip-bicycle-e-bike
Viewer
•
Updated
11 days ago
•
6k
•
22
AtAndDev/QwQ-LongCoT-59k-cleaned
Viewer
•
Updated
Dec 6, 2024
•
59.2k
•
121
AtAndDev/sedir-clean
Viewer
•
Updated
Dec 5, 2024
•
11.8k
•
55
AtAndDev/sedir-unclean
Viewer
•
Updated
Dec 5, 2024
•
19.9k
•
49
AtAndDev/ultrachat_200k_formatted
Viewer
•
Updated
Oct 10, 2024
•
208k
•
34
AtAndDev/MedInstruct
Viewer
•
Updated
Jul 20, 2024
•
216
•
32
AtAndDev/MedRag-textbooks-stella_en_400M_v5
Viewer
•
Updated
Jul 14, 2024
•
126k
•
36
AtAndDev/MedRag-textbooks-gte-large-en-v1.5
Viewer
•
Updated
Jul 14, 2024
•
126k
•
34
AtAndDev/MedRag-textbooks-mxbai-embed-large-v1
Viewer
•
Updated
Jul 14, 2024
•
126k
•
41