LeMaterial

Enterprise
non-profit
Activity Feed

AI & ML interests

AI4Science

Recent Activity

Articles

LeMaterial's activity

msironย 
in LeMaterial/LeMat-Bulk 20 days ago
msironย 
updated a Space about 1 month ago
inelgnuย 
updated a Space about 2 months ago
msironย 
updated a Space about 2 months ago
thomwolfย 
posted an update about 2 months ago
view post
Post
5012
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of ๐Ÿ—ฃ๏ธlanguages.

We applied the same data-driven approach that led to SOTA English performance in๐Ÿท FineWeb to thousands of languages.

๐Ÿฅ‚ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive ๐Ÿ“œ ODC-By 1.0 license, and the ๐Ÿ’ป code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a ๐Ÿ“ blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
ยท
thomwolfย 
posted an update about 2 months ago
thomwolfย 
posted an update about 2 months ago
thomwolfย 
posted an update 2 months ago
thomwolfย 
posted an update 2 months ago