The FineWeb team is happy to finally release "FineWeb2" 🥂🥳

FineWeb 2 extends the data driven approach to pre-training dataset design that was introduced in FineWeb 1 to now covers 1893 languages/scripts

Details: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

A detailed open-science tech report is coming soon
Post image

Comments