Ooh new fineweb dataset just dropped: Fineweb 2 - 3T tokens of highly multilingual top-quality filtered data, permissively licensed!
https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
Apologies to the GPU-Poors (like me!) who can only imagine what one could build with it, if only I had 10^25 FLOPs lying around)
https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
Apologies to the GPU-Poors (like me!) who can only imagine what one could build with it, if only I had 10^25 FLOPs lying around)
Comments
That's admittedly not 10^25 FLOPS, but we can do a lot with "just" 10^16.
I'll be spending the interim ten years learning how to make the most of those resources once they become available.
But currently 10^16 can't even get you 1B tokens worth of training.
10^16 is woefully insufficient to train frontier models today, but you're right that we will continue to come up with better techniques. There is also a lot we can do already, like continued pretraining of unfrozen layers.
I guess when I get excited about a nice open dataset, I am mostly imagining new model _lineages_, not derived from the current best opaque 'open weights' base models.