Ooh new fineweb dataset just dropped: Fineweb 2 - 3T tokens of highly multilingual top-quality filtered data, permissively licensed! huggingface.co/datasets/Hug... Apologies to the GPU-Poors (like me!) who can only imagine what one could build with it, if only I had 10^25 FLOPs lying around) - ThreadSky

About ThreadSky

yetanotheruseless.com • 78 days ago

Ooh new fineweb dataset just dropped: Fineweb 2 - 3T tokens of highly multilingual top-quality filtered data, permissively licensed!

https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

Apologies to the GPU-Poors (like me!) who can only imagine what one could build with it, if only I had 10^25 FLOPs lying around)

Comments

ttkciar.bsky.social•77 days ago

Patience. In ten years or so we should be able to pick up 10^16 FLOPS on eBay for fairly cheap.

That's admittedly not 10^25 FLOPS, but we can do a lot with "just" 10^16.

I'll be spending the interim ten years learning how to make the most of those resources once they become available.

yetanotheruseless.com•77 days ago

You can't pre-train from scratch on "just" 10^16, unless we come up with *really clever* new training techniques (which we will, and already are, using dramatically stringent filtering down to The Very Best and Most Useful tokens).

But currently 10^16 can't even get you 1B tokens worth of training.

ttkciar.bsky.social•77 days ago

One one hand you're not wrong, but on the other hand you're not wrong :-)

10^16 is woefully insufficient to train frontier models today, but you're right that we will continue to come up with better techniques. There is also a lot we can do already, like continued pretraining of unfrozen layers.

yetanotheruseless.com•77 days ago

yeah, LoRA requires very little, and continued pretraining of only a few layers requires more, but still reasonable FLOPS.

I guess when I get excited about a nice open dataset, I am mostly imagining new model _lineages_, not derived from the current best opaque 'open weights' base models.

timkellogg.me•78 days ago

does it come pre aged? like a fine wine

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply