🚀 Introducing the Byte Latent Transformer (BLT) – A LLM architecture that scales better than Llama 3 using patches instead of tokens 🤯 Paper 📄 dl.fbaipublicfiles.com/blt/BLT__Pat... Code 🛠️ github.com/facebookrese... - ThreadSky

🚀 Introducing the Byte Latent Transformer (BLT) – A LLM architecture that scales better than Llama 3 using patches instead of tokens 🤯
Paper 📄 https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf
Code 🛠️ https://github.com/facebookresearch/blt

Comments

joelburget.bsky.social•85 days ago

Do you all have plans for how multimodal would work? Treating an image as a sequence of bytes (the rows in a bitmap or something) seems pretty bad since it throws away so much structure. Reverting to tokens is ugly. You probably want to learn the encoding but it's not clear how.

nafnlaus.bsky.social•86 days ago

Very neat to see further progress on moving away from fixed tokens, which always felt like an awkward hack.

Wouldn't be surprised if we see this in Llama 4.

schumann.bsky.social•85 days ago

Great work! Are you going to release the models?

artidoro.bsky.social•86 days ago

1/ 🧱 BLT encodes bytes into dynamic patches using light-weight local models and processes them with a large latent transformer. Think of it as a transformer sandwich! 🥪

artidoro.bsky.social•86 days ago

2/ 🧩 Entropy patching dynamically adjusts patch sizes based on data complexity, allowing BLT to allocate more compute to hard predictions and use larger patches for simpler ones. This results in fewer larger processing steps to cover the same data.

artidoro.bsky.social•86 days ago

3/ 📈 BLT unlocks a new scaling dimension by simultaneously growing patch and model size without changing training or inference cost. Patch length scaling quickly overtakes BPE transformer scaling, and the trends look even better at larger scales!

artidoro.bsky.social•86 days ago

4/ ⚡ Parameter matched training runs up to 8B params and 4T bytes show that BLT performs well on standard benchmarks, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops.

artidoro.bsky.social•86 days ago

5/ 💪 But where BLT excels is at modeling the long-tail of data with better robustness to noise and improved understanding and manipulation of substrings.

artidoro.bsky.social•86 days ago

6/ Without direct modeling of bytes Llama 3.1 trained on 16x more data still lags behind on some of these tasks!

tedunderwood.me•85 days ago

Congrats, and thanks for posting it here.

Comments

Posting Rules

Reply