Quick thread in response to a question on token packing practices when pretraining LLMs!
Reposted from Will Held
Yes! Token packing has been the standard since RoBERTa. Excerpt below!

The intuition is that the model quickly learns to not attend across [SEP] boundaries and packing avoids "wasting" compute on padding tokens required to make the variable batch size consistent.

Comments