Quick thread in response to a question on token packing practices when pretraining LLMs!

Reposted from Will Held

Yes! Token packing has been the standard since RoBERTa. Excerpt below!

The intuition is that the model quickly learns to not attend across [SEP] boundaries and packing avoids "wasting" compute on padding tokens required to make the variable batch size consistent.

Comments

Posting Rules

Comments

Posting Rules

Reply