Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou

tl;dr: increasing input vocabulary is always good, increasing output vocabularies is good for bigger models.
https://arxiv.org/abs/2501.16975
1 / 4
Post image
Post image
Post image
Post image

Comments