A nice analysis of different tokenization strategies (BPE, wordpiece, sentencepiece) on protein sequences.
https://arxiv.org/abs/2411.17669
https://arxiv.org/abs/2411.17669
1 / 3
Comments
https://academic.oup.com/bioinformatics/article/40/4/btae196/7645044?login=false
TLDR: performance gains from tokenization depend largely on dataset composition and specific task (complexity?).