Does anyone know why the Llama (et al.) tokenizer has duplicate merges (e.g., "▁bas ically" and "▁basic ally")?
The vocabulary has ~32k tokens, but ~64k merges. Not all tokens have duplicate merges that form them, but some have many ("▁render" has 6 and "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁" [all spaces] has 15!).
The vocabulary has ~32k tokens, but ~64k merges. Not all tokens have duplicate merges that form them, but some have many ("▁render" has 6 and "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁" [all spaces] has 15!).
Comments
But most of the space merges here should still be usable since they skip the 3-space merge.
Maybe they were actually added during regular training? If they were added manually, what was the criteria? Does it actually change any real life tokenizations?
People definitely do this for vocabulary expansion, but I have never seen it done midway through the merge list.
If the duplicates were added during training, there is no reason to believe they would have similar co-occurrence counts at similar times (and therefore merge priority).