Does anyone know why the Llama (et al.) tokenizer has duplicate merges (e.g., "▁bas ically" and "▁basic ally")?

The vocabulary has ~32k tokens, but ~64k merges. Not all tokens have duplicate merges that form them, but some have many ("▁render" has 6 and "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁" [all spaces] has 15!).

Comments