Does anyone know why the Llama (et al.) tokenizer has duplicate merges (e.g., "▁bas ically" and "▁basic ally")? The vocabulary has ~32k tokens, but ~64k merges. Not all tokens have duplicate merges that form them, but some have many ("▁render" has 6 and "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁" [all spaces] has 15!). - ThreadSky

mcognetta.bsky.social • 4 days ago

Does anyone know why the Llama (et al.) tokenizer has duplicate merges (e.g., "▁bas ically" and "▁basic ally")?

The vocabulary has ~32k tokens, but ~64k merges. Not all tokens have duplicate merges that form them, but some have many ("▁render" has 6 and "▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁" [all spaces] has 15!).

Comments

mcognetta.bsky.social•4 days ago

My initial thought is these merges were added manually as a postprocessing step, possibly due to some of the bad tokenizations that were formed when using Llama tokenizers for code.

mcognetta.bsky.social•4 days ago

I think I got this idea because I just looked at the end of the merge list (see included photo) and saw all the space merges. Around that time I had read about some other code LLMs that used tokenizers trained on natural language and they all had problems with whitespace.

mcognetta.bsky.social•4 days ago

Here's another mystery. The merges that form the 3-space token appear _after_ several merges that use the 3-space token! That isn't possible in BPE and it should render a lot of the other merges unreachable.

But most of the space merges here should still be usable since they skip the 3-space merge.

mcognetta.bsky.social•4 days ago

But later I went into the merge list and saw how many more there were and that most of them were just random words, so now I am curious.

Maybe they were actually added during regular training? If they were added manually, what was the criteria? Does it actually change any real life tokenizations?

mcognetta.bsky.social•4 days ago

Llama uses SentencePiece's BPE implementation. I don't think any of these merges can actually be formed during training, and I think that the duplicates should be unreachable during inference (maybe there are some weird edge cases, but since merge priority is a total ordering, I don't think so).

mcognetta.bsky.social•4 days ago

Nothing actually prevents you from going into a BPE tokenizer and adding new merges (as long as they map to a token in the vocab).

People definitely do this for vocabulary expansion, but I have never seen it done midway through the merge list.

mcognetta.bsky.social•4 days ago

The other evidence that points to these being added in manually is that sets of duplicate merges all are grouped together.

If the duplicates were added during training, there is no reason to believe they would have similar co-occurrence counts at similar times (and therefore merge priority).

Comments

Posting Rules

Reply