#EMNLP has a nice set of tokenization/subword modeling papers this year. It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some! Here is a list with links + presentation time (in chronological order). - ThreadSky

mcognetta.bsky.social • 118 days ago

#EMNLP has a nice set of tokenization/subword modeling papers this year.

It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some!

Here is a list with links + presentation time (in chronological order).

Comments

mcognetta.bsky.social•118 days ago

On the Proper Treatment of Tokenization in Psycholinguistics - https://aclanthology.org/2024.emnlp-main.1032/ - Nov 12 (Tue) 11:00-12:30

Leading Whitespaces of Language Models’ Subword Vocabulary Pose a Confound for Calculating Word Probabilities - https://aclanthology.org/2024.emnlp-main.202/ - Nov 12 (Tue) 11:00-12:30

mcognetta.bsky.social•118 days ago

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings - https://aclanthology.org/2024.emnlp-main.1217/ - Nov 12 (Tue) 14:00-15:30

Lexically Grounded Subword Segmentation - https://aclanthology.org/2024.emnlp-main.421/ - Nov 12 (Tue) 14:00-15:30

mcognetta.bsky.social•118 days ago

Distributional Properties of Subword Regularization - https://aclanthology.org/2024.emnlp-main.600/ - Nov 12 (Tue) 14:00-15:30

Subword Segmentation in LLMs: Looking at Inflection and Consistency - https://aclanthology.org/2024.emnlp-main.672/ - Nov 12 (Tue) 14:00-15:30

mcognetta.bsky.social•118 days ago

CUTE: Measuring LLMs’ Understanding of Their Tokens - https://aclanthology.org/2024.emnlp-main.177/ - Nov 12 (Tue) 14:00-15:30

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs - https://aclanthology.org/2024.emnlp-main.543/ - Nov 12 (Tue) 14:00-15:30

mcognetta.bsky.social•118 days ago

Tokenization Is More Than Compression - https://aclanthology.org/2024.emnlp-main.40/ - Nov 13 (Wed) 10:30-12:00

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training - https://aclanthology.org/2024.emnlp-main.925/ - Nov 14 (Thu) 10:30-12:00

mcognetta.bsky.social•118 days ago

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models - https://aclanthology.org/2024.emnlp-main.649/ - Nov 14 (Thu) 14:00-15:30

johnegan.bsky.social•105 days ago

hi marco,

am trying to develop options for probabilistic firewalls
Q: what is/are the best security measure(s) that you are aware of to help stop or mitigate probabilistic injection ?
the simplest form of probabilistic injection is a ‘prompt injection’

tomlim.bsky.social•118 days ago

Also this one:

Lexically Grounded Subword Segmentation
https://aclanthology.org/2024.emnlp-main.421/

Poster Session Nov 12 (Tue) 2 pm 🙂

tomlim.bsky.social•118 days ago

Fantastic list, thank you!

adamwiemerslage.bsky.social•118 days ago

I was also skunking my all the tokenization papers today! Here is one more

https://arxiv.org/pdf/2406.20086

mcognetta.bsky.social•118 days ago

Got that one covered aleady on the list (but maybe it's easy to miss because the thumbnail shows up only for one link)!

adamwiemerslage.bsky.social•118 days ago

Oh oops!

craigschmidt.com•105 days ago

Great list. People taking about our paper on here inspired me to open a Bluesky account.

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply