#EMNLP has a nice set of tokenization/subword modeling papers this year.
It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some!
Here is a list with links + presentation time (in chronological order).
It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some!
Here is a list with links + presentation time (in chronological order).
Comments
Leading Whitespaces of Language Models’ Subword Vocabulary Pose a Confound for Calculating Word Probabilities - https://aclanthology.org/2024.emnlp-main.202/ - Nov 12 (Tue) 11:00-12:30
Lexically Grounded Subword Segmentation - https://aclanthology.org/2024.emnlp-main.421/ - Nov 12 (Tue) 14:00-15:30
Subword Segmentation in LLMs: Looking at Inflection and Consistency - https://aclanthology.org/2024.emnlp-main.672/ - Nov 12 (Tue) 14:00-15:30
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs - https://aclanthology.org/2024.emnlp-main.543/ - Nov 12 (Tue) 14:00-15:30
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training - https://aclanthology.org/2024.emnlp-main.925/ - Nov 14 (Thu) 10:30-12:00
am trying to develop options for probabilistic firewalls
Q: what is/are the best security measure(s) that you are aware of to help stop or mitigate probabilistic injection ?
the simplest form of probabilistic injection is a ‘prompt injection’
Lexically Grounded Subword Segmentation
https://aclanthology.org/2024.emnlp-main.421/
Poster Session Nov 12 (Tue) 2 pm 🙂
https://arxiv.org/pdf/2406.20086