Profile avatar
craigschmidt.com
Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work at Kensho Technologies.
22 posts 413 followers 2,185 following
Regular Contributor
Conversation Starter

If you have an interest in tokenization in Natural Language Processing (NLP), this is a nice discord. Come say hi.

Super honored that this paper received the best paper award at #COLING2025!

I got a 70, despite all the time I spent reading the Economist this year.

The final entry in my #EMNLP2024 fav papers was this paper aclanthology.org/2024.finding... from Thomas L Griffiths' keynote. Used rotational cyphers like ROT-13 and ROT-3 to disentangle forms of reasoning in Chain-of-Thought. Good cypher joke in the keynote! (see p. 24 arxiv.org/abs/2309.13638)

This #EMNLP2024 best paper aclanthology.org/2024.emnlp-m... had large gains over their (somewhat weak) baseline in trying to determine if a given document was in a LLMs pre-training data. Progress in an important problem.

This #EMNLP2024 outstanding paper (aclanthology.org/2024.emnlp-m..., underline.io/events/469/s...) LMs can learn a rare grammatical construction like "a beautiful five days", even without any examples in the training data, by generalizing from more common phenomenon.

This #EMNLP2024 post (aclanthology.org/2024.emnlp-m..., underline.io/events/469/p...) was about avoiding hallucination without human feedback. If you compare an answer at a higher temperature to a beam search generation, then the latter will be more factual, making preference pairs for DPO.

This paper underline.io/events/469/s... at #EMNLP2024 had one of my favorite takeaways: if you fine tune a LLM on knew knowledge it doesn't know you encourage hallucinations.

I wanted to post of a few of my favorite #EMNLP2024 papers, starting with a couple in tokenization. Fishing For Magicarp explores the problem of undertrained "glitch" tokens, and how they can be identified from their embedding vectors. aclanthology.org/2024.emnlp-m...

Hey @blueskystarterpack.com please add: go.bsky.app/8P9ftjL

I made a starter pack for people in NLP working in the area of tokenization. Let me know if you'd like to be added go.bsky.app/8P9ftjL

I really enjoyed #EMNLP2024. It was an honor to present our tokenization paper aclanthology.org/2024.emnlp-m.... I’m planning to post about some of my favorite papers soon, but here is a nice write up.

As @marcoher.bsky.social noted, this is an interesting confirmation of arxiv.org/pdf/2402.14903. I wonder if Llama3's tokenizer has all 3 digit numbers in the vocab? (GPT4 does, Claude doesn't). If not, it would be fun to look at errors on those missing numbers.

There's a known bug in how we compute "word" probabilities with subword-based LMs that mark beginnings of words -- as pointed out by Byung-doh Oh and Will Schuler, & @tpimentel.bsky.social and Clara Meister I'm pleased to announce that minicons now includes a fix which runs batch-wise!

#EMNLP has a nice set of tokenization/subword modeling papers this year. It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some! Here is a list with links + presentation time (in chronological order).