craigschmidt.com - Profile | ThreadSky | a Reddit-style client for Bluesky

craigschmidt.com

Interested in ML, AI, and NLP. Particularly interested in tokenization. Live in the Boston area and work at Kensho Technologies.

22 posts 413 followers 2,185 following

Posts 15 Comments 10

If you have an interest in tokenization in Natural Language Processing (NLP), this is a nice discord. Come say hi.

submitted 25 days ago • 0 comments

Super honored that this paper received the best paper award at #COLING2025!

submitted 44 days ago • 0 comments

I got a 70, despite all the time I spent reading the Economist this year.

submitted 78 days ago • 0 comments

The final entry in my #EMNLP2024 fav papers was this paper aclanthology.org/2024.finding... from Thomas L Griffiths' keynote. Used rotational cyphers like ROT-13 and ROT-3 to disentangle forms of reasoning in Chain-of-Thought. Good cypher joke in the keynote! (see p. 24 arxiv.org/abs/2309.13638)

submitted 79 days ago • 0 comments

This #EMNLP2024 best paper aclanthology.org/2024.emnlp-m... had large gains over their (somewhat weak) baseline in trying to determine if a given document was in a LLMs pre-training data. Progress in an important problem.

submitted 79 days ago • 0 comments

This #EMNLP2024 outstanding paper (aclanthology.org/2024.emnlp-m..., underline.io/events/469/s...) LMs can learn a rare grammatical construction like "a beautiful five days", even without any examples in the training data, by generalizing from more common phenomenon.

submitted 79 days ago • 0 comments

This #EMNLP2024 post (aclanthology.org/2024.emnlp-m..., underline.io/events/469/p...) was about avoiding hallucination without human feedback. If you compare an answer at a higher temperature to a beam search generation, then the latter will be more factual, making preference pairs for DPO.

submitted 79 days ago • 0 comments

This paper underline.io/events/469/s... at #EMNLP2024 had one of my favorite takeaways: if you fine tune a LLM on knew knowledge it doesn't know you encourage hallucinations.

submitted 79 days ago • 1 comment

I wanted to post of a few of my favorite #EMNLP2024 papers, starting with a couple in tokenization. Fishing For Magicarp explores the problem of undertrained "glitch" tokens, and how they can be identified from their embedding vectors. aclanthology.org/2024.emnlp-m...

submitted 79 days ago • 1 comment

Hey @blueskystarterpack.com please add: go.bsky.app/8P9ftjL

submitted 79 days ago • 0 comments

I made a starter pack for people in NLP working in the area of tokenization. Let me know if you'd like to be added go.bsky.app/8P9ftjL

submitted 79 days ago • 0 comments

I really enjoyed #EMNLP2024. It was an honor to present our tokenization paper aclanthology.org/2024.emnlp-m.... I’m planning to post about some of my favorite papers soon, but here is a nice write up.

submitted 101 days ago • 0 comments

As @marcoher.bsky.social noted, this is an interesting confirmation of arxiv.org/pdf/2402.14903. I wonder if Llama3's tokenizer has all 3 digit numbers in the vocab? (GPT4 does, Claude doesn't). If not, it would be fun to look at errors on those missing numbers.

submitted 104 days ago • 1 comment

There's a known bug in how we compute "word" probabilities with subword-based LMs that mark beginnings of words -- as pointed out by Byung-doh Oh and Will Schuler, & @tpimentel.bsky.social and Clara Meister I'm pleased to announce that minicons now includes a fix which runs batch-wise!

submitted 104 days ago • 1 comment

#EMNLP has a nice set of tokenization/subword modeling papers this year. It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some! Here is a list with links + presentation time (in chronological order).

submitted 118 days ago • 5 comments