Profile avatar
annawegmann.bsky.social
PhD candidate in NLP at Utrecht University | Accounting for language variation in ML/NLP | Tokenizers! | Paraphrases | she/her https://annawegmann.github.io/
24 posts 839 followers 381 following
Prolific Poster
Conversation Starter

🚨 NEW WORKSHOP ALERT 🚨 We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 @icmlconf.bsky.social! 🎉 Submissions are open for work on tokenization across all areas of machine learning. 📅 Submission deadline: May 30, 2025 🔗 tokenization-workshop.github.io

PhD thesis submitted ✅

(1 ) *Insults Welsh language* (2) *Excitedly studies increased use of less than socially acceptable Welsh in Welsh participants*

What encoding error is this? It cant be the language I spent five years learning. Tokenizers, we stand no chance

📢 Closing this week! 📢 Supervised by @zeerak.bsky.social and starting Sept 2025, the project will examine the ethical implications of natural language processing. Apply now ▶️ edin.ac/40PAXEq

First step, identify all English variation and collect texts representing it. No biggie

buuuurn

It's great to read old methods sections. Pretend to be a lost shopper and secretly study language. Thanks Labov.

When I say my name, people start speaking French to me, although my French is basic. That also happens with AI systems. We wrote a whole paper on that, testing across models for presumed cultural identity based on names w/ Siddhesh Pawar @rnv.bsky.social @iaugenstein.bsky.social

Today we are launching a server dedicated to Tokenization research! Come join us! discord.gg/CDJhnSvU

I wrote down some thoughts about what sociolinguistics can contribute to LLMs and vice versa, now available dx.doi.org/10.1111/lnc3...

Hey @blueskystarterpack.com please add: go.bsky.app/8P9ftjL

It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮 [thread]

✨New pre-print!✨ Successful language technologies should work for a wide variety of languages. But some languages have systematically worse performance than others. In this paper we ask whether performance differences are due to morphological typology. Spoiler: I don’t think so! #NLP #linguistics

Include more diversity in NLP! The project is super relevant and timely. Plus Dong is a great advisor! Utrecht is a beautiful city. We have a growing number of cool NLP people. I’ll also be part of the project 🎓 ➡️ ✉️🧑‍⚕️

Turns out I'm terrible at stopping things. Current tasks feel like **the** most important thing ever • Email: networking genius at work • Code: shit happens; must refactor the entire universe • Write: find all the meaning; clearly my true calling • Advise: help others thrive; Future = Shaped

The Netherlands feel cold and dark after #EMNLP and I am not sure I needed to know

Our paper on the effect of ChatGPT on activity on @stackoverflow.com.web.brid.gy is out: academic.oup.com/pnasnexus/ar... @maria-drc.bsky.social, Nadzeya Laurentsyeva & I find a 25% decrease in activity on SO within 6 months of #ChatGPT 's release vs counterfactuals. Why does it matter?

Cool work by @jhuclsp colleagues Rafael Rivera Soto and Nick Andrews on how AI-generated text carries unique stylistic fingerprints, enabling the detection and identification of specific language models. Based on ICLR paper: arxiv.org/pdf/2401.06712 hub.jhu.edu/2024/11/18/a...

#EMNLP has a nice set of tokenization/subword modeling papers this year. It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some! Here is a list with links + presentation time (in chronological order).

Since the starter pack only allow for 150 places, I figure it'll be nice to create some specific ones 🤗 I've put a few women in NLP but please let me know if you're in, I'll add you 🫶 go.bsky.app/FCQ134m

If you're an NLP researcher and haven't made it into either Starter Pack yet, please let me know! We're over halfway full at this point 😧 go.bsky.app/JgneRQk

A starter pack for #NLP #NLProc researchers! 🎉 go.bsky.app/SngwGeS

Work in progress -- suggestions for NLP-ers based in the EU/Europe & already on Bluesky very welcome! go.bsky.app/NZDc31B

I've filled up the first Women in AI starter pack (thanks for all your nominations!) so here's part 2 go.bsky.app/2wr669L

Here’s my women in AI starter pack, to help bring some diversity to your feed go.bsky.app/LaGDpqg

EMNLP was a blast

Measure the style of your texts using our popular style embedding model huggingface.co/AnnaWegmann/...

Interested in whether people👂 each other in a conversation? 🚨 #EMNLP2024 with Tijs van den Broek and Dong Nguyen about detecting paraphrases between speakers 🤖 Detect? huggingface.co/AnnaWegmann/... 📊 Analyze? huggingface.co/datasets/Ann... 📄 Read? aclanthology.org/2024.emnlp-m...