annawegmann.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

🚨 NEW WORKSHOP ALERT 🚨 We're thrilled to announce the first-ever Tokenization Workshop (TokShop) at #ICML2025 @icmlconf.bsky.social! 🎉 Submissions are open for work on tokenization across all areas of machine learning. 📅 Submission deadline: May 30, 2025 🔗 tokenization-workshop.github.io

submitted 64 days ago • 1 comment

PhD thesis submitted ✅

submitted 57 days ago • 2 comments

(1 ) Insults Welsh language (2) Excitedly studies increased use of less than socially acceptable Welsh in Welsh participants

submitted 90 days ago • 0 comments

What encoding error is this? It cant be the language I spent five years learning. Tokenizers, we stand no chance

submitted 96 days ago • 2 comments

📢 Closing this week! 📢 Supervised by @zeerak.bsky.social and starting Sept 2025, the project will examine the ethical implications of natural language processing. Apply now ▶️ edin.ac/40PAXEq

submitted 99 days ago • 1 comment

First step, identify all English variation and collect texts representing it. No biggie

submitted 107 days ago • 1 comment

buuuurn

submitted 110 days ago • 1 comment

It's great to read old methods sections. Pretend to be a lost shopper and secretly study language. Thanks Labov.

submitted 111 days ago • 1 comment

When I say my name, people start speaking French to me, although my French is basic. That also happens with AI systems. We wrote a whole paper on that, testing across models for presumed cultural identity based on names w/ Siddhesh Pawar @rnv.bsky.social @iaugenstein.bsky.social

submitted 118 days ago • 3 comments

submitted 127 days ago • 0 comments

Today we are launching a server dedicated to Tokenization research! Come join us! discord.gg/CDJhnSvU

submitted 128 days ago • 3 comments

I wrote down some thoughts about what sociolinguistics can contribute to LLMs and vice versa, now available dx.doi.org/10.1111/lnc3...

submitted 134 days ago • 0 comments

Hey @blueskystarterpack.com please add: go.bsky.app/8P9ftjL

submitted 180 days ago • 0 comments

It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance 😮 [thread]

submitted 206 days ago • 5 comments

✨New pre-print!✨ Successful language technologies should work for a wide variety of languages. But some languages have systematically worse performance than others. In this paper we ask whether performance differences are due to morphological typology. Spoiler: I don’t think so! #NLP #linguistics

submitted 208 days ago • 2 comments

Include more diversity in NLP! The project is super relevant and timely. Plus Dong is a great advisor! Utrecht is a beautiful city. We have a growing number of cool NLP people. I’ll also be part of the project 🎓 ➡️ ✉️🧑‍⚕️

submitted 208 days ago • 0 comments

Turns out I'm terrible at stopping things. Current tasks feel like the most important thing ever • Email: networking genius at work • Code: shit happens; must refactor the entire universe • Write: find all the meaning; clearly my true calling • Advise: help others thrive; Future = Shaped

submitted 208 days ago • 0 comments

The Netherlands feel cold and dark after #EMNLP and I am not sure I needed to know

submitted 209 days ago • 0 comments

Our paper on the effect of ChatGPT on activity on @stackoverflow.com.web.brid.gy is out: academic.oup.com/pnasnexus/ar... @maria-drc.bsky.social, Nadzeya Laurentsyeva & I find a 25% decrease in activity on SO within 6 months of #ChatGPT 's release vs counterfactuals. Why does it matter?

submitted 215 days ago • 3 comments

Cool work by @jhuclsp colleagues Rafael Rivera Soto and Nick Andrews on how AI-generated text carries unique stylistic fingerprints, enabling the detection and identification of specific language models. Based on ICLR paper: arxiv.org/pdf/2401.06712 hub.jhu.edu/2024/11/18/a...

submitted 211 days ago • 0 comments

#EMNLP has a nice set of tokenization/subword modeling papers this year. It's a good mix of tokenization algorithms, tokenization evaluation, tokenization-free methods, and subword embedding probing. Lmk if I missed some! Here is a list with links + presentation time (in chronological order).

submitted 218 days ago • 5 comments

Since the starter pack only allow for 150 places, I figure it'll be nice to create some specific ones 🤗 I've put a few women in NLP but please let me know if you're in, I'll add you 🫶 go.bsky.app/FCQ134m

submitted 213 days ago • 11 comments

If you're an NLP researcher and haven't made it into either Starter Pack yet, please let me know! We're over halfway full at this point 😧 go.bsky.app/JgneRQk

submitted 212 days ago • 40 comments

A starter pack for #NLP #NLProc researchers! 🎉 go.bsky.app/SngwGeS

submitted 226 days ago • 45 comments

Work in progress -- suggestions for NLP-ers based in the EU/Europe & already on Bluesky very welcome! go.bsky.app/NZDc31B

submitted 220 days ago • 48 comments

I've filled up the first Women in AI starter pack (thanks for all your nominations!) so here's part 2 go.bsky.app/2wr669L

submitted 212 days ago • 10 comments

Here’s my women in AI starter pack, to help bring some diversity to your feed go.bsky.app/LaGDpqg

submitted 220 days ago • 13 comments

EMNLP was a blast

submitted 211 days ago • 0 comments

Measure the style of your texts using our popular style embedding model huggingface.co/AnnaWegmann/...

submitted 211 days ago • 1 comment

Interested in whether people👂 each other in a conversation? 🚨 #EMNLP2024 with Tijs van den Broek and Dong Nguyen about detecting paraphrases between speakers 🤖 Detect? huggingface.co/AnnaWegmann/... 📊 Analyze? huggingface.co/datasets/Ann... 📄 Read? aclanthology.org/2024.emnlp-m...

submitted 211 days ago • 1 comment