Profile avatar
siree.sh
PhD student @ltiatcmu.bsky.social. Working on NLP that centers worker agency. Otherwise: coffee, fly fishing, and keeping peach pits around, for...some reason https://siree.sh
120 posts 2,372 followers 2,745 following
Regular Contributor
Active Commenter

Gently, I would like to say: When people tell you that they would appreciate a feature that does something automatically, it's not responsive to that concern to explain that by going through several steps for every individual instance, they can get the same result in each instance.

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 🧵1/9

I for one am grateful for the opportunity to meditate on the meaning of “scientific artifact” at 2:15am

Great points in the replies here re:civility being an unequally applied, generally awful standard. It's also a great example of how common formulations of "sentiment analysis" or "toxicity detection" in NLP that are non-contextual lead to systems that stop sounding good with even slight scrutiny.

AI as another pivot to video is a thought I've also had, and this article is such a great articulation. The way "AI" is framed, pitched, and deployed by big players reflects at best ignorance, and at worst a real contempt for the social nature and humanity of our jobs and online spaces

This talk was such a joy to do! If you'd like to read the paper, it's here: arxiv.org/abs/2411.17840. Thank you for having us, @patrickbriansmith.bsky.social!

If you're at NAACL this week (or just want to keep track), I have a feed for you: bsky.app/profile/did:... Currently pulling everyone that mentions NAACL, posts a link from the ACL Anthology, or has NAACL in their username. Happy conferencing!

Ever trusted a metric that works great on average, only for it to fail in your specific use case? In our #NAACL2025 paper (w/ @841io.bsky.social), we show why global evaluations are not enough and why context matters more than you think. 📄 aclanthology.org/2025.finding... #NLP #Evaluation (🧵1/9)

🚀 Excited to share a new interp+agents paper: 🐭🐱 MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools appearing at #NAACL2025 This was work done @msftresearch.bsky.social last summer with Jason Eisner, Justin Svegliato, Ben Van Durme, Yu Su, and Sam Thomson 1/🧵

This is absurdly great, but I haven't read a single news article about it. A fully open source, offline-first alternative to Notion that's a collab between the French and German governments because they want to host docs securely and on their own terms. THIS is what Europe should be doing.

one thing in AI is not new -- people taking one small part of a job, mischaracterizing it, ignoring all the other stuff, and then assume the AI can do the whole job on its own