chautmpham.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

I see this work as our answer to the "cultural alignment" and "cultural benchmarking" trends in NLP research. Instead of making decisions for people, we consider "culture" in a specific setting with specific people for a specific task, and we ask people directly about their cultural adaptations.

submitted 12 days ago • 2 comments

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

submitted 19 days ago • 1 comment

We find that LLMs (e.g. GPT-4o, LLaMA-3.1) consistently recall book content across languages, even for texts without official translation in pre-training data! Great work led by undergrads at UMass NLP 🥳

submitted 23 days ago • 0 comments

One of the ways that LLMs can be inconsistent is the "generator-validator gap," where LLMs deem their own answers incorrect. 🎯 We demonstrate that ranking-based discriminator training can significantly reduce this gap, and improvements on one task often generalize to others! 🧵👇

submitted 67 days ago • 1 comment

📚 Check out the newest JCA article by Li Lucy (@lucy3.bsky.social), Camilla Griffiths, Claire Ying, JJ Kim-Ebio, Sabrina Baur, Sarah Levine, Jennifer L. Eberhardt, David Bamman (@dbamman.bsky.social), and Dorottya Demszky. culturalanalytics.org/article/1316...

submitted 74 days ago • 1 comment

A very cool paper shows that you can use the RL loss to improve story generation by some clever setups on training on known texts (e.g. ground predictions versus a next chapter you know). RL starting to generalize already!

submitted 75 days ago • 0 comments

We have updated #nocha, a leaderboard for reasoning over long-context narratives 📖, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team 🪄 🧙 Check 🔗 novelchallenge.github.io for details :)

submitted 81 days ago • 0 comments

New paper from our team @GoogleDeepMind! 🚨 We've put LLMs to the test as writing co-pilots – how good are they really at helping us write? LLMs are increasingly used for open-ended tasks like writing assistance, but how do we assess their effectiveness? 🤔 arxiv.org/pdf/2503.19711

submitted 81 days ago • 1 comment

Our lab had a #dogathon 🐕 yesterday where we analyzed NYC Open Data on dog licenses. We learned a lot of dog facts, which I’ll share in this thread 🧵 1) Geospatial trends: Cavalier King Charles Spaniels are common in Manhattan; the opposite is true for Yorkshire Terriers.

submitted 81 days ago • 2 comments

💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets. Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/

submitted 96 days ago • 1 comment

Ask OpenAI Operator for bus routes from your home in Vietnam to a university and it likely fails because it refuses to use Google Maps! Our new BEARCUBS 🐻 benchmark shows CU agents still struggle with seemingly straightforward multimodal questions.

submitted 102 days ago • 0 comments

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers? We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all! Our analysis across 26 languages 🧵👇

submitted 109 days ago • 1 comment

Excited to share our preprint "Provocations from the Humanities for Generative AI Research” We're open to feedback—read & share thoughts! @laurenfklein.bsky.social @mmvty.bsky.social @docdre.distributedblackness.net @mariaa.bsky.social @jmjafrx.bsky.social @nolauren.bsky.social @dmimno.bsky.social

submitted 114 days ago • 8 comments

🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵

submitted 118 days ago • 2 comments

⚠️Current methods for generating instruction-following data fall short for long-range reasoning tasks like narrative claim verification. We present CLIPPER ✂️, a compression-based pipeline that produces grounded instructions for ~$0.5 each, 34x cheaper than human annotations.

submitted 121 days ago • 1 comment

🤖🍲 What can LLMs do for sustainable food? 🤖🍲 We collaborated with domain experts (food scientists and chefs) to define a typology of food design and prediction tasks. LLMs can assist in food and menu development, saving food scientists' time and reducing emissions! URL: bit.ly/3ERJbUV

submitted 125 days ago • 1 comment

People often claim they know when ChatGPT wrote something, but are they as accurate as they think? Turns out that while general population is unreliable, those who frequently use ChatGPT for writing tasks can spot even "humanized" AI-generated text with near-perfect accuracy 🎯

submitted 145 days ago • 9 comments

Excited to share that today our paper recommender platform www.scholar-inbox.com has reached 20k users! We hope to reach 100k by the end of the year.. Lots of new features are being worked on currently and rolled out soon.

submitted 158 days ago • 12 comments

During my time in the SIGGEN board, we received a request from the @aclmeeting.bsky.social executive board to create an overview of dual use issues in Natural Language Generation. In response, I carried out a survey. The results are here: arxiv.org/abs/2501.06636 Feedback is very welcome.

submitted 159 days ago • 0 comments

📢 The 7th Workshop on Narrative Understanding (WNU) will happen with #NAACL2025 and is open for submissions. 🌐: tinyurl.com/wnu25 Direct Submission: February 17 Pre-Reviewed (ARR) papers: March 10 Excited to organize this again and hope to see you in Albuquerque 🌵 early this May! #wnu2025 #NLProc

submitted 163 days ago • 2 comments

Something I don't understand is: why can't LLMs write novel-length fiction yet? They've got the context length for it. And new models seem capable of the multi-hop reasoning required for plot. So why hasn't anyone demoed a model that can write long interesting stories? I do have a theory ... +

submitted 174 days ago • 49 comments

A short list of tips for keeping a clean, organized ML codebase for new researchers: eugenevinitsky.com/posts/quick-...

submitted 186 days ago • 12 comments

🚨I too am on the job market‼️🤯 I'm searching for faculty positions/postdocs in multilingual/multicultural NLP, vision+language models, and eval for genAI! I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat there! Papers in🧵, see more: saxon.me

submitted 198 days ago • 1 comment

BERTopic users: how do you retrieve the documents most associated with a given topic? I can see some possible options from the documentation, but I'm most interested in standard practice (NB: please don't take this question as a tacit endorsement of BERTopic, I'm just trying to evaluate it fairly)

submitted 200 days ago • 2 comments

Hi everyone, I am excited to share our large-scale survey study with 800+ researchers, which reveals researchers’ usage and perceptions of LLMs as research tools, and how the usage and perceptions differ based on demographics. See results in comments! 🔗 Arxiv link: arxiv.org/abs/2411.05025

submitted 202 days ago • 9 comments

Hi, so I've spent the past almost-decade studying research uses of public social media data, like e.g. ML researchers using content from Twitter, Reddit, and Mastodon. Anyway, buckle up this is about to be a VERY long thread with lots of thoughts and links to papers. 🧵

submitted 207 days ago • 62 comments

TRL is a cornerstone of LLM post training and imo it's the default to learn. There are great alternatives like Unsloth, Axolotl, and AutoTrain. But if you want a daily drive that does experimentation to production, it's TRL. 🧵 these community notebooks guide you through TRL's core:

submitted 209 days ago • 4 comments

Papers from our group! 🤓 - Queer culture in television: 2024.computational-humanities-research.org/papers/paper... - Acting in American film: naitian.org/once-more-wi... - Classification w/ LLMs in cultural analytics: 2024.computational-humanities-research.org/papers/paper...

submitted 213 days ago • 1 comment

Mat is not on 🦋—posting on his behalf! It's time to revisit common assumptions in IR! Embeddings have improved drastically, but mainstream IR evals have stagnated since MSMARCO + BEIR. We ask: on private or tricky IR tasks, are rerankers better? Surely, reranking many docs is best?

submitted 214 days ago • 4 comments

💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length! 🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:

submitted 214 days ago • 1 comment

Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives https://arxiv.org/abs/2410.05558

submitted 215 days ago • 0 comments