Profile avatar
chautmpham.bsky.social
PhD student @umdcs @ClipUMD | Previously @manningcs @MSFTResearch | Long-form Generation & Long-context Reasoning | https://chtmp223.github.io
33 posts 2,057 followers 555 following
Prolific Poster
Conversation Starter

I see this work as our answer to the "cultural alignment" and "cultural benchmarking" trends in NLP research. Instead of making decisions for people, we consider "culture" in a specific setting with specific people for a specific task, and we ask people directly about their cultural adaptations.

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

We find that LLMs (e.g. GPT-4o, LLaMA-3.1) consistently recall book content across languages, even for texts without official translation in pre-training data! Great work led by undergrads at UMass NLP 🥳

One of the ways that LLMs can be inconsistent is the "generator-validator gap," where LLMs deem their own answers incorrect. 🎯 We demonstrate that ranking-based discriminator training can significantly reduce this gap, and improvements on one task often generalize to others! 🧵👇

📚 Check out the newest JCA article by Li Lucy (@lucy3.bsky.social), Camilla Griffiths, Claire Ying, JJ Kim-Ebio, Sabrina Baur, Sarah Levine, Jennifer L. Eberhardt, David Bamman (@dbamman.bsky.social), and Dorottya Demszky. culturalanalytics.org/article/1316...

A very cool paper shows that you can use the RL loss to improve story generation by some clever setups on training on known texts (e.g. ground predictions versus a next chapter you know). RL starting to generalize already!

We have updated #nocha, a leaderboard for reasoning over long-context narratives 📖, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team 🪄 🧙 Check 🔗 novelchallenge.github.io for details :)

New paper from our team @GoogleDeepMind! 🚨 We've put LLMs to the test as writing co-pilots – how good are they really at helping us write? LLMs are increasingly used for open-ended tasks like writing assistance, but how do we assess their effectiveness? 🤔 arxiv.org/pdf/2503.19711

Our lab had a #dogathon 🐕 yesterday where we analyzed NYC Open Data on dog licenses. We learned a lot of dog facts, which I’ll share in this thread 🧵 1) Geospatial trends: Cavalier King Charles Spaniels are common in Manhattan; the opposite is true for Yorkshire Terriers.

💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets. Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/

Ask OpenAI Operator for bus routes from your home in Vietnam to a university and it likely fails because it refuses to use Google Maps! Our new BEARCUBS 🐻 benchmark shows CU agents still struggle with seemingly straightforward multimodal questions.

Is the needle-in-a-haystack test still meaningful given the giant green heatmaps in modern LLM papers? We create ONERULER 💍, a multilingual long-context benchmark that allows for nonexistent needles. Turns out NIAH isn't so easy after all! Our analysis across 26 languages 🧵👇

Excited to share our preprint "Provocations from the Humanities for Generative AI Research” We're open to feedback—read & share thoughts! @laurenfklein.bsky.social @mmvty.bsky.social @docdre.distributedblackness.net @mariaa.bsky.social @jmjafrx.bsky.social @nolauren.bsky.social @dmimno.bsky.social

🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵

⚠️Current methods for generating instruction-following data fall short for long-range reasoning tasks like narrative claim verification. We present CLIPPER ✂️, a compression-based pipeline that produces grounded instructions for ~$0.5 each, 34x cheaper than human annotations.

🤖🍲 What can LLMs do for sustainable food? 🤖🍲 We collaborated with domain experts (food scientists and chefs) to define a typology of food design and prediction tasks. LLMs can assist in food and menu development, saving food scientists' time and reducing emissions! URL: bit.ly/3ERJbUV

People often claim they know when ChatGPT wrote something, but are they as accurate as they think? Turns out that while general population is unreliable, those who frequently use ChatGPT for writing tasks can spot even "humanized" AI-generated text with near-perfect accuracy 🎯

Excited to share that today our paper recommender platform www.scholar-inbox.com has reached 20k users! We hope to reach 100k by the end of the year.. Lots of new features are being worked on currently and rolled out soon.

During my time in the SIGGEN board, we received a request from the @aclmeeting.bsky.social executive board to create an overview of dual use issues in Natural Language Generation. In response, I carried out a survey. The results are here: arxiv.org/abs/2501.06636 Feedback is very welcome.

📢 The 7th Workshop on Narrative Understanding (WNU) will happen with #NAACL2025 and is open for submissions. 🌐: tinyurl.com/wnu25 Direct Submission: February 17 Pre-Reviewed (ARR) papers: March 10 Excited to organize this again and hope to see you in Albuquerque 🌵 early this May! #wnu2025 #NLProc

Something I don't understand is: why can't LLMs write novel-length fiction yet? They've got the context length for it. And new models seem capable of the multi-hop reasoning required for plot. So why hasn't anyone demoed a model that can write long interesting stories? I do have a theory ... +

A short list of tips for keeping a clean, organized ML codebase for new researchers: eugenevinitsky.com/posts/quick-...

🚨I too am on the job market‼️🤯 I'm searching for faculty positions/postdocs in multilingual/multicultural NLP, vision+language models, and eval for genAI! I'll be at #NeurIPS2024 presenting our work on meta-evaluation for text-to-image faithfulness! Let's chat there! Papers in🧵, see more: saxon.me

BERTopic users: how do you retrieve the documents most associated with a given topic? I can see some possible options from the documentation, but I'm most interested in standard practice (NB: please don't take this question as a tacit endorsement of BERTopic, I'm just trying to evaluate it fairly)

Hi everyone, I am excited to share our large-scale survey study with 800+ researchers, which reveals researchers’ usage and perceptions of LLMs as research tools, and how the usage and perceptions differ based on demographics. See results in comments! 🔗 Arxiv link: arxiv.org/abs/2411.05025

Hi, so I've spent the past almost-decade studying research uses of public social media data, like e.g. ML researchers using content from Twitter, Reddit, and Mastodon. Anyway, buckle up this is about to be a VERY long thread with lots of thoughts and links to papers. 🧵

TRL is a cornerstone of LLM post training and imo it's the default to learn. There are great alternatives like Unsloth, Axolotl, and AutoTrain. But if you want a daily drive that does experimentation to production, it's TRL. 🧵 these community notebooks guide you through TRL's core:

Papers from our group! 🤓 - Queer culture in television: 2024.computational-humanities-research.org/papers/paper... - Acting in American film: naitian.org/once-more-wi... - Classification w/ LLMs in cultural analytics: 2024.computational-humanities-research.org/papers/paper...

Mat is not on 🦋—posting on his behalf! It's time to revisit common assumptions in IR! Embeddings have improved drastically, but mainstream IR evals have stagnated since MSMARCO + BEIR. We ask: on private or tricky IR tasks, are rerankers better? Surely, reranking many docs is best?

💬 Have you or a loved one compared LM probabilities to human linguistic acceptability judgments? You may be overcompensating for the effect of frequency and length! 🌟 In our new paper, we rethink how we should be controlling for these factors 🧵:

Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives https://arxiv.org/abs/2410.05558