Profile avatar
navitagoyal.bsky.social
PhD student @umdcs, Member of @ClipUmd lab | Earlier @AdobeResearch, @IITRoorkee
3 posts 246 followers 186 following
Getting Started

🚨 New Position Paper 🚨 Multiple choice evals for LLMs are simple and popular, but we know they are awful 😬 We complain they're full of errors, saturated, and test nothing meaningful, so why do we still use them? 🫠 Here's why MCQA evals are broken, and how to fix them 🧵

How can we generate synthetic data for a task that requires global reasoning over a long context (e.g., verifying claims about a book)? LLMs aren't good at *solving* such tasks, let alone generating data for them. Check out our paper for a compression-based solution!

This paper is really cool. They decompose NLI (and defeasible NLI) hypotheses into atoms, and then use these atoms to measure the logical consistency of LLMs. E.g. for an entailment NLI example, each hypothesis atom should also be entailed by the premise. Very nice idea 👏👏

Please join us for: AI at Work: Building and Evaluating Trust Presented by our Trustworthy AI in Law & Society (TRIALS) institute. Feb 3-4 Washington DC Open to all! Details and registration at: trails.gwu.edu/trailscon-2025 Sponsorship details at: trails.gwu.edu/media/556

This is my first time serving as an AC for a big conference. Just read this great work by Goyal et al. arxiv.org/abs/2411.11437 I'm optimizing for high coverage and low redundancy—assigning reviewers based on relevant topics or affinity scores alone feels off. Seniority and diversity matter!