New preprint! Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs. In this paper with @kesnet50.bsky.social and my advisor Armando Solar-Lezama, we investigate how LLMs perform on randomly selected simple language reasoning problems. arxiv.org/abs/2501.02825 - ThreadSky

kavi.bsky.social • 50 days ago

New preprint! Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs.

In this paper with @kesnet50.bsky.social and my advisor Armando Solar-Lezama, we investigate how LLMs perform on randomly selected simple language reasoning problems.

https://arxiv.org/abs/2501.02825

Comments

projectglint.com•48 days ago

This makes me wonder if a LLM can, given the description of a small DFA, simulate that DFA? When do they break down, is it just at context limits or earlier?

I'm also curious why LLMs don't already have this ability. If we trained them on related data, would they gain this skill (generalizably?)

kavi.bsky.social•50 days ago

TL;DR: they perform extremely poorly, falling behind even quite simple ngram-heuristic baselines.

kavi.bsky.social•50 days ago

Specifically, we sample 3-state DFAs and then see how well LLMs can generate a new accepted string given several accepted strings, or how well LLMs can compute whether a string is accepted, given the "transducer" output on the prefix (i.e., whether each prefix is accepted).

kavi.bsky.social•50 days ago

Why randomly sampled DFAs? Dataset leakage. Tasks that are off the shelf might literally be in the training dataset, and tasks that you invent off the top of your head might actually be similar to ones you have seen before. We wish to test LLMs on randomly selected problems to avoid leakage issues.

kavi.bsky.social•50 days ago

The original purpose of this project was to create a metric for LLM performance and see how far up the language reasoning difficulty ladder LLMs could climb. However, we found that the simplest possible nontrivial language family, 3-state DFAs, proved too complex.

kavi.bsky.social•50 days ago

Our conclusion is that LLMs are not fully general models of language that can accomplish in-context generalization on unseen languages, and high LLM performance on a variety of language tasks (e.g., being able to write in novel DSLs) is likely due at least in part to familiarity.

nsaphra.bsky.social•50 days ago

Is the ngram baseline trained on the specific DFA language while the LLM reasons in-context? Or is it trained on a collection of randomly generated DFAs?

tektology-enjoyer.bsky.social•49 days ago

How would ya contrast this vs Chollet's ARC Test?

kavi.bsky.social•49 days ago

ARC is far more complex and more humanlike (which is a desirable property in many ways for people who want LLMs to be more humanlike) but also more liable to unintentional dataset concept leakage in a way that sampling from a large parsimoniously defined set is not

kavi.bsky.social•49 days ago

I think the main benefit of our approach is the transparency and parsimony in how the problems are created, which excludes the possibility of, e.g., "problems that researchers come up with when they want to test something" being a small subspace of possible useful problems

kavi.bsky.social•49 days ago

The main disadvantage is maybe useful problems are a small subspace of all possible problems

tektology-enjoyer.bsky.social•49 days ago

Though it does get at the core of "Reasoning" (vs "the LSAT questions it's been trained on can give a good answer to your Symbolic Logic question"). While still being "fair" questions imho. Also kinda reminds me of the problems they gave to Copycat.

alloverthplacehold.bsky.social•49 days ago

Thanks for sharing!

Would love to see the test results for the LLMs specifically marketed as reasoning-focused (Gemini 2.0 Flash Thinking, OpenAI o1 (pro), DeepSeek Thinking) though.

kavi.bsky.social•49 days ago

Preliminary results on o1 suggest it's not good at this task. I'll see if I can get approval from my advisor to blow a few hundred bucks validating this

benalpers.bsky.social•49 days ago

📌

selectstarman.bsky.social•49 days ago

🎉

Comments

Posting Rules

Reply