New preprint w/ @jennhu.bsky.social @kmahowald.bsky.social : Can LLMs introspect about their knowledge of language? Across models and domains, we did not find evidence that LLMs have privileged access to their own predictions. 🧵(1/8) - ThreadSky

siyuansong.bsky.social • 19 hours ago

New preprint w/ @jennhu.bsky.social @kmahowald.bsky.social : Can LLMs introspect about their knowledge of language?
Across models and domains, we did not find evidence that LLMs have privileged access to their own predictions. 🧵(1/8)

Comments

siyuansong.bsky.social•19 hours ago

Why should we care if LLMs introspect?

Practical reasons: an LLM that can report its internal states would be safer + more reliable.

Scientific reasons: we shouldn’t use meta-linguistic prompts (eg acceptability judgments) w/ LLMs unless they can introspect about their linguistic knowledge! (2/8)

siyuansong.bsky.social•19 hours ago

We propose a new measure of introspection: the degree to which a model’s prompted responses predict its own string probabilities, beyond what would be predicted by another model with *nearly identical* internal knowledge. (3/8)

siyuansong.bsky.social•19 hours ago

We test it in two linguistically informed domains: grammaticality judgments and word prediction. We set both up as forced choice and get (a) direct log probability measurements and (b) prompted knowledge. We compare (a) and (b) both within the same model and across models. (4/8)

siyuansong.bsky.social•19 hours ago

We see that meta-linguistic prompting and direct measurement of probabilities both contain grammatical knowledge. Accuracy of both methods is high (and meta-linguistic accuracy is higher than direct for larger models). (5/8)

siyuansong.bsky.social•19 hours ago

However, the consistency between these measures is low (kappa ~ .25 for experiment 1). And within-model correlation is not really higher than across-model correlation when we consider relevantly similar models, like random seed variants (see plot below for our breakdown of “similar” models). (6/8)

siyuansong.bsky.social•19 hours ago

Our findings offer a cautionary data point against recent results suggesting models can introspect.

There is also a takeaway for linguistics: meta-linguistic prompting does not necessarily tap into the linguistic generalizations reflected in an LLM’s internal model of language. (7/8)

yoavartzi.com•13 hours ago

Cool!

Can this be thought of as model consistency? Because, I know folks saw this as a recurring issue in the past

Maybe more interesting, is there a scaling trend?

siyuansong.bsky.social•12 hours ago

My idea on this: Yes, we did observe signs of low consistency—alignment between the direct and meta methods was indeed low.
But our focus here is more on the *within-model vs. across-model* correlation, which I think is different from the consistency typically discussed.

siyuansong.bsky.social•12 hours ago

There is an interesting scaling trend — response to metalinguistic prompt from larger models aligns better with answers in probability measurement (Fig. 2b/6/9). But even when considering only large models ≥70B, we still didn’t find evidence of introspection.

siyuansong.bsky.social•12 hours ago

Of course, introspection might emerge in even larger models or different tasks. Due to computational and methodological (we need to get prob distribution) constraints, we didn’t include closed-source models like Claude 3.7 or GPT 4.5, so that remains an open question for future work!

Comments

Posting Rules

Reply