Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵 - ThreadSky

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

Comments

nsaphra.bsky.social•211 days ago

we've been working on something fun and very related (but not super close, like this is usable for us). DMing!

seasnell.bsky.social•213 days ago

We first pose the task of emergence prediction:

given access to LLMs that have random few-shot accuracy on a task, can we predict the point in scaling (e.g., pretraining loss) at which performance will jump up beyond random-chance?

seasnell.bsky.social•213 days ago

We then discover a simple insight for this problem:

finetuning LLMs on a given task can shift the point in scaling at which emergence occurs towards less capable LLMs, and the magnitude of this shift is modulated by the amount of finetuning data.

seasnell.bsky.social•213 days ago

To operationalize this insight, we finetune LLMs on varying amounts of data and fit a parametric function (i.e., “emergence law”) which models how the point of emergence shifts with the amount of data. We can then extrapolate a prediction for emergence in the few-shot setting.

seasnell.bsky.social•213 days ago

We validate our emergence law using four standard NLP benchmarks where large-scale open-source LLMs already demonstrate emergence, so we can easily check our predictions.

We find that our emergence law can accurately predict the point of emergence up to 4x the FLOPs in advance.

seasnell.bsky.social•213 days ago

Finally, we present a case study of two real world uses for emergence prediction:

1) cheaply assessing pretraining data quality (left).

2) predicting more complex capabilities, closer to those of future frontier models, using the difficult APPS coding benchmark (right).

seasnell.bsky.social•213 days ago

This was a fun project with Eric Wallace, Dan Klein, and Sergey Levine
.
An early version of this work also appeared in COLM 2024.

Paper link: https://arxiv.org/abs/2411.16035

Comments

Posting Rules

Reply