The circuit hypothesis proposes that LLM capabilities emerge from small subnetworks within the model. But how can we actually test this? 🤔 joint work with @velezbeltran.bsky.social @maggiemakar.bsky.social @anndvision.bsky.social @bleilab.bsky.social Adria @far.ai Achille and Caro - ThreadSky

claudiashi.bsky.social • 80 days ago

The circuit hypothesis proposes that LLM capabilities emerge from small subnetworks within the model. But how can we actually test this? 🤔

joint work with @velezbeltran.bsky.social @maggiemakar.bsky.social @anndvision.bsky.social @bleilab.bsky.social Adria @far.ai Achille and Caro

Comments

stellaathena.bsky.social•80 days ago

In a recent paper lead by @dashiells.bsky.social, we argued that the circuit analysis done in a prior paper had reached the wrong conclusion despite finding some evidence. It would be interesting to see if your method can distinguish the two hypotheses.

https://arxiv.org/abs/2312.06581

neurostats.org•78 days ago

📌

claudiashi.bsky.social•80 days ago

We formalize three criteria of an idealized circuit and develop hypothesis tests for them:
1️⃣ Mechanism Preservation: The circuit should preserve the model's behavior
2️⃣ Localization: Removing the circuit disables the task
3️⃣ Minimality: The circuit contains no redundant parts

neurostats.org•78 days ago

📌

claudiashi.bsky.social•80 days ago

We translate these properties into three idealized tests:

Equivalence Test: The circuit and the original model have the same chance of outperforming each other

claudiashi.bsky.social•80 days ago

Independence Test: Removing the circuit renders the model output independent of that of the circuit

Minimality Test: All edges in the circuit are necessary for the task

claudiashi.bsky.social•80 days ago

The idealized tests are stringent, so we developed two flexible tests that quantify:

Sufficiency Test: How faithful is faithful enough?
Partial Necessity Test: How much knockdown effect is significant?

claudiashi.bsky.social•80 days ago

We compare the candidate circuit against random circuits drawn from a reference distribution. We vary the reference distribution to change the hardness of the test.

claudiashi.bsky.social•80 days ago

We apply our tests to six benchmark circuits from the literature: two synthetic circuits, two semi-synthetic circuits (circuits discovered on toy transformer models), and two circuits in the wild (circuits discovered on transformer models such as GPT-2).

Comments

Posting Rules

Reply