The circuit hypothesis proposes that LLM capabilities emerge from small subnetworks within the model. But how can we actually test this? 🤔
joint work with @velezbeltran.bsky.social @maggiemakar.bsky.social @anndvision.bsky.social @bleilab.bsky.social Adria @far.ai Achille and Caro
joint work with @velezbeltran.bsky.social @maggiemakar.bsky.social @anndvision.bsky.social @bleilab.bsky.social Adria @far.ai Achille and Caro
Comments
https://arxiv.org/abs/2312.06581
1️⃣ Mechanism Preservation: The circuit should preserve the model's behavior
2️⃣ Localization: Removing the circuit disables the task
3️⃣ Minimality: The circuit contains no redundant parts
Equivalence Test: The circuit and the original model have the same chance of outperforming each other
Independence Test: Removing the circuit renders the model output independent of that of the circuit
Minimality Test: All edges in the circuit are necessary for the task
Sufficiency Test: How faithful is faithful enough?
Partial Necessity Test: How much knockdown effect is significant?
We apply our tests to six benchmark circuits from the literature: two synthetic circuits, two semi-synthetic circuits (circuits discovered on toy transformer models), and two circuits in the wild (circuits discovered on transformer models such as GPT-2).