This simple pipeline works shockingly well: we substantially outperform (find more interpretable+predictive hypotheses) two recent baselines which use LLMs alone for hypothesis generation (no SAE), and also BERTopic, a classic embedding clustering method. 4/ - ThreadSky

rajmovva.bsky.social • 102 days ago

This simple pipeline works shockingly well: we substantially outperform (find more interpretable+predictive hypotheses) two recent baselines which use LLMs alone for hypothesis generation (no SAE), and also BERTopic, a classic embedding clustering method. 4/

Comments

ddofer.bsky.social•102 days ago

How did you evaluate interpretability?

rajmovva.bsky.social•102 days ago

Once we generate a hypothesized concept, we ask an LLM to annotate that concept on ~10K examples and measure whether those annotations predict the target. This sort-of gets at interpretability, because it requires that the hypothesis, when written in natural language, can actually be used. 1/

ddofer.bsky.social•101 days ago

My question is on the quantification/measurement/comparison.
I understand your method and approach :), my problem is convincing reviewer's about comparative interpretability methods being better)

rajmovva.bsky.social•102 days ago

But, it's not totally sufficient because the hypothesis might somehow be interpretable to an LLM, but not to a human. So, we also have some qualitative discussion on this point--see 6.2/6.3. (Though, anecdotally, so far it seems that if an LLM can interpret a hypothesis, a human also can.) 2/2

rajmovva.bsky.social•102 days ago

Despite using OpenAI LLMs, our method is cheap: for example, outputting 20 hypotheses on a dataset of 20K Yelp reviews costs ~$0.40. It’s *much* cheaper than prior LLM baselines, because unlike prior methods, the LLM doesn’t have to do much; it’s mostly the SAE (which trains on a laptop). 5/

rajmovva.bsky.social•102 days ago

Why are we excited? Out of the box, the method works well on most datasets we’ve tested, including several that didn’t make the paper. We’re just scratching the surface of methods here (o1/r1 for autointerp, better embeddings, etc.), and results already look promising 6/

ddofer.bsky.social•101 days ago

Re interp/3: A trick that worked great was adding positive, negative and neutral activating examples.
e.g. in "Automated Annotation of Disease Subtypes"
https://www.sciencedirect.com/science/article/abs/pii/S1532046424000686

rajmovva.bsky.social•102 days ago

We built an easy-to-use, pip-installable package (“pip install hypothesaes”). To run on your own data, all you need is a list of texts with an associated target variable and an LLM API key

https://github.com/rmovva/HypotheSAEs 7/

rajmovva.bsky.social•102 days ago

@kennypeng.bsky.social also built a website to explore results on Yelp, headlines, & Congress datasets: https://hypothesaes.org.

You can see every SAE neuron in UMAP space, colored by whether the neuron correlates positively or negatively with the target variable. 8/

rajmovva.bsky.social•102 days ago

It was lots of fun to co-lead this with @kennypeng.bsky.social, with coauthors @nkgarg.bsky.social, Jon Kleinberg, and @emmapierson.bsky.social! Feel free to reach out if we can be helpful. Links:

Draft: https://arxiv.org/abs/2502.04382
Python package: https://github.com/rmovva/HypotheSAEs
Demo: https://hypothesaes.org

9/9

ddofer.bsky.social•101 days ago

Any immediate plans to update the package to support local LLMs? (In our (unpublished) work, they work great for this sort of explanations, even 8B models)

Comments

Posting Rules

Reply