💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets. Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/ - ThreadSky

rajmovva.bsky.social • 99 days ago

💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets.

Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/

Comments

rajmovva.bsky.social•99 days ago

Starting from blackbox text embeddings—which are expressive, but uninterpretable—we (1) use SAEs to map the embeddings to an interpretable space; (2) filter for features that predict the target (engagement, etc.); (3) interpret those features using LLMs. 2/

rajmovva.bsky.social•99 days ago

These natural language interpretations are our hypotheses, which we validate on held-out data. On well-studied social science datasets, we add to prior work: for example, we find that news headlines about social issues or the environment decrease engagement. 3/

rajmovva.bsky.social•99 days ago

This simple pipeline works shockingly well: we substantially outperform (find more interpretable+predictive hypotheses) two recent baselines which use LLMs alone for hypothesis generation (no SAE), and also BERTopic, a classic embedding clustering method. 4/

ddofer.bsky.social•98 days ago

How did you evaluate interpretability?

rajmovva.bsky.social•98 days ago

Once we generate a hypothesized concept, we ask an LLM to annotate that concept on ~10K examples and measure whether those annotations predict the target. This sort-of gets at interpretability, because it requires that the hypothesis, when written in natural language, can actually be used. 1/

ddofer.bsky.social•98 days ago

My question is on the quantification/measurement/comparison.
I understand your method and approach :), my problem is convincing reviewer's about comparative interpretability methods being better)

rajmovva.bsky.social•98 days ago

But, it's not totally sufficient because the hypothesis might somehow be interpretable to an LLM, but not to a human. So, we also have some qualitative discussion on this point--see 6.2/6.3. (Though, anecdotally, so far it seems that if an LLM can interpret a hypothesis, a human also can.) 2/2

rajmovva.bsky.social•99 days ago

Despite using OpenAI LLMs, our method is cheap: for example, outputting 20 hypotheses on a dataset of 20K Yelp reviews costs ~$0.40. It’s *much* cheaper than prior LLM baselines, because unlike prior methods, the LLM doesn’t have to do much; it’s mostly the SAE (which trains on a laptop). 5/

rajmovva.bsky.social•99 days ago

Why are we excited? Out of the box, the method works well on most datasets we’ve tested, including several that didn’t make the paper. We’re just scratching the surface of methods here (o1/r1 for autointerp, better embeddings, etc.), and results already look promising 6/

ddofer.bsky.social•98 days ago

Re interp/3: A trick that worked great was adding positive, negative and neutral activating examples.
e.g. in "Automated Annotation of Disease Subtypes"
https://www.sciencedirect.com/science/article/abs/pii/S1532046424000686

rajmovva.bsky.social•99 days ago

We built an easy-to-use, pip-installable package (“pip install hypothesaes”). To run on your own data, all you need is a list of texts with an associated target variable and an LLM API key

https://github.com/rmovva/HypotheSAEs 7/

Comments

Posting Rules

Reply