Another neat alignment-motivated complexity theory conjecture from ARC! I am excited for more theory folk to work on alignment, and crisply defined conjectures are a great starting point. Some thoughts on how this conjecture relates to the overall problem. 🧵 www.alignment.org/blog/a-compu... - ThreadSky

About ThreadSky

girving.bsky.social • 11 days ago

Another neat alignment-motivated complexity theory conjecture from ARC! I am excited for more theory folk to work on alignment, and crisply defined conjectures are a great starting point.

Some thoughts on how this conjecture relates to the overall problem. 🧵

https://www.alignment.org/blog/a-computational-no-coincidence-principle

Comments

girving.bsky.social•11 days ago

Broadly, we would like AIs to behave as if we fully understand and endorse their full reasoning. But this goal is unachievable: we should generically expect advanced AI systems to include wild heuristics that (1) we can't follow and (2) are limited in various ways.

girving.bsky.social•11 days ago

We already see this is narrow AI systems. AlphaGo and friends are strongly superhuman on average but also fragile: it is straightforward to manually construct board states where they fall down.

girving.bsky.social•11 days ago

(Aside: It's much harder to construct such a board state live on the board. Lee Sedol pulled it off once via a position with enough miai to break the efficiency of MCTS, and https://gleave.me/publication/2022-11-go-attack did it more reliably with adversarial RL.)

girving.bsky.social•11 days ago

Scalable oversight tries to port ideas from interactive proof theory to let humans follow complicated AI reasoning, but the limitations mean that something more is needed: there may be cases where heuristics apply but are not flexible enough to support human explanation.

girving.bsky.social•11 days ago

Humans behave like this too: often we know how to do something, but not well enough to explain it (even if we'll fully motivated to do so). AI systems will be the same.

girving.bsky.social•11 days ago

My short summary of the ARC agenda is:

1. Follow AI reasoning where we can
2. Admit we will sometimes fail to follow AI reasoning
3. Define a notion of "heuristic explanation" that holds even if (2) fails.
4. Use (3) to distinguish (1) and (2).

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply