In AI safety, we have inner misalignment (actions don't minimize the loss function) and outer misalignment (loss function is misspecified). But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications. I, er, really hope. - ThreadSky

In AI safety, we have inner misalignment (actions don't minimize the loss function) and outer misalignment (loss function is misspecified).

But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.

I, er, really hope.

Comments

Posting Rules

Comments

Posting Rules

Reply