In AI safety, we have inner misalignment (actions don't minimize the loss function) and outer misalignment (loss function is misspecified).

But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.

I, er, really hope.

Comments