In AI safety, we have inner misalignment (actions don't minimize the loss function) and outer misalignment (loss function is misspecified).
But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.
I, er, really hope.
But I do think that inner misalignment (~learned features) tend to act as a protective mechanism to avoid outer misalignment implications.
I, er, really hope.
Comments
Assuming outer misalignment, x can be seen as safer than y.
That being said, the better the model, the less this will happen.