Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

"We introduce a simple strategy that makes refusal behavior controllable at test-time without retraining: the refusal token."

https://arxiv.org/abs/2412.06748
Post image

Comments