Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
"We introduce a simple strategy that makes refusal behavior controllable at test-time without retraining: the refusal token."
https://arxiv.org/abs/2412.06748
"We introduce a simple strategy that makes refusal behavior controllable at test-time without retraining: the refusal token."
https://arxiv.org/abs/2412.06748
Comments