Anthropic's "Towards Sycophancy In Language Models" https://arxiv.org/pdf/2310.13548
TLDR: LLMs tend to generate sycophantic responses.
Human feedback & preference models encourage this behavior.
I also think this is just the nature of training on internet writing.... We write in social clusters:
TLDR: LLMs tend to generate sycophantic responses.
Human feedback & preference models encourage this behavior.
I also think this is just the nature of training on internet writing.... We write in social clusters:
Comments
It's extremely hard to take out sycophancy out of an LLM, trained the way we train them.