We've released some preliminary research today demonstrating fine tuning attacks which can bypass the safety mechanism of DeepSeek-R1. These simple attacks show that DeepSeek's safety can be easily removed to provide harmful content, and potentially to a worse extent to non-reasoning LLMs.
Comments
A full write up is available
https://github.com/Bristol-Cyber-Security-Group/hackingdeepseek
(Arxiv release incoming)
And a university press release here:
https://bristol.ac.uk/news/2025/february-/deepseek-risks.html