The Deepseek v3 paper is out and the training is very interesting.
1. They use Multi-token prediction during training which Meta released a paper about a few months ago.
2. They used their r1 reasoning models to distill reasoning into v3.
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

Comments