Really cool to see theory connect to practice! We observed this phenomenon when trying to do deeper WSD cooldowns of our 8B model in the https://marin.community project!
We Z-Lossed our way through the pain, but cool to see some stronger theory: https://marin.readthedocs.io/en/latest/reports/marin-8b-retro/#raccoon-debugging-sft-ability
We Z-Lossed our way through the pain, but cool to see some stronger theory: https://marin.readthedocs.io/en/latest/reports/marin-8b-retro/#raccoon-debugging-sft-ability
Comments