Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown?
Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization.
A short thread on our latest paper π
https://arxiv.org/abs/2501.18965
Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization.
A short thread on our latest paper π
https://arxiv.org/abs/2501.18965
Comments
The second part suggests that the sudden drop in loss during cooldown happens when gradient norms do not go to zero.
Using the theoretically optimal schedule (which can be computed for free), we obtain noticeable improvement in training 124M and 210M models.
ποΈ https://arxiv.org/abs/2501.18965
π¦ https://github.com/fabian-sp/lr-scheduling
(But that aside: Great work!)