Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown? Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization. A short thread on our latest paper 🚞 arxiv.org/abs/2501.18965 - ThreadSky

fschaipp.bsky.social • 23 days ago

Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown?
Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization.

A short thread on our latest paper 🚞

https://arxiv.org/abs/2501.18965

Comments

fschaipp.bsky.social•23 days ago

Using a bound from https://arxiv.org/pdf/2310.07831, we can reproduce the empirical behaviour of cosine and wsd (=constant+cooldown) schedule. Surprisingly the result is for convex problems, but still matches the actual loss of (nonconvex) LLM training.

fschaipp.bsky.social•23 days ago

This allows to understand LR schedules beyond experiments: we study (i) optimal cooldown length, (ii) the impact of gradient norm on the schedule performance.
The second part suggests that the sudden drop in loss during cooldown happens when gradient norms do not go to zero.

fschaipp.bsky.social•23 days ago

How does this help in practice? In continued training, we need to decrease the learning rate in the second phase. But by how much?

Using the theoretically optimal schedule (which can be computed for free), we obtain noticeable improvement in training 124M and 210M models.

fschaipp.bsky.social•23 days ago

Bonus: this provides a provable explanation for the benefit of cooldown: if we plug in the wsd schedule into the bound, a log-term (H_T+1) vanishes compared to constant LR (dark grey).

fschaipp.bsky.social•23 days ago

This is joint work with @haeggee.bsky.social, Adrien Taylor, Umut Simsekli and @bachfrancis.bsky.social

🗞️ https://arxiv.org/abs/2501.18965

🔦 https://github.com/fabian-sp/lr-scheduling

dirque.bsky.social•22 days ago

I think this raises more questions than it answers: Why on earth can a bound for the convex nonsmooth case predict the non-convex and smooth case? We seem to overlook something here…

(But that aside: Great work!)

fschaipp.bsky.social•22 days ago

yes, it does raise questions (and I don't have an answer yet). but I am not sure whether the practical setting falls within the smooth case neither (if smooth=Lipschitz smooth; and even if smooth=differentiable, there are non-diff elements in the architecture like RMSNorm)

Comments

Posting Rules

Reply