Do you want to get the most out of your samples, but increasing the update steps just destabilizes RL training? Our #ICLR2025 spotlight 🎉 paper shows that using the values of unseen actions causes instability in continuous state-action domains and how to combat this problem with learned models! - ThreadSky

cvoelcker.bsky.social • 16 days ago

Do you want to get the most out of your samples, but increasing the update steps just destabilizes RL training? Our #ICLR2025 spotlight 🎉 paper shows that using the values of unseen actions causes instability in continuous state-action domains and how to combat this problem with learned models!

Comments

bsky.chewxy.com•16 days ago

📌

cvoelcker.bsky.social•16 days ago

As a side benefit, this doubles as an almost feature complete #tdmpc2 implementation in Jax :D I just broke the MuZero loss, fixing that for our next paper

cvoelcker.bsky.social•16 days ago

Getting the most out of limited interactions is a fundamental challenge in off-policy reinforcement learning. But when you try to run modern methods like SAC, they diverge as soon as you increase the number of learning steps … because they rely on hallucinated on-policy values.

cvoelcker.bsky.social•16 days ago

However, we can prevent this by generating a small amount of on-policy trajectories from a learned #worldmodel. This leads to remarkably stable training across the most challenging DMC tasks!

For more details, come chat with us in #Singapore 😎

joemwatson.bsky.social•15 days ago

This sounds a lot like MBPO, is it essentially MBPO-style augmentation on top of TD-MPC?

cvoelcker.bsky.social•15 days ago

Yes and no. We find that you can drop half of what MBPO does, no deeper rollouts, you only need very small amounts of model data, and we try to focus our paper more on why this is important. But as a one line paper summary “TD-MPC + MBPO” works.

joemwatson.bsky.social•15 days ago

Interesting, thanks! From what I remember the model rollout horizon in MBPO was often only 1 timestep (or started at 1 timestep and was extended during learning)

cvoelcker.bsky.social•15 days ago

Yes, MBPO has a schedule so in the beginning it’s one step. More importantly, MBPO only trains the actor critic component on the model generated next state(s), while we mix real and model data. In practice this seems to be much more stable! There are more issues with MBPO, paper coming soon 😄

cvoelcker.bsky.social•16 days ago

The paper is available at https://openreview.net/forum?id=6Rt... and you can find the code at https://github.com/adaptive-age...

sacha2.bsky.social•16 days ago

github link is broken

sacha2.bsky.social•16 days ago

so is the paper link

sacha2.bsky.social•16 days ago

hashtag gulfofamerica

theeimer.bsky.social•16 days ago

Same issue for me, had to look manually again 😅
https://openreview.net/forum?id=6RtRsg8ZV1

cvoelcker.bsky.social•16 days ago

This was another incredibly enriching collaboration with @marcelhussing.bsky.social and @ericeaton.bsky.social from #UPenn and my supervisors @sologen.bsky.social and @igilitschenski.bsky.social at #UofT

Comments

Posting Rules

Reply