This is how our work on classification vs. regression started, it was folk knowledge in many circles but I had overestimated the extent to which the wider community knew this. https://openreview.net/forum?id=dVpFKfqF3R
Yeah! I’d been hearing suggestions about how helpful two hot was for a while and then your paper came out and both put that in writing plus a bunch of other tricks
interestingly, this is the second time that public dunking on Octo has informed me of someone who actually does use it (like @tomdupuis.bsky.social), revising my opinion of its utility upwards rather than downwards.
Tbh I think the biggest added values of Octo were:
- open source implementation already coupled to a cheap table-top arm that I could buy (ViperX), integration is so hard and it saved me a lot of time
- Proof of existence of a good robot policy with few parameters < 100M
- good init for finetuning
super happy to hear the first one, since we worked so hard on open-sourcing! the last one is the one that I'm less confident about -- it's very hard to beat a well-tuned from-scratch baseline. but if it works for you, then perhaps I've been too pessimistic!
I agree, and adapting it to other robot setups is also not too difficult. The codebase, compared to some other repos in the field, is much more cleaned up and documented
we used language that implied that the pretrained Octo weights would definitely actually be useful for YOU in YOUR lab. in reality, the real world is much bigger than OXE and some carefully-tuned evals. end-to-end robot learning is not even at the GPT-2 level
in retrospect, we should have done more to temper expectations and highlight the difference between the method/architecture/open-source code and the actual pretrained weights, which we def want ppl to play with, but come with absolutely no guarantees.
i think one of the biggest common confusions and failure points i’ve found from helping make the SIMPLER benchmark parallelized was understanding the robot controller. Many people trying to test their VLAs against others missed controller details and get bad results when it could’ve been better
In general, I think a thorough comparison of action spaces would also be nice, for example comparing absolute EE poses, absolute EE deltas, relative EE poses (UMI-style)
This is like another hyperparameter search, but we rely on real evals to get good feedback
i haven’t met a single person who didn’t originally work on OXE or adjacent projects that knew that the implicit assumed controller/actions in the dataset are a kind of “delta target ee pose” control
Sooooo many recent “big VLM learns aaaall the policies” stuff is super brittle. I think robotics needs to do some big introspection on its replicability problems, similar to how RL has started thinking about these issues.
I agree but the only way we can do that is with common platforms, both real world and sim. And it's too easy to make progress by customizing the platform instead of improving the method
What do you mean by that?
If you try zero-shot on new objects, sure I'm not surprised, the generalization capabilities are oversold, but it's not hard to figure out why: domain gap.
But with proper fine-tuning Octo small is very good on any tasks we tried and Octo base even better.
1. Sure but that's true for any robotics model honestly... Too much moving parts
2. Define too much. In my experience with a few 100s demo (takes a day) you can solve some nice non-trivial tasks
But we are using a secret sauce for finetuning I can't spill yet
3. Depends on the generalization type
For OOD generalization, most of the degradation comes from lack of VISUAL generalization, which is entirely to be expected currently... We need better vision backbones, and such don't exist yet (closest I can think of is prismatic models= SigLip + DinoV2 for both semantics and geometric info)
Regarding better vision backbones, it’s worth checking Theia and RADIO. I doubt why SigLIP+DINOv2 became popular since OpenVLA did not have ablation studies on the choice of vision encoders.
I think there are some that do better, but they're much more bounded! Robot utility models by @notmahi.bsky.social do pretty well. But they're extremely constrained compared to open vla.
I'm less convinced about vision backbone. I think they cap out at a reasonably good performance but will never
There is! The submission deadline just wrapped up for ICLR 2025 unfortunately, but the intention is for us to do it again for ICLR 2026, and perhaps at some point other conferences might adopt it!
We took a bunch of them in robot learning and made a tutorial about them! I tried to put everything that I find myself regularly telling my students there somewhere. Really think it can save some days to months of a new grad students’ life.
I think some variants of this would be things like reporting that baselines are known to be consistently weak https://arxiv.org/abs/2407.07218
or that commonly used benchmarks have been solved for a while. Thinking about examples in other fields
Casually plugging this https://arxiv.org/abs/2410.08870 we got lots of nodding and "yea of course" for this but nobody seems to want to change it. cc @cvoelcker.bsky.social
Then "ten simple rules" series from plos might count? It is a little bit more about how science is done, but insofar as it aims to write down practical tips that are often just spread by word of mouth it could be a good inspiration: https://collections.plos.org/collection/ten-simple-rules/
IQL and BCQ are still the most consistent, reliable offline RL algorithms. Interestingly, IQL optimizes for the optimal batch constrained policy too (just without a behavior policy model which is needed for BCQ).
Many other algorithms seem to work “better” since they overfit hyperparams for D4RL.
Exactly why I hate the weirdly common take that there is no reproducibility crisis in machine learning because everyone just knows the secret gossip about what methods work in practice
I collected some folk knowledge for RL and stuck them in my lecture slides a couple weeks back: https://web.mit.edu/6.7920/www/lectures/L18-2024fa-Evaluation.pdf#page=55 See Appendix B... sorry, I know, appendix of a lecture slide deck is not the best for discovery. Suggestions very welcome.
This is awesome, thanks! 🙏 Forwarding to my students immediately!
I have a small note due which is a pet peeve of mine: when tuning hyperparameters, make sure to tune and and report different seeds! I think especially newbies might miss that, but that can make up to a factor of 8 as far I've seen
That can also help! My point is more about the fact that by tuning, we're inducing an optimization bias (even with grid search, I'd say), so usually your performance will look much better on the exact setting you tune on.
You probably win in the long run when your expository article accumulates 10,000 citations because there's no other source for basic techniques in the field
Do you think folks from academia would spend time putting together content like this for an online publication of some sort if it weren’t officially sanctioned? Asking for a friend who would love to see something like this collected across all ML fields
In my very niche field of numerical conformal bootstrap, there’s a lot of “community knowledge” that is never published. There’s only 4-5 people who possess that knowledge. I’ve been very lucky to collaborate with them, but I’m writing a giant appendix in my paper with all of these tips and tricks
in eg regularity theory for PDE it used to be true to a tragic level.. basically either you went to talk to the "gurus who tell you the sh*t orally" or the papers were so full of jumped steps that decrypting + doing some contribution was akin to decoding a hash function + continuing a blockchain
Comments
interestingly, this is the second time that public dunking on Octo has informed me of someone who actually does use it (like @tomdupuis.bsky.social), revising my opinion of its utility upwards rather than downwards.
(the first time I was the one doing the dunking)
- open source implementation already coupled to a cheap table-top arm that I could buy (ViperX), integration is so hard and it saved me a lot of time
- Proof of existence of a good robot policy with few parameters < 100M
- good init for finetuning
we used language that implied that the pretrained Octo weights would definitely actually be useful for YOU in YOUR lab. in reality, the real world is much bigger than OXE and some carefully-tuned evals. end-to-end robot learning is not even at the GPT-2 level
This is like another hyperparameter search, but we rely on real evals to get good feedback
Everyone being like "I tried it and it doesn't work" actually kinda is
If you try zero-shot on new objects, sure I'm not surprised, the generalization capabilities are oversold, but it's not hard to figure out why: domain gap.
But with proper fine-tuning Octo small is very good on any tasks we tried and Octo base even better.
Requires too much data to fine tune
Once fine-tuned, does not generalize even as well as the original
I am sure there are success stories I've just heard so many failures
2. Define too much. In my experience with a few 100s demo (takes a day) you can solve some nice non-trivial tasks
But we are using a secret sauce for finetuning I can't spill yet
3. Depends on the generalization type
I'm less convinced about vision backbone. I think they cap out at a reasonably good performance but will never
Not sure if there is a blog track for the upcoming iclr, though.
More info on this year's track here:
https://iclr-blogposts.github.io/2025/about/
https://distill.pub/
https://supervised-robot-learning.github.io/
or that commonly used benchmarks have been solved for a while. Thinking about examples in other fields
Many other algorithms seem to work “better” since they overfit hyperparams for D4RL.
I have a small note due which is a pet peeve of mine: when tuning hyperparameters, make sure to tune and and report different seeds! I think especially newbies might miss that, but that can make up to a factor of 8 as far I've seen
(I agree with your opinion, disagree with criteria for conference acceptance and the value we put on such acceptances.)
In fact, writing good Wikipedia articles for your field might be the best way to spread this knowledge.
do you just upload a pdf to arxiv?