Does everyone in your community agree on some folk knowledge that isn’t published anywhere? Put it in a paper! It’s a pretty valuable contribution - ThreadSky

eugenevinitsky.bsky.social • 103 days ago

Does everyone in your community agree on some folk knowledge that isn’t published anywhere? Put it in a paper! It’s a pretty valuable contribution

Comments

brosa.ca•102 days ago

This is how our work on classification vs. regression started, it was folk knowledge in many circles but I had overestimated the extent to which the wider community knew this. https://openreview.net/forum?id=dVpFKfqF3R

eugenevinitsky.bsky.social•102 days ago

Yeah! I’d been hearing suggestions about how helpful two hot was for a while and then your paper came out and both put that in writing plus a bunch of other tricks

skieshaventfallen.bsky.social•103 days ago

Substack is really good for all sorts of stuff like that

robert-g-campbell.bsky.social•103 days ago

Are they behind a paywall though?

skieshaventfallen.bsky.social•103 days ago

Some, some not. I haven't gone past the paywall often.

cpaxton.bsky.social•102 days ago

The current one at all the conferences and back channels is just "openvla and Octo don't actually work"

kvablack.bsky.social•102 days ago

yeah we 100% oversold Octo

cpaxton.bsky.social•102 days ago

I dont mean this as an insult on your work or anything, octo is really cool

kvablack.bsky.social•102 days ago

don't worry I wasn't insulted :)

interestingly, this is the second time that public dunking on Octo has informed me of someone who actually does use it (like @tomdupuis.bsky.social), revising my opinion of its utility upwards rather than downwards.

(the first time I was the one doing the dunking)

tomdupuis.bsky.social•102 days ago

Tbh I think the biggest added values of Octo were:
- open source implementation already coupled to a cheap table-top arm that I could buy (ViperX), integration is so hard and it saved me a lot of time
- Proof of existence of a good robot policy with few parameters < 100M
- good init for finetuning

kvablack.bsky.social•102 days ago

super happy to hear the first one, since we worked so hard on open-sourcing! the last one is the one that I'm less confident about -- it's very hard to beat a well-tuned from-scratch baseline. but if it works for you, then perhaps I've been too pessimistic!

ebauer.bsky.social•99 days ago

I agree, and adapting it to other robot setups is also not too difficult. The codebase, compared to some other repos in the field, is much more cleaned up and documented

lerrelpinto.com•102 days ago

I'm curious. Why?

kvablack.bsky.social•101 days ago

(disclaimer: opinions are my own)

we used language that implied that the pretrained Octo weights would definitely actually be useful for YOU in YOUR lab. in reality, the real world is much bigger than OXE and some carefully-tuned evals. end-to-end robot learning is not even at the GPT-2 level

kvablack.bsky.social•101 days ago

in retrospect, we should have done more to temper expectations and highlight the difference between the method/architecture/open-source code and the actual pretrained weights, which we def want ppl to play with, but come with absolutely no guarantees.

lerrelpinto.com•100 days ago

Thank you for your candor.

stonet2000.bsky.social•99 days ago

i think one of the biggest common confusions and failure points i’ve found from helping make the SIMPLER benchmark parallelized was understanding the robot controller. Many people trying to test their VLAs against others missed controller details and get bad results when it could’ve been better

ebauer.bsky.social•99 days ago

In general, I think a thorough comparison of action spaces would also be nice, for example comparing absolute EE poses, absolute EE deltas, relative EE poses (UMI-style)

This is like another hyperparameter search, but we rely on real evals to get good feedback

stonet2000.bsky.social•99 days ago

i haven’t met a single person who didn’t originally work on OXE or adjacent projects that knew that the implicit assumed controller/actions in the dataset are a kind of “delta target ee pose” control

eugenevinitsky.bsky.social•102 days ago

I don't think I've ever seen someone say that about their own paper. Props.

chanpyb.bsky.social•102 days ago

I thought that was just me 😅 was trying it on an uncluttered single item picking task

cpaxton.bsky.social•102 days ago

Nope it's literally everyone I've talked to. Start up, big co, academia, hobbyist.

cpaxton.bsky.social•102 days ago

This btw is why I think more people should just tweet out "hey I'm having trouble getting X to work what am I doing wrong"

cvoelcker.bsky.social•102 days ago

Sooooo many recent “big VLM learns aaaall the policies” stuff is super brittle. I think robotics needs to do some big introspection on its replicability problems, similar to how RL has started thinking about these issues.

cpaxton.bsky.social•102 days ago

I agree but the only way we can do that is with common platforms, both real world and sim. And it's too easy to make progress by customizing the platform instead of improving the method

cpaxton.bsky.social•102 days ago

Have not verified myself

chrisoffner3d.bsky.social•102 days ago

I suspected as much but would love to see someone write about this.

cpaxton.bsky.social•102 days ago

This is the problem with robotics learning though. One person saying "hey it didn't work for my problem" isn't so interesting or useful

Everyone being like "I tried it and it doesn't work" actually kinda is

chrisoffner3d.bsky.social•102 days ago

So cute how roboticists try to actually get stuff to work. In CS/ML we just try to get stuff accepted by three reviewers.

eugenevinitsky.bsky.social•102 days ago

Yeah that’s the type of thing that saves a huge amount of hassle

tomdupuis.bsky.social•102 days ago

What do you mean by that?
If you try zero-shot on new objects, sure I'm not surprised, the generalization capabilities are oversold, but it's not hard to figure out why: domain gap.
But with proper fine-tuning Octo small is very good on any tasks we tried and Octo base even better.

cpaxton.bsky.social•102 days ago

Does not work out of the box
Requires too much data to fine tune
Once fine-tuned, does not generalize even as well as the original

I am sure there are success stories I've just heard so many failures

tomdupuis.bsky.social•102 days ago

1. Sure but that's true for any robotics model honestly... Too much moving parts
2. Define too much. In my experience with a few 100s demo (takes a day) you can solve some nice non-trivial tasks
But we are using a secret sauce for finetuning I can't spill yet
3. Depends on the generalization type

tomdupuis.bsky.social•102 days ago

For OOD generalization, most of the degradation comes from lack of VISUAL generalization, which is entirely to be expected currently... We need better vision backbones, and such don't exist yet (closest I can think of is prismatic models= SigLip + DinoV2 for both semantics and geometric info)

iceturkey.bsky.social•101 days ago

Regarding better vision backbones, it’s worth checking Theia and RADIO. I doubt why SigLIP+DINOv2 became popular since OpenVLA did not have ablation studies on the choice of vision encoders.

cpaxton.bsky.social•102 days ago

I think there are some that do better, but they're much more bounded! Robot utility models by @notmahi.bsky.social do pretty well. But they're extremely constrained compared to open vla.

I'm less convinced about vision backbone. I think they cap out at a reasonably good performance but will never

miguelalonsojr.bsky.social•99 days ago

📌

davidchristensen.bsky.social•103 days ago

"ooh, the guy at number 13... He's a bit odd". Not sure it would get much traction (Sorry Gavin, it's just an example, I don't mean you :) )

kale-ab.bsky.social•99 days ago

📌 + also really liked your surprising effectiveness of ppo paper! It had a lot of great tips :)

eugenevinitsky.bsky.social•99 days ago

Ah thanks! Yeah this was definitely folks knowledge that Yi Wu and I had known for a long time that we wanted to see written down

leadlicker.bsky.social•103 days ago

This topic fascinates me but I have nothing to contribute :/

pcastr.bsky.social•103 days ago

I feel like the ICLR blog post track could be a good venue for this type of work.

Not sure if there is a blog track for the upcoming iclr, though.

busycalibrating.bsky.social•103 days ago

There is! The submission deadline just wrapped up for ICLR 2025 unfortunately, but the intention is for us to do it again for ICLR 2026, and perhaps at some point other conferences might adopt it!

More info on this year's track here:
https://iclr-blogposts.github.io/2025/about/

natolambert.bsky.social•102 days ago

With modern thing LM systems there are way more details than fit into a blog post sometimes

eugenevinitsky.bsky.social•103 days ago

Good point! I think it can often stand alone though and accrues more bean counting points as a paper

schaul.bsky.social•102 days ago

I think the Distill journal was really valuable in this space, but unfortunately is no longer around to help...

https://distill.pub/

ntraft.bsky.social•102 days ago

TMLR would almost certainly accept things like this. https://jmlr.org/tmlr/

eugenevinitsky.bsky.social•102 days ago

Yessss Distill was a wonder. Unfortunate that it was clearly so much work to maintain

analyticsaurabh.bsky.social•102 days ago

The same folks are at Anithtopic now doing similar work on interpretability.

schaul.bsky.social•101 days ago

@colah.bsky.social: with a few years' hindsight, how do you see the Distill space now? Is there a chance for a reboot or a rebirth in another form?

notmahi.bsky.social•102 days ago

We took a bunch of them in robot learning and made a tutorial about them! I tried to put everything that I find myself regularly telling my students there somewhere. Really think it can save some days to months of a new grad students’ life.

https://supervised-robot-learning.github.io/

lerrelpinto.com•103 days ago

Are there examples of this done in other fields? Something we can take inspiration from?

eugenevinitsky.bsky.social•103 days ago

I think some variants of this would be things like reporting that baselines are known to be consistently weak https://arxiv.org/abs/2407.07218
or that commonly used benchmarks have been solved for a while. Thinking about examples in other fields

eugenevinitsky.bsky.social•103 days ago

Kind of in the same class https://bsky.app/profile/zacharylipton.bsky.social/post/3lbun6wrwg22a

eugenevinitsky.bsky.social•103 days ago

These are things that likely most users were aware of and were just passed around as common knowledge

marcelhussing.bsky.social•102 days ago

Casually plugging this https://arxiv.org/abs/2410.08870 we got lots of nodding and "yea of course" for this but nobody seems to want to change it. cc @cvoelcker.bsky.social

twkillian.bsky.social•102 days ago

No one wants to change it because their internal baselines are a hegemony blocking everyone else from replicating and outperforming their labs

cvoelcker.bsky.social•102 days ago

Also, benchmark work is super difficult and few good theories exist for which benchmarks should be used, so fame is substituted for argument.

neurograce.bsky.social•102 days ago

Then "ten simple rules" series from plos might count? It is a little bit more about how science is done, but insofar as it aims to write down practical tips that are often just spread by word of mouth it could be a good inspiration: https://collections.plos.org/collection/ten-simple-rules/

hyperpotatoneo.bsky.social•102 days ago

IQL and BCQ are still the most consistent, reliable offline RL algorithms. Interestingly, IQL optimizes for the optimal batch constrained policy too (just without a behavior policy model which is needed for BCQ).

Many other algorithms seem to work “better” since they overfit hyperparams for D4RL.

eugenevinitsky.bsky.social•102 days ago

Yessss I have heard this one so many times at this point though I wasn’t sure if it was still current or if something had in fact started working well

nsaphra.bsky.social•103 days ago

Exactly why I hate the weirdly common take that there is no reproducibility crisis in machine learning because everyone just knows the secret gossip about what methods work in practice

cpaxton.bsky.social•102 days ago

Share some of this secret gossip please

eugenevinitsky.bsky.social•103 days ago

Yeah I’m with you. People who believe this should be publishing the secret gossip and helping clear up the field. It’s valuable work

neurostats.org•89 days ago

Yes if that folk knowledge is invisible, other fields misinterpret what they learn from ML publications.

cathywu.bsky.social•102 days ago

I collected some folk knowledge for RL and stuck them in my lecture slides a couple weeks back: https://web.mit.edu/6.7920/www/lectures/L18-2024fa-Evaluation.pdf#page=55 See Appendix B... sorry, I know, appendix of a lecture slide deck is not the best for discovery. Suggestions very welcome.

neurostats.org•89 days ago

📌

theeimer.bsky.social•102 days ago

This is awesome, thanks! 🙏 Forwarding to my students immediately!

I have a small note due which is a pet peeve of mine: when tuning hyperparameters, make sure to tune and and report different seeds! I think especially newbies might miss that, but that can make up to a factor of 8 as far I've seen

cvoelcker.bsky.social•102 days ago

I had a fight with a senior PhD student once who tuned a hyperparameter to the 10 digit after the point, but on a single seed. It was pretty funny.

cathywu.bsky.social•101 days ago

Oh - can you elaborate? Average across different seeds while tuning? Or something else.

theeimer.bsky.social•97 days ago

That can also help! My point is more about the fact that by tuning, we're inducing an optimization bias (even with grid search, I'd say), so usually your performance will look much better on the exact setting you tune on.

togelius.bsky.social•102 days ago

And get it rejected from top conferences for lack of novelty.

(I agree with your opinion, disagree with criteria for conference acceptance and the value we put on such acceptances.)

danielmclaury.bsky.social•102 days ago

You probably win in the long run when your expository article accumulates 10,000 citations because there's no other source for basic techniques in the field

eugenevinitsky.bsky.social•102 days ago

What if there was a great conference thats primary review criteria was technical correctness

neurostats.org•89 days ago

Isn't this TMLR? Or does it not have enough cachet?

lzamparo.bsky.social•89 days ago

That is TMLR, I think

pbontrager.bsky.social•102 days ago

Kind of like a minimal criterion novelty search algorithm

eugenevinitsky.bsky.social•102 days ago

Sorry I know I’m a broken record but I’m legally required to hype @rl-conference.bsky.social anytime anyone says this

vickiboykis.com•102 days ago

Do you think folks from academia would spend time putting together content like this for an online publication of some sort if it weren’t officially sanctioned? Asking for a friend who would love to see something like this collected across all ML fields

xeophon.bsky.social•102 days ago

Yes, see @eleutherai.bsky.social as one example: https://github.com/EleutherAI/cookbook

vickiboykis.com•102 days ago

Huge fan! Thanks for resurfacing this!

eugenevinitsky.bsky.social•101 days ago

I mean, people did it for distill so I imagine yes!

detroitstorm.bsky.social•103 days ago

only if you want to play euchre while drinking. I can tell you what hand you can "Go It Alone", fairly common knowledge in the upper midwest.

thomasahle.bsky.social•99 days ago

Making a wiki style website is a good way to do this, while encouraging others from. The community to contribute and keep it updated.

In fact, writing good Wikipedia articles for your field might be the best way to spread this knowledge.

saiprasanna.in•102 days ago

A wiki where people can submit their piece of the folk knowledge and others vote on validity could also cool. But it might be hard to incentivise

hmoraldo.bsky.social•102 days ago

Maybe starting it with incomplete / wrong content will be enough incentive just like in https://xkcd.com/386/

infinitelyfinite.bsky.social•102 days ago

Isn't that reddit? 😉

natolambert.bsky.social•102 days ago

This is how you end up with a 70 page Tulu 3 paper Lolol. Agree.

eugenevinitsky.bsky.social•102 days ago

That type of careful notating of all the details is the most important thing though! I'm so impressed with what y'all are doing

jatniel.bsky.social•102 days ago

Is it time to collectively research and share ideas instead of arguing online about random topics?

bharathr98.com•103 days ago

In my very niche field of numerical conformal bootstrap, there’s a lot of “community knowledge” that is never published. There’s only 4-5 people who possess that knowledge. I’ve been very lucky to collaborate with them, but I’m writing a giant appendix in my paper with all of these tips and tricks

bharathr98.com•103 days ago

4-5 sounds tiny but that’s because there’s maybe 70 (at max 100) people who have touched numerical bootstrap in their life 😅

bharathr98.com•103 days ago

I wasted nearly two years of my PhD learning that from scratch because I didn’t know better. I don’t want others to go through that again

annamayblue.bsky.social•103 days ago

YEP!

rn.ke•102 days ago

how do you go about doing this?
do you just upload a pdf to arxiv?

mirceasci.bsky.social•102 days ago

in eg regularity theory for PDE it used to be true to a tragic level.. basically either you went to talk to the "gurus who tell you the sh*t orally" or the papers were so full of jumped steps that decrypting + doing some contribution was akin to decoding a hash function + continuing a blockchain

mirceasci.bsky.social•102 days ago

then some time later, people started writing stuff in lecture notes and it became more democratic

Comments

Posting Rules

Reply