RL promises "systems that can adapt to their environment". However, no RL system that I know of actually fulfill anything close to this goal, and, furthermore, I'd argue that all the current RL methodologies are actively hostile to this goal. Prove me wrong. - ThreadSky

yoavgo.bsky.social • 57 days ago

RL promises "systems that can adapt to their environment". However, no RL system that I know of actually fulfill anything close to this goal, and, furthermore, I'd argue that all the current RL methodologies are actively hostile to this goal. Prove me wrong.

Comments

alxfed.bsky.social•56 days ago

Once again, 'adapt to environment' is a problem with multiple _diverging_ solutions. 'Engineers' don't solve this type of problems, they presume that everything in the world converges to an 'optimum'.

miguelalonsojr.bsky.social•56 days ago

📌

fillip.pro•57 days ago

This is mostly correct.

However combining RL in distributed training environments with federated learning and semi-manual feedback loops has given us decent results in relatively restrained exercises.

fillip.pro•57 days ago

The ability to adapt rather than learn is within is critically missing from RL implementations.

Practically speaking from large-scale RL projects it’s never been a cost-effective goal.

Environmental constraints always make sense there.

yoavgo.bsky.social•56 days ago

why does distributed and federated help?

fillip.pro•56 days ago

We run training in live environments where-as many RL training plans are based on aggregated and refined data that is sterilized from real world environments, which I believe prevents a lot of opportunities for more complex and comprehensive environments to be included into the feedback loop.

dmewes.com•57 days ago

The neutral networks in the Creatures artificial life simulation use RL successfully for this purpose. But those are very different from most of today's ANNs. https://www.researchgate.net/publication/226997131_Creatures_Entertainment_Software_Agents_with_Artificial_Life

yoavgo.bsky.social•56 days ago

intriguing!!

yablak.net•57 days ago

The Smartchoices project at Google Brain attempted this for optimizing software systems by running rl training on a separate thread/parallel system, then swapping out the nn every x iterations. Very few such models went into prod iirc.

yablak.net•57 days ago

The problem with such systems is understanding them. What playbook do you give to the SREs when it fails?

yoavgo.bsky.social•57 days ago

(what we did get from RL, sort-of, is systems that can learn in many different (related) environments, or a "framework" (RL) that can be used to train actors in many different environments. But this is very different from "adapt to")

robinchauhan.bsky.social•57 days ago

Some methods actively try to figure out what type of env they are in (domain adaptation). Do you consider this adaptation?
https://finale.seas.harvard.edu/publications/direct-policy-transfer-hidden-parameter-markov-decision-processes

Otherwise there is "continual learning", and things like MAML.

yoavgo.bsky.social•56 days ago

relevant, yes. (though the link is broken)

robinchauhan.bsky.social•56 days ago

Trying again...
https://finale.seas.harvard.edu/publications/direct-policy-transfer-hidden-parameter-markov-decision-processes

robinchauhan.bsky.social•56 days ago

I'd say "deep RL" has largely come to focus on learning from huge amts of experience, almost like a faster version of evolution (where immediate learning is less important) as it turns out this can be v useful (ie. quadropeds walking).
But even in ie. Qudroped locomotion people do domain adaptation.

robinchauhan.bsky.social•56 days ago

Ie. The quadroped is constantly estimating the type of surface it's about to step on, different surfaces mean very different policy needed so has to adapt immediately.
But generally still "in distribution", encountering smth very different for first time and learning it is generally offline afaik

petitegeek.bsky.social•56 days ago

Re continual learning, we have to do this in affective computing because each person expresses themself differently. I like this work by Churamani et al that eventually ended up in a real adaptive robotics implementation

https://ieeexplore.ieee.org/abstract/document/9320226?casa_token=3UGGu1OCOV8AAAAA:gmli3FtqCJZTHGurwyXDjLTwGhG2ywS4m6upQBhN0QtJtSBRxn3GLoKiivslpIjZ--nFO2nNJ3_h

https://ieeexplore.ieee.org/abstract/document/10086005

petitegeek.bsky.social•56 days ago

But... it's not an RL setup in the state-action-reward paradigm.

yoavgo.bsky.social•56 days ago

even better!

yoavgo.bsky.social•56 days ago

tnx!

mm-jj-nn.bsky.social•57 days ago

Yes, I think of its core as pretty much the opposite of adaptation: it's a methodology for finding an optimal solution to a specific environment (as represented by an MDP or similar). Though there is certainly a long history of trying to layer generalization/transfer/lifelong-learning/etc. on top.

yoavgo.bsky.social•57 days ago

are you aware of such layering that actually adapt in the sense that they quickly learn to perform in the new environment, but also retains their ability to operate in the original environment?

tdietterich.bsky.social•57 days ago

I would guess that if the model is linear, then we do get adaptation (as in LMS). But nonlinearity tends to introduce barriers to adaptation, and dynamic programming assumes caching is useful (the enemy of adaptation)

yoavartzi.com•57 days ago

Maybe I am getting the specifics of what you mean wrong, cut we have a line of work on systems that adapt and improve over time. This is the latest:
https://arxiv.org/abs/2410.13852

yoavgo.bsky.social•56 days ago

very cool (based on abstract), will read.

yoavartzi.com•56 days ago

Unfortunately, @ch272h.bsky.social didn't do a talk (yet).... but, in the name of accessibility, the video for a tiny-bit-less recent paper
https://youtu.be/Ogc7kQBEndg
(key difference is that the more recent stuff daws the signal from the interaction, so doesn't need explicit feedback for reward signal)

cvoelcker.bsky.social•56 days ago

To prove you wrong I would need to hear why you think they are hostile ;) Probably won't prove you wrong, but I am interested why you say "hostile"

yoavgo.bsky.social•56 days ago

from previous interactions, i know you tend to view RL very broadly, mostly as a problem formulation. i tend to think of it rather as the set of families of mathematical formulations and algorithms people use in practice.

yoavgo.bsky.social•56 days ago

do you see why these could be "hostile"?

cvoelcker.bsky.social•56 days ago

Maybe, but I am genuinely interested in your take here. What is the "hostile" part to you. I have an opinion, I was curious to see if yours matches.

yoavgo.bsky.social•56 days ago

i think the main things are:
- reliance on (numerical) reward as the only signal.
- storing the policy in model weights / tables
- having a relatively "clean" mathematical formulation that does not naturally account for "processes" such as memory, imagination, simulation etc.

cvoelcker.bsky.social•56 days ago

What would be the alternative to #2? Writing programs as policy? Or do you have something else in mind

jmac-ai.bsky.social•56 days ago

I think you'll find that most of the RL community is interested in these things too :)

Even RL pioneers like Sutton advocate for many (all?) of these. E.g., see his Alberta agent research plan.
https://arxiv.org/pdf/2208.11173

While reward is present, it's not a singular reward signal.

jmac-ai.bsky.social•57 days ago

RL doesn't really "promise" anything. It's a problem definition first. We were quite bad at even solving rudimentary RL problems until recently.

Many RL researchers work on more real-time adaptation and have for some time. Progress is made slowly.

yoavgo.bsky.social•56 days ago

can you point to specific researchers/works i should look at?

yoavartzi.com•56 days ago

@jmac-ai.bsky.social actually work(ed) in this space, with COACH:
https://arxiv.org/abs/1701.06049
The domain is simplistic, but the ideas are really cool. We tried to apply COACH, but left it as open problem, and opted for simpler methods, eg:
https://arxiv.org/abs/2212.09710

jmac-ai.bsky.social•56 days ago

Thanks! Yes, I did a bunch of work in learning interactively from human feedback. Brad Knox has a ton in that space too that predated me.

Other work I did included simultaneously grounding language so you could interactively train with feedback and language commands.

Beyond my stuff...

jmac-ai.bsky.social•56 days ago

Basically all of the option learning work was developed with the intention of making adaptation fast. George Konidaris has a bunch of learning and using options in planning work.

jmac-ai.bsky.social•56 days ago

There's a bunch of on-going work on meta-RL. I think someone already linked you to the Ada work from DeepMind which I really liked.

Sergey Levine has a bunch of work in this space too. Meta World might be a good primer for some of that: https://meta-world.github.io/

eugenevinitsky.bsky.social•57 days ago

I think there are some semi-successful instances of this depending on what you mean by adaptation. As an example,
https://arxiv.org/abs/2301.07608 - all tasks are technically in-distribution but are plausibly distinct from training examples and active exploration is plausibly happening

neurostats.org•57 days ago

📌

eugenevinitsky.bsky.social•57 days ago

Whether this is satisfactory depends on what you mean by "related" and whether this task space is rich enough for previously unobserved draws from it to satisfy that.

yoavgo.bsky.social•57 days ago

i only skimmed it, but it does look like very cool work. i didn't get to all the details yet, but I think what is likely "unsatisfactory" to me here wrt to my original post is not the nature of the tasks/adaptation, but rather

yoavgo.bsky.social•57 days ago

but rather that the adapting agent was trained with RL, but the agent itself, to my understanding, does not operate within the RL framework, but is doing something else, which is based on "memory". (am I wrong here?)

eugenevinitsky.bsky.social•57 days ago

No you're correct. Once the agent is trained, the update process is done implicitly in memory as opposed to an RL-based updated. I misunderstood the point of your post

yoavgo.bsky.social•57 days ago

regardless, this is very cool (and related) work, so I am very glad that you misunderstood and pointed me to it!

programsynthesis.bsky.social•57 days ago

Isn't RL *the problem* where the agent learns to maximize a reward signal by interacting with its environment? With that in mind, the algorithm from the paper Eugene referred to falls within the "RL framework" because it aims to solve the RL problem.

yoavgo.bsky.social•56 days ago

some people like to define RL as a problem statement. i rather see it as a set of families of techniques people actually use in practice to solve these problems.

programsynthesis.bsky.social•57 days ago

I agree, however, that most RL approaches cannot adapt efficiently to the environment. Meta-RL is possibly the most promising approach to this, as the agent learns from data the biases needed to adapt efficiently.

Comments

Posting Rules

Reply