RL promises "systems that can adapt to their environment". However, no RL system that I know of actually fulfill anything close to this goal, and, furthermore, I'd argue that all the current RL methodologies are actively hostile to this goal. Prove me wrong.
Comments
However combining RL in distributed training environments with federated learning and semi-manual feedback loops has given us decent results in relatively restrained exercises.
Practically speaking from large-scale RL projects it’s never been a cost-effective goal.
Environmental constraints always make sense there.
https://finale.seas.harvard.edu/publications/direct-policy-transfer-hidden-parameter-markov-decision-processes
Otherwise there is "continual learning", and things like MAML.
https://finale.seas.harvard.edu/publications/direct-policy-transfer-hidden-parameter-markov-decision-processes
But even in ie. Qudroped locomotion people do domain adaptation.
But generally still "in distribution", encountering smth very different for first time and learning it is generally offline afaik
https://ieeexplore.ieee.org/abstract/document/9320226?casa_token=3UGGu1OCOV8AAAAA:gmli3FtqCJZTHGurwyXDjLTwGhG2ywS4m6upQBhN0QtJtSBRxn3GLoKiivslpIjZ--nFO2nNJ3_h
https://ieeexplore.ieee.org/abstract/document/10086005
https://arxiv.org/abs/2410.13852
https://youtu.be/Ogc7kQBEndg
(key difference is that the more recent stuff daws the signal from the interaction, so doesn't need explicit feedback for reward signal)
- reliance on (numerical) reward as the only signal.
- storing the policy in model weights / tables
- having a relatively "clean" mathematical formulation that does not naturally account for "processes" such as memory, imagination, simulation etc.
Even RL pioneers like Sutton advocate for many (all?) of these. E.g., see his Alberta agent research plan.
https://arxiv.org/pdf/2208.11173
While reward is present, it's not a singular reward signal.
Many RL researchers work on more real-time adaptation and have for some time. Progress is made slowly.
https://arxiv.org/abs/1701.06049
The domain is simplistic, but the ideas are really cool. We tried to apply COACH, but left it as open problem, and opted for simpler methods, eg:
https://arxiv.org/abs/2212.09710
Other work I did included simultaneously grounding language so you could interactively train with feedback and language commands.
Beyond my stuff...
Sergey Levine has a bunch of work in this space too. Meta World might be a good primer for some of that: https://meta-world.github.io/
https://arxiv.org/abs/2301.07608 - all tasks are technically in-distribution but are plausibly distinct from training examples and active exploration is plausibly happening