In a similar vein to Extreme Learning Machines, we introduce eXtreme Policy Optimization. We start by initializing a starting random policy. During learning, a transformation layer shapes this policy into the final policy. We show that fewer updates are needed to reach a reasonable policy.
"We introduce 'Cross Policy Optimization' (XPO): a novel approach that simultaneously trains two diverse policies. XPO uses a bias-free gradient that uses the on-policy \emph{and} off-policy trajectories from both policies to train each."
Comments
It could happen.