I collected some folk knowledge for RL and stuck them in my lecture slides a couple weeks back: https://web.mit.edu/6.7920/www/lectures/L18-2024fa-Evaluation.pdf#page=55 See Appendix B... sorry, I know, appendix of a lecture slide deck is not the best for discovery. Suggestions very welcome.
This is awesome, thanks! 🙏 Forwarding to my students immediately!
I have a small note due which is a pet peeve of mine: when tuning hyperparameters, make sure to tune and and report different seeds! I think especially newbies might miss that, but that can make up to a factor of 8 as far I've seen
That can also help! My point is more about the fact that by tuning, we're inducing an optimization bias (even with grid search, I'd say), so usually your performance will look much better on the exact setting you tune on.
The problem is that just like any other optimization, generalization to other settings is then limited and not necessarily predictable, potentially leading to much better or worse performance. That's the difference between similarly colored bars in this plot.
So basically reporting the direct outcome of tuning is like reporting training performance only, it's better practice in the AutoML community to use a validation setting (e.g. fresh seeds) instead to get a more realistic image of the algorithm's performance.
Comments
I have a small note due which is a pet peeve of mine: when tuning hyperparameters, make sure to tune and and report different seeds! I think especially newbies might miss that, but that can make up to a factor of 8 as far I've seen