Interesting! Would be cool to have these datasets on OpenML as well so they are easy to use in tabular benchmarks.
Here are some more recommendations for stronger tabular baselines:
1. For CatBoost and XGBoost, you'd want at least early stopping to select the best iteration.
Using my library https://github.com/dholzmueller/pytabkit
you could, for example, use CatBoost_TD_Regressor(n_cv=5), which will use better default parameters for regression, train five models in a cross-validation setup, select the best iteration for each, and ensemble them.
The library offers the same for XGBoost and LightGBM. Plus, the library includes some of the best tabular DL models like RealTabR, TabR, RealMLP, and TabM that could also be interesting to try. (ModernNCA is also very good but not included.)
Finally, if you just want to have the best performance for a given (large) time budget, AutoGluon combines many tabular models. It does not include some of the latest models (yet), but has a very good CatBoost, for example, and will likely outperform individual models.
Comments
Here are some more recommendations for stronger tabular baselines:
1. For CatBoost and XGBoost, you'd want at least early stopping to select the best iteration.
you could, for example, use CatBoost_TD_Regressor(n_cv=5), which will use better default parameters for regression, train five models in a cross-validation setup, select the best iteration for each, and ensemble them.
This would be a nice “baseline” to have for the associated Polaris benchmark. https://polarishub.io/benchmarks/biogen/adme-fang-reg-v1.
We're aiming to serve as a source of truth for machine learning in drug discovery. For context, see: https://polarishub.io/blog/reproducible-machine-learning-in-drug-discovery-how-polaris-serves-as-a-single-source-of-truth
Would love to hear your feedback!