Profile avatar
dholzmueller.bsky.social
Postdoc in machine learning with Francis Bach & @GaelVaroquaux: neural networks, tabular data, uncertainty, active learning, atomistic ML, learning theory. https://dholzmueller.github.io
157 posts 722 followers 140 following
Regular Contributor
Active Commenter
comment in response to post
What about work on adaptive learning rates (in the sense of convergence rates, not step sizes) that studies methods with hyperparameter optimization on a holdout set to achieve optimal/good convergence rates simultaneously for different classes of functions? E.g. projecteuclid.org/journals/ann...
comment in response to post
By the way, I think an intercept in this case is necessary because the logistic regression model does not have an intercept. For more realistic models that can learn an intercept themselves, I think an intercept for TS is probably not very important.
comment in response to post
Finally, if you just want to have the best performance for a given (large) time budget, AutoGluon combines many tabular models. It does not include some of the latest models (yet), but has a very good CatBoost, for example, and will likely outperform individual models.
comment in response to post
The library offers the same for XGBoost and LightGBM. Plus, the library includes some of the best tabular DL models like RealTabR, TabR, RealMLP, and TabM that could also be interesting to try. (ModernNCA is also very good but not included.)
comment in response to post
Using my library github.com/dholzmueller... you could, for example, use CatBoost_TD_Regressor(n_cv=5), which will use better default parameters for regression, train five models in a cross-validation setup, select the best iteration for each, and ensemble them.
comment in response to post
Interesting! Would be cool to have these datasets on OpenML as well so they are easy to use in tabular benchmarks. Here are some more recommendations for stronger tabular baselines: 1. For CatBoost and XGBoost, you'd want at least early stopping to select the best iteration.
comment in response to post
github.com/EFS-OpenSour... has some calibration methods like this implemented, but their temperature scaling MLE version has a bug where it doesn't optimize, so I didn't include it in our benchmark.
comment in response to post
I think Dirichlet scaling (or the binary version Beta scaling) also includes an intercept but I'm not sure. In my experience it's very slow, though, and not better than temperature scaling at least on smaller datasets (~1K-10K calibration samples).
comment in response to post
There is an adapter for Dirichlet scaling, which is basically regularized matrix scaling. (Except that matrix scaling can exploit shifts in the logits, which a true post-hoc calibration method like Dirichlet scaling can't IIUC).
comment in response to post
In case anyone is wondering about the name RealMLP, it is motivated by the “Real MVP” meme (which probably also inspired the RealNVP method). 6/6
comment in response to post
The benchmark: arxiv.org/abs/2407.00956 RealMLP: github.com/dholzmueller... 5/ bsky.app/profile/dhol...
comment in response to post
It is surprising how many DL methods perform worse than the simple MLP baseline by Gorishniy, @puhsu.bsky.social et al. This highlights the benchmarking problems in the field (and potentially the difficulty in using many of these models correctly). The situation is slowly improving. 4/
comment in response to post
When including more baselines, RealMLP’s average rank slightly improves to make it the top-performing method overall, with a fifth place on binary classification, first place on multi-class, and second place on regression. 3/
comment in response to post
Some caveats: All DL models are trained with a batch size of 1024, while we recommend using 256 for RealMLP on medium-sized datasets. Other choices (selection of datasets, not using bagging, choice of metrics, search spaces for baselines) can of course also influence results. 2/
comment in response to post
by "classical" I mean deep learning models, just for supervised learning
comment in response to post
Is "classical" supervised tabular learning also part of the workshop?
comment in response to post
We also have results for LightGBM with our tuned default hyperparameters (LGBM-TD), but they are somewhat similar and the behavior might depend on the “subsample” hyperparameter (which is related to bagging). 4/
comment in response to post
It is reassuring that the best (average or individual) stopping epoch from bagging works well for RealMLP in the refitting setting, where no validation set is available. It would be interesting to see if this holds up in the non-iid setting with time-based splits. 3/
comment in response to post
The result? Refitting is a bit better, but only if you fit an ensemble during refitting. But: it’s slower, you don’t get validation scores for the refitted models, the result might change with more folds, and tuning the hyperparameters on the CV scores may favor bagging. 2/
comment in response to post
No problem, and thanks for thinking of me 🙂
comment in response to post
Not really...
comment in response to post
Yes
comment in response to post
A reason for the different sensitivities may also be that val metrics that are more similar to the train loss are more likely to decrease monotonically, and therefore have less risk of stopping too early. For regression with MSE we found little sensitivity to the patience. 3/
comment in response to post
For early stopping on boosted trees, using accuracy as the val metric requires high patience. Brier loss yields similar test accuracy for high patience but is less sensitive to patience. Cross-entropy (the train metric) is even less sensitive but not as good for test accuracy. 2/
comment in response to post
I did a small test with TabM-mini and 5-fold bagging, only default parameters with numerical embeddings. It seems that it's roughly comparable with RealMLP. But then maybe RealMLP can benefit more from additional ensembling or the two could be combined. A fair comparison with ensembling is hard.
comment in response to post
I think I would fit here 🙂
comment in response to post
This paper was a huge effort and we have many more results, insights, and plots that didn’t make it into this thread. I’ll post some of them soon, so please consider following me if you’re interested! 15/
comment in response to post
Finally, there are some limitations, partially due to the cost of running all of the benchmarks. 14/
comment in response to post
For training, we use AdamW with a multi-cycle learning rate schedule. Since it makes early stopping more difficult, we always train for the full 256 epochs and revert to the best epoch afterwards. Unfortunately, this makes RealMLP quite a bit slower on average. 13/
comment in response to post
For classification, using label smoothing in the cross-entropy loss improves the results for classification error, but hurts other metrics like AUROC (see below) or cross-entropy itself. This discrepancy is inconvenient, and I hope it can be resolved in future research. 12/
comment in response to post
To encourage feature selection, we introduce a diagonal weight layer, which we call scaling layer, after the embedding layer. Luckily, we found out that it is much more effective with a much larger layer-wise learning rate (96x for RealTabR-D). 11/
comment in response to post
Architecturally, we modify numerical embedding layers (arxiv.org/abs/2203.05556) by introducing first-layer biases and a Densenet-style skip connection, which yields good results even at (CPU-friendly) small embedding sizes. 10/
comment in response to post
We introduce robust scaling + smooth clipping (RS+SC), an outlier-robust preprocessing method combining quantile-based rescaling and soft clipping to (-3, 3). It is more robust than a StandardScaler but preserves more distributional information than a QuantileTransformer. 9/