dholzmueller.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

What about work on adaptive learning rates (in the sense of convergence rates, not step sizes) that studies methods with hyperparameter optimization on a holdout set to achieve optimal/good convergence rates simultaneously for different classes of functions? E.g. projecteuclid.org/journals/ann...

submitted 16 days ago

comment in response to post

By the way, I think an intercept in this case is necessary because the logistic regression model does not have an intercept. For more realistic models that can learn an intercept themselves, I think an intercept for TS is probably not very important.

submitted 20 days ago

comment in response to post

Finally, if you just want to have the best performance for a given (large) time budget, AutoGluon combines many tabular models. It does not include some of the latest models (yet), but has a very good CatBoost, for example, and will likely outperform individual models.

submitted 20 days ago

comment in response to post

The library offers the same for XGBoost and LightGBM. Plus, the library includes some of the best tabular DL models like RealTabR, TabR, RealMLP, and TabM that could also be interesting to try. (ModernNCA is also very good but not included.)

submitted 20 days ago

comment in response to post

Using my library github.com/dholzmueller... you could, for example, use CatBoost_TD_Regressor(n_cv=5), which will use better default parameters for regression, train five models in a cross-validation setup, select the best iteration for each, and ensemble them.

submitted 20 days ago

comment in response to post

Interesting! Would be cool to have these datasets on OpenML as well so they are easy to use in tabular benchmarks. Here are some more recommendations for stronger tabular baselines: 1. For CatBoost and XGBoost, you'd want at least early stopping to select the best iteration.

submitted 20 days ago

comment in response to post

github.com/EFS-OpenSour... has some calibration methods like this implemented, but their temperature scaling MLE version has a bug where it doesn't optimize, so I didn't include it in our benchmark.

submitted 23 days ago

comment in response to post

I think Dirichlet scaling (or the binary version Beta scaling) also includes an intercept but I'm not sure. In my experience it's very slow, though, and not better than temperature scaling at least on smaller datasets (~1K-10K calibration samples).

submitted 23 days ago

comment in response to post

There is an adapter for Dirichlet scaling, which is basically regularized matrix scaling. (Except that matrix scaling can exploit shifts in the logits, which a true post-hoc calibration method like Dirichlet scaling can't IIUC).

submitted 23 days ago

comment in response to post

In case anyone is wondering about the name RealMLP, it is motivated by the “Real MVP” meme (which probably also inspired the RealNVP method). 6/6

submitted 43 days ago

comment in response to post

The benchmark: arxiv.org/abs/2407.00956 RealMLP: github.com/dholzmueller... 5/ bsky.app/profile/dhol...

submitted 43 days ago

comment in response to post

It is surprising how many DL methods perform worse than the simple MLP baseline by Gorishniy, @puhsu.bsky.social et al. This highlights the benchmarking problems in the field (and potentially the difficulty in using many of these models correctly). The situation is slowly improving. 4/

submitted 43 days ago

comment in response to post

When including more baselines, RealMLP’s average rank slightly improves to make it the top-performing method overall, with a fifth place on binary classification, first place on multi-class, and second place on regression. 3/

submitted 43 days ago

comment in response to post

Some caveats: All DL models are trained with a batch size of 1024, while we recommend using 256 for RealMLP on medium-sized datasets. Other choices (selection of datasets, not using bagging, choice of metrics, search spaces for baselines) can of course also influence results. 2/

submitted 43 days ago

comment in response to post

by "classical" I mean deep learning models, just for supervised learning

submitted 44 days ago

comment in response to post

Is "classical" supervised tabular learning also part of the workshop?

submitted 45 days ago

comment in response to post

We also have results for LightGBM with our tuned default hyperparameters (LGBM-TD), but they are somewhat similar and the behavior might depend on the “subsample” hyperparameter (which is related to bagging). 4/

submitted 91 days ago

comment in response to post

It is reassuring that the best (average or individual) stopping epoch from bagging works well for RealMLP in the refitting setting, where no validation set is available. It would be interesting to see if this holds up in the non-iid setting with time-based splits. 3/

submitted 91 days ago

comment in response to post

The result? Refitting is a bit better, but only if you fit an ensemble during refitting. But: it’s slower, you don’t get validation scores for the refitted models, the result might change with more folds, and tuning the hyperparameters on the CV scores may favor bagging. 2/

submitted 91 days ago

comment in response to post

No problem, and thanks for thinking of me 🙂

submitted 91 days ago

comment in response to post

Not really...

submitted 91 days ago

comment in response to post

Yes

submitted 93 days ago

comment in response to post

A reason for the different sensitivities may also be that val metrics that are more similar to the train loss are more likely to decrease monotonically, and therefore have less risk of stopping too early. For regression with MSE we found little sensitivity to the patience. 3/

submitted 93 days ago

comment in response to post

For early stopping on boosted trees, using accuracy as the val metric requires high patience. Brier loss yields similar test accuracy for high patience but is less sensitive to patience. Cross-entropy (the train metric) is even less sensitive but not as good for test accuracy. 2/

submitted 93 days ago

comment in response to post

I did a small test with TabM-mini and 5-fold bagging, only default parameters with numerical embeddings. It seems that it's roughly comparable with RealMLP. But then maybe RealMLP can benefit more from additional ensembling or the two could be combined. A fair comparison with ensembling is hard.

submitted 93 days ago

comment in response to post

I think I would fit here 🙂

submitted 95 days ago

comment in response to post

This paper was a huge effort and we have many more results, insights, and plots that didn’t make it into this thread. I’ll post some of them soon, so please consider following me if you’re interested! 15/

submitted 101 days ago

comment in response to post

Finally, there are some limitations, partially due to the cost of running all of the benchmarks. 14/

submitted 101 days ago

comment in response to post

For training, we use AdamW with a multi-cycle learning rate schedule. Since it makes early stopping more difficult, we always train for the full 256 epochs and revert to the best epoch afterwards. Unfortunately, this makes RealMLP quite a bit slower on average. 13/

submitted 101 days ago

comment in response to post

For classification, using label smoothing in the cross-entropy loss improves the results for classification error, but hurts other metrics like AUROC (see below) or cross-entropy itself. This discrepancy is inconvenient, and I hope it can be resolved in future research. 12/

submitted 101 days ago

comment in response to post

To encourage feature selection, we introduce a diagonal weight layer, which we call scaling layer, after the embedding layer. Luckily, we found out that it is much more effective with a much larger layer-wise learning rate (96x for RealTabR-D). 11/

submitted 101 days ago

comment in response to post

Architecturally, we modify numerical embedding layers (arxiv.org/abs/2203.05556) by introducing first-layer biases and a Densenet-style skip connection, which yields good results even at (CPU-friendly) small embedding sizes. 10/

submitted 101 days ago

comment in response to post

We introduce robust scaling + smooth clipping (RS+SC), an outlier-robust preprocessing method combining quantile-based rescaling and soft clipping to (-3, 3). It is more robust than a StandardScaler but preserves more distributional information than a QuantileTransformer. 9/

submitted 101 days ago