Architecturally, we modify numerical embedding layers (arxiv.org/abs/2203.05556) by introducing first-layer biases and a Densenet-style skip connection, which yields good results even at (CPU-friendly) small embedding sizes. 10/ - ThreadSky

About ThreadSky

dholzmueller.bsky.social • 102 days ago

Architecturally, we modify numerical embedding layers (https://arxiv.org/abs/2203.05556) by introducing first-layer biases and a Densenet-style skip connection, which yields good results even at (CPU-friendly) small embedding sizes. 10/

Comments

dholzmueller.bsky.social•102 days ago

To encourage feature selection, we introduce a diagonal weight layer, which we call scaling layer, after the embedding layer. Luckily, we found out that it is much more effective with a much larger layer-wise learning rate (96x for RealTabR-D). 11/

dholzmueller.bsky.social•102 days ago

For classification, using label smoothing in the cross-entropy loss improves the results for classification error, but hurts other metrics like AUROC (see below) or cross-entropy itself. This discrepancy is inconvenient, and I hope it can be resolved in future research. 12/

dholzmueller.bsky.social•102 days ago

For training, we use AdamW with a multi-cycle learning rate schedule. Since it makes early stopping more difficult, we always train for the full 256 epochs and revert to the best epoch afterwards. Unfortunately, this makes RealMLP quite a bit slower on average. 13/

dholzmueller.bsky.social•102 days ago

Finally, there are some limitations, partially due to the cost of running all of the benchmarks. 14/

dholzmueller.bsky.social•102 days ago

This paper was a huge effort and we have many more results, insights, and plots that didn’t make it into this thread. I’ll post some of them soon, so please consider following me if you’re interested! 15/

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply