This is very similar to the work I have been doing. Super cool to see that I am not the only one still believing in BERT. There is so much left to gain here. I will be interested to look into ablations around their local vs. global attention tradeoff and see their training data.
Reposted from Jeremy Howard
I'll get straight to the point.

We trained 2 new models. Like BERT, but modern. ModernBERT.

Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff.

It's much faster, more accurate, longer context, and more useful. 🧵

Comments