Efficiency is not only about speed. ModernBERT is also memory-friendly and can handle larger batch sizes than previous encoders, which can come in handy for contrastive learning or running on smaller GPUs (which is an important use case for encoders)
Comments
https://huggingface.co/answerdotai/ModernBERT-base
https://huggingface.co/answerdotai/ModernBERT-large
If you want all the details, please have a look at the nicely written blog post and the very detailed paper
I'll go on with some less general and more personal information
We have a lot of experiments on ColBERT models in the paper, with tons of different base models
PyLate handled it all, even models using half-baked remote code
This was a really cool stress test and I am really happy it went so smoothly
The ModernBERT-base checkpoint achieves 51.3 of BEIR average
This means that we beat e5 in a <45 minutes training on MS MARCO only (using only half of the memory of our 8x100)
But I am very grateful to everyone involved in this project, I truly learned a lot and I am so proud of the models we managed to build together
Besides the models, I think we showed that there is still a lot to be done and I hope we succeed at igniting back the interest in encoder pre-training 🔥