Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: hf.co/blog/static-... Or read more in this thread first 🧵 - ThreadSky

tomaarsen.com • 43 days ago

Check out the full blogpost if you'd like to 1) use these lightning-fast models or 2) learn how to train them with consumer-level hardware: https://hf.co/blog/static-embeddings

Or read more in this thread first 🧵

Comments

tomaarsen.com•43 days ago

We apply our recipe to train 2 Static Embedding models that we release today! We release:
2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (classification, clustering, etc.), both Apache 2.0. Fully integrated in Sentence Transformers, etc.

🧵

tomaarsen.com•43 days ago

🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation
📜 my training scripts, using the Sentence Transformers library
📊 my Weights & Biases reports with losses & metrics
📕 my list of 30 training and 13 evaluation datasets

🧵

tomaarsen.com•43 days ago

Static Embedding models have been around since before Transformers (e.g. GLoVe, word2vec), they work with pre-computed word embeddings from a mapping.

I apply this simple architecture, but train it like a modern embedding model: Contrastive Learning with Matryoshka support.

🧵

tomaarsen.com•43 days ago

The Static Embedding models have some excellent properties:
🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for all-mpnet-base-v2 and 56 for gte-large-en-v1.5
📏 No maximum sequence length! Embed texts at any length (at your own risk)

🧵

tomaarsen.com•43 days ago

📐 Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more.
🪆 Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% performance decrease for English Similarity tasks)

🧵

tomaarsen.com•43 days ago

By being integrated in Sentence Transformers without any other dependencies, all Static Embedding models work out of the box in all projects that integrate with Sentence Transformers, like:

- @langchain.bsky.social

Comments

Posting Rules

Reply