vision+language people: Does anyone have a good sense why most of the recent sota VLMs now have a simple MLP as the mapping network between vision and LLM embeddings? Why does this work better, is learning more efficient? Slowly over time people dropped the more elaborate Q-Former/Perceiver arch - ThreadSky

bennokrojer.bsky.social • 55 days ago

vision+language people: Does anyone have a good sense why most of the recent sota VLMs now have a simple MLP as the mapping network between vision and LLM embeddings? Why does this work better, is learning more efficient?

Slowly over time people dropped the more elaborate Q-Former/Perceiver arch

Comments

bennokrojer.bsky.social•55 days ago

Even if the answer is "just better performance", has anyone investigated why? Given that these are very different modalities, it's wild that a 3-layer MLP is often enough with some tricks of how to compress many tokens (e.g. videos) into some shorter sequence

bennokrojer.bsky.social•55 days ago

Looking a bit closer, people also unfreeze the vision encoder like in the recent strong InternVL-2.5 (https://arxiv.org/abs/2412.05271)

So in that case the vision encoder probably learns to adapt its final tokens to the LLM embedding space more than a 3-layer MLP could

az-mtl.bsky.social•55 days ago

I think the LiT (locked-image training) paper did some experiments on this. Even though their main conclusion was freezing vision side and finetuning text side was best, they actually show that it depends on the downstream task. I think retrieval was better when you finetune both sides.

bennokrojer.bsky.social•55 days ago

And they also have some other tricks like ignoring the last three layers of the ViT since these were too focused on the CLIP loss and not on being useful information for an LLM (ie more local details that an LLM can work with):

bennokrojer.bsky.social•55 days ago

The next question is how often people keep the LLM fully frozen, and in InternVL-2.5 they train all components (also LLM)

bennokrojer.bsky.social•55 days ago

To me it's still a miracle how easy it is to map into LLM space (eg https://arxiv.org/abs/2209.15162) but since the exact techniques are changing over time, this interpretability question will also have to adopt to them

bennokrojer.bsky.social•55 days ago

Another fun paper on my reading list in this direction

https://arxiv.org/abs/2405.16700

koustuvsinha.com•54 days ago

good questions! from what I see some folks still use complex mappers like Perceivers, but often simple mlp works good enough. the variable which induces the biggest improvement is almost always the alignment data.

koustuvsinha.com•54 days ago

one hypothesis why simple mappers work is 1. unfreezing the LLM provides enough parameters for mapping, 2. richer vision representations are closer to llm internal latent space https://arxiv.org/abs/2405.07987

koustuvsinha.com•54 days ago

another factor which makes simple mlps work is visual token length. if you care about shorter tokens, you need a better mapper. these days most llms are capable of long context, which reduces the need of compressing visual tokens.

bennokrojer.bsky.social•54 days ago

Those all make sense! And that's what I'm gathering so far as well. Longer videos might be the only case where smarter compression will be needed for a while

Comments

Posting Rules

Reply