vision+language people: Does anyone have a good sense why most of the recent sota VLMs now have a simple MLP as the mapping network between vision and LLM embeddings? Why does this work better, is learning more efficient?
Slowly over time people dropped the more elaborate Q-Former/Perceiver arch
Slowly over time people dropped the more elaborate Q-Former/Perceiver arch
Comments
So in that case the vision encoder probably learns to adapt its final tokens to the LLM embedding space more than a 3-layer MLP could
https://arxiv.org/abs/2405.16700