vision+language people: Does anyone have a good sense why most of the recent sota VLMs now have a simple MLP as the mapping network between vision and LLM embeddings? Why does this work better, is learning more efficient?

Slowly over time people dropped the more elaborate Q-Former/Perceiver arch

Comments