good questions! from what I see some folks still use complex mappers like Perceivers, but often simple mlp works good enough. the variable which induces the biggest improvement is almost always the alignment data.
Comments
Log in with your Bluesky account to leave a comment
one hypothesis why simple mappers work is 1. unfreezing the LLM provides enough parameters for mapping, 2. richer vision representations are closer to llm internal latent space https://arxiv.org/abs/2405.07987
another factor which makes simple mlps work is visual token length. if you care about shorter tokens, you need a better mapper. these days most llms are capable of long context, which reduces the need of compressing visual tokens.
Those all make sense! And that's what I'm gathering so far as well. Longer videos might be the only case where smarter compression will be needed for a while
Comments