one hypothesis why simple mappers work is 1. unfreezing the LLM provides enough parameters for mapping, 2. richer vision representations are closer to llm internal latent space https://arxiv.org/abs/2405.07987
Comments
Log in with your Bluesky account to leave a comment
another factor which makes simple mlps work is visual token length. if you care about shorter tokens, you need a better mapper. these days most llms are capable of long context, which reduces the need of compressing visual tokens.
Those all make sense! And that's what I'm gathering so far as well. Longer videos might be the only case where smarter compression will be needed for a while
Comments