The return of the Autoregressive Image Model: AIMv2 now going multimodal.
Excellent work by @alaaelnouby.bsky.social & team with code and checkpoints already up:
https://arxiv.org/abs/2411.14402
Excellent work by @alaaelnouby.bsky.social & team with code and checkpoints already up:
https://arxiv.org/abs/2411.14402
1 / 2
Comments
AIMv2 mimics such a pipeline with a vision encoder to extract patch features then concatenated with tokens and further decoding next vision patches & tokens
-simple & dense supervision
-alignment w/ downstream multi-modal tasks
-no need for negatives
-scaling in data, parameters, resolution
Exciting work!