The multimodality space is now evolving in a much better way. The focus has shifted to finding the bottlenecks and fixing things on the fundamental level. This paper from **Apple** introduces **AIMv2**, and effort is in a similar direction, except that they only do it for the autoregressive models.
Post image

Comments