📄👀: cross-modal information flow in multimodal large language models

neat interpretability work on how visual and linguistic information is integrated in MLLMs: "the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens."

Comments