📄👀: cross-modal information flow in multimodal large language models
neat interpretability work on how visual and linguistic information is integrated in MLLMs: "the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens."
neat interpretability work on how visual and linguistic information is integrated in MLLMs: "the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens."
Comments