Very excited to share this work in which we use classic cognitive tasks to understand the limitations of vision language models. It turns out that many of the failures of VLMs can be explained as resulting from the classic 'binding problem' in cognitive science.
Reposted from
Declan Campbell
(1) Vision language models can explain complex charts & decode memes, but struggle with simple tasks young kids find easy - like counting objects or finding items in cluttered scenes! Our ππ #NeurIPS2024 paper shows why: they face the same 'binding problem' that constrains human vision! π§΅π
Comments