Very excited to share this work in which we use classic cognitive tasks to understand the limitations of vision language models. It turns out that many of the failures of VLMs can be explained as resulting from the classic 'binding problem' in cognitive science. - ThreadSky

taylorwwebb.bsky.social • 133 days ago

Very excited to share this work in which we use classic cognitive tasks to understand the limitations of vision language models. It turns out that many of the failures of VLMs can be explained as resulting from the classic 'binding problem' in cognitive science.

Reposted from Declan Campbell

(1) Vision language models can explain complex charts & decode memes, but struggle with simple tasks young kids find easy - like counting objects or finding items in cluttered scenes! Our 🆒🆕 #NeurIPS2024 paper shows why: they face the same 'binding problem' that constrains human vision! 🧵👇

Comments

willjharrison.bsky.social•133 days ago

Is this different than the model not really encoding space? Most operations will be spatially invariant which means binding will be problematic, but not THE problem.

taylorwwebb.bsky.social•133 days ago

Can you clarify what you mean about spatial invariance? Spatial judgments definitely seem to be a problem for these models, but this seems to be a separate issue from the binding failures we looked at here (many involving tasks that don’t have a spatial component).

willjharrison.bsky.social•133 days ago

These models excel at spotting a cat no matter where it is in the image. The higher order statistics that make this possible don’t “need” the lower order statistics involved in simple binding of eg shape and colour.

taylorwwebb.bsky.social•133 days ago

Identifying whether a cat is present in an image is an instance of the disjunctive search task (identifying the presence of a single feature) that we show VLMs excel at, and that human observers can do rapidly, even for large numbers of objects.

taylorwwebb.bsky.social•133 days ago

Though we don’t test it, one could envision a conjunctive search task involving conjunctions of real-world object categories (such as cats) and some other feature (such as color), and I would expect VLMs to struggle with this task because of the binding problem.

willjharrison.bsky.social•133 days ago

More simple would be to test for binding of much larger objects. Eg a red square and a blue triangle, but the shapes are so big the square takes up the entire left half of the image and the triangle takes up the right half. I suspect binding errors will drop.

neuroai.bsky.social•133 days ago

Such great work!

taylorwwebb.bsky.social•133 days ago

We find that VLMs behave very much like human vision when people are forced to respond quickly, thus relying on feedforward processing alone. This has implications for the source of difficulty in visual reasoning tasks, and suggests the need for object-centric approaches.

neurograce.bsky.social•133 days ago

Hi, can you share a link to the paper? The thread doesn't seem to have one

taylorwwebb.bsky.social•133 days ago

Thanks for pointing this out! Here’s a link: https://arxiv.org/abs/2411.00238

neurograce.bsky.social•133 days ago

Thanks!

Comments

Posting Rules

Reply