Is this different than the model not really encoding space? Most operations will be spatially invariant which means binding will be problematic, but not THE problem. - ThreadSky

About ThreadSky

willjharrison.bsky.social • 135 days ago

Is this different than the model not really encoding space? Most operations will be spatially invariant which means binding will be problematic, but not THE problem.

Comments

taylorwwebb.bsky.social•135 days ago

Can you clarify what you mean about spatial invariance? Spatial judgments definitely seem to be a problem for these models, but this seems to be a separate issue from the binding failures we looked at here (many involving tasks that don’t have a spatial component).

willjharrison.bsky.social•135 days ago

These models excel at spotting a cat no matter where it is in the image. The higher order statistics that make this possible don’t “need” the lower order statistics involved in simple binding of eg shape and colour.

taylorwwebb.bsky.social•135 days ago

Identifying whether a cat is present in an image is an instance of the disjunctive search task (identifying the presence of a single feature) that we show VLMs excel at, and that human observers can do rapidly, even for large numbers of objects.

taylorwwebb.bsky.social•135 days ago

Though we don’t test it, one could envision a conjunctive search task involving conjunctions of real-world object categories (such as cats) and some other feature (such as color), and I would expect VLMs to struggle with this task because of the binding problem.

willjharrison.bsky.social•135 days ago

More simple would be to test for binding of much larger objects. Eg a red square and a blue triangle, but the shapes are so big the square takes up the entire left half of the image and the triangle takes up the right half. I suspect binding errors will drop.

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply