Is this different than the model not really encoding space? Most operations will be spatially invariant which means binding will be problematic, but not THE problem.
Comments
Log in with your Bluesky account to leave a comment
Can you clarify what you mean about spatial invariance? Spatial judgments definitely seem to be a problem for these models, but this seems to be a separate issue from the binding failures we looked at here (many involving tasks that don’t have a spatial component).
These models excel at spotting a cat no matter where it is in the image. The higher order statistics that make this possible don’t “need” the lower order statistics involved in simple binding of eg shape and colour.
Identifying whether a cat is present in an image is an instance of the disjunctive search task (identifying the presence of a single feature) that we show VLMs excel at, and that human observers can do rapidly, even for large numbers of objects.
Though we don’t test it, one could envision a conjunctive search task involving conjunctions of real-world object categories (such as cats) and some other feature (such as color), and I would expect VLMs to struggle with this task because of the binding problem.
More simple would be to test for binding of much larger objects. Eg a red square and a blue triangle, but the shapes are so big the square takes up the entire left half of the image and the triangle takes up the right half. I suspect binding errors will drop.
Comments