(4) Both multimodal LMs & text-to-image models show strict capacity limits - similar to human 'subitizing' limits during rapid parallel processing. Key finding: They improve with visually distinct objects, suggesting failures stem from feature interference. - ThreadSky

About ThreadSky

(4) Both multimodal LMs & text-to-image models show strict capacity limits - similar to human 'subitizing' limits during rapid parallel processing. Key finding: They improve with visually distinct objects, suggesting failures stem from feature interference.

Comments

thisisadax.bsky.social•142 days ago

(5) We developed a scene description benchmark inspired by visual working memory tasks to more directly evaluate how feature overlap affects performance. Key finding: Errors spike when objects share overlapping features - driven by 'illusory conjunctions' where features get mixed up!

thisisadax.bsky.social•142 days ago

(6) Finally, we found that breaking 🪚🔨 visual analogy tasks into smaller chunks (i.e. performing object segmentation) to mitigate the influence of feature interference improves performance on those tasks.

thisisadax.bsky.social•142 days ago

(7) The punchline? Capacity limits aren't just about the number of objects - they stem from interference between representations when processing multiple things at once. This 'binding problem' creates fundamental constraints on parallel processing in both humans and VLMs🧍‍♂️🤖 .

thisisadax.bsky.social•142 days ago

(8) This work wouldn't have been possible without my amazing collaborators Sunayana Rane, Tyler Giallanza, Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, @frankland.bsky.social, @cocoscilab.bsky.social, Jonathan Cohen, and @taylorwwebb.bsky.social.

neurostats.org•123 days ago

📌

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply