Some kind of actual concept model capable of reasoning through the abstract visual features, i.e. “the minute hand is three ticks past 2 so that must be thirteen past the hour.” Basically an AGI level task TBH.
you could cheat by showing it a bunch of examples of every possible time, but it’s not really understanding the concepts in that case, just pattern matching
also, I think the current way vision adapters are hooked up to LLMs is inadequate. representing high density visual features in a linear embedding is problematic.
Comments