ByteDance's UI-TARS, end-to-end GUI agent model based on VLM architecture. It processes screenshots as input and performs human-like interactions. https://huggingface.co/papers/2501.12326

Comments