[1/8] While several HITL systems exist (e.g. OpenAI Canvas, our Collaborative STORM), what makes human-agent collab special? Agents need autonomy to be useful, yet the goal is empowering humans.
We start with three tasks: travel planning, surveying related work, and tabular analysis
[2/8] Excitingly, collaborative agents consistently outperform their fully autonomous counterparts in terms of task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users.
[4/8] Our vision builds on a long-standing dream in AI: to develop machines that act as teammates, not mere tools.
This demands situational intelligence to take initiative, communicate, and adapt. Co-Gym offers an evaluation framework that assesses both collab outcomes and processes.
[5/8] We built a user simulator and web UI to instantiate Co-Gym in simulated and real settings.
Experiments reveal human-like patterns: collaborative inertia, where poor communication hinders delivery; and collaborative advantage, where human-agent teams outperform autonomous agents.
[6/8] We conducted a detailed error analysis by having authors annotate 300 trajectories. Collaborative agents expose significant limitations in current LMs and agent scaffoldings, with communication and situational awareness failures occurring in 65% and 40% of real trajectories.
Comments
We start with three tasks: travel planning, surveying related work, and tabular analysis
We define primitives for public/private components in the shared env, as well as collaboration actions and notification protocol.
This demands situational intelligence to take initiative, communicate, and adapt. Co-Gym offers an evaluation framework that assesses both collab outcomes and processes.
Experiments reveal human-like patterns: collaborative inertia, where poor communication hinders delivery; and collaborative advantage, where human-agent teams outperform autonomous agents.