π New #ICLR2025 Paper Alert! π
Can Audio Foundation Models like Moshi and GPT-4o truly engage in natural conversations? π£οΈπ
We benchmark their turn-taking abilities and uncover major gaps in conversational AI. π§΅π
π: https://arxiv.org/abs/2503.01174
Can Audio Foundation Models like Moshi and GPT-4o truly engage in natural conversations? π£οΈπ
We benchmark their turn-taking abilities and uncover major gaps in conversational AI. π§΅π
π: https://arxiv.org/abs/2503.01174
Comments
In human dialogue, we listen, speak, and backchannel in real-time.
Similarly the AI should know when to listen, speak, backchannel, interrupt, convey to the user when it wants to keep the conversation floor and address user interruptions
(2/9)
Recent audio FMs claim to have conversational abilities but limited efforts to evaluate these models on their turn taking capabilities.
(3/9)
Moshi: small gaps, some overlapβbut less than natural dialogue
Cascaded: higher latency, minimal overlap.
(4/9)
Moshi generates overlapping speechβbut is it helpful or disruptive to the natural flow of the conversation? π€
(5/9)
Strong OOD generalization -> a reliable proxy for human judgment!
No need for costly human judgmentsβour model judges the timing of turn taking events automatically!
(6/9)
β Both systems fails to speak up when they should and do not give user enough cues when they wants to keep conversation floor.
β Moshi interrupt too aggressively.
β Both systems rarely backchannel.
β User interruptions are poorly managed.
(7/9)