🚀 New #ICLR2025 Paper Alert! 🚀 Can Audio Foundation Models like Moshi and GPT-4o truly engage in natural conversations? 🗣️🔊 We benchmark their turn-taking abilities and uncover major gaps in conversational AI. 🧵👇 📜: arxiv.org/abs/2503.01174 - ThreadSky

🚀 New #ICLR2025 Paper Alert! 🚀

Can Audio Foundation Models like Moshi and GPT-4o truly engage in natural conversations? 🗣️🔊

We benchmark their turn-taking abilities and uncover major gaps in conversational AI. 🧵👇

📜: https://arxiv.org/abs/2503.01174

Comments

siddhant-arora.bsky.social•7 days ago

💡 Why does turn-taking matter?

In human dialogue, we listen, speak, and backchannel in real-time.

Similarly the AI should know when to listen, speak, backchannel, interrupt, convey to the user when it wants to keep the conversation floor and address user interruptions

(2/9)

siddhant-arora.bsky.social•7 days ago

Silence ≠ turn-switching cue! 🚫 Pauses are often longer than gaps in real conversations. 🤦‍♂️

Recent audio FMs claim to have conversational abilities but limited efforts to evaluate these models on their turn taking capabilities.

(3/9)

siddhant-arora.bsky.social•7 days ago

We compare E2E (Moshi https://us.moshi.chat) & cascaded (https://github.com/huggingface/speech-to-speech) dialogue systems through user study with global corpus level statistics!

Moshi: small gaps, some overlap—but less than natural dialogue
Cascaded: higher latency, minimal overlap.

(4/9)

siddhant-arora.bsky.social•7 days ago

Global metrics fails to evaluate when turn taking event happens!

Moshi generates overlapping speech—but is it helpful or disruptive to the natural flow of the conversation? 🤔

(5/9)

siddhant-arora.bsky.social•7 days ago

We train a causal judge model on real human-human conversations that predicts turn-taking events. ⚡

Strong OOD generalization -> a reliable proxy for human judgment!

No need for costly human judgments—our model judges the timing of turn taking events automatically!

(6/9)

siddhant-arora.bsky.social•7 days ago

🤯 What did we find?

❌ Both systems fails to speak up when they should and do not give user enough cues when they wants to keep conversation floor.
❌ Moshi interrupt too aggressively.
❌ Both systems rarely backchannel.
❌ User interruptions are poorly managed.

(7/9)

Comments

Posting Rules

Reply