Results?
We tested
✅ GPT-4o (end-to-end audio)
✅ GPT pipeline (transcribe + text + TTS)
✅ Gemini 2.0 Flash
✅ Gemini 2.5 Pro
We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence.
No model wins everything. (3/5)
We tested
✅ GPT-4o (end-to-end audio)
✅ GPT pipeline (transcribe + text + TTS)
✅ Gemini 2.0 Flash
✅ Gemini 2.5 Pro
We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence.
No model wins everything. (3/5)
Comments
We talked with people who are building voice products and found most benchmarks don't capture their concerns!
→ Which model gives you low-latency conversations?
→ Which model can execute functions to go beyond chat?
→ Which model is the easiest to adjust and improve via prompts?
We’ve open-sourced everything on GitHub:
🔗 https://github.com/SALT-NLP/CAVA
We’re open to collaborations --- test, extend, and help with large audio model evaluation! (5/5)