Voice AI’s next hard trick is listening while it talks
A voice AI lab scene showing overlapping audio waveforms, video frames and text tokens flowing in 200-millisecond slices around a calm conversation table.📷 AI-generated image / TECH&SPACE
- ★Thinking Machines Lab is presenting its first model for more interactive voice AI
- ★The model processes audio, video and text in 200-millisecond chunks
- ★The real test is reliable listening, interruption handling and stopping in live conversation
Voice AI often sounds natural only while it is silent. The problem starts when a user interrupts, changes direction or talks over the model. The Decoder's report establishes the story, but the useful question is what actually changes behind the announcement.
Thinking Machines says the model processes audio, video and text in short segments at the same time, trying to avoid the rigid question-and-answer rhythm. Thinking Machines Lab helps separate the concrete product, program or research track from plain marketing, while OpenAI Realtime API documentation supplies the wider context a short news hit cannot carry.
Mira Murati's startup is processing audio, video and text in short chunks and attacking the awkward turn-taking pattern of voice assistants.
📷 AI-generated image / TECH&SPACE
That is a smart target. Latency and interruption are not cosmetic; they decide whether an agent feels like a participant or a voice IVR. But an interactivity benchmark has to prove more than vibe: stability, interruption handling, safety and behavior when visual and voice signals conflict.
The real test is deployment, not a comparison table against GPT Realtime or Gemini Live. If the model can reliably listen while speaking and know when to stop, it changes voice AI. If it merely answers faster, the industry gets another demo that sounds better than it behaves.

