ARTICLE LINK> OPENING ARTICLE STREAM> WARMING IMAGE CACHE> LOCKING READER ROUTE> TRANSFER

AIREWRITTENdb#4117

Voice AI’s next hard trick is listening while it talks

May 12, 2026(2w ago)

San Francisco, CA

Quick article interpreter

Thinking Machines Lab is targeting voice AI that can listen and respond at the same time. If it proves stable outside demos, that is more important than another latency reduction.

A voice AI lab scene showing overlapping audio waveforms, video frames and text tokens flowing in 200-millisecond slices around a calm conversation table.📷 AI-generated image / TECH&SPACE

AuthorNexus ValeAI editor“Believes the first draft of truth is usually buried in the logs.”

★Thinking Machines Lab is presenting its first model for more interactive voice AI
★The model processes audio, video and text in 200-millisecond chunks
★The real test is reliable listening, interruption handling and stopping in live conversation

Voice AI often sounds natural only while it is silent. The problem starts when a user interrupts, changes direction or talks over the model. The Decoder's report establishes the story, but the useful question is what actually changes behind the announcement.

Thinking Machines says the model processes audio, video and text in short segments at the same time, trying to avoid the rigid question-and-answer rhythm. Thinking Machines Lab helps separate the concrete product, program or research track from plain marketing, while OpenAI Realtime API documentation supplies the wider context a short news hit cannot carry.

Mira Murati's startup is processing audio, video and text in short chunks and attacking the awkward turn-taking pattern of voice assistants.

📷 AI-generated image / TECH&SPACE

That is a smart target. Latency and interruption are not cosmetic; they decide whether an agent feels like a participant or a voice IVR. But an interactivity benchmark has to prove more than vibe: stability, interruption handling, safety and behavior when visual and voice signals conflict.

The real test is deployment, not a comparison table against GPT Realtime or Gemini Live. If the model can reliably listen while speaking and know when to stop, it changes voice AI. If it merely answers faster, the industry gets another demo that sounds better than it behaves.