Council Post: What's Keeping AI From Holding A Conversation?

Source: Forbes

Expertise from Forbes Councils members, operated under license. Opinions expressed are those of the author.

Apoorv Singh is a researcher at smallest.ai, which builds ultra-efficient AI models for real-time applications including voice AI.

Talk to any voice AI product today, and you'll notice something strange. It's smart, it understands you, and it gives good answers -- but it doesn't feel like a conversation. It feels like a very polite game of walkie-talkie. You speak, you wait, it speaks, you wait. There's a rhythm to real conversation that no AI product has truly captured yet, and the reason is architectural, not intellectual.

The Half-Duplex Problem

Almost every voice AI system in production today runs on what's called a cascaded pipeline. Speech goes in, an ASR model converts it to text, that text gets fed to an LLM that generates a response, and a TTS model converts that response back to audio. It's three separate systems stitched together, running in sequence.

This produces coherent -- often impressive -- responses, but it's fundamentally half-duplex. The system can either listen or speak at any given moment. It can't do both. While it's generating a response, it's deaf. While it's listening, it's silent. There's no overlap, no interruption, no back-channel feedback. The "uh-huh" and "right" and "wait, actually" that make human conversation feel alive simply don't exist.

The latency also compounds. Each handoff between ASR, LLM and TTS adds processing time. By the time the system responds, one to three seconds have passed. Human conversational turn-taking operates in a window of 200 to 500 milliseconds. That gap between what the system delivers and what the human ear expects is where the illusion breaks.

What Full-Duplex Actually Means

In telecommunications, full-duplex means both parties can transmit simultaneously. A phone call is full-duplex. A walkie-talkie is half-duplex. Human conversation is full-duplex in the deepest sense.

We don't just tolerate overlapping speech; we rely on it. A Swedish study of natural dialogue published in 2010 found overlap rates of 44% in face-to-face conversation and 52% in telephone conversation, whether that's interruptions, back channels or collaborative completions.

Full-duplex spoken dialogue in AI means building a system that can listen and generate simultaneously, on parallel channels, in real time. It means the model can hear you say "actually, wait" mid-sentence and stop. It means it can produce a quiet "mm-hmm" while you’re still talking, signaling that it’s following along. It means there are no rigid speaker turns; just two streams of audio flowing in both directions at once.

This is a fundamental rethinking of how spoken AI systems are architected.

The Research Frontier

The research community has been converging on this problem from multiple angles.

Kyutai's Moshi, published in late 2024, was arguably the first real-time full-duplex spoken language model. It treats dialogue as speech-to-speech generation, modeling the user's audio and its own audio as two parallel token streams processed jointly. By removing the concept of explicit speaker turns entirely, Moshi can handle overlapping speech, interruptions and interjections natively. It achieves a practical latency of around 200 milliseconds, comparable to the gap measured in human-to-human conversation.

Meta's SyncLLM work took a different approach, integrating time information directly into a language model so it runs synchronously with the real-world clock. The model predicts in 160 to 240 millisecond chunks, maintaining alignment with the pace of actual speech.

Nvidia's PersonaPlex is built on Moshi's architecture and added voice and role control, demonstrating that full-duplex systems can maintain consistent personas while handling interruptions and back channels naturally.

Tencent's work on semantic voice activity detection introduced a small language model that emits control tokens (continue listening, start speaking, start listening, continue speaking) to manage conversational flow. This treats turn-taking not as a binary switch but as a continuous decision process driven by semantic understanding.

What all of this research shares is a recognition that the cascaded pipeline has reached its ceiling. You can optimize each component individually, shave milliseconds off ASR, make the LLM faster and reduce TTS latency. However, the architecture itself imposes constraints that no amount of optimization can overcome. Speech gets flattened to text at the first handoff -- and with it, tone, hesitation, urgency and emotion. All of the paralinguistic information that humans use to navigate conversation is destroyed before the LLM ever sees it.

The Speech-To-Speech Shift

The alternative is what researchers call native speech-to-speech models. Audio goes in and audio comes out through a single unified model. There's no intermediate text representation or sequential handoffs. The model processes acoustic and semantic information jointly, which means it can preserve the emotional and tonal content of speech throughout the entire pipeline.

This is where the field is heading, and it's where we're investing heavily at smallest.ai. Our model, Hydra, is a native speech-to-speech system in which audio goes in and comes out with sub-300-millisecond latency and true full-duplex capability. It can hear you while it's speaking. It preserves emotional fidelity because speech never gets flattened to text. It handles interruptions and overlaps natively -- not through bolted-on voice activity detection but because the architecture models both streams simultaneously.

What's Still Missing

Full-duplex is necessary but not sufficient. The research community is still wrestling with several hard problems.

High-quality dual-channel conversational training data is scarce. Evaluation benchmarks are still maturing. Most metrics measure transcription accuracy or response quality in isolation, but the thing that makes a conversation feel real is the interplay between those qualities and timing, rhythm and responsiveness. We don't yet have great ways to measure whether something feels like talking to a person.

Safety is another open question. Voice-based systems that can speak fluidly and emotionally raise new concerns around impersonation and manipulation that text systems don't face as acutely.

However, the trajectory is clear. The half-duplex era of voice AI, the walkie-talkie era, is ending. The systems that will define the next generation of voice products won't just understand language. They'll understand conversation. The difference is larger than it sounds.