Chatbots find their voice as audio assistants rise

Source: Reuters

This is a Reuters Breakingviews prediction for 2026.

LONDON, Dec 26 (Reuters Breakingviews) - Tech visionaries often cite the 2013 sci-fi movie "Her" as a blueprint for where artificial intelligence is headed. In the film, Joaquin Phoenix's character spends his days talking through a small earpiece to Samantha, his ever present virtual assistant and eventual romantic interest, voiced by Scarlett Johansson. The story is hardly a love letter to AI, though it does hint at where investors' money and users' attention may be headed in 2026.

Voice-based software isn't new. Many people already use Apple's (AAPL.O) Siri assistant, for example. Amazon.com (AMZN.O) in early 2025 claimed that there were 600 million Alexa-enabled devices in the world, helping users look up facts, play music or control the living-room lights. But these experiences have historically been clunky. The voices have typically sounded robotic. And the software ran on rigid pre-programmed rules, making it difficult to process new information and gauge the context for a query in the way that OpenAI's ChatGPT and Anthropic's Claude can.

AI is changing all that. Alexa and Siri now pack the punch of large language models (LLMs), significantly improving their usefulness. Meanwhile, OpenAI's Sam Altman and Jony Ive are working on a device that seems likely to be screenless with a strong audio element. Startups like Sequoia-backed ElevenLabs are part of the trend too. The $6.6 billion company specializes in making computer voices sound real and has paid $11 million to people to upload short voice clips. These 10,000 samples help to train systems that can replicate a wide variety of tones, accents, and emotions.

As voice-enabled AI gets smarter and more human-sounding, consumers will lap it up. Speaking is about three times faster than typing for both English and Mandarin Chinese, according to a 2016 academic study. And leading speech-recognition models, like OpenAI's Whisper, claim error rates as low as 3%, meaning they get 97% of words right. That's roughly as accurate as using a smartphone keyboard since users typically have a typo rate of about 2%, based on a 2019 experiment.

Rather than using a web browser or a mobile application to order food or call a cab, it will become increasingly common to just speak to an AI assistant instead. Uber Technologies (UBER.N) already supports voice commands for Siri users in English, German, Japanese, French, Hindi and Portuguese. In theory, a customer wearing earbuds could order their favourite sushi dish without having to take their phone out of their pocket. That should also appeal to older or visually impaired users, who are sometimes less comfortable texting.

Consumers are already primed for audio AI. Wearing headphones for large parts of the day is increasingly common. WhatsApp users send more than 7 billion voice messages daily, while nearly half of young adults use voice notes weekly, according to GV's Tom Hulme. Next Move Strategy Consulting expects the revenue of the total voice AI market, including smart earbuds, to more than triple in size between 2025 and 2030, reaching $34 billion by the end of the decade. Meanwhile, venture capital firms invested $6.6 billion in voice AI startups in 2025 - up from $4 billion in 2023, according to PitchBook.

The bigger question is which companies stand to benefit as chatbots become audiobots. Greater demand for natural-sounding voices seems likely, which will help ElevenLabs. The startup claims a dominant 70% to 80% share of the synthetic voice market. It expects $300 million in annual recurring revenue by the end of 2025, and has a 60% operating profit margin.

Tech giants are already finding ways to move AI from the screen to the ear. Apple's AirPods now offer live translation in five languages, letting users understand what a foreign speaker is saying in real time. Alphabet (GOOGL.O) is putting similar functions from its Gemini assistant into the Pixel Buds earphones.

The bigger prize, however, may lie in developing more specialist audio AI models, as distinct from primarily text-based systems. The current status quo for many voice-based assistants often involves translating speech to text, feeding it into an LLM, and then reading the results aloud. A potentially better, albeit more expensive, alternative is to build "unified audio" systems that can listen, reason and respond directly through sound alone. This opens up a new field of possibilities, like incorporating users' intonation and contextual background noise into an answer. In other words, it's a step closer to the sci-fi vision of "Her".

Another big question is who loses as audiobots rise. Ive and Altman's intentions with the secretive OpenAI device may offer a hint: the Wall Street Journal has reported that the pair hope to reduce users' screentime. It follows that social media apps like TikTok, Instagram and WhatsApp may suffer unless they can adapt.

The biggest problem for voice AI, however, may be privacy. Regulators and the general public won't like people walking around with headphones or other devices that are always listening. OpenAI and others may have to find a way around that obstacle. Still, the history of social media suggests that users will happily surrender personal information in return for a product they love. In 2026, the future of AI may therefore be heard as much as it's seen.