At Google I/O 2018, Google Duplex called a hair salon and scheduled an appointment in real time. The stylist on the other end had no idea she was speaking with software. The AI said "Mmm-hmm" at the right moment, handled an unexpected scheduling question, and booked the slot. The crowd gasped. But what the audience did not see was the pipeline underneath: audio capture, streaming speech recognition, a language model generating a reply, a text-to-speech voice renderer, all stitched together in a loop that had to complete in under a second — or the illusion would collapse.
Every real-time conversational AI system — from Google Duplex to Amazon Alexa to OpenAI's real-time API — runs on the same conceptual pipeline. The stages are Audio Capture → Speech Recognition (ASR) → Language Model Inference → Speech Synthesis (TTS). What varies is how fast each stage runs, how they are chained, and whether any stage is eliminated to save latency.
Stage 1 — Audio Capture: A microphone samples air pressure thousands of times per second (typically 16 kHz for voice). The raw PCM waveform is chunked into small frames (10–30 ms) and streamed to the ASR engine. Good voice activity detection (VAD) is essential here — you do not want to forward silence to a paid inference API.
Stage 2 — Automatic Speech Recognition (ASR): A model converts acoustic frames to text. Streaming ASR systems like OpenAI Whisper, Google Speech-to-Text, or Deepgram Nova emit partial transcripts as the user speaks, rather than waiting for the full utterance. This shaves hundreds of milliseconds from perceived latency because the language model can begin pre-processing before speech ends.
Stage 3 — Language Model Inference: The transcript arrives at a large language model. The model generates a reply token-by-token. In a real-time system the first token must emerge quickly (low time-to-first-token, or TTFT). A reply that arrives 3 seconds after the user stops speaking feels broken; under 500 ms feels nearly natural.
Stage 4 — Text-to-Speech (TTS) Synthesis: The reply text is converted to audio using a neural voice model. Modern TTS systems can stream audio — playing the first syllable before the full sentence is synthesized — which again slashes perceived latency dramatically.
Human back-and-forth conversation operates on gaps of roughly 200 ms. Research by psycholinguists Levinson and Torreira (2015) found that delays beyond 700 ms cause listeners to infer the speaker is uncertain or uncooperative. In voice AI, delays beyond ~1.2 s cause abandonment in commercial applications. Every architecture decision is, ultimately, a decision about where milliseconds are spent.
The fundamental architecture choice is batch vs. streaming. In a batch system, the user speaks, the system records the complete utterance, processes it end-to-end, and returns audio. In a streaming system, each stage starts processing as soon as data arrives from the previous stage. Batch is simpler to build but adds hundreds to thousands of milliseconds of latency. Streaming reduces latency but requires careful state management and error recovery.
Amazon Alexa was originally batch-based: record wake word, stream audio to cloud, return result. Early versions had noticeable pauses that users tolerated because accuracy was high. As competition intensified, Amazon invested heavily in streaming ASR and streaming TTS to close the gap. By 2023 Alexa's end-to-end latency on fast queries had dropped below 600 ms on modern Echo hardware.
OpenAI's Realtime API, announced in October 2024, takes this further by eliminating the text layer entirely for some use cases. Audio goes in; audio comes out. The model handles ASR, language understanding, and TTS as a single end-to-end neural process, removing inter-stage serialization entirely.
A pipeline that only processes speech sequentially cannot handle the most natural human behavior: interruption. When you say "stop" or begin a new sentence while the AI is still speaking, a real-time system must detect the barge-in, stop the outgoing audio, and reset its context. This requires voice activity detection running concurrently with TTS playback — a non-trivial engineering challenge that distinguishes production-grade systems from demos.
During the development of Google Assistant's duplex-style calling feature, Google engineers described needing a dedicated barge-in detection pathway that ran independently of the main inference pipeline. Without it, the AI would continue speaking over an interrupted human, which destroyed the naturalistic illusion immediately.
Real-time conversation AI is not a single model — it is a tightly coupled system of models operating concurrently against a shared latency budget. Understanding where each millisecond is spent is the prerequisite to making any one component better.
You are an engineer designing a voice-first customer support bot for a major airline. The bot must handle booking changes, cancellations, and gate inquiries over phone. End-to-end response latency must be under 1.2 seconds for 90% of queries. Explore the trade-offs with the assistant below.
In 2021 and 2022, journalists and researchers documented that Otter.ai — one of the most widely used real-time transcription services — performed significantly worse on speakers with non-American accents and on women's voices. An MIT Media Lab study found that commercial ASR systems from Amazon, Google, IBM, and Microsoft had word-error rates up to five times higher for Black American speakers compared to white American speakers. The real-time pipeline, optimized for average-user accuracy, was delivering systematically degraded experiences to large portions of its users without surfacing any indication that it was less confident on those inputs.
Streaming ASR systems process audio in overlapping chunks and emit partial hypotheses — best guesses that may be revised as more audio arrives. A system might transcribe "I'd like to res—" and then revise to "I'd like to reschedule" once the full word arrives. This creates a fundamental challenge: the language model downstream must decide when to act on a transcript that might still change.
Beam search is the standard decoding algorithm. The ASR model maintains multiple candidate transcription paths simultaneously, pruning unlikely branches as evidence accumulates. The finalized segment of a transcript — the portion the model is confident won't change — grows leftward as more audio arrives. Everything to the right of that boundary is provisional.
Modern systems like Deepgram Nova-2 and Google Cloud Speech-to-Text v2 allow developers to distinguish between interim and final results via API flags. Downstream components should generally only act on final results, unless latency requirements are so strict that acting on interim results — and occasionally correcting course — is worth the complexity.
ASR models typically output a confidence score alongside each transcript segment. These scores, however, are imperfectly calibrated. A score of 0.95 does not reliably mean 95% accurate. Research from Google Brain (published 2021) showed that confidence calibration varies significantly by speaker demographics, acoustic environment, and domain vocabulary.
In a production voice system, low-confidence ASR output should trigger a clarification strategy rather than proceeding on a potentially incorrect transcript. A flight-booking bot that mishears "Denver" as "Dover" and books the wrong city has committed a serious error downstream of a small upstream uncertainty. The smart system asks: "Did you say Denver, Colorado?" rather than silently committing to a wrong transcription.
Generic ASR models are trained on conversational speech. They often struggle with domain-specific vocabulary: medical terms, product names, proper nouns, technical jargon. Whisper's open-source model, for example, frequently mis-transcribes unusual proper nouns. Production systems address this through custom vocabulary injection (biasing) — providing a list of expected terms the ASR should favor. Google, Amazon, and Deepgram all offer boosting APIs for this purpose.
Close-talking microphones (headsets, phone handsets) yield relatively clean audio. Far-field microphones — as used by smart speakers — capture audio mixed with room reverberation, appliance noise, TV audio, and multiple speakers. Amazon's Alexa devices use a seven-microphone array with hardware beamforming to isolate the primary speaker before audio even reaches the ASR model. Without such preprocessing, far-field ASR error rates can be three to four times higher than close-talking.
In 2023, during testing of Amazon Echo's new-generation hardware, internal benchmarks released in an FTC filing showed that Alexa's word-error rate on far-field queries was approximately 9.8% under normal home conditions — rising to 22% in high-noise environments like active kitchens. These numbers drove the team to invest in neural acoustic front-ends that model room acoustics before ASR decoding begins.
Streaming ASR forces a fundamental trade-off: the longer you wait before emitting a hypothesis, the more context you have and the more accurate the transcript. But the user is waiting. A system can be tuned toward low-latency mode (emit partial results aggressively, accept more corrections) or accuracy mode (hold results longer before finalizing). Most production systems expose this as a configurable parameter. The right setting depends on use case: a transcription service for court reporting tolerates more latency; a real-time voice chatbot does not.
Streaming ASR is not a binary transcript — it is a probability distribution over possible words that collapses into text as evidence arrives. A robust real-time voice system treats ASR output as uncertain until finalized, and builds clarification logic for the cases where confidence is genuinely low.
You are configuring the ASR layer for a telehealth platform where patients call in to describe symptoms and request prescriptions. Speakers include elderly patients, speakers with non-native accents, and patients in noisy home environments. Misrecognitions of medication names could have serious consequences.
Throughout 2024, Apple struggled to integrate large language model capabilities into Siri in a way that met its voice latency requirements. Reports from Bloomberg journalist Mark Gurman described internal debates about whether to run inference on-device (fast, private, but limited capability) or in the cloud (more capable, but adding network round-trips). Apple's announced Apple Intelligence framework — revealed at WWDC 2024 — ultimately used a hybrid: small models on-device for fast, low-complexity responses, and larger cloud models (including an optional ChatGPT integration) for complex queries that users would tolerate waiting for. The system was designed to make the latency tier invisible to the user — you never see "routing to cloud model."
A conversational voice session accumulates turns over time. Each user utterance and AI reply adds tokens to the conversation history. Most LLMs have context window limits — GPT-4 Turbo supports 128K tokens, Claude 3 supports 200K — but even within those limits, longer contexts mean longer inference times, which compound the latency problem.
Production voice systems typically implement rolling context windows: older turns are summarized or dropped to keep the active context manageable. The challenge is deciding what to keep. A booking bot that drops the user's stated destination from three turns ago will ask again, frustrating the user. A summarization strategy that compresses early turns into a brief note ("User wants to fly to Denver, November 14") preserves intent without ballooning the context.
OpenAI's Realtime API (October 2024) manages its own session context automatically, but exposes a session object developers can inspect and modify. This lets developers inject background context (e.g., the customer's booking history retrieved from a database) without it appearing in the spoken conversation.
Text responses and voice responses are not interchangeable. A 200-word text answer is readable in 45 seconds; spoken aloud at natural pace it takes over 90 seconds and feels interminable. Voice responses must be shorter, denser with information, and structured for listening rather than reading. Bullet points, tables, and headers are meaningless in audio. The LLM system prompt for a voice deployment must explicitly constrain response length and style.
Amazon's research team published findings in 2022 showing that Alexa users rated responses over 40 words as "too long" at a significantly higher rate than responses of 20–35 words, even when the longer responses were more informative. The optimal response length for voice was domain-specific but almost always shorter than the text equivalent users were comfortable reading.
Beyond length, voice responses require prosody-friendly language: natural sentence boundaries, avoidance of long parentheticals, and structures that TTS engines can deliver with appropriate phrasing. A sentence like "The flight — subject to availability, which changes hourly — departs at 6:00 PM" is awkward for TTS; "The flight departs at 6:00 PM. Availability can change by the hour." is not.
On-device models (Apple's 3B parameter on-device model, Google's Gemini Nano) have latency advantages — no network round-trip — but capability limits. Cloud models (GPT-4o, Claude 3.5 Sonnet) are more capable but add 100–400 ms of network latency per request. Apple's hybrid routing strategy — fast on-device for simple queries, cloud for complex — represents the current state of the art in balancing this trade-off. The routing decision itself must be made in milliseconds.
Speculative decoding is a technique where a small, fast "draft" model generates candidate tokens that a larger model verifies in parallel. When the large model agrees with the draft (which happens for common token sequences), you get the large model's quality at near the small model's speed. Google DeepMind published results in 2023 showing speculative decoding could achieve 2–3× token throughput improvements on LLaMA-class models without quality degradation — directly translating to faster TTFT in voice systems.
For voice deployments, another latency technique is reply pre-generation: while the user is still speaking, the system predicts the most likely query type and pre-generates partial responses. If the prediction is correct, the LLM portion of the pipeline is nearly eliminated. If wrong, the cached response is discarded. This technique requires accurate intent classification from partial ASR output and is most effective in narrow-domain applications where the set of possible queries is limited.
Most real-world voice applications require the LLM to do more than generate text — it must query databases, call APIs, check inventory, retrieve account information. Function calling (OpenAI's term) or tool use allows the LLM to emit structured requests for external data within its generation stream. The pipeline pauses, executes the tool, and resumes with the result injected into context.
In voice, each tool call adds a round-trip to the latency budget. A voice bot that needs to check flight availability, retrieve passenger preferences, and confirm pricing before responding might spend 800–1,200 ms on tool calls alone, before the LLM even starts generating the reply. Production systems mitigate this by parallelizing tool calls where dependencies allow and pre-fetching likely-needed data during ASR processing.
Voice LLM deployment is not just about making a good LLM go fast — it requires rethinking output length, style, context management, and tool-call parallelism simultaneously. The system that wins is the one that respects the user's perception of time across the entire pipeline.
You are configuring the LLM layer for a retail banking voice bot that handles balance inquiries, transaction disputes, and card freeze requests. The bot must call three backend APIs (account data, fraud detection, card management) and respond in under 1.5 seconds. Average query complexity is low, but disputes require nuanced multi-turn conversations.
In early 2023, ElevenLabs launched a voice cloning product that could reproduce any voice from a short audio sample with remarkable fidelity. Within days, anonymous users had generated audio clips mimicking celebrities and politicians saying things they never said. A clip of Emma Watson reading Adolf Hitler was shared on 4chan. ElevenLabs introduced identity verification and content moderation in response — but the episode underscored how far neural TTS had advanced past the uncanny valley in terms of naturalness, while advancing much faster than the safety infrastructure designed to contain it.
Modern neural TTS systems are generative models, typically consisting of two stages: an acoustic model that converts text (or phoneme sequences) to a mel-spectrogram (a compressed representation of audio frequencies over time), and a vocoder that converts the mel-spectrogram to raw audio waveforms. Systems like Google's WaveNet (2016) pioneered neural vocoders; subsequent systems like FastSpeech 2, VITS, and Voicebox (Meta, 2023) have dramatically improved quality and inference speed.
The latest systems, including ElevenLabs, OpenAI TTS, and Eleven Multilingual v2, are end-to-end: text goes in, waveform comes out, with the entire system trained jointly. Quality on well-represented speakers and languages has surpassed what most listeners can identify as synthetic in controlled tests.
In a real-time pipeline, waiting for the full TTS rendering before playing audio is unacceptable. Streaming TTS breaks synthesis into sentence or even sub-sentence chunks and streams audio frames as they are generated. The user hears the beginning of a response while synthesis of the end of the response is still in progress.
The challenge is that prosody — the natural melody of speech — depends on the whole sentence or even paragraph. When you synthesize sentence-by-sentence, you lose the intonation cues that span sentence boundaries: a list sounds wrong because each item is synthesized without knowledge of the others; a question that follows a statement sounds wrong because the system doesn't know a question is coming. OpenAI's TTS API and ElevenLabs Streaming address this partly by using look-ahead buffers — synthesizing slightly ahead of what is currently playing to give the model context about what comes next.
The biggest remaining quality gap in streaming TTS is prosodic coherence across chunk boundaries. Human speech naturally uses rising intonation to signal continuation and falling intonation to signal completion. A streaming TTS system that processes one sentence at a time may synthesize each sentence as if it is the last one — producing a monotone, declarative rhythm that feels robotic even when individual phonemes are perfect. Solutions include sentence-boundary look-ahead, explicit prosody control (SSML tags), and full-paragraph synthesis with playback streaming (higher latency but better quality).
In a multi-session voice application — such as a banking bot a user calls repeatedly — voice consistency builds trust. If the voice changes between calls, sounds different at different speaking rates, or changes timbre when stressed phonemes appear, users notice. Production TTS deployments invest in voice consistency testing: running standardized test sentences through the TTS system across updates to verify that the perceptual quality and identity of the voice remains stable.
Amazon's Alexa team maintains internal voice consistency benchmarks and runs A/B tests before any change to the Alexa TTS voice reaches production. A 2022 incident where a TTS model update accidentally altered Alexa's vocal quality was caught in staging by these benchmarks before it reached users — an example of how the quality control infrastructure around TTS has become as important as the model itself.
Speech Synthesis Markup Language (SSML) is an XML-based standard that lets developers annotate text with prosodic instructions: pauses, emphasis, rate changes, pitch modifications, and pronunciation hints. All major TTS platforms (Amazon Polly, Google Cloud TTS, Microsoft Azure Speech) support SSML. A well-constructed SSML response can significantly improve the naturalness of synthesized voice output by giving the TTS engine explicit guidance rather than relying on it to infer prosody from text alone.
For example, a number like "1,200" might be synthesized as "one thousand two hundred" or "twelve hundred" depending on context. SSML's <say-as> tag lets developers specify: read this as a cardinal number, an ordinal, a date, a telephone number, or a dollar amount. Without such hints, TTS systems guess from context — and sometimes guess wrong in ways that are subtly confusing.
The ElevenLabs incident was not isolated. As TTS fidelity crosses the threshold of perceptual indistinguishability from human speech, the applications for fraud, disinformation, and non-consensual voice cloning multiply. In 2023, the FTC began studying synthetic voice regulations. Several states passed laws criminalizing non-consensual voice cloning for fraudulent purposes. The voice AI industry has moved toward watermarking standards — embedding imperceptible signals in synthesized audio that allow detection — though no universal standard has yet emerged.
For product teams deploying voice AI, the practical implication is disclosure: should users know they are speaking to an AI? California's BOT Disclosure Act (SB 1001, effective 2019) requires disclosure in consumer contexts. The ethical norm across the industry has shifted clearly toward disclosure, even where not legally required, after the backlash to deceptive AI calling demos.
Neural TTS has crossed the naturalness threshold — synthetic speech can now fool listeners in controlled conditions. This makes the surrounding infrastructure — disclosure norms, watermarking, voice consistency testing, prosody control — as important as the model quality itself. The best TTS deployment is not the most realistic one; it is the most appropriate one.
<say-as> tag do, and why does it matter for voice AI applications?<say-as>, a TTS system guesses how to read "10/11" — is it October 11th, the fraction ten-elevenths, or two separate numbers? In a banking or travel application, such ambiguity causes errors that damage user trust. <say-as> removes the ambiguity entirely.<say-as> specifically addresses interpretation ambiguity for tokens that could be read multiple ways. "1,200" could be cardinal, a year, or something else. "10/11" could be a date, a fraction, or a ratio. Telling the TTS which interpretation to use prevents the awkward mis-readings that erode user trust in voice applications.You are the voice AI lead for a healthcare company launching a patient outreach bot that calls patients to remind them of appointments, follow up on medication adherence, and escalate concerning symptoms to human nurses. The bot must sound natural and empathetic. You need to decide on voice, streaming strategy, SSML usage, and disclosure approach.
<say-as> tag specifically address in TTS synthesis?<say-as>, the TTS must guess from context how to read ambiguous tokens. "10/11" could be a date, a fraction, or two separate numbers. "1200" could be twelve hundred, one thousand two hundred, or a year. <say-as> eliminates this ambiguity with explicit interpretation instructions.<say-as> addresses interpretation ambiguity. Many tokens can be read multiple valid ways — as a number, a date, a currency amount, a phone number. Without explicit instruction, TTS engines make context-based guesses that are sometimes wrong. <say-as> tells the engine exactly which interpretation to apply.