Module 2 · Lesson 1

How ASR Works: From Sound Waves to Text

The acoustic journey — how machines turn vibrating air into readable words

What actually happens inside an ASR system between "Hello" and the letter H?

When OpenAI released Whisper as open-source in September 2022, researchers immediately tested it against every benchmark available. On the LibriSpeech clean test set — long the gold standard — Whisper large-v2 scored a 2.7% Word Error Rate, roughly matching human performance. The achievement was notable not because of a new architecture but because Whisper was trained on 680,000 hours of weakly supervised internet audio spanning 96 languages. Scale, not innovation, had cracked near-human ASR.

The ASR Pipeline in Five Stages

Automatic Speech Recognition transforms a continuous audio waveform into a discrete sequence of words. Every production system — from a Google Assistant query to a hospital dictation tool — passes audio through the same conceptual pipeline, even if implementation details differ dramatically.

Audio Capture

Microphone → PCM samples at 16 kHz

Pre-processing

Noise reduction, VAD, normalization

Feature Extraction

Mel-filterbank / MFCC spectrograms

Acoustic Model

Neural net → phoneme or token probs

Decoder

Language model + beam search → text

Stage 1 & 2: Capture and Pre-processing

A microphone converts air-pressure variations into an electrical analog signal. An analog-to-digital converter (ADC) samples this signal, typically at 16,000 samples per second (16 kHz) for speech — enough to faithfully represent the 80 Hz–8 kHz range that carries intelligibility. Each sample is a 16-bit integer recording instantaneous amplitude.

Voice Activity Detection (VAD) discards silence so the acoustic model doesn't waste computation. Traditional VAD used energy thresholds; modern systems use tiny neural nets (e.g., Silero VAD, ~1 MB) that classify 10–30 ms frames as speech or non-speech with high accuracy even in noisy environments.

Noise reduction — beamforming on multi-mic arrays, spectral subtraction, or learned noise suppression — happens here too. Amazon showed in 2017 that the far-field performance of Echo devices depended almost entirely on this stage, not the downstream model.

Stage 3: Feature Extraction — The Mel Spectrogram

Raw waveform samples are too dense to feed directly into a neural network efficiently. Instead, the audio is sliced into overlapping 25 ms frames spaced 10 ms apart, and a Short-Time Fourier Transform (STFT) converts each frame from time-domain amplitude to frequency-domain magnitude. The result is filtered through a Mel filterbank — a set of triangular filters spaced on the Mel scale, which mimics the logarithmic frequency perception of the human ear.

Whisper uses 80 Mel bins. Each 10 ms frame becomes a vector of 80 numbers. Thirty seconds of audio becomes an 80 × 3000 matrix — that's the spectrogram fed to the encoder. Older systems used Mel-Frequency Cepstral Coefficients (MFCCs), a further compression step; most transformer-based systems now skip cepstral compression and feed log-Mel spectrograms directly.

Why the Mel Scale?

Humans perceive pitch logarithmically — the difference between 100 Hz and 200 Hz sounds as large as the difference between 1,000 Hz and 2,000 Hz. The Mel scale encodes this perceptual spacing mathematically. Features on the Mel scale are far more correlated with phoneme identity than raw linear-frequency features, which is why even deep neural nets trained end-to-end still benefit from Mel filterbank inputs.

Stage 4 & 5: Acoustic Model and Decoder

The acoustic model takes the spectrogram and produces a probability distribution over linguistic units — classically phonemes, now often byte-pair-encoded (BPE) subword tokens. Modern systems are largely encoder-decoder transformers. Whisper's encoder is a stack of transformer blocks that attends over the entire 30-second context; its decoder uses cross-attention to produce tokens autoregressively.

The decoder stage traditionally combined an acoustic model score with a separately trained language model (LM) via beam search — searching the lattice of possible word sequences for the highest combined probability. End-to-end neural systems like Whisper internalize the language model in the decoder weights, removing the need for a separate LM at inference. This simplifies deployment but makes domain adaptation harder.

WERWord Error Rate: (Substitutions + Deletions + Insertions) / Total reference words. The primary ASR accuracy metric. Human transcription is typically 4–6% WER on conversational speech; near-human ASR is ≤5%.

CTCConnectionist Temporal Classification — a loss function that allows the model to produce variable-length output from variable-length input without explicit alignment, using a blank token to represent silence between tokens.

BPE tokensByte-Pair Encoding sub-words. Whisper uses a 51,864-token vocabulary including multilingual sub-words. Predicting tokens rather than phonemes reduces decoder steps and allows the same model to handle punctuation, capitalization, and multiple languages.

Real Benchmark: LibriSpeech

LibriSpeech (Panayotov et al., 2015) — 1,000 hours of audiobook speech — became the standard ASR benchmark. Human WER on its clean test set is ~5.8%. Whisper large-v2 achieves ~2.7% WER on clean and ~5.2% on other (noisier) test sets. Google's Universal Speech Model (USM, 2023) reports similar numbers on 300+ languages using 12 million hours of training audio.

Lesson 1 Quiz

How ASR Works: From Sound Waves to Text

What sample rate is standard for speech ASR systems, and why is a higher rate not necessary?

Correct. Nyquist's theorem requires sampling at twice the maximum frequency of interest. Speech intelligibility lives below 8 kHz, so 16 kHz sampling fully captures it. Higher rates waste compute without accuracy benefit for standard ASR.

Not quite. The key is Nyquist's theorem: you need double the highest frequency you care about. Speech intelligibility lives below 8 kHz, making 16 kHz the practical standard.

Why does Whisper use an 80-bin Mel filterbank rather than raw STFT magnitude bins?

Correct. The Mel scale compresses frequency bins according to perceptual logarithmic spacing, so the resulting features cluster naturally around phoneme-discriminating spectral patterns — giving the neural network a much better starting point.

The reason is perceptual. Mel spacing mirrors how the human ear groups frequencies; raw linear-frequency STFT bins waste resolution on high frequencies that carry little phoneme information.

OpenAI Whisper large-v2 achieved ~2.7% WER on LibriSpeech clean primarily because of:

Correct. OpenAI's own paper highlighted that scale of training data — not architectural novelty — was the key driver. Whisper used a standard encoder-decoder transformer trained on massive weakly-labeled internet audio.

The breakthrough was data scale. Whisper used a fairly standard transformer architecture but trained it on 680,000 hours of multilingual audio — an unprecedented amount for a publicly released ASR model.

What does CTC (Connectionist Temporal Classification) solve in ASR training?

Correct. Before CTC (Graves et al., 2006), ASR training required manually aligning each audio frame to its phoneme label — expensive and error-prone. CTC's blank token lets the model learn alignment implicitly from input-output pairs alone.

CTC's core contribution is removing the need for explicit temporal alignment. It introduces a "blank" output token so the model can be trained end-to-end from (audio, transcript) pairs without frame-level annotation.

Lab 1 — ASR Pipeline Dissection

Discuss the acoustic pipeline with your AI lab partner

Your Mission

You are working with an AI tutor that specializes in ASR internals. Explore how the pipeline stages connect, why each design choice was made, and what happens when things go wrong.

Starter questions: "Walk me through what happens to a 30-second audio clip inside Whisper." · "Why would a VAD failure break the whole pipeline?" · "What's the difference between a phoneme-based and token-based decoder?"

ASR Pipeline Tutor

Lab 1

Welcome to Lab 1. I'm here to help you understand what's happening inside an ASR system — from the moment audio hits the microphone to the moment text appears on screen. What part of the pipeline would you like to dig into first?

Module 2 · Lesson 2

Transformer-Based ASR: Whisper, Wav2Vec, and USM

How self-supervised and weakly-supervised learning rewrote the ASR record books

Why did transformer-based models suddenly obsolete decades of HMM-GMM engineering?

Facebook AI Research released wav2vec 2.0 and showed that fine-tuning on just 10 minutes of labeled English speech could achieve 4.8% WER on LibriSpeech clean — a result that would have required thousands of hours of labeled data five years earlier. The trick: pre-training on 53,000 hours of unlabeled audio using a contrastive self-supervised objective, learning rich speech representations before seeing a single transcript.

The Pre-Deep-Learning Baseline: HMM-GMMs

For three decades, the dominant ASR architecture combined Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs). HMMs modeled the temporal dynamics of speech (phonemes transition probabilistically), while GMMs estimated the acoustic likelihood of each HMM state. A separate n-gram language model re-scored beam search hypotheses. The pipeline worked but required careful hand-engineering: pronunciation dictionaries, phoneme sets, speaker adaptation, and separate acoustic/language model training.

Deep Neural Network acoustic models (DNN-HMMs) arrived around 2011–2012, replacing GMMs with small feedforward nets. The hybrid DNN-HMM dominated commercial ASR through 2018. Google's voice search used DNN-HMMs at scale from approximately 2012 onward.

End-to-End Architectures: CTC and Attention

The first truly end-to-end ASR models used either CTC loss (Graves, 2006; applied to speech by Graves et al., 2013 in "Deep Speech") or attention-based encoder-decoder (Chan et al., 2016 "Listen, Attend and Spell"). Baidu Research's Deep Speech 2 (2015) showed that a deep RNN trained with CTC on 11,940 hours of English and Mandarin could outperform the HMM-GMM pipeline without pronunciation dictionaries — a landmark paper that shifted the field.

Transformers replaced RNNs starting around 2019. Self-attention over the full sequence outperformed recurrent models because it directly models long-range dependencies — crucial when a late word in a sentence disambiguates an earlier phoneme ambiguity ("I'll have the scone" vs. "I'll have this cone").

Three Landmark Models Compared

Meta AI · 2020

wav2vec 2.0

Self-supervised contrastive pre-training on unlabeled audio. Learns discrete speech units via quantization. Fine-tunes to near-SOTA with minutes of labeled data. Key innovation: self-supervised pre-training on raw waveforms.

OpenAI · 2022

Whisper

Weakly-supervised training on 680K hours of internet audio with noisy transcripts. Encoder-decoder transformer with log-Mel spectrogram input. Trained jointly on transcription, translation, language ID, and timestamps.

Google · 2023

USM (Universal Speech Model)

12 million hours of audio across 300+ languages. Two-stage training: self-supervised pre-training then supervised fine-tuning. Achieves strong performance even on very low-resource languages with as few as a few hundred hours of labeled data.

Self-Supervised vs. Weakly-Supervised ASR

Self-supervised learning (SSL) — used by wav2vec 2.0, HuBERT, and WavLM — pre-trains on raw unlabeled audio by masking portions of the input and predicting discrete targets. The model never sees a transcript during pre-training. Fine-tuning on small labeled datasets then achieves remarkable WER. This approach is ideal when transcribed speech is scarce (low-resource languages, specialized domains).

Weakly-supervised learning — Whisper's approach — trains directly on (audio, transcript) pairs scraped from the internet, even if transcript quality is imperfect. The noise is averaged out at scale. Whisper requires no fine-tuning step for general English; it generalizes zero-shot to new domains because the training data is so diverse.

The practical tradeoff: SSL models like wav2vec 2.0 are better when you have a specific domain with limited labeled data (medical, legal). Weakly-supervised large models like Whisper are better for general use without fine-tuning.

Deep Speech 2 — The Turning Point

Baidu's Deep Speech 2 paper (Amodei et al., 2015) reported that on Mandarin Chinese, the system outperformed human transcription on noisy environments — the first credible claim of superhuman ASR in any language. The system used 9 recurrent layers with batch normalization and trained on 11,940 hours of data. Critically, it used no pronunciation dictionary, no hand-crafted phoneme sets, and no HMM — demonstrating that purely data-driven end-to-end learning was viable at scale.

Contrastive LossIn wav2vec 2.0, the model masks spans of the quantized latent representation and must identify the correct quantized unit among distractors. This forces the encoder to learn phoneme-discriminative features without any labels.

Beam SearchA decoding algorithm that maintains the top-k hypotheses at each step rather than greedily selecting the single best token. Wider beams find better transcripts but increase latency — a critical tradeoff for real-time ASR.

Speaker DiarizationThe task of segmenting audio by speaker identity ("who spoke when"). Modern pipelines combine ASR with diarization to produce speaker-labeled transcripts — essential for meeting summarization and call-center analytics.

Lesson 2 Quiz

Transformer-Based ASR: Whisper, Wav2Vec, and USM

What was the key innovation that allowed wav2vec 2.0 to achieve ~4.8% WER with only 10 minutes of labeled speech?

Correct. wav2vec 2.0 learned rich speech representations from unlabeled audio via a contrastive objective before fine-tuning. This "pre-train then fine-tune" approach is why it needed so little labeled data.

The key was self-supervised pre-training on raw unlabeled audio. The model learned to distinguish correct speech units from distractors without any transcripts, then only needed a tiny labeled set to adapt.

Baidu's Deep Speech 2 (2015) was significant primarily because it:

Correct. Deep Speech 2 removed all hand-crafted linguistic components — no phoneme sets, no pronunciation dictionaries, no HMMs — and still outperformed human transcription on noisy Mandarin, proving the viability of purely data-driven end-to-end ASR.

Deep Speech 2's landmark contribution was demonstrating that purely data-driven end-to-end ASR could surpass human performance on noisy Mandarin without any hand-crafted linguistic knowledge — a decisive argument against HMM-GMM systems.

When would you prefer a self-supervised model like wav2vec 2.0 over Whisper's weakly-supervised approach?

Correct. SSL models shine when labeled data is scarce. Pre-trained representations can be fine-tuned on hundreds of labeled examples in a specialized domain (radiology dictation, courtroom transcripts) to achieve accuracy that Whisper's zero-shot performance can't match.

The SSL advantage is in low-resource fine-tuning. If you have a specialized domain (medical, legal) and can only afford to label a few hundred examples, pre-training on unlabeled audio first extracts enormously useful general representations.

Google's Universal Speech Model (USM) is trained on 12 million hours of audio primarily to:

Correct. Scale enables coverage. By training on 12 million hours spanning 300+ languages, USM can bootstrap reasonable performance even for languages with minimal labeled data — a critical capability for global products like Google Translate and YouTube captions.

USM's massive scale enables multilingual breadth. Training on 300+ languages at 12 million hours means that even very low-resource languages benefit from representations learned in related high-resource languages.

Lab 2 — Model Architecture Comparison

Compare ASR architectures with your AI lab partner

Your Mission

Your AI tutor is an ASR researcher. Probe the differences between HMM-GMM, end-to-end CTC, and transformer-based systems. Challenge assumptions about when one approach beats another.

Try: "Why didn't we just use transformers from the start?" · "What does wav2vec 2.0 actually learn during pre-training?" · "Explain beam search like I'm deploying this in production tomorrow."

ASR Architecture Tutor

Lab 2

Ready to dig into ASR architectures. Whether you want to understand why HMMs dominated for 30 years or how self-supervised learning changed everything, ask away. What's your starting question?

Module 2 · Lesson 3

WER, Accuracy Gaps, and Real-World Failure Modes

Why near-human benchmark scores don't mean near-human real-world performance

If Whisper achieves 2.7% WER on LibriSpeech, why does it still misfire on accented speech, names, and domain jargon?

A Stanford HAI study published in March 2023 tested Whisper and five other leading ASR systems on voice data from Black Americans, finding WER rates 2× to 3× higher than on standard test sets. For speakers with heavy African-American Vernacular English (AAVE) features, Whisper large-v2 produced WER exceeding 30% — far above its published 2.7% benchmark. The disparity was not unique to any single model; it reflected the demographics of the training corpora, which overrepresent formal broadcast English.

WER Is Not the Whole Story

Word Error Rate counts word-level substitutions, deletions, and insertions against a reference transcript. It is a necessary benchmark but conceals several critical failure dimensions. A system with 5% average WER can still have 25% WER on one accent group while achieving 2% on another — the average hides the disparity.

Additional limitations of WER as a sole metric:

Equal weight to all words: "The meeting is at three" and "The meeting is at free" differ by one word — a 20% WER — but a named-entity system misrecognizing "Smith" as "Sith" in a legal transcript has the same numeric cost yet vastly different practical consequence.

No semantic weighting: Deleting a filler word ("um") and deleting the name of a drug in a prescription are identical in WER.

No measurement of downstream harm: In voice-activated medical devices, wrong words cause adverse events. WER doesn't capture this.

The Big Four Failure Modes

Failure Mode 1

Accent & Dialect Bias

Training corpora skew toward broadcast English, standard American accents, and high-resource languages. Non-native speakers, regional dialects (Scottish, Nigerian, Indian English), and AAVE consistently see higher WER — sometimes 3–5× the baseline.

Failure Mode 2

Out-of-Vocabulary Terms

New proper nouns, product names, and domain jargon that didn't appear in training data are routinely mangled. "Ozempic" → "Ozambic." "ChatGPT" → "chat GPT." Medical brand names, legal terminology, and technical acronyms are especially vulnerable.

Failure Mode 3

Noise & Channel Degradation

WER degrades non-linearly with noise. At 0 dB SNR (signal power equals noise power), even Whisper large-v2 WER jumps from 3% to 40%+. Telephone-quality audio (8 kHz, codec artifacts), far-field microphones, and reverberant rooms all systematically degrade accuracy.

Failure Mode 4

Hallucination

Whisper occasionally generates plausible-sounding text that was never spoken — a known artifact of the decoder generating tokens based on learned priors. OpenAI's own documentation warns against using Whisper in high-stakes applications without verification because of hallucinated segments.

Whisper Hallucination: A Documented Problem

In 2023, multiple independent researchers documented Whisper producing entire fabricated sentences on silent or near-silent audio segments. A paper by Koenecke et al. (2023) found that Whisper sometimes inserted content warnings, disclaimers, or entirely invented phrases into long pauses. The root cause: the decoder is trained to always produce output, and its language model priors fill silence with statistically likely completions. This is categorically different from a DNN-HMM system, which would simply produce no output in silence.

For medical and legal transcription — two of the highest-growth ASR application areas — hallucination is not acceptable. Production deployments in these domains either use hallucination-detection post-processing or fall back to specialized models (Nuance DAX for medical, for example) that accept higher WER in exchange for zero hallucination on critical terms.

Domain Adaptation Strategies

Fine-tuning on domain-specific labeled audio remains the most reliable adaptation method. A hospital that fine-tunes Whisper medium on 200 hours of physician dictation typically reduces medical-term WER by 40–60%.

Custom language model integration — providing a domain vocabulary or a custom n-gram LM — works with hybrid systems. For end-to-end models like Whisper, shallow fusion (adding LM log-probabilities to decoder logits) or rescoring with a medical LM at the hypothesis level can boost domain accuracy without full fine-tuning.

Hotword/boosting is supported by streaming APIs (Google Speech-to-Text, AWS Transcribe): a list of domain terms is upweighted in the decoder. This is faster to deploy than fine-tuning and effective for proper nouns, but can introduce false positives (the system hears the term where another word was spoken).

The 2016 Microsoft Milestone — and Its Caveat

In October 2016, Microsoft announced that its ASR system achieved 5.9% WER on the Switchboard conversational speech benchmark — matching what the company measured as human transcriber performance. Headlines read "AI beats humans at speech recognition." The caveat: Switchboard tests professional transcribers who listen to clear telephone-quality English from a narrow demographic sample. It says nothing about accented speech, noisy environments, or specialized domains. The "human parity" framing misled product teams who deployed the models to contexts far outside the benchmark distribution.

SNRSignal-to-Noise Ratio in decibels. 30 dB SNR is a quiet office; 10 dB is a busy café; 0 dB is signal and noise at equal power. Most published WER benchmarks are measured at ≥20 dB SNR — far cleaner than real-world deployments.

Hallucination (ASR)Tokens generated by the ASR decoder that were not present in the input audio. Distinguished from substitution errors (which replace real words with wrong words) — hallucination inserts entirely invented content, often during silence or very low SNR.

Lesson 3 Quiz

WER, Accuracy Gaps, and Real-World Failure Modes

The 2023 Stanford study on Whisper and AAVE found WER rates 2–3× higher than published benchmarks. What is the primary structural cause?

Correct. The model learned from data that skews demographically toward formal, broadcast, standard-variety English. Dialectal features not well-represented in training data produce higher error rates — this is a data distribution problem, not a fundamental acoustic one.

The root cause is training data demographics. The internet audio Whisper was trained on skews heavily toward formal broadcast English. Dialectal features of AAVE are underrepresented, so the model's learned acoustic-to-token mappings generalize poorly.

Whisper hallucination during silence is best explained by:

Correct. End-to-end decoder models are trained to always produce output. In the absence of strong acoustic signal, the decoder falls back on its internal language model — generating text it "expects" to see, which may have nothing to do with the actual audio.

The mechanism is the decoder's prior. End-to-end models are trained to produce tokens; when acoustic signal is weak, the decoder uses its learned language model to generate plausible continuations rather than remaining silent.

Hotword boosting in ASR APIs (Google, AWS) addresses which failure mode most directly?

Correct. Hotword boosting increases the decoder's probability weight for specified terms, making the system more likely to recognize product names, medical brands, or personal names it has never encountered in training. It's fast to deploy but doesn't fix accent bias or noise problems.

Hotword boosting specifically targets OOV (out-of-vocabulary) terms. By explicitly telling the decoder to favor certain words or phrases, you compensate for their absence from the training distribution — a lightweight alternative to full domain fine-tuning.

Microsoft's October 2016 "human parity" claim on Switchboard benchmark was misleading because:

Correct. Switchboard's test conditions — professional transcribers, clear telephone audio, mostly standard American English — are far removed from real-world ASR deployment scenarios. Equating benchmark performance with real-world capability misled many product teams.

The benchmark gap is the issue. Switchboard tests a very narrow slice of the speech recognition problem. "Human parity" on Switchboard does not translate to human parity on accented speech, noisy environments, medical jargon, or any other real-world distribution shift.

Lab 3 — WER Failure Mode Analyst

Diagnose real-world ASR failures with your AI lab partner

Your Mission

Your AI tutor plays the role of an ASR quality engineer. Bring it real or hypothetical transcription failures and work through root-cause analysis: is it accent bias, OOV terms, noise, or hallucination? Then discuss what mitigation applies.

Try: "My ASR system keeps mishearing 'Ozempic' as 'Ozambic' — is that OOV or noise?" · "How do I detect hallucinations in production?" · "A client reports 22% WER on Indian-accented English. What do I do?"

ASR Failure Mode Analyst

Lab 3

I'm your ASR quality engineer today. Describe a transcription failure you're seeing — or a hypothetical deployment scenario — and we'll work through what's causing it and what to do about it. What failure are we diagnosing?

Module 2 · Lesson 4

Streaming ASR, Latency, and Production Architecture

How real-time constraints reshape every design decision in ASR pipelines

Why does a model that achieves 2% WER in batch mode often fail in real-time streaming contexts?

At Google I/O 2019, Google demonstrated Live Caption — on-device real-time captions for any audio on an Android phone, running entirely offline in under 80 ms latency using a model compressed to fit in 80 MB. The feat required a completely different architecture than Whisper: a streaming RNN-T (Recurrent Neural Network Transducer) that processed audio in 80 ms chunks, with no future context, rather than attending to full 30-second windows. The demo was live and unrehearsed on stage. It worked.

The Fundamental Streaming Constraint

Whisper and most high-accuracy ASR models are offline systems: they ingest a complete audio segment (up to 30 seconds), run the full encoder over it, then decode. This requires the complete audio to be available before any transcription begins. Latency is the audio duration plus inference time — unacceptable for live captioning, real-time voice assistants, or call-center analytics.

Streaming ASR must produce partial transcripts incrementally while audio is still being received. Every design decision — model architecture, context window, beam search width, vocabulary size — is constrained by a target latency budget, typically 100–300 ms for natural conversation.

RNN-T: The Streaming Architecture

The RNN Transducer (RNN-T), introduced by Graves (2012) and scaled up by Google (2019, 2021), is the dominant architecture for streaming ASR. Unlike CTC, which assumes outputs are conditionally independent given the input, RNN-T uses a prediction network (analogous to a language model) that conditions on previously emitted tokens. This yields higher accuracy than CTC at comparable latency budgets.

RNN-T's key property: it emits a token (or a blank) for each audio frame as it arrives. There is no fixed output length and no need to wait for future audio. The joint network combines the acoustic encoder state with the prediction network state to produce the output distribution at each frame. Google's production Live Caption model is an 80 ms chunk RNN-T; Google's larger production ASR is a conformer-based RNN-T.

Conformer: Convolution Meets Attention

Transformers capture long-range dependencies well but miss local acoustic patterns. CNNs capture local structure efficiently but have bounded receptive fields. The Conformer (Gulati et al., Google, 2020) interleaves multi-head self-attention with convolution modules in each block, achieving best-of-both: long-range context from attention and local feature extraction from convolution. Conformer-based encoders achieved new SOTA on LibriSpeech (1.9%/3.9% clean/other WER) and are now widely used in production streaming systems.

The Latency–Accuracy Tradeoff

Every streaming ASR system navigates a fundamental tradeoff: more future context = higher accuracy but higher latency. A word's pronunciation often depends on following context (coarticulation, prosody). "I'll have the scone" — the final word disambiguates earlier phoneme decisions. Using no future context at all (causal streaming) maximizes responsiveness but loses this disambiguation.

Production systems address this via chunk-based streaming with limited lookahead: process 80–160 ms of audio at a time, with an optional 80–160 ms lookahead buffer. This provides partial future context without the full-window wait. AWS Transcribe's real-time API uses ~300 ms processing windows with partial-result streaming. Google's Dialogflow CX uses configurable interim result latency.

Production Architecture: Key Components

Component

Streaming VAD

Detects speech start/end in real time to trigger and terminate the ASR session. Silero VAD runs in ~1 ms per 30 ms frame, enabling sub-100 ms speech detection latency. Critical for barge-in detection in voice assistants.

Component

Endpointing

Determines when the user has finished speaking (vs. taking a breath). Naive energy-threshold endpointing cuts off mid-sentence; learned endpointing uses prosodic and LM cues to recognize sentence completion. Google's endpointer is a separate LSTM trained on human-judged utterance boundaries.

Component

Inverse Text Normalization

Converts spoken-form output ("three hundred and forty two dollars") to written form ("$342"). Also handles dates, times, acronyms, and punctuation insertion. Often a separate model or finite-state transducer running post-ASR.

Component

Punctuation & Diarization

ASR models typically output unpunctuated lowercase text. A separate punctuation model restores sentence boundaries. Speaker diarization runs in parallel to label turns — essential for meeting transcripts. Both add latency and are often run post-utterance rather than frame-by-frame.

On-Device ASR: Apple's Path

Apple has shipped on-device ASR since iOS 16 (2022). The on-device model — an encoder-decoder with ~100M parameters — runs entirely on the Neural Engine, never sending audio to Apple servers. In iOS 17, Apple extended on-device ASR to support real-time dictation in all supported languages. The accuracy tradeoff versus server-side Whisper-class models is real but acceptable for most dictation use cases — and the privacy benefit is decisive for enterprise adoption.

WER in Practice: Production Benchmarks

System	Mode	LibriSpeech Clean	Conversational
Whisper large-v2	Offline	2.7%	~9–12%
Google USM	Offline	~2.5%	~8%
Google Conformer RNN-T	Streaming	4.2%	~11%
AWS Transcribe (2023)	Streaming	~5%	~12–15%
Apple On-Device (iOS 17)	Streaming	~6–8%	~15–18%

RNN-TRecurrent Neural Network Transducer. An end-to-end sequence transduction model combining an acoustic encoder with an autoregressive prediction network via a joint network. The dominant architecture for production streaming ASR because it emits tokens frame-by-frame without requiring future context.

EndpointingThe process of detecting utterance boundaries in streaming ASR — determining when a speaker has finished speaking. Poor endpointing causes either premature cutoffs (response too soon) or excessive silence tolerance (response too slow).

ConformerA neural network block that combines multi-head self-attention (for long-range context) with depthwise convolution (for local feature extraction). Proposed by Google in 2020; now the dominant encoder architecture for both streaming and offline production ASR.

Lesson 4 Quiz

Streaming ASR, Latency, and Production Architecture

Why can't Whisper's offline encoder-decoder architecture be used directly for real-time streaming ASR?

Correct. Whisper's encoder attends over the full 30-second input window before producing any output. For real-time use you need a causal or chunk-based architecture that emits tokens while audio is still being received.

The architectural problem is the full-window encoder. Whisper must wait for the complete audio segment before it can run its encoder — so minimum latency is the audio's full duration, which is unacceptable for live interaction.

The RNN-T architecture's key advantage over CTC for streaming ASR is:

Correct. CTC assumes outputs are conditionally independent given the input — there's no internal language model. RNN-T's prediction network explicitly models the probability of the next token given all previous tokens, improving coherence and accuracy at comparable latency.

The distinction is the prediction network. CTC treats each output token as independent; RNN-T autoregressively conditions on prior tokens through a dedicated prediction network, effectively building in a language model that improves context-sensitive transcription.

What is the primary innovation of the Conformer encoder over a standard transformer encoder for ASR?

Correct. The Conformer insight (Gulati et al., 2020) was that speech has structure at multiple scales — long-range prosodic dependencies need attention, while short-range acoustic patterns (formant transitions, fricatives) need convolution. Interleaving both in each block outperforms either alone.

The Conformer's innovation is the combination of attention and convolution. Transformers alone miss fine-grained local acoustic patterns that convolution captures efficiently. The Conformer block applies both in sequence, achieving state-of-the-art on LibriSpeech in 2020.

Google Live Caption at Google I/O 2019 ran entirely on-device with under 80 ms latency. What architecture made this possible?

Correct. RNN-T's causal chunk-based design is what enables on-device streaming at this latency target. It processes each 80 ms frame as it arrives, emitting tokens immediately, with no need to buffer future audio.

The key is RNN-T streaming on-device. Google's Live Caption used a causal RNN-T model compressed to ~80 MB that processes 80 ms chunks as they arrive — no server round-trip, no future-context buffering, fully real-time on the device's neural processor.

Lab 4 — Streaming ASR System Design

Design a real-time ASR pipeline with your AI lab partner

Your Mission

Your AI tutor is a production ASR architect. Work through the design of a real-time ASR pipeline for a specific use case — a live captioning app, a call-center bot, or a voice assistant. Make concrete trade-off decisions about latency, accuracy, and cost.

Try: "I need to build a real-time captioning system for a TV broadcast. What architecture do I use?" · "How do I handle barge-in for a voice assistant?" · "Walk me through the trade-off between a 80 ms and 300 ms lookahead window."

Production ASR Architect

Lab 4

I'm your production ASR architect. Tell me what you're building — a live captioning system, a call-center bot, a voice assistant, or something else — and we'll work through the architecture decisions together. Latency budget, accuracy targets, on-device vs. cloud, domain requirements — all of it. What's the use case?

Module 2 Test

Automatic Speech Recognition — 15 questions · Pass at 80%

1. What does the Mel filterbank accomplish in the ASR feature extraction stage?

The Mel filterbank applies triangular filters spaced on the Mel perceptual scale, compressing raw STFT output into features that align with human auditory perception — making them far more useful for phoneme classification.

The Mel filterbank's role is frequency-domain compression into perceptually meaningful bands. It doesn't handle ADC, VAD, or phoneme classification — those are other pipeline stages.

2. Whisper large-v2 achieves approximately what WER on LibriSpeech clean test set?

Correct. Whisper large-v2 achieves ~2.7% WER on LibriSpeech clean — below the typical human transcriber rate of ~5.8% on this benchmark.

Whisper large-v2's published WER on LibriSpeech clean is ~2.7%, below typical human transcriber performance on this specific benchmark.

3. CTC (Connectionist Temporal Classification) was essential to early end-to-end ASR because it:

Correct. Before CTC, ASR training required painstakingly aligning each audio frame to its corresponding phoneme — a process requiring expert linguists. CTC's blank token lets the model discover alignment implicitly.

CTC removed the need for explicit temporal alignment. Using a "blank" symbol, it marginalizes over all possible alignments between input frames and output tokens, enabling training directly from (audio, text) pairs.

4. wav2vec 2.0's self-supervised pre-training objective uses:

Correct. wav2vec 2.0 quantizes continuous latent representations into discrete units, masks spans of the encoded audio, and trains the model to identify the correct quantized unit for each masked span among K distractors — a contrastive self-supervised objective requiring no labels.

wav2vec 2.0 uses contrastive learning. It doesn't need labels; instead, it learns by distinguishing correct quantized speech representations from negative samples in masked positions of the encoded audio.

5. Baidu's Deep Speech 2 (2015) was notable for demonstrating superhuman ASR performance on which language in noisy conditions?

Correct. Deep Speech 2 reported superhuman performance on noisy Mandarin — the first credible such claim for any ASR system — demonstrating that end-to-end neural ASR was viable for a major non-English language.

Deep Speech 2's headline result was on noisy Mandarin Chinese, where it outperformed human transcribers — a landmark achievement for non-English ASR.

6. Google's USM (Universal Speech Model) supports 300+ languages primarily through:

Correct. USM's scale — 12 million hours and two-stage training — enables cross-lingual transfer. Representations learned from high-resource languages benefit low-resource languages, achieving viable performance even for languages with minimal labeled data.

USM achieves multilingual coverage through massive-scale training. Its two-stage approach (self-supervised pre-training then supervised fine-tuning) across 12 million hours allows learned representations to transfer across linguistically related languages.

7. The 2023 Stanford study on AAVE and Whisper found WER rates approximately:

Correct. The Stanford study found 2–3× higher WER for AAVE speakers, with some speakers exceeding 30% WER on Whisper large-v2 — a dramatic divergence from its published 2.7% benchmark on LibriSpeech clean.

The disparity was 2–3× higher relative WER, with some AAVE speakers seeing WER above 30% on Whisper large-v2 despite its 2.7% LibriSpeech benchmark — a stark illustration of benchmark-deployment gap.

8. ASR hallucination differs from a substitution error in that:

Correct. Substitution errors arise from acoustic ambiguity — the model heard something but mapped it to the wrong word. Hallucination is categorically different: the decoder generates text with no acoustic grounding, driven by its language model priors.

The distinction is grounding. A substitution error has acoustic grounding — something was spoken but misrecognized. Hallucination has no grounding — the decoder generates text from its priors in the absence of relevant acoustic signal.

9. Hotword boosting in commercial ASR APIs is most effective for addressing which problem?

Correct. Hotword boosting upweights specified terms in the decoder, compensating for their underrepresentation in training data. It's the fastest deployment path for OOV terms but doesn't address accent bias or noise robustness.

Hotword boosting specifically targets OOV terms by increasing their decoder probability weight. It's a lightweight alternative to full fine-tuning for product names, medical terms, or personal names absent from training.

10. The Conformer encoder architecture (Google, 2020) achieves state-of-the-art ASR by:

Correct. The Conformer's insight was that speech understanding needs both global context (from attention) and local feature extraction (from convolution). Applying both in each block captures structure at multiple temporal scales.

The Conformer combines attention for long-range context with depthwise convolution for local acoustic patterns. Speech comprehension benefits from structure at both scales — prosodic patterns spanning seconds and fine-grained acoustic features spanning milliseconds.

11. Why does using a wider beam in beam search improve ASR accuracy but hurt real-time performance?

Correct. Beam search maintains the top-k candidate token sequences at each step. Each additional hypothesis requires computing full decoder scores, multiplying latency roughly linearly with beam width — a direct tradeoff against accuracy.

Beam search cost scales with beam width. Each hypothesis requires running the decoder and scoring with the language model, so maintaining 10 hypotheses costs roughly 10× more compute than greedy decoding — directly increasing latency.

12. RNN-T's prediction network gives it an advantage over CTC streaming ASR by:

Correct. The prediction network autoregressively conditions on prior outputs, giving RNN-T an implicit language model that CTC's conditionally-independent outputs cannot match — producing more contextually coherent transcripts at the same latency.

The prediction network is RNN-T's key advantage. Unlike CTC which treats each output as independent, RNN-T's prediction network maintains an autoregressive language model, improving transcription of contextually ambiguous acoustics in real-time.

13. Inverse Text Normalization (ITN) in a production ASR pipeline converts:

Correct. ASR models output spoken-form text. ITN post-processing converts spoken forms to written conventions — critical for downstream applications like document creation, named entity extraction, and UI display.

ITN converts spoken forms to written conventions. ASR models output what was said ("three hundred and forty two dollars") but users and downstream systems expect written form ("$342") — ITN handles this conversion.

14. Apple's on-device ASR (iOS 16+) offers what primary advantage over server-side models, despite some accuracy tradeoff?

Correct. For enterprises, healthcare, and users in regulated environments, the guarantee that audio never traverses a network is often more important than marginal WER improvements from server-side models. Privacy enables deployment in contexts where cloud ASR is contractually prohibited.

The decisive advantage is privacy. When audio never leaves the device, ASR becomes viable in medical, legal, and enterprise environments where data sovereignty and confidentiality requirements prohibit sending audio to external servers.

15. Microsoft's 2016 "human parity" claim on Switchboard benchmark was critiqued because Switchboard:

Correct. Switchboard's distribution — professional transcribers, clean telephone audio, standard American English speakers — represents a tiny slice of real-world ASR deployment conditions. "Human parity" on this benchmark does not generalize, and product teams who assumed it did faced harsh surprises in deployment.

The benchmark-deployment gap is the issue. Switchboard is a narrow, controlled benchmark. Real-world deployment involves accented speech, background noise, domain jargon, and diverse speakers — none of which Switchboard adequately represents.