Module 3 · Lesson 1

From Spoken Word to Machine Understanding

The acoustic journey — how a voice assistant hears, transcribes, and begins to parse what you actually mean.

What happens in the fraction of a second between your voice and the assistant's reply?

In April 2011, Apple acquired Siri Inc. for a reported $200 million. Seven months later, on October 4, 2011 — the day before Steve Jobs died — Apple introduced Siri to the world on the iPhone 4S. For the first time, tens of millions of consumers could speak naturally to a pocket device and receive spoken replies. The demonstration was imperfect and often mocked, but it irrevocably shifted consumer expectations about what a phone should do.

The Voice Pipeline — Five Stages

Every voice assistant, regardless of brand, processes speech through a shared conceptual pipeline. Understanding each stage is essential for anyone designing voice-enabled products.

Stage 01

Wake-Word Detection

A small, always-on model listens locally for a trigger phrase ("Hey Siri," "Alexa," "OK Google") without streaming audio to the cloud.

Stage 02

Automatic Speech Recognition

ASR converts the audio waveform to a text transcript. Modern systems use deep neural networks trained on thousands of hours of speech.

Stage 03

Natural Language Understanding

NLU extracts intent ("play music") and entities ("The Beatles," "shuffle") from the raw transcript.

Stage 04

Dialog Management

The dialog manager decides whether to fulfill, clarify, or escalate the request based on context and conversation state.

Stage 05

Text-to-Speech

TTS synthesizes a natural-sounding spoken reply. Neural TTS (WaveNet, Tacotron) replaced robotic concatenative synthesis after 2016.

Wake-Word Detection in Depth

Wake-word engines run entirely on-device using tiny models — often under 1 MB — because they must operate continuously without draining the battery or sending audio to servers before consent is given. Amazon's Alexa team published research in 2017 showing that their on-device detector achieves a false-accept rate of roughly one per day during normal household use, which is considered acceptable for consumer products.

The choice of wake word matters acoustically. Words with distinct phoneme sequences and multiple syllables ("Alexa," "Cortana") perform better than short monosyllables because they are harder to trigger accidentally in normal speech.

Privacy Note

In May 2018, an Amazon Echo in Portland, Oregon recorded a private conversation and sent it to a contact in the owner's address book. Amazon attributed the incident to a series of misheard words that mimicked wake-word and send commands. The event prompted renewed industry discussion about false-accept thresholds and on-device processing boundaries.

Automatic Speech Recognition

Early ASR systems in the 1990s used Hidden Markov Models combined with Gaussian Mixture Models. Google's 2012 deployment of deep neural networks for ASR — announced at Interspeech — cut word error rates by roughly 30% compared to the best GMM-HMM systems. By 2015, Microsoft's Cortana team achieved a 5.9% word error rate on the Switchboard benchmark, approaching human-level performance on that dataset.

Modern systems like OpenAI's Whisper (released September 2022) train a single encoder-decoder transformer on 680,000 hours of weakly supervised web audio spanning 99 languages, achieving strong multilingual performance without language-specific fine-tuning.

Word Error Rate (WER) The standard ASR metric: (substitutions + deletions + insertions) ÷ total reference words. Human transcriptionists typically score 4–5% WER on conversational speech; leading ASR systems now match or beat this on clean audio.

Acoustic Model The component that maps audio features (typically mel-frequency cepstral coefficients or mel spectrograms) to phoneme or subword probabilities.

Language Model (in ASR) A probability distribution over word sequences used to resolve acoustic ambiguities — "I scream" vs. "ice cream" — by preferring statistically likely phrases in context.

Natural Language Understanding

Once the transcript is available, NLU performs two primary tasks: intent classification (what does the user want to do?) and entity extraction (what specific objects or parameters are involved?). Early voice assistants used rule-based slot-filling grammars; modern systems use fine-tuned transformer models that handle the same utterance across thousands of possible intents.

Google's 2016 publication on the "Smart Reply" NLU pipeline for Inbox described handling over 12,000 intent categories across different language configurations — an early glimpse at the scale required for real-world deployment.

Key Insight

The pipeline is only as strong as its weakest stage. A 95%-accurate ASR feeding a perfect NLU still produces incorrect intents when the transcript error lands on a critical word like a person's name or location. This is why voice UX designers must account for ASR failure modes, not just NLU limitations.

Lesson 1 Quiz

From Spoken Word to Machine Understanding · 5 questions

1. What is the primary purpose of wake-word detection running entirely on-device?

Correct. On-device detection preserves privacy by ensuring no audio leaves the device until the wake word is confirmed.

Not quite. The key driver is privacy and consent — audio should not be streamed before the user triggers the assistant.

2. Which metric is standard for evaluating Automatic Speech Recognition accuracy?

Correct. WER measures substitutions, deletions, and insertions divided by total reference words — the standard ASR evaluation metric.

Incorrect. BLEU is for machine translation, F1 for classification, perplexity for language models. WER is the ASR standard.

3. Apple introduced Siri to consumers on which device in 2011?

Correct. Siri launched on the iPhone 4S in October 2011, the day before Steve Jobs died.

Incorrect. Siri was introduced on the iPhone 4S on October 4, 2011.

4. In the five-stage voice pipeline, which component resolves whether to fulfill a request or ask a clarifying question?

Correct. The dialog manager tracks conversation state and decides whether to fulfill, clarify, or escalate the user's request.

Incorrect. NLU extracts intent and entities; dialog management decides what to do with that information given conversation context.

5. OpenAI's Whisper ASR model (2022) was trained on approximately how many hours of audio?

Correct. Whisper was trained on 680,000 hours of weakly supervised multilingual web audio spanning 99 languages.

Incorrect. Whisper's training set was 680,000 hours — a scale that enabled its strong multilingual performance without language-specific tuning.

Lab 1 — Pipeline Diagnosis

Explore failure modes across the five-stage voice processing pipeline

Your Scenario

You're a voice UX engineer at a smart-home company. Users are complaining that the device sometimes misunderstands commands even though it clearly heard them say something. You need to identify which pipeline stage failed and why.

Discuss with the AI: Describe a specific failure scenario (e.g., "User said 'play jazz in the kitchen' but assistant played jazz in every room"). The AI will help you trace which pipeline stage failed and suggest a fix. Complete at least 3 exchanges to finish the lab.

Pipeline Diagnosis Lab

Voice Pipeline · L1

Welcome to the Pipeline Diagnosis Lab. I'm your voice UX engineering consultant. Describe a voice assistant failure your users have reported — what did they say, what did the assistant do, and what should it have done? We'll trace the fault through the five pipeline stages together.

Module 3 · Lesson 2

The Big Three — Alexa, Google Assistant, and Siri

Architecture decisions, market position, and hard lessons learned from the dominant voice platforms.

Why do three technically capable assistants produce such different user experiences — and different business outcomes?

By 2023, Amazon had sold over 500 million Alexa-enabled devices worldwide. Google Assistant ran on more than 1 billion devices. Apple's Siri processed an estimated 25 billion requests per month. Yet all three companies simultaneously announced major restructurings of their voice assistant divisions, signaling that raw scale had not translated into the profitable, indispensable utility each had envisioned in 2015.

Amazon Alexa — The Skills Ecosystem

Amazon launched Alexa alongside the Echo smart speaker in November 2014. The strategic insight was that a dedicated ambient device in the home — not a smartphone — would normalize voice interaction. Amazon's "Skills" framework, announced in 2015, was modeled on the App Store: any developer could publish a voice skill with defined intents and utterances.

By 2019, the Alexa Skills Store contained over 100,000 skills across categories. However, internal Amazon research leaked in 2022 showed that the vast majority of Alexa interactions were limited to just a few use cases: music, timers, alarms, and smart home control. The elaborate skills ecosystem had low discovery and even lower retention. Amazon reportedly spent over $10 billion on Alexa development from 2014 to 2022 without finding a clear path to profitability.

Architecture Note

Alexa's architecture separates the Automatic Speech Recognition layer (handled by Amazon Transcribe infrastructure) from the intent resolution layer (the Alexa Skills Kit). This separation made third-party integration straightforward but created latency: the average Alexa round-trip in 2016 was approximately 1.5–2 seconds, compared to under 1 second for Google's on-device Assistant processing in 2019.

Google Assistant — Knowledge Graph Advantage

Google announced Assistant at Google I/O in May 2016, as the conversational successor to Google Now. The key differentiator was Google's Knowledge Graph — a database of over 500 billion facts connecting entities, relationships, and attributes, built from two decades of search indexing. This gave Google Assistant an enormous factual retrieval advantage for open-domain questions.

In June 2018, Google CEO Sundar Pichai demonstrated "Google Duplex" at Google I/O — an AI system that called a real hair salon and made an appointment, using natural-sounding speech with "umms" and "ahs" to avoid detection. The demonstration received both acclaim and controversy. Google subsequently added disclosure requirements, announcing in 2019 that Duplex would identify itself as an AI when initiating calls.

Google's 2017 deployment of the Pixel Buds with real-time translation between 40 languages illustrated the integration potential when ASR, NMT (Neural Machine Translation), and TTS are vertically integrated within one company's infrastructure.

Apple Siri — Privacy Architecture

Apple's Siri took a distinctly different architectural path after 2019 under Craig Federighi's engineering direction. Rather than improving factual recall through cloud data, Apple invested in on-device processing. The Neural Engine introduced in the A11 Bionic chip (2017) ran NLU inference locally, and by 2021, Siri's personal requests (messages, reminders, calls) processed entirely on-device without audio reaching Apple's servers.

This privacy-first approach came with capability trade-offs. Independent benchmarks published by Loup Ventures from 2018 to 2021 consistently ranked Siri last among the major assistants in answering complex factual questions — a direct consequence of not using cloud knowledge retrieval for personal queries.

Dimension	Alexa	Google Assistant	Siri
Launched	November 2014	May 2016	October 2011
Key Advantage	Smart home ecosystem; Skills marketplace	Knowledge Graph; search integration	Device privacy; Apple ecosystem depth
Processing Model	Primarily cloud	Hybrid (cloud + on-device since 2019)	On-device first (personal queries)
Developer Access	Alexa Skills Kit (open)	Actions on Google (open)	SiriKit + App Intents (controlled)
Known Weakness	Low skill discovery / retention	Privacy concerns; fragmented hardware	Limited open-domain factual recall

Cross-Platform Pattern

All three platforms discovered by 2022 that the "assistant as platform" model — where the assistant becomes the dominant interface replacing apps — had not materialized. Users continued to switch between apps and voice, using voice primarily for low-friction, time-sensitive tasks rather than deep transactional workflows.

Lesson 2 Quiz

The Big Three — Alexa, Google Assistant, Siri · 5 questions

1. What was Amazon's strategic rationale for launching Alexa on a dedicated smart speaker rather than a smartphone?

Correct. Amazon's insight was that an always-present device in the home would make voice interaction habitual in ways a smartphone could not.

Incorrect. Amazon's key strategic bet was that ambient, always-on presence in the home would normalize voice — not that smartphones lacked hardware.

2. Google Duplex, demonstrated in 2018, raised controversy primarily because of what concern?

Correct. The non-disclosure of AI identity in phone calls was the central ethical objection; Google subsequently added mandatory disclosure.

Incorrect. The controversy centered on deception — Duplex's natural-sounding speech might mislead people into thinking they were speaking with a human.

3. Apple's privacy-first Siri architecture resulted in which documented trade-off?

Correct. Loup Ventures benchmarks 2018–2021 consistently placed Siri last in factual recall — a direct consequence of on-device processing limitations.

Incorrect. The documented trade-off was in factual question answering quality, not device control or battery life.

4. What gave Google Assistant a major advantage in answering open-domain factual questions?

Correct. Two decades of search indexing built a Knowledge Graph that gave Google Assistant unmatched factual retrieval across open-domain questions.

Incorrect. The Knowledge Graph — not hardware or calendar integration — was the primary factual advantage.

5. What cross-platform pattern did all three major voice assistants discover by approximately 2022?

Correct. Despite massive investment, voice did not become the dominant interface — users continued switching between apps and voice, using voice for quick tasks.

Incorrect. The shared discovery was that voice stayed in a narrow utility lane rather than becoming the primary interface for deep workflows.

Lab 2 — Platform Selection Advisor

Choosing the right voice platform for a real product scenario

Your Scenario

You're a product manager at a healthcare startup building a voice-enabled medication reminder and symptom checker for elderly users at home. You need to decide whether to build on Alexa, Google Assistant, or Siri — and justify the decision to stakeholders.

Ask the AI to help you analyze which platform best fits this use case. Consider: privacy requirements for health data, user demographics, device penetration in the target market, and developer API access. Complete at least 3 exchanges.

Platform Selection Lab

Big Three Platforms · L2

Hello! I'm your voice platform strategy advisor. You're selecting a voice platform for a healthcare app targeting elderly users at home. Tell me your top priority — is it privacy compliance, ease of device setup for non-technical users, developer API flexibility, or something else? We'll work through the platform trade-offs together.

Module 3 · Lesson 3

Text-to-Speech — The Voice That Talks Back

From robotic concatenation to neural synthesis — how TTS went from uncanny to convincing, and why it matters for voice UX.

What makes a synthetic voice feel trustworthy — and what happens when it goes wrong?

In September 2016, Google's DeepMind published a paper introducing WaveNet — a generative model for raw audio waveforms. In a blind listening test, evaluators rated WaveNet's output 4.21 out of 5 on a Mean Opinion Score scale, compared to 3.86 for the best concatenative system. More strikingly, the gap between WaveNet and human speech was smaller than the gap between WaveNet and the previous generation of TTS. The voice synthesis landscape changed permanently that week.

Generations of TTS Technology

Concatenative synthesis (dominant 1990s–2015) works by splicing together recorded audio segments — diphones, triphones, or words — from a large voice database. Quality depends heavily on the size of the recording corpus and the smoothness of join points. AT&T's Natural Voices system (circa 2000) used databases of over 40 hours of recorded speech per voice. The characteristic robotic quality arises from unnatural prosody (rhythm and pitch) at concatenation boundaries.

Parametric synthesis models the vocal tract acoustically using statistical parameters. It requires smaller databases than concatenative approaches but historically produced "buzzy" output. HMM-based parametric TTS dominated research in the 2005–2015 period.

Neural TTS (2016–present) uses deep generative models to produce waveforms directly or via intermediate mel-spectrogram representations. The key systems include:

2016

WaveNet

DeepMind's autoregressive waveform model. Original version required 1,600× real-time compute. Distilled versions deployed in Google products by 2017.

2018

Tacotron 2

Google's sequence-to-sequence model converting text to mel spectrograms, then WaveNet for audio. Became the backbone of Google Assistant's voice.

2019

FastSpeech

Microsoft Research's parallel (non-autoregressive) synthesis — 38× faster than Tacotron 2 with comparable quality, enabling real-time on-device TTS.

2021

VALL-E

Microsoft's zero-shot voice cloning model. Could replicate a speaker's voice from a 3-second sample, preserving emotional tone and acoustic environment.

Prosody, Expressiveness, and the Uncanny Valley

Prosody — the rhythm, stress, and intonation of speech — is what separates natural from robotic output even when individual phonemes are accurate. Early WaveNet deployments sounded humanlike in isolated sentences but became flat in longer paragraphs because the model lacked discourse-level understanding of which words should carry emphasis.

Amazon's team addressed this for Alexa in 2019 by integrating a Neural Text Pre-Processing layer that uses contextual models to predict prosodic boundaries before audio synthesis. Apple similarly introduced "Siri Voice 5" in iOS 16 (2022) using a new neural model trained on "expressive speech" data, achieving a significantly more natural cadence in directions and longer passages.

The Voice Cloning Risk

Microsoft's VALL-E (2023) demonstrated that a 3-second voice sample could generate arbitrary speech in that speaker's voice. The Federal Trade Commission in the United States launched the "Voice Cloning Challenge" in November 2023, soliciting proposals for technologies to detect synthetic voice and protect consumers from audio deepfakes in phone scams. The FBI reported a rise in "grandparent scams" using voice-cloned audio to impersonate family members.

Voice Persona Design

Beyond synthesis quality, voice UX teams must make deliberate choices about voice persona: pitch, speaking rate, accent, and emotional range. Amazon's Alexa voice was recorded by a professional voice actress in Boulder, Colorado in 2013; her identity remained undisclosed until 2016, when journalist Brad Stone revealed that Nina Rolle was the voice behind Alexa.

Research published by Stanford's SPARQ lab found that users attribute personality traits — including competence and warmth — to voice assistants within the first three utterances. Voices perceived as "warm" generated higher task completion rates for service requests; voices perceived as "authoritative" generated better compliance with health or safety instructions.

Mean Opinion Score (MOS) The standard TTS quality metric: human listeners rate audio naturalness on a 1–5 scale; scores are averaged. Human speech typically scores 4.5+. Leading neural TTS systems now score 4.3–4.5 in controlled conditions.

Mel Spectrogram A visual representation of audio frequency content over time on a perceptual scale matching human hearing. Used as an intermediate representation in Tacotron and other neural TTS pipelines.

Voice Cloning The synthesis of a specific person's voice characteristics from a sample recording. Zero-shot cloning (VALL-E, 2023) requires only seconds of sample audio.

Design Implication

As TTS quality approaches human parity, voice designers shift focus from "does it sound natural?" to "does it sound right for this brand and context?" A warm, unhurried voice for a meditation app and a crisp, efficient voice for a financial assistant serve different user needs — even if both score equally on MOS tests.

Lesson 3 Quiz

Text-to-Speech Technology · 5 questions

1. What was WaveNet's Mean Opinion Score in DeepMind's 2016 blind listening test?

Correct. WaveNet scored 4.21 MOS vs. 3.86 for the best concatenative system in DeepMind's 2016 evaluation.

Incorrect. WaveNet scored 4.21 MOS. 3.86 was the previous best concatenative system; 5.0 is the theoretical maximum.

2. What was the primary computational challenge with the original WaveNet model?

Correct. WaveNet's autoregressive architecture generated audio sample-by-sample, requiring 1,600× real-time — impractical for live use until distillation improved speed.

Incorrect. The bottleneck was computational: autoregressive sample-by-sample generation required 1,600× real-time compute.

3. Microsoft's FastSpeech (2019) achieved which key improvement over Tacotron 2?

Correct. FastSpeech's parallel architecture synthesized mel spectrograms 38× faster than Tacotron 2, enabling viable on-device real-time TTS.

Incorrect. FastSpeech's advance was speed — 38× faster through parallel generation — not MOS scores or multilingual capability.

4. Stanford SPARQ lab research found that users attribute personality to a voice assistant within how many utterances?

Correct. Users form impressions of competence and warmth within the first three utterances — making opening voice design critical.

Incorrect. According to Stanford SPARQ research, personality attribution happens within just three utterances.

5. Microsoft's VALL-E model demonstrated voice cloning from a sample of what minimum length?

Correct. VALL-E demonstrated zero-shot voice cloning from a 3-second audio sample, preserving emotional tone and acoustic environment.

Incorrect. VALL-E required only 3 seconds of sample audio — a threshold that raised significant concerns about audio deepfake misuse.

Lab 3 — Voice Persona Designer

Define a TTS voice persona for a real product context

Your Scenario

You're the lead voice designer for a new AI-powered financial planning app targeting users aged 35–55. Your team needs to specify the TTS voice persona before working with a voice talent agency and synthesis engineers.

Work with the AI to define your voice persona spec: pitch range, speaking rate, emotional tone, accent considerations, and how the voice should shift between information delivery and empathetic moments. Complete at least 3 exchanges to finish the lab.

Voice Persona Lab

TTS Design · L3

Welcome to the Voice Persona Design lab. I'm your voice UX consultant. Let's spec a TTS voice for your financial planning app. First question: who is speaking — a neutral AI entity, a named persona like "Aria," or a voice that mimics a human advisor archetype? Each choice carries different trust and brand implications. What direction is your team leaning?

Module 3 · Lesson 4

Multimodal Voice — Screens, Sensors, and Ambient Computing

Voice assistants are leaving the speaker and inhabiting cars, earbuds, glasses, and appliances — changing the design rules entirely.

When the screen disappears, what does good design actually look like?

In January 2019, Amazon introduced the Echo Show 5 — its fifth screen-equipped Echo device. That same year, Google shipped the Nest Hub Max with a 10-inch display and face-match feature. Voice-only had given way to voice-plus-screen: a hybrid modality that required designers to think simultaneously about what the assistant says and what it shows, and crucially, how those two channels reinforce or contradict each other.

Multimodal Design Principles

When voice and visual channels coexist, design teams must manage channel complementarity: using each channel for what it does best. Voice excels at conveying immediacy, emotion, and sequential instruction. Screens excel at spatial comparisons, reference information, and selection from multiple options. Combining both without clear orchestration produces cognitive overload.

Amazon's internal design guidelines for the Echo Show (published as part of the Alexa Multimodal Design Best Practices document) state that visual content should never simply repeat what the voice says — it should add information (a map when giving directions) or replace spoken content that is better consumed visually (a list of five items).

Voice in the Car — Automotive AI

The automotive context imposes the most severe constraints on voice UX: the user cannot look away from the road, latency must be minimal to avoid distraction, and commands must often be issued in noisy environments at high speed. NHTSA (National Highway Traffic Safety Administration) guidelines recommend voice interactions that take no more than 2 seconds of cognitive engagement time while driving.

BMW introduced ConnectedDrive voice control in 2011; by 2023, BMW's "Hey BMW" system processed over 2 million voice commands per month across its fleet. Mercedes-Benz's MBUX (Mercedes-Benz User Experience), launched in the E-Class in 2018, used a transformer-based NLU that could handle multi-step commands like "I'm cold" — triggering the climate control rather than requiring explicit commands like "increase temperature to 72 degrees." Mercedes reported that MBUX reduced driver touchscreen interaction by 27% compared to the previous generation.

Real Deployment Data

A 2022 study by J.D. Power on automotive voice systems found that 23% of drivers who tried a voice feature in a new vehicle never used it again after the first attempt — primarily due to failed recognitions of music titles, navigation addresses, or contact names. The study identified "first-attempt failure" as the single largest driver of voice system abandonment, suggesting that graceful error recovery UX is more important than recognition accuracy alone.

Wearables and Ambient Computing

Apple AirPods with Siri activation (hands-free "Hey Siri" support added in 2021 with AirPods Pro second generation) brought voice into a truly ambient modality — users issue commands while walking, running, or cooking without taking out a phone. The constraint: no screen at all, and the interaction must complete in seconds without the user stopping activity.

Meta's Ray-Ban Stories glasses (2021) and the second-generation Ray-Ban Meta glasses (2023) added a camera and voice AI to a form factor that looks like standard eyewear. The 2023 version integrated Meta AI, allowing users to ask questions about what they are looking at — a multimodal fusion of vision and voice that represents the emerging paradigm of contextually-aware ambient voice.

Google Glass Enterprise Edition 2 (2019) demonstrated a specialized case: warehouse workers at DHL used voice commands paired with AR overlays to receive pick-and-pack instructions without touching a device, reducing picking errors by 11% and training time by 30% according to DHL's published case study.

The Privacy Geometry of Ambient Voice

As voice assistants move into more intimate contexts — earbuds, glasses, car cabins — the privacy surface area grows. A smart speaker in a living room is a known, stationary device. An AI-enabled earpiece accompanies the user everywhere, overhearing conversations the user may not consciously track.

The European Data Protection Board issued guidelines in March 2021 clarifying that always-on voice capture in wearable devices required explicit, informed consent and that the standard "device activation = consent to capture" argument was insufficient under GDPR for devices worn outside the home.

Multimodal Orchestration The real-time coordination of voice, visual, haptic, and other output channels so that each carries information suited to its medium without redundancy or contradiction.

Ambient Computing A paradigm in which computation and AI assistance are woven invisibly into everyday objects and environments, accessible through natural interaction (primarily voice) without deliberate device use.

Cognitive Load (in Voice UX) The mental effort required to interact with a voice system. High-cognitive-load interactions (long menus, multi-step forms) are unsuitable for driving or physical activity contexts.

Forward-Looking Pattern

The integration of large language models into voice pipelines — replacing rigid intent classifiers with generative response systems — is eliminating many legacy failure modes (mismatched intents, unrecognized entity types) while introducing new ones (hallucinated facts in spoken responses, inconsistent persona maintenance). The next generation of voice UX design must account for both the expanded capability and the new failure signature of LLM-backed assistants.

Lesson 4 Quiz

Multimodal Voice and Ambient Computing · 5 questions

1. According to Amazon's Alexa Multimodal Design Best Practices, visual content on a screen-equipped device should do what when paired with a spoken response?

Correct. Amazon's guideline states visual content should add information (a map) or replace it (a list of five items) — never just echo the voice.

Incorrect. Amazon's multimodal design principle is complementarity: screens should add what voice cannot, not repeat it.

2. NHTSA guidelines recommend in-car voice interactions engage no more than how many seconds of cognitive effort?

Correct. NHTSA recommends a maximum of 2 seconds of cognitive engagement time per voice interaction while driving.

Incorrect. NHTSA's guideline is 2 seconds maximum cognitive engagement — anything longer risks distraction-related incidents.

3. DHL's deployment of Google Glass Enterprise Edition 2 in warehouses reported which specific productivity improvement?

Correct. DHL's published case study reported 11% fewer picking errors and 30% faster training for new workers using the Glass system.

Incorrect. DHL reported 11% reduction in errors and 30% reduction in training time. The 27% touchscreen figure was Mercedes-Benz MBUX data.

4. The J.D. Power 2022 automotive voice study identified what as the single largest driver of voice system abandonment?

Correct. J.D. Power found that first-attempt failure drove abandonment more than any other factor — 23% of first-time users never returned after an initial failure.

Incorrect. The study identified first-attempt failure as the primary abandonment driver, underscoring the critical importance of error recovery UX.

5. The European Data Protection Board's 2021 guidelines on wearable voice devices clarified that device activation alone was insufficient for what legal requirement?

Correct. The EDPB stated that device activation ≠ GDPR consent for always-on audio capture, particularly for devices worn outside the home.

Incorrect. The EDPB ruling addressed GDPR informed consent requirements for always-on voice capture in wearable devices.

Lab 4 — Ambient Voice UX Critic

Evaluate and improve a multimodal voice interaction design

Your Scenario

You've been handed a voice UX flow for a smart kitchen assistant that runs on a screen-equipped device (like an Echo Show). The assistant guides users through recipes using voice + screen. Your job is to critique the interaction design against multimodal principles.

Describe a specific interaction moment from the recipe assistant (e.g., "When the user asks how much flour, the screen shows a photo of the dish and the voice reads the entire ingredient list"). The AI will evaluate it against multimodal design principles and suggest improvements. Complete at least 3 exchanges.

Ambient Voice UX Lab

Multimodal Design · L4

Welcome to the Multimodal UX Critic lab. I'm your voice interaction design reviewer. Describe an interaction moment from your smart kitchen recipe assistant — what the user says, what the voice responds, and what the screen shows. I'll evaluate it against multimodal design principles: channel complementarity, cognitive load, and context appropriateness. What's the first interaction you'd like me to review?

Module 3 Test

Voice Assistants Technology · 15 questions · Pass at 80%

1. In the standard voice processing pipeline, which stage runs entirely on-device before any audio is sent to the cloud?

Correct. Wake-word detection uses a tiny on-device model to confirm the trigger phrase before streaming begins.

Incorrect. Wake-word detection is the only stage guaranteed to run on-device before cloud streaming.

2. Word Error Rate is calculated as which formula?

Correct. WER counts substitutions, deletions, and insertions, divided by the total reference word count.

Incorrect. WER = (substitutions + deletions + insertions) ÷ reference words.

3. Google's 2012 deployment of deep neural networks for ASR reduced word error rates by approximately how much compared to GMM-HMM systems?

Correct. Google's 2012 DNN deployment at Interspeech cut WER by roughly 30% over the best GMM-HMM systems.

Incorrect. The improvement was approximately 30%.

4. Amazon's Alexa Skills Store contained approximately how many skills by 2019?

Correct. The Alexa Skills Store reached over 100,000 skills by 2019, though most saw minimal usage.

Incorrect. The Skills Store had exceeded 100,000 skills by 2019.

5. Google Duplex was demonstrated making a real phone call at which event in 2018?

Correct. Google CEO Sundar Pichai demonstrated Duplex at Google I/O in June 2018.

Incorrect. Duplex was demonstrated by Sundar Pichai at Google I/O in June 2018.

6. Apple's on-device Siri processing strategy was primarily enabled by what hardware component introduced in the A11 Bionic chip?

Correct. The Neural Engine in the A11 Bionic (2017) enabled on-device NLU inference that underpins Siri's privacy-first architecture.

Incorrect. The Neural Engine — Apple's dedicated ML accelerator — enabled Siri's on-device NLU processing.

7. Concatenative TTS produces its characteristic robotic quality primarily because of what?

Correct. Concatenative TTS stitches recorded audio segments; the seam points create prosodic discontinuities that sound robotic.

Incorrect. The robotic quality comes from prosodic discontinuities at the join points between spliced audio segments.

8. Which TTS system introduced parallel (non-autoregressive) spectrogram generation achieving 38× faster synthesis?

Correct. FastSpeech (Microsoft Research, 2019) used parallel synthesis to achieve 38× speed over Tacotron 2.

Incorrect. FastSpeech was the 2019 system that achieved 38× faster synthesis through parallel mel-spectrogram generation.

9. The 2018 Portland Echo incident — where a private conversation was sent to a contact — resulted from what technical issue?

Correct. Amazon attributed the incident to background speech accidentally triggering the wake word, "send message," and a contact name in sequence.

Incorrect. Amazon's explanation was that normal speech accidentally triggered wake-word detection followed by misinterpreted send commands.

10. Mercedes-Benz MBUX reduced driver touchscreen interaction by what percentage compared to the previous generation system?

Correct. Mercedes reported MBUX reduced touchscreen interaction by 27% through natural-language voice command handling.

Incorrect. Mercedes-Benz reported a 27% reduction in touchscreen interaction with MBUX. 11% was DHL's picking error reduction.

11. What does the "language model" component in an ASR system specifically contribute?

Correct. The ASR language model resolves ambiguities like "I scream" vs. "ice cream" by scoring phrase probability in context.

Incorrect. The language model in ASR resolves acoustic ambiguities using the statistical likelihood of word sequences.

12. Amazon reportedly spent approximately how much on Alexa development from 2014 to 2022?

Correct. Reports citing internal Amazon documents placed total Alexa investment at over $10 billion through 2022 without a clear profitability path.

Incorrect. Reports placed Amazon's total Alexa investment at over $10 billion from 2014 to 2022.

13. What principle defines good multimodal orchestration between voice and screen channels?

Correct. Channel complementarity — using each modality for its strengths — is the core multimodal design principle.

Incorrect. Good multimodal design means each channel carries unique, complementary content — not duplication.

14. Tacotron 2 (Google, 2018) converts text to what intermediate representation before audio synthesis?

Correct. Tacotron 2 generates mel spectrograms from text, which are then converted to audio waveforms by a separate vocoder (WaveNet).

Incorrect. Tacotron 2 produces mel spectrograms as its intermediate representation, feeding them to WaveNet for waveform generation.

15. The EU's GDPR, as clarified by the EDPB in 2021, requires what for always-on voice capture in wearable devices worn outside the home?

Correct. The EDPB 2021 guidelines stated that activating a wearable device does not constitute informed GDPR consent for continuous audio capture.

Incorrect. The EDPB clarified that explicit, informed consent is required — device activation alone does not satisfy GDPR for always-on wearable capture.