L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 7 · Lesson 1

What Is Prosody — and Why Does AI Struggle With It?

Pitch, rhythm, stress, and silence: the hidden architecture of spoken meaning.
How does a sentence's emotional weight survive — or collapse — on its way from human mouth to AI model?

When Amazon launched Alexa's "neural text-to-speech" engine in 2018, internal teams discovered a persistent failure: Alexa would deliver condolence messages — "I'm so sorry for your loss" — in the same crisp, metered cadence she used for weather forecasts. The pitch contour was flat. The pauses were absent. Users in bereavement support skills reported finding the responses "eerie." Amazon's speech science team spent the following year specifically engineering what they called expressive SSML tags to inject pause duration and pitch drop into grief-adjacent utterances. The problem wasn't vocabulary. It was prosody.

Prosody Defined

Prosody is the collective term for the suprasegmental features of speech — the acoustic properties that ride above individual phonemes and words to convey meaning, attitude, and emotion. Linguists conventionally divide it into four primary dimensions:

Pitch (F0)
High
Duration
Med-Hi
Loudness
Med
Voice Quality
Lower
F0 (Fundamental Frequency)The rate of vocal cord vibration — perceived as pitch. Rising F0 signals questions, surprise, or heightened emotion in most languages. Falling F0 signals finality or sadness.
Duration / RhythmHow long phonemes and pauses last. Elongated vowels signal emphasis; shortened syllables signal urgency. Pause placement can reverse sentence meaning entirely.
Intensity (Loudness)Amplitude envelope — perceived as volume. Anger and excitement increase intensity; grief and fear often reduce it or make it irregular.
Voice QualitySpectral features including breathiness, creakiness (vocal fry), and nasality — major carriers of fine-grained emotional state.
Why Text-Based AI Loses Prosody

Most large language models are trained on text — a medium that strips prosody entirely. The sentence "That's great" carries zero acoustic information when tokenized. Whether it was spoken with genuine enthusiasm, dripping sarcasm, or quiet resignation is invisible to the model unless the surrounding context makes it inferable.

Even when AI systems receive audio input — through automatic speech recognition (ASR) — the dominant pipeline converts speech to a text transcript before passing it to the language model. The ASR layer discards the prosodic envelope. Google's 2022 paper on their Universal Speech Model noted that "transcript-first pipelines systematically lose turn-level affect information that would otherwise be recoverable from the acoustic signal."

The gap matters because humans rely on prosody for roughly 38% of emotional meaning transmission in face-to-face conversation (Albert Mehrabian's 1967 studies, though frequently misquoted, established this specifically for emotional attitude words in controlled settings). In voice-only channels — phone calls, smart speakers, IVR systems — that percentage rises because facial expression is absent.

Why This Matters Now

As real-time voice AI enters healthcare triage, mental health support, elderly care, and customer service, the cost of prosodic blindness rises sharply. A system that cannot detect vocal distress — the slight catch in pitch, the halting rhythm of someone in crisis — is not emotionally neutral. It is actively unsafe in high-stakes contexts.

The Acoustic-to-Semantic Gap

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory published work in 2021 showing that even state-of-the-art sentiment classifiers operating on text transcripts missed 31% of negative emotional states that were clearly detectable from the audio alone. The missed cases shared a pattern: calm, grammatically neutral language delivered with distressed prosody — classic "masked distress" patterns common in depression and domestic abuse contexts.

The challenge for AI developers is therefore not simply building better sentiment analysis on text. It requires preserving or re-extracting prosodic features from audio before or alongside transcription — a fundamentally different architectural choice.

Core Insight

Prosody is not decoration on top of linguistic content. For emotion, it is often the primary signal. Building voice AI that ignores prosody is like building vision AI that ignores color — the system functions in a degraded, sometimes dangerously misleading way.

Lesson 1 Quiz

Prosody fundamentals and the acoustic-to-semantic gap
1. Which acoustic feature is most directly perceived as pitch in human speech?
Correct. F0 is the rate of vocal cord vibration and is the primary acoustic correlate of perceived pitch.
Not quite. Fundamental frequency (F0) is the acoustic feature directly perceived as pitch.
2. What does a "transcript-first pipeline" lose before passing speech to a language model?
Correct. Converting speech to text before LLM processing discards the acoustic prosodic features that carry emotional information.
Incorrect. The vocabulary and semantic content survive transcription — it's the prosodic envelope that is lost.
3. Amazon's 2018 Alexa expressive SSML work was triggered by which specific failure?
Correct. Bereavement responses with no prosodic adaptation led users to describe Alexa as "eerie," prompting the expressive SSML effort.
Incorrect. The trigger was delivering condolence messages with the same flat, metered cadence as factual weather responses.

Lab 1 — Prosody Analysis

Practice identifying prosodic features and their emotional implications with your AI tutor.

Your Task

You're working with a speech AI research team. Use the AI tutor below to explore how prosodic features map to emotional states — and why losing them in transcription is a problem. Ask about specific features, real-world cases, or design implications.

Starter prompt: "If a caller says 'I'm fine' with a slow, falling pitch and a long pause after 'fine' — what prosodic features should an AI system flag, and what do they suggest?"
Prosody Analysis Tutor Lab 1
Welcome to Lab 1. I'm your prosody analysis tutor. Let's explore how acoustic features carry emotional information — and what happens when AI systems strip them out. What aspect of prosody would you like to investigate first?
Module 7 · Lesson 2

Emotion Detection in Real-Time Voice AI

From acoustic feature extraction to affective state classification — what the engineering actually looks like.
What techniques allow a live voice AI system to infer emotional state from speech — and how reliable are they in 2024?

Cogito, a Boston-based AI company, deployed real-time speech analytics across major US insurance call centers beginning in 2017. Their system processed live audio alongside agents, measuring features including pace variability, energy level, overtalk frequency, and silence ratios. When the model detected caller distress — rapid speech, rising pitch, short silences — it displayed a small icon prompting agents to "slow down" or "show empathy." A 2019 internal study across 75,000 calls showed a 17% improvement in customer satisfaction scores on calls where agents followed the prompts. By 2022, Cogito's system was analyzing over 100 million conversations annually across healthcare, insurance, and financial services clients.

The Emotion Recognition Pipeline

Speech emotion recognition (SER) systems typically operate as a three-stage pipeline. Understanding each stage clarifies both the capabilities and the failure modes of current systems.

Stage 1: Feature Extraction

Raw audio is windowed into frames (typically 25ms with 10ms hop). From each frame, systems extract: mel-frequency cepstral coefficients (MFCCs), pitch estimates (CREPE or RAPT algorithms), energy/RMS values, zero-crossing rate, and spectral features. These form a feature matrix over time.

Stage 2: Temporal Modeling

Emotional states unfold over seconds, not milliseconds. LSTM networks, Transformers, or convolutional architectures aggregate frame-level features into utterance-level representations. Attention mechanisms identify which frames are most diagnostically relevant.

Stage 3: Classification or Regression

The model outputs either discrete emotion categories (happy, sad, angry, neutral, fearful) or continuous dimensional values on the valence-arousal plane. Most production systems use dimensional output as it handles ambiguous states better.

The Valence-Arousal Space

The Russell Circumplex Model (1980) maps emotions on two axes: valence (positive ↔ negative) and arousal (high energy ↔ low energy). Anger = high arousal, negative valence. Excitement = high arousal, positive valence. Depression = low arousal, negative valence. Serenity = low arousal, positive valence.

Benchmark Performance — Where We Actually Are

The IEMOCAP corpus (Interactive Emotional Dyadic Motion Capture) from USC is the most widely used benchmark for speech emotion recognition. As of 2024, state-of-the-art systems achieve roughly 75–80% weighted accuracy on 4-class IEMOCAP tasks (angry, happy, sad, neutral). That sounds promising until you examine the failure modes.

The "happy" problem: Happy and excited states are routinely confused because both feature high arousal. In the IEMOCAP benchmark, "happy" is one of the lowest-accuracy classes across nearly all published systems — often below 65% recall.

The "neutral" problem: Natural speech contains far more neutral-affect utterances than labeled datasets reflect. Systems trained on acted emotion corpora dramatically over-predict emotional states in real conversations — a calibration failure with real consequences when deployed in call centers.

Real Deployment Gap

Microsoft's 2023 Azure Cognitive Services team published internal data showing that their emotion detection model, trained on studio-recorded corpora, dropped from 79% accuracy on benchmark data to 51% on real customer service calls — due to background noise, overlapping speech, and non-acted emotional expression. This "lab-to-field" gap is the dominant unsolved problem in production SER.

End-to-End Emotion Models: wav2vec and Beyond

The dominant research direction since 2021 has been end-to-end models that learn acoustic representations directly from raw audio without hand-engineered features. Facebook AI's wav2vec 2.0 (released 2020) and its successor w2v-BERT learn powerful speech representations through self-supervised pretraining on thousands of hours of unlabeled audio. When fine-tuned for emotion recognition, these models outperform MFCC-based systems significantly.

Google's 2023 paper on Universal Speech Model (USM) demonstrated that a single large speech model pretrained on 12 million hours of audio could be fine-tuned for emotion detection with as few as 100 labeled examples per class — addressing the data scarcity problem that plagues specialized SER training.

The 2024 Frontier

OpenAI's GPT-4o (May 2024) was the first major commercial system to process raw audio input end-to-end without a transcript intermediary for some tasks. OpenAI's technical blog noted it could detect "tone, multiple speakers, and emotional context" from audio. This marks a genuine architectural shift — but as of late 2024, the emotion detection capabilities remain underdocumented in production settings.

Lesson 2 Quiz

Emotion detection pipelines and real-world SER performance
1. In the Russell Circumplex Model, what two dimensions define emotional space?
Correct. The circumplex model places emotions on a valence (positive/negative) and arousal (high/low energy) plane.
Incorrect. The two dimensions are valence (positive ↔ negative) and arousal (high energy ↔ low energy).
2. What was the key finding of Cogito's 2019 internal call center study?
Correct. The 75,000-call study showed a 17% CSAT improvement when agents acted on the real-time prosodic distress signals.
Incorrect. The 17% improvement was in customer satisfaction scores (CSAT), not call length or accuracy.
3. What is the primary cause of the "lab-to-field" accuracy gap in production SER systems?
Correct. Acted emotion datasets don't capture the subtler, noisier, overlapping emotional expression of real conversations.
Incorrect. The gap is caused by training on acted, studio-quality corpora that don't generalize to real-world conditions.

Lab 2 — SER Pipeline Design

Work through emotion recognition architecture decisions with your AI tutor.

Your Task

You're designing a real-time speech emotion recognition system for a healthcare triage phone line. The system must flag potential caller distress within 3 seconds of an utterance ending. Use the AI tutor to reason through architectural choices, tradeoffs, and failure modes.

Starter prompt: "I need to choose between a traditional MFCC + LSTM pipeline and a fine-tuned wav2vec 2.0 model for a healthcare triage SER system. What are the key tradeoffs for a 3-second real-time constraint?"
SER Architecture Tutor Lab 2
Welcome to Lab 2. I'm your speech emotion recognition architecture tutor. Let's design a real-time SER system for healthcare triage — where getting the latency, accuracy, and failure modes right actually matters. What's your first design question?
Module 7 · Lesson 3

Expressive Speech Synthesis — Making AI Sound Human

How modern TTS systems generate prosodically appropriate speech — and where they still fail.
What does it actually take to make synthesized speech convey the right emotional weight — not just the right words?

When Google demonstrated Duplex at I/O 2018 — its AI that calls restaurants to make reservations — the audience's shock was not about accuracy. It was about the "um" and "mm-hmm." The system inserted filler words and produced micro-pauses in exactly the positions a nervous human caller would. Researchers subsequently analyzed the audio and found Duplex was generating prosodic contours in real-time: rising pitch on booking requests (signaling tentativeness), falling pitch on confirmations (signaling closure), and deliberate 200–400ms hesitation pauses before complex date-negotiation turns. Google later disclosed that the prosody model was trained on millions of hours of real phone conversations — not scripted TTS data. The filler words were not a bug. They were a deliberate prosodic design choice to prevent the call recipient from hanging up on an obviously robotic voice.

The Architecture of Expressive TTS

Modern expressive text-to-speech systems separate the problem into three components: linguistic analysis (what to say and how to phrase it), prosody prediction (what the acoustic parameters should be), and neural vocoder (generating the actual audio waveform). Each component has seen substantial progress since 2016's WaveNet breakthrough by DeepMind.

FastSpeech 2 (Microsoft, 2021)

Predicts duration, pitch, and energy for each phoneme in parallel — enabling non-autoregressive generation at roughly 40× real-time speed. The pitch predictor outputs a continuous F0 contour that can be scaled up or down to change emotional register without re-synthesizing from scratch.

VALL-E (Microsoft, 2023)

A codec language model trained on 60,000 hours of English speech. Given a 3-second speaker prompt, VALL-E can synthesize novel text in that speaker's voice — preserving prosodic style, emotional coloring, and accent. Demonstrated zero-shot emotional transfer: feeding an angry speaker sample produced angry synthesis of unrelated text.

ElevenLabs Emotional Styles

ElevenLabs' 2023 v2 model introduced "style exaggeration" — a continuous parameter controlling how far the system pushes pitch variance, speaking rate variation, and energy fluctuation. Setting it to 0 produces flat corporate TTS; setting it high produces exaggerated theatrical speech. The company's internal A/B tests showed listener preference peaked at style values of 0.3–0.5 for conversational AI contexts.

Google's AudioPaLM (2023)

AudioPaLM combines a language model with an audio model, enabling it to respond to spoken input with spoken output while preserving the speaker's emotional tone. If a user speaks with urgency, AudioPaLM tends to respond with a faster speaking rate and elevated energy — a rudimentary form of prosodic mirroring.

SSML and Prosody Control

Speech Synthesis Markup Language (SSML) is the W3C standard for annotating text with prosodic instructions. It allows developers to specify pitch, rate, volume, pauses, and emphasis directly in markup. For example:

SSML Example — Condolence Phrasing

<speak><prosody rate="slow" pitch="-2st"><break time="300ms"/>I'm so sorry for your loss.<break time="500ms"/></prosody></speak>

This instructs the TTS engine to: slow speech rate, lower pitch by 2 semitones, insert 300ms silence before the phrase, and 500ms silence after. The result is audibly more appropriate than default rendering.

The limitation of SSML is that it requires explicit authoring — a human must annotate every utterance. For dynamic conversation AI, this is impractical. The research frontier is automatic prosody annotation: LLMs that predict appropriate SSML tags for any generated text based on discourse context, user emotional state, and conversational goal.

The Uncanny Valley of TTS

A persistent finding in TTS listener research is a version of the uncanny valley: synthesized speech that is almost but not quite right triggers more discomfort than obviously robotic speech. Carnegie Mellon's 2022 study found that listeners rated high-quality neural TTS with occasional prosodic errors as "more disturbing" than clearly synthetic speech — because the near-human quality made the errors feel intentional or deceptive.

This has concrete product implications. ElevenLabs, Eleven, and OpenAI all add controlled imperfections — micro-pauses, slight pitch instabilities, breath sounds — to prevent their TTS from crossing into the uncanny valley zone. These are not limitations of the model; they are deliberate design choices informed by perceptual psychology.

Transparency Question

Google Duplex's 2018 demo sparked immediate debate about disclosure: if a synthesized voice uses prosodic features specifically designed to be mistaken for human, is there an ethical obligation to disclose? Google subsequently added a required disclosure at the start of Duplex calls. The broader question — how expressive is too expressive — remains unresolved in AI voice design ethics.

Lesson 3 Quiz

Expressive TTS, prosody control, and synthesis design
1. What was the deliberate prosodic design purpose of filler words ("um," "mm-hmm") in Google Duplex?
Correct. Google's team disclosed the fillers were a deliberate prosodic choice to reduce detection of synthetic origin and prevent premature call termination.
Incorrect. The fillers served a prosodic deception-avoidance goal: making the voice sound human enough to keep the conversation going.
2. Which SSML parameter would you use to make a TTS voice sound more somber by adjusting vocal pitch?
Correct. The prosody tag with a negative semitone pitch value lowers the F0, producing a perceptually more somber tone.
Incorrect. The prosody tag with a pitch attribute (e.g., "-2st" for negative semitones) is the correct SSML control for pitch adjustment.
3. What did CMU's 2022 TTS listener study find about the uncanny valley in speech synthesis?
Correct. The near-human quality made prosodic errors feel intentional or deceptive — triggering more discomfort than obviously robotic speech.
Incorrect. The study found the opposite: near-human TTS with errors was rated more disturbing than clearly robotic speech — the uncanny valley effect.

Lab 3 — Expressive TTS Design

Design SSML and prosody strategies for emotionally appropriate AI voice responses.

Your Task

You're building the voice layer for a mental health support chatbot. The system needs to respond to users who may be in distress with tonally appropriate, non-jarring synthesized speech. Use the AI tutor to work through SSML strategies, prosody parameter choices, and the ethics of expressive synthesis in sensitive contexts.

Starter prompt: "A user just said 'I've been really struggling lately.' My TTS needs to respond with a supportive message. Walk me through what SSML parameters I should consider and why."
Expressive TTS Tutor Lab 3
Welcome to Lab 3. I'm your expressive TTS design tutor. We're working on prosodically appropriate voice responses for a mental health support context — where the wrong tone can be as harmful as the wrong words. What's your first design challenge?
Module 7 · Lesson 4

Ethics, Bias, and the Future of Affective Voice AI

Who benefits, who is harmed, and what obligations come with building systems that read and generate human emotion.
When an AI system claims to detect your emotional state from your voice — what rights do you have, what risks do you face, and what should builders be required to disclose?

HireVue, a video hiring platform used by Unilever, Goldman Sachs, and hundreds of other companies, analyzed job interview videos using AI that assessed facial expressions, word choice, and — critically — vocal prosodic features including speech rate variation, pitch range, and micro-pause patterns. The company claimed these features predicted job performance. In 2019, the Electronic Privacy Information Center (EPIC) filed a complaint with the FTC, arguing the system was scientifically unfounded and discriminatory. In January 2021, HireVue announced it had removed facial analysis from its system under public and regulatory pressure. The company continued using audio analysis — but the episode established a precedent: affective AI deployed in high-stakes decisions faces fundamental legitimacy challenges. A 2019 review in Psychological Science in the Public Interest co-authored by 14 leading emotion researchers concluded that the scientific evidence for reliably inferring emotions from faces or voices was "weak to nonexistent" for applications like hiring.

Demographic Bias in Emotion AI

Emotion recognition systems trained predominantly on Western, English-language, acted corpora exhibit systematic bias across multiple demographic axes. The evidence is extensive and consistent:

Accent and Language Bias

MIT Media Lab's 2019 Gender Shades follow-up study found commercial emotion recognition systems performed 10–15% worse on non-native English speakers. Prosodic norms — what "confident" or "distressed" sounds like — vary substantially across languages. Systems trained on American English misclassify tonal language speakers systematically.

Gender and Age Bias

NIST's 2022 speaker recognition evaluations found emotional classifiers showed up to 12% accuracy differential between male and female voices, partly due to F0 range differences. Elderly voices — with higher jitter, shimmer, and reduced pitch range — are systematically under-detected for positive emotional states.

Cultural Emotional Norms

Research by Lisa Feldman Barrett and colleagues (2019) established that emotional expression varies substantially across cultures — not just in display rules but in the acoustic features used to convey specific states. A system calibrated on one cultural context produces systematically biased output in another.

Disability and Neurodivergence

Autistic speakers, individuals with speech impediments, and those with depression-related voice changes are systematically mislabeled by standard SER systems. A 2021 Nature Digital Medicine study found depression screening tools using vocal features showed 23% higher false-negative rates for autistic participants versus neurotypical controls.

Regulatory Landscape (2024)

The EU AI Act (passed March 2024) classifies real-time emotion recognition systems in employment, education, and law enforcement contexts as high-risk AI, requiring conformity assessment, transparency obligations, and human oversight before deployment. Biometric categorization based on voice — including emotional state classification — falls under particularly strict provisions.

In the United States, Illinois' Artificial Intelligence Video Interview Act (2020) requires employers using AI to analyze video interviews to disclose the use of AI, explain how it works, and obtain written consent from applicants. Multiple other states introduced similar legislation in 2023–2024. The FTC's 2023 policy statement on AI and consumer protection specifically cited emotion AI as an area of active concern.

The Consent Architecture Problem

When you call a customer service line and hear "this call may be recorded for quality assurance," you are not being told that a real-time emotion AI is classifying your vocal distress and flagging it for agent prompting, CRM tagging, or churn prediction. The consent architecture of most deployed affective voice AI does not reflect the actual scope of data collection and inference — a gap that regulators are beginning to close.

Responsible Design Principles for Affective Voice AI

Research teams and ethicists working in this space have converged on several principles for responsible deployment. These are not hypothetical — they represent documented positions from organizations including the AI Now Institute, Partnership on AI, and the ACM FAccT community:

ProportionalityThe sensitivity of the inference (emotional state) must be proportional to the stakes of the use case. Detecting caller distress for a crisis line is different from detecting boredom in a retail call center.
TransparencyUsers must be told when affective AI is active, what is being inferred, and how it affects decisions — not buried in terms of service.
ContestabilityUsers must have a meaningful mechanism to challenge or correct emotional classifications that affect them.
Demographic AuditingSystems must be tested for differential performance across demographic groups before and during deployment, with results disclosed.
Human ReviewHigh-stakes decisions triggered by emotional AI — particularly in healthcare, employment, and legal contexts — must have a human decision-maker in the loop.
The Builder's Obligation

Builders of affective voice AI sit at an intersection of rapidly advancing capability and lagging regulation. The systems work well enough to be deployed widely, and not well enough to be deployed safely without careful design. Understanding both the technical limits and the ethical obligations is not optional expertise for practitioners in this field — it is the field.

Lesson 4 Quiz

Ethics, bias, and regulation in affective voice AI
1. What was the outcome of the EPIC complaint against HireVue's emotion AI system?
Correct. Under sustained pressure from EPIC and the research community, HireVue removed facial analysis — while continuing audio analysis.
Incorrect. HireVue removed its facial analysis in January 2021 under pressure — the FTC did not formally rule either way.
2. How does the EU AI Act (2024) classify real-time emotion recognition in employment contexts?
Correct. Employment emotion recognition is classified as high-risk under the EU AI Act, triggering significant compliance obligations.
Incorrect. The EU AI Act places employment emotion recognition in the high-risk category — not banned, but heavily regulated.
3. Which demographic group showed 23% higher false-negative rates in depression screening vocal AI tools, per a 2021 Nature Digital Medicine study?
Correct. Autistic speakers' different vocal prosodic patterns caused standard depression screening tools to miss cases at significantly higher rates.
Incorrect. The 23% higher false-negative rate was specifically found for autistic participants versus neurotypical controls.

Lab 4 — Affective AI Ethics Audit

Apply responsible design principles to real-world affective voice AI deployment scenarios.

Your Task

You're on an AI ethics review board evaluating a proposed deployment of real-time emotion recognition in a major hospital's emergency department phone triage system. The vendor claims 78% accuracy on benchmark data. Use the AI tutor to work through the audit framework — what questions to ask, what evidence to demand, and what safeguards to require.

Starter prompt: "The vendor says their emotion AI has 78% weighted accuracy on IEMOCAP. What questions should I ask before approving deployment in a hospital emergency triage context?"
Ethics Audit Tutor Lab 4
Welcome to Lab 4. I'm your affective AI ethics audit tutor. We're evaluating an emotion recognition system for emergency department triage — a high-stakes, high-vulnerability context where demographic bias and accuracy gaps have life-safety implications. What's your first audit question?

Module 7 Test

Emotional and Prosodic AI — 15 questions · 80% to pass
1. What does "fundamental frequency" (F0) correspond to in human perception?
Correct. F0 is the rate of vocal cord vibration, directly perceived as pitch.
Incorrect. F0 is perceived as pitch.
2. A "transcript-first pipeline" routes speech through which component before the language model?
Correct. ASR converts audio to text first, losing prosodic information before the LLM sees it.
Incorrect. ASR (automatic speech recognition) is the intermediary that creates the transcript.
3. What did Amazon's expressive SSML tags specifically address in Alexa's early TTS?
Correct. The SSML work emerged directly from the failure of flat condolence delivery in bereavement skills.
Incorrect. The SSML tags addressed prosodic tone for emotional contexts — specifically the flat grief-context responses.
4. In the Russell Circumplex, "depression" maps to which quadrant?
Correct. Depression is characterized by low energy (low arousal) and negative affect (negative valence).
Incorrect. Depression = low arousal, negative valence in the circumplex model.
5. What was Cogito's primary product offering in call center deployments?
Correct. Cogito analyzed live audio and surfaced icons prompting agents to slow down or show empathy when distress was detected.
Incorrect. Cogito provided live agent coaching via real-time prosodic distress detection — not post-call analysis.
6. What accuracy drop did Microsoft's Azure Cognitive Services emotion model experience going from benchmark to real call center data?
Correct. The 28-point drop from 79% to 51% illustrates the severity of the lab-to-field generalization gap.
Incorrect. The documented drop was from 79% to 51% — a 28-point collapse from benchmark to real-world conditions.
7. What self-supervised pretraining approach does wav2vec 2.0 use to learn speech representations?
Correct. wav2vec 2.0 learns powerful acoustic representations by predicting masked audio segments from unlabeled data.
Incorrect. wav2vec 2.0 is self-supervised — trained on unlabeled audio without emotion labels.
8. Microsoft's VALL-E (2023) demonstrated which capability regarding emotional voice synthesis?
Correct. VALL-E's codec language model architecture enabled emotional style to transfer zero-shot from a speaker prompt to novel text.
Incorrect. VALL-E's breakthrough was zero-shot emotional transfer from a 3-second speaker sample to new text synthesis.
9. What did CMU's 2022 study find was more disturbing to listeners than clearly synthetic TTS?
Correct. The uncanny valley effect: near-human quality made prosodic errors feel intentional or deceptive — more disturbing than obvious synthesis.
Incorrect. The uncanny valley finding was that near-human TTS with prosodic errors rated worse than clearly robotic speech.
10. What was the 2019 Psychological Science in the Public Interest review's conclusion about inferring emotions from faces and voices?
Correct. This landmark review by 14 emotion researchers provided academic backing for HireVue critics' arguments.
Incorrect. The review's conclusion was stark: the evidence for reliably inferring emotions from biometric signals for applications like hiring was "weak to nonexistent."
11. Illinois' Artificial Intelligence Video Interview Act (2020) requires employers using AI hiring tools to do which of the following?
Correct. The Illinois law established the disclosure-plus-consent framework that other states subsequently modeled.
Incorrect. The Illinois law requires disclosure of AI use, explanation of the technology, and written consent — not model submission or score sharing.
12. Which responsible design principle requires that high-stakes AI emotion decisions maintain a human decision-maker in the loop?
Correct. Human Review specifically addresses the requirement for human oversight in high-stakes affective AI decisions.
Incorrect. Human Review is the principle requiring a human decision-maker in the loop for high-stakes AI emotion inferences.
13. What accuracy differential did NIST's 2022 evaluations find in emotional classifiers between male and female voices?
Correct. The up-to-12% accuracy gap between male and female voices in SER systems partly reflects F0 range differences in training data.
Incorrect. NIST found up to 12% accuracy differential between male and female voices in SER systems.
14. Which FastSpeech 2 component allows changing emotional register without full re-synthesis?
Correct. FastSpeech 2's explicit pitch predictor outputs a scalable F0 contour — enabling post-hoc emotional register adjustment.
Incorrect. The pitch predictor's scalable F0 contour output is what allows emotional register changes without full re-synthesis.
15. What was the architectural significance of OpenAI's GPT-4o (May 2024) for voice emotion AI?
Correct. GPT-4o's end-to-end audio processing marks a genuine architectural departure from transcript-first pipelines — keeping prosodic information intact.
Incorrect. GPT-4o's significance was architectural: it processed raw audio end-to-end without mandatory transcription first, preserving prosodic signals.