Voice and Real-Time AI

1. ElevenLabs' Eleven Multilingual v2 supports approximately how many languages?

Correct. Eleven Multilingual v2 (2023) supports 29 languages with cross-lingual voice cloning — the ability to clone a voice in one language and use it to speak in another.

ElevenLabs' Eleven Multilingual v2 covers 29 languages. Google's Chirp (100+ languages) and Microsoft Azure Neural Voice (140 languages) have broader language coverage at the platform level.

2. Google Meet's live speaker caption latency is approximately 1–2 seconds from speech end to display. This latency is primarily caused by:

Correct. Speaker boundary detection is not instantaneous — the system must observe audio after the boundary to compute an embedding for the new speaker and assign it to a cluster. This post-boundary observation window (typically 1–2 seconds) creates an irreducible latency in streaming diarization.

The latency is fundamentally acoustic and statistical: to assign a speaker label to a segment, the system needs enough audio after the speaker change to compute a reliable embedding. This post-boundary lookahead requirement (typically 1–2 seconds) cannot be eliminated without accepting much higher speaker error rates.

3. What accuracy drop did Microsoft's Azure Cognitive Services emotion model experience going from benchmark to real call center data?

Correct. The 28-point drop from 79% to 51% illustrates the severity of the lab-to-field generalization gap.

Incorrect. The documented drop was from 79% to 51% — a 28-point collapse from benchmark to real-world conditions.

4. The "Air Canada precedent" establishes that:

Correct. The ruling established that you cannot disclaim your AI agent's promises — organizational liability attaches to what the agent says and commits to.

The core precedent is organizational liability for AI agent commitments — you own what your agent promises.

5. What did Amazon's expressive SSML tags specifically address in Alexa's early TTS?

Correct. The SSML work emerged directly from the failure of flat condolence delivery in bereavement skills.

Incorrect. The SSML tags addressed prosodic tone for emotional contexts — specifically the flat grief-context responses.

6. WaveNet's core architectural innovation was:

Correct. WaveNet used dilated causal convolutions — skipping over inputs at exponentially growing intervals — to model the probability of each audio sample given all previous ones, enabling a large receptive field without prohibitive computation.

Incorrect. WaveNet's core innovation was dilated causal convolutions that model the probability distribution of each audio sample conditioned on all previous samples in the waveform.

7. In the Russell Circumplex, "depression" maps to which quadrant?

Correct. Depression is characterized by low energy (low arousal) and negative affect (negative valence).

Incorrect. Depression = low arousal, negative valence in the circumplex model.

8. The EU AI Act (coming into force 2026) classifies real-time remote biometric identification in public spaces as:

Correct. The EU AI Act's prohibited practices article bans real-time remote biometric identification systems in publicly accessible spaces, with narrow exceptions for law enforcement involving serious crimes (terrorism, missing persons) requiring prior judicial authorization.

The EU AI Act places real-time remote biometric identification in public spaces in the prohibited category — not merely high-risk. This means speaker identification systems deployed in public environments (transit, retail, stadiums) cannot legally operate in the EU after 2026 without falling within the narrow law-enforcement exception.

9. What self-supervised pretraining approach does wav2vec 2.0 use to learn speech representations?

Correct. wav2vec 2.0 learns powerful acoustic representations by predicting masked audio segments from unlabeled data.

Incorrect. wav2vec 2.0 is self-supervised — trained on unlabeled audio without emotion labels.

10. The MemGPT architecture (now Letta) addressed which key limitation of standard LLMs?

Correct. MemGPT created an OS-style memory hierarchy to overcome the context window limit — enabling theoretically unlimited conversational memory.

MemGPT's contribution was solving the context window limitation through a hierarchical memory architecture managed by the model itself.

11. What distinguishes OpenAI's Realtime API architecture from a classic ASR→LLM→TTS pipeline?

Correct. The Realtime API's key architectural innovation is eliminating the text representation entirely for some use cases. Audio in, audio out, through a single model — removing two conversion steps and preserving acoustic nuance (tone, emotion, hesitation) that disappears when audio is transcribed to text.

The defining difference is architectural: audio goes directly in and directly out without a text intermediate layer. This removes two conversion latencies and, importantly, preserves acoustic information that is permanently lost when audio is compressed into a text transcript.

12. Diarization Error Rate (DER) is defined as:

Correct. DER = (missed speech + false alarm + speaker error) / total reference speaker time. It is a time-based metric — errors are measured in seconds of incorrectly attributed audio, not in number of speaker turns or words.

DER measures incorrectly attributed speaker time as a fraction of total reference speaker time. Its three components are missed speech (VAD miss), false alarm (VAD false positive), and speaker error (correct speech detection but wrong speaker label). WER and EER are different metrics.

13. Data minimization as a privacy principle applied to diarization means:

Correct. If your product's function (real-time diarization labels) does not require storing embeddings after the session, deleting them is both better privacy practice and substantially reduces legal exposure. The legal risk of biometric data scales directly with retention duration and volume.

Data minimization in the biometric context means not storing what you don't need. Many diarization use cases (real-time caption attribution) only require embeddings during the session — after which they can be deleted. Retaining embeddings indefinitely for "future use" is the most common source of unnecessary BIPA and GDPR exposure.

14. The standard two-stage neural TTS pipeline consists of:

Correct. The standard two-stage pipeline separates acoustic modeling (text or phonemes → mel-spectrogram) from waveform synthesis (mel-spectrogram → audio waveform, the vocoder stage). Tacotron 2 + HiFi-GAN is a canonical example.

Incorrect. The standard two-stage pipeline consists of an acoustic model that converts text to a mel-spectrogram, followed by a vocoder that converts the spectrogram to a waveform.

15. Whisper large-v2 achieves approximately what WER on LibriSpeech clean test set?

Correct. Whisper large-v2 achieves ~2.7% WER on LibriSpeech clean — below the typical human transcriber rate of ~5.8% on this benchmark.

Whisper large-v2's published WER on LibriSpeech clean is ~2.7%, below typical human transcriber performance on this specific benchmark.

16. BMW's iDrive 8 used button-press as primary voice activation in automotive contexts primarily because:

Correct.

Driver studies found that requiring users to say a wake phrase created more distraction-related errors than a physical button press in dynamic driving conditions.

17. End-to-End Neural Diarization (EEND) handles overlapping speech better than clustering approaches, but has a key limitation:

Correct. EEND models N speaker activity streams jointly. This means you must commit to a maximum N during training. An EEND model trained for 4 speakers cannot handle 5-speaker conversations without retraining or using the extended EENDx architecture.

EEND's architectural limitation is its fixed speaker count ceiling. Because it jointly models all speaker activity streams, the number of output streams is fixed at training time. This is its key trade-off versus clustering approaches, which can handle arbitrary speaker counts (at the cost of failing to model overlap).

18. Microsoft's VALL-E (2023) demonstrated which capability regarding emotional voice synthesis?

Correct. VALL-E's codec language model architecture enabled emotional style to transfer zero-shot from a speaker prompt to novel text.

Incorrect. VALL-E's breakthrough was zero-shot emotional transfer from a 3-second speaker sample to new text synthesis.

19. Apple Intelligence routes Siri queries between on-device and cloud models. What is the main advantage of the on-device tier?

Correct. On-device models eliminate network latency (typically 100–400 ms) and keep user data on-device. The trade-off is lower capability — a 3B parameter on-device model cannot match a 70B+ cloud model. Apple's system routes based on query complexity to get the best of both.

The on-device advantage is dual: speed (no network round-trip adds meaningful latency in voice applications) and privacy (data doesn't leave the device). The cost is capability — smaller models handle simpler tasks. This trade-off is the core of Apple's hybrid routing strategy.

20. California's BOT Disclosure Act (AB 1950, 2019) requires:

Correct.

AB 1950 requires that bots interacting with California residents in commercial contexts — including voice AI — disclose that they are automated systems, not humans.

Final Exam