The Hardware Race — Final Exam

1. What hardware feature in the NVIDIA H100 (Hopper, 2022) specifically optimized it for transformer-architecture language models?

Correct. The Transformer Engine is hardware that recognizes the attention patterns in transformer computation and dynamically selects optimal precision. This specificity to the dominant AI architecture of the era gives the H100 its performance characteristics for LLM workloads.

The Transformer Engine dynamically switches between FP8 and FP16 within single layers, specifically optimizing for the attention mechanisms that define transformer models. This hardware-architecture co-optimization is a key H100 differentiator.

2. AWS Inferentia2's NeuronCore v2 architecture uses 32 MB of on-chip SRAM per core. What inference workload characteristic does this specifically address?

Correct. On-chip SRAM serves as a managed scratchpad for weight tiles and activations in tight inference loops. By staging frequently-used weights in SRAM, the NeuronCore reduces the volume of HBM traffic per token, improving throughput and energy efficiency.

Incorrect. The on-chip SRAM acts as a scratchpad reducing HBM traffic for frequently accessed weight tiles during inference loops — not KV cache or gradient storage. Reducing this DRAM traffic is the key inference optimization.

3. The October 7, 2022 export control rules' "U.S. person" provision required what of American citizens working in China's advanced chip sector?

Correct. The rule required U.S. persons to get a BIS license to continue work in China's advanced chip sector — effectively forcing many to resign.

Incorrect. The U.S. person rule required obtaining a BIS license (presumptively denied) to continue supporting advanced Chinese chip production — causing many American engineers to resign.

4. What specific event triggered the U.S. October 2022 export control rules that affected AI chip exports to China?

Correct. The October 7, 2022 rules were a proactive policy decision by BIS targeting chips — including the A100 and H100 — above specific compute thresholds.

The October 7, 2022 rules were a U.S. Commerce Department / BIS policy decision — proactively restricting chips above defined performance thresholds, not a response to a specific incident.

5. The Foreign Direct Product Rule extends U.S. export control jurisdiction to foreign-manufactured products under what condition?

Correct. The FDP Rule captures foreign products made using U.S. technology or software, giving U.S. controls extraterritorial reach.

Incorrect. The FDP Rule applies when the foreign product is the direct product of U.S. technology or software — regardless of where it's physically made or the component content ratio.

6. In what year did Google begin running production workloads on TPU v1 hardware inside its data centers?

Correct. Google deployed TPU v1 in its data centers in 2015, running production traffic for Search, Street View, and Photos — over a year before any public announcement.

Not correct. TPU v1 was deployed in Google data centers in 2015, before the June 2016 public mention by Sundar Pichai at Google I/O.

7. When was Huawei added to the BIS Entity List, effectively beginning the campaign to cut it off from advanced chip supply?

Correct. Huawei and 68 affiliates were added to the Entity List on May 15, 2019.

Incorrect. Huawei was added to the Entity List in May 2019. SMIC was added in December 2020. The broader chip rules came in October 2022.

8. Continuous batching (Orca) improves hardware utilization by allowing new requests to join mid-generation. What problem with static batching does this solve?

Correct. Static batching gates new work on the slowest batch member. Since generation lengths vary widely, fast-finishing requests create idle GPU slots. Continuous batching fills those slots immediately, keeping arithmetic intensity — and hardware utilization — high.

Incorrect. Static batching's core problem is the idle-wait: fast requests finish but the batch slot can't be refilled until all batch members complete. Continuous batching allows immediate slot reuse, eliminating this waste.

9. Nvidia's A800 and H800 chips, designed for the Chinese market after October 2022, were engineered specifically to fall below which threshold?

Correct. Nvidia reduced NVLink interconnect bandwidth density below 600 Gbps/mm² in the A800 and H800 to comply with October 2022 thresholds.

Incorrect. The A800/H800 were specifically engineered to stay below the 600 Gbps/mm² interconnect bandwidth threshold — the parameter BIS chose to define AI training chip performance.

10. What is the key mechanism by which Groq's LPU achieves high inference throughput?

Correct. The LPU's deterministic compiler-scheduled execution eliminates runtime overhead from caching and scheduling, delivering consistent high-throughput inference.

Groq's LPU achieves throughput through deterministic, statically-scheduled execution — the compiler pre-plans every memory access, eliminating cache hierarchies and dynamic scheduling overhead.

11. How many chips does a Google TPU v4 pod contain?

Correct. TPU v4 pods contain 4,096 chips connected via Google's proprietary optical interconnect.

Incorrect. TPU v4 pods contain 4,096 chips — the configuration used to train PaLM 540B.

12. Google's TPU v5e was designed specifically for inference rather than training. Which characteristic reflects this inference optimization?

Correct. Google explicitly traded peak FLOPS for improved cost-per-token on TPU v5e — the right optimization for inference where sustained economic efficiency matters more than peak throughput.

Incorrect. TPU v5e trades some compute peak for lower power and better cost-per-token — the characteristic signature of an inference-optimized chip versus a training-optimized one.

13. SemiAnalysis estimated inference hardware spending would exceed training hardware spending by 2025. Which trend is the primary driver of inference spend growth?

Correct. Training is episodic; inference is perpetual and grows with adoption. Each new model deployment adds a sustained inference load. As the installed base of deployed models grows, aggregate inference demand accumulates and eventually dominates the market.

Incorrect. The driver is accumulated continuous inference demand: each deployed model generates ongoing serving load that compounds as more models deploy and user adoption grows — eventually overwhelming the episodic nature of training runs.

14. The 2017 Google paper "In-Datacenter Performance Analysis of a Tensor Processing Unit" reported what performance advantage for TPU v1 on production inference workloads?

Correct. The Jouppi et al. 2017 paper reported 15–30× better performance-per-watt on Google's six production inference workloads versus contemporary Haswell CPUs and NVIDIA K80 GPUs.

Not correct. The reported figure was 15–30× — a result that surprised the academic hardware community and helped drive the broader industry's interest in purpose-built AI inference accelerators.

15. What did AlexNet achieve at the 2012 ImageNet competition?

Correct — AlexNet's margin of victory was what made the result definitive.

AlexNet reduced ImageNet error from 26% to 15.3% — a decisive margin that ended skepticism about deep learning.

16. The "memory wall" concept, introduced by Wulf and McKee in 1994, predicted that:

Correct. Wulf and McKee's key argument was the compounding performance gap between CPU speed (~54%/yr) and DRAM bandwidth (~7%/yr), eventually making memory access time dominant.

The memory wall paper argued that DRAM bandwidth grew much slower than processor performance, not that latency or cost were the primary issues. The compounding gap meant memory would eventually dominate total execution time.

17. Meta reported that INT8 quantization on Llama 2 70B caused under 1% perplexity increase on MMLU. What is the practical significance of this finding?

Correct. Under 1% perplexity degradation at INT8 means the accuracy cost is acceptable for production serving. Combined with 2× memory bandwidth efficiency from half-width weights, this makes INT8 a compelling inference optimization at scale.

Incorrect. Under 1% perplexity increase is well within production tolerance, meaning INT8 is viable for real serving — enabling 2× better memory bandwidth utilization (half the bytes to load) and proportional throughput gains.

18. Which two countries, besides the U.S., agreed in January 2023 to align semiconductor equipment export controls with the U.S. framework?

Correct. Japan (home to Tokyo Electron) and the Netherlands (home to ASML) were the critical allies whose cooperation was needed to make equipment controls effective.

Incorrect. Japan and the Netherlands were the key partners — Tokyo Electron and ASML are the dominant non-U.S. chipmaking equipment suppliers whose cooperation was essential.

19. What fabrication process node was used for both Azure Maia 100 and Azure Cobalt 100?

Correct. Both Maia 100 and Cobalt 100 were fabricated by TSMC on their 5nm process node.

Incorrect. Both Azure Maia 100 and Cobalt 100 were built on TSMC's 5nm process.

20. FlashAttention's approach to reducing HBM bandwidth consumption is best described as:

Correct. FlashAttention uses tiling and kernel fusion: the N×N attention matrix is computed in SRAM-sized blocks, and the softmax normalization is handled with an online algorithm that never writes the full matrix to HBM. The output is identical to standard attention.

FlashAttention is an exact algorithm — not approximate. It works by tiling the computation into SRAM-resident blocks and fusing operations (Q@K^T, softmax, @V) into a single kernel, avoiding HBM writes of the full N×N intermediate matrix. No quantization or prefetch hardware is required.