GPT vs. Claude vs. Gemini

1. GPT-4o was released in May 2024. What does the "o" in its name stand for?

✓ Correct. "Omni" — GPT-4o was OpenAI's first model to natively process text, images, and audio in a single unified model without routing through separate systems.

✗ The "o" stands for omni. GPT-4o was OpenAI's first native omni model, handling text, image, and audio in one set of weights rather than through a pipeline of separate models.

2. What is benchmark gaming via test set contamination, and why is it difficult to prevent in frontier model training?

✓ Correct. Because benchmarks are published on the public internet and training data spans hundreds of billions of tokens scraped from the web, benchmark questions naturally appear in training data. Models may then partially memorize answers rather than generalize the underlying reasoning — which is why performance drops when questions are rephrased.

✗ Contamination occurs because benchmark datasets are published publicly online, and frontier model training data is scraped from the public internet at massive scale. This means benchmark questions naturally appear in training data, and models may memorize answers rather than learn the underlying reasoning — the key evidence being the 10–15 point drop when questions are rephrased.

3. What is the primary difference in how Gemini and Claude approached multimodal design?

Correct!

Incorrect. Review the relevant lesson for more information.

4. ChatGPT reached 1 million users in five days after its November 2022 launch. How long did it take to reach 100 million users?

✓ Correct. ChatGPT reached 100 million users in approximately two months — setting a consumer adoption record that beat TikTok's previous benchmark and forced every major technology company to publicly reveal its AI plans.

✗ Incorrect. ChatGPT reached 100 million users in roughly two months, not one month or longer. The 1 million milestone came in five days; 100 million was the two-month record that made it the fastest consumer application adoption in history at the time.

5. What is a limitation of GPT-4o's audio processing?

Correct!

Incorrect. Review the relevant lesson for more information.

6. What is self-attention's key advantage over recurrent neural networks (RNNs) for training large language models?

✓ Correct. Parallel processing is the core advantage — RNNs process tokens one at a time, making large-scale GPU parallelism impossible.

✗ Self-attention processes all tokens simultaneously, unlike RNNs which process sequentially. This parallelism is what made training on trillions of tokens feasible.

7. According to the lesson's guidance on latency benchmarking, what is the recommended practice for measuring latency before a production deployment decision?

✓ Correct. Real production latency must be measured under realistic concurrency (not sequential single requests), and you need both time-to-first-token (critical for streaming interfaces) and total response time, plus P95 tail latency — because all three labs' APIs show significant variance under load that single-request tests do not reveal.

✗ The lesson recommends benchmarking under realistic concurrency, measuring time-to-first-token separately from total response time, and checking P95 tail latency in addition to median. Single sequential requests dramatically underestimate real latency under load, and all three frontier API providers show significant variance that only concurrent testing reveals.

8. A developer wants to build a workflow that generates Python code, executes it against sample data, and iterates based on output. Which GPT-4o feature most directly enables this?

✓ Correct. Code Interpreter runs code in a sandboxed Python environment and returns results (including errors, plots, and data outputs) directly into the conversation context. This closed-loop generate → execute → iterate pattern is exactly what Code Interpreter was designed for.

✗ Code Interpreter (now called Advanced Data Analysis in ChatGPT) executes Python in a sandbox and feeds the output back into context — enabling the generate → run → iterate loop. DALL-E is for image generation; context window size affects memory, not execution.

9. OpenAI's Batch API, launched April 2024, offers what primary benefit for asynchronous workloads?

✓ Correct. The Batch API provides a 50% price reduction on all models for requests submitted as asynchronous batch jobs completing within 24 hours. Quality is identical to synchronous API calls — only the response time changes, making it ideal for classification, enrichment, and generation pipelines with no real-time requirement.

✗ The Batch API's primary benefit is a 50% cost discount. It trades real-time response for this discount — jobs complete within 24 hours rather than immediately. Context windows and model quality are unchanged; it's purely a cost optimization for workloads that don't require instant responses.

10. When applying the five-axis decision framework to a new deployment, which statement best reflects how the axes should be weighted?

✓ Correct. The module's core lesson is that one axis usually dominates: context length dominated the legal document case, ecosystem dominated the SaaS chatbot case, recency dominated the research assistant case, and cost dominated the classification pipeline. The skill is quickly identifying which axis is the binding constraint for your specific task, then using the others as secondary filters.

✗ The framework's practical value comes from identifying which single axis is the binding constraint for your task. Equal weighting leads to paralysis; always-cost or always-safety rules lead to wrong answers in contexts where another axis dominates. Case Study analysis consistently showed one axis driving the decision while others served as filters.

11. Anthropic was co-founded by Dario and Daniela Amodei along with approximately how many colleagues who left OpenAI with them?

✓ Correct. The Amodei siblings left OpenAI in 2021 with nine colleagues, citing safety concerns about the pace of capability development — a total founding team of eleven people.

✗ Incorrect. Dario and Daniela Amodei left with nine colleagues, forming a founding team of eleven. The departure was specifically motivated by concerns that OpenAI was prioritizing capability over safety research.

12. What does "vision capability" for GPT-4o primarily depend on?

Correct!

Incorrect. Review the relevant lesson for more information.

13. In what month and year did ChatGPT launch publicly, and how quickly did it reach one million users?

✓ Correct. ChatGPT launched in November 2022 and reached one million users in approximately five days — then 100 million users in roughly two months, the fastest consumer-app adoption ever recorded at that time.

✗ ChatGPT launched in November 2022 and hit one million users in about five days. It then reached 100 million users in approximately two months, setting a record for consumer app growth.

14. All three major model families (OpenAI, Anthropic, Google) now support function calling / tool use. What distinguishes OpenAI's position in this space?

✓ Correct. OpenAI introduced the plugin ecosystem in 2023 and the function-calling API shortly after, giving third-party developers a head start building connectors. By mid-2024, the breadth of pre-built integrations (Zapier, Zendesk, GitHub, etc.) for GPT-4o significantly exceeds what's available for Claude and Gemini.

✗ All three providers support function calling. OpenAI's advantage is ecosystem maturity — they were first, which means more third-party developers have built and published integrations. When a team needs to wire up an existing enterprise tool, the GPT-4o connector often already exists.

15. When Google announced Gemini Ultra achieved 90.0% on MMLU — surpassing the "human expert" baseline of 89.8% — what methodological issue did critics identify?

✓ Correct. Gemini used 32-shot chain-of-thought prompting while most prior model comparisons on MMLU used 5-shot prompting. OpenAI noted that GPT-4, evaluated under the same 32-shot CoT conditions, also exceeded 90% — a fact that received far less coverage than the initial announcement.

✗ The key methodological gap was prompting format: Gemini used 32-shot chain-of-thought prompting vs. the standard 5-shot format used in prior comparisons. Under identical 32-shot CoT conditions, GPT-4 also exceeded 90% on MMLU — which significantly undermined the "first to surpass human expert performance" framing.

16. A real-time customer-facing chatbot must respond within 400ms at the 95th percentile. Which model tier is most appropriate?

✓ Correct. Flash and Haiku are designed for latency-sensitive workloads. Their ~300ms median latency comfortably clears a 400ms SLA. Opus and full GPT-4o are significantly slower and would frequently miss the 95th-percentile threshold.

✗ Latency-sensitive applications require the Flash/Haiku tier. These models are optimized for speed (~300ms median) at the cost of some reasoning depth. Opus and standard GPT-4o are much slower and would breach the 400ms SLA at scale.

17. Gemini 1.5 Pro launched in February 2024 with a capability that represented a genuine category expansion. What was it?

✓ Correct. Gemini 1.5 Pro's one-million-token context window was not an incremental update — it enabled qualitatively new use cases like uploading an entire large codebase or a feature-length film for analysis in a single prompt.

✗ Gemini 1.5 Pro's defining innovation was its one-million-token context window, enabling analysis of entire codebases or feature-length videos in a single prompt — a qualitative capability leap, not just a bigger number.

18. What is the key difference between RLAIF (used in CAI) and standard RLHF?

✓ Correct. RLAIF replaces human raters with an AI feedback model guided by written principles (the constitution). This is why CAI required roughly 90% fewer human preference labels on the harmlessness dimension compared to standard RLHF, as documented in the 2022 CAI paper.

✗ The core distinction is the feedback source: RLHF uses human raters comparing output pairs; RLAIF uses an AI model guided by written principles to score outputs. CAI uses RLAIF, dramatically reducing human labeling requirements.

19. What is the training objective during GPT's pre-training phase?

✓ Correct. Next-token prediction on a massive internet corpus is the self-supervised objective that causes the model to implicitly learn language, facts, and reasoning.

✗ Pre-training uses next-token prediction — a self-supervised objective where the model creates its own training signal from text structure, requiring no external labels.

20. On the MMLU benchmark, what score did Gemini Ultra 1.0 achieve — the first model reported to surpass human-expert-level performance?

✓ Correct. Gemini Ultra 1.0 scored 90.0% on MMLU, compared to GPT-4's 86.4% — the first model to exceed human-expert-level performance on that benchmark.

✗ Gemini Ultra scored 90.0% on MMLU. GPT-4 scored 86.4%. The human expert threshold is approximately 89.8%.

Final Exam