L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
Module 7 · Lesson 1

Why AI-Generated Code Needs Special Test Strategies

The nature of LLM outputs creates failure modes that traditional testing instincts miss.
What makes testing AI-generated code different from testing code you wrote yourself?

In February 2023, Amazon's internal CodeWhisperer rollout surfaced a pattern that engineers initially found counterintuitive: the AI produced code that passed every existing unit test perfectly, yet failed in production within hours. The tests themselves had been generated by the same model that wrote the code — each optimised to match the other, creating a closed loop of plausible-looking but untested assumptions.

The Closed-Loop Problem

When a human writes code, they carry implicit mental models of edge cases, business rules, and failure modes. Tests written by the same human, even imperfectly, tend to probe those same mental models. When an LLM writes both code and tests from the same prompt, neither artefact independently validates the other — they share the same blind spots.

This is not a hypothetical. A 2023 Stanford/DeepMind study on GitHub Copilot output found that LLM-generated unit tests covered happy paths at roughly the same rate as human-authored tests, but covered boundary and error conditions at 40% lower rates when the test suite itself was generated by the model.

Real Incident — Cursor AI, March 2024

A widely-cited GitHub thread documented a developer who used Cursor to generate both an authentication middleware and its tests. All 14 tests passed. A manual security review found the middleware silently accepted JWTs with alg: none — the null-signature attack. The generated tests never exercised that path because the model had no training signal to think of it as important.

Three Structural Differences

Human-authored code

Mental model exists before code is written. Tests probe known risks. Bugs cluster around deliberate shortcuts or complexity.

AI-generated code

Code is statistically plausible, not logically derived. Tests share the model's blind spots. Bugs cluster around edge cases the training corpus under-represented.

The second structural difference is confident wrongness. LLMs produce syntactically correct, stylistically clean code even when the logic is subtly wrong. Human developers usually signal uncertainty through comments, TODOs, or asking for review. Models produce the same confident output whether the solution is correct or not.

The third difference is version-surface mismatch. Models are trained on code from across many library versions. Generated code may call APIs that existed in version 2.x but were removed in 3.x, passing tests locally if the developer happens to have an older version installed.

Documented Example — Pillow Library, 2022–2023

Multiple GitHub Issues in 2022–2023 reported that Copilot generated Pillow image-manipulation code calling Image.ANTIALIAS, a constant removed in Pillow 10.0.0 (October 2023). Tests written against environments with Pillow 9.x passed; CI environments using 10.x failed immediately. The pattern appeared in hundreds of public repositories.

What You Must Test That You Normally Skip

  • Null and empty inputs — models often omit None-checks because most training examples work with valid data.
  • Off-by-one boundaries — LLMs frequently produce fencepost errors in loop bounds and slice indices.
  • Error propagation — generated exception handlers often silently swallow errors rather than re-raising or logging.
  • Library version assumptions — any imported dependency should be tested against the pinned version in your requirements file.
  • Security inputs — SQL injection, path traversal, and oversized payloads are systematically under-tested by AI-generated suites.
Closed-loop testing When the same system (or same model prompt) produces both the code under test and the test suite, so shared blind spots go undetected.
Version-surface mismatch Generated code referencing API signatures or constants from library versions different from those installed in the deployment environment.
Confident wrongness Syntactically and stylistically correct code that contains logical errors, produced without any signal that the output is uncertain.

Lesson 1 Quiz

Why AI-Generated Code Needs Special Test Strategies
1. What is the primary danger of using the same AI model to generate both code and its tests from the same prompt?
Correct. The 2023 Stanford/DeepMind study confirmed that AI-generated test suites cover error conditions at 40% lower rates precisely because they share blind spots with the code they test.
Not quite. The key problem is shared blind spots — both artefacts are optimised to match each other rather than independently validate correctness.
2. Which vulnerability did Cursor-generated authentication middleware fail to block, as documented in a 2024 GitHub thread?
Correct. The middleware silently accepted JWTs with the null algorithm, bypassing signature verification entirely. All 14 generated tests passed because none exercised this path.
The documented incident involved JWT null-algorithm (alg: none) attacks — a well-known vulnerability that the model's generated tests simply never exercised.
3. The Pillow ANTIALIAS incident illustrates which structural testing risk in AI-generated code?
Correct. Copilot generated code using Image.ANTIALIAS, removed in Pillow 10.0.0. Code trained on older examples called an API that no longer existed.
This is a version-surface mismatch — the model's training data included the older API constant, but the deployment environment used a newer library version that removed it.
4. According to the lesson, which category of test inputs is most systematically under-represented in AI-generated test suites?
Correct. The lesson lists security inputs as systematically under-tested, along with null inputs, off-by-one boundaries, and error propagation paths.
Happy-path coverage is actually where AI-generated tests perform comparably to human tests. The 40% gap appears in boundary conditions and especially security-oriented inputs.

Lab 1 — Identifying Blind Spots

Practice recognising what AI-generated test suites systematically miss.

Your Task

Below is a Python function and its AI-generated test suite. Analyse the tests, identify what is missing, and discuss your findings with the AI assistant. What boundary cases, security inputs, and error conditions should have been tested but were not?

# AI-generated function def get_user_record(db, user_id): query = f"SELECT * FROM users WHERE id = {user_id}" result = db.execute(query) return result.fetchone() # AI-generated tests def test_get_user_valid(): db = MockDB(users=[{'id': 1, 'name': 'Alice'}]) result = get_user_record(db, 1) assert result['name'] == 'Alice' def test_get_user_not_found(): db = MockDB(users=[]) result = get_user_record(db, 99) assert result is None
Start by telling the assistant what critical test cases you think are missing. Then ask follow-up questions about how to structure them properly. Complete at least 3 exchanges to finish the lab.
AI Lab Assistant Testing Blind Spots
I'm looking at this function and test suite with you. What's the first gap you notice? Think about what kinds of inputs a real user — or an attacker — might pass for user_id that these two tests never exercise.
Module 7 · Lesson 2

Property-Based and Mutation Testing

Two techniques that systematically find what example-based tests miss in AI outputs.
How do you test code when you don't know in advance what inputs it will receive?

When Stripe's engineering team published their internal evaluation of Copilot in late 2022, one finding stood out: the AI reliably produced plausible-looking parsing and serialization functions that failed on inputs outside the narrow range exemplified in the prompt. Their mitigation — which they published as a public engineering blog post in January 2023 — was to require property-based tests for all AI-generated parsing code before merge.

Property-Based Testing

Traditional example-based tests check that add(2, 3) == 5. Property-based tests check that for any two integers a and b, add(a, b) == add(b, a). The testing framework (Hypothesis in Python, fast-check in JavaScript, QuickCheck in Haskell) generates hundreds of random inputs automatically, including the edge cases your examples never thought to include.

This is particularly powerful for AI-generated code because it escapes the closed-loop problem. The AI didn't generate the test inputs — a random generator did. The model's training distribution has no influence on what gets probed.

# Example: Hypothesis property test for an AI-generated JSON parser from hypothesis import given, strategies as st import json @given(st.text()) def test_parse_never_crashes(s): # The parser should raise ValueError, never crash with an unhandled exception try: parse_user_json(s) except ValueError: pass # expected # Any other exception = test failure
Real Adoption — Anthropic Internal Tooling, 2023

Anthropic's 2023 model-evaluation documentation (released as part of their responsible scaling policy) noted that property-based testing was used to evaluate Claude's own code-generation outputs during red-teaming, specifically because it was the only technique that reliably found edge-case failures that curated example tests missed.

Mutation Testing

Mutation testing answers a different question: are your tests actually capable of detecting bugs? The framework (mutmut for Python, Stryker for JavaScript/TypeScript) makes small changes to the code under test — flipping a > to >=, deleting a return statement, negating a condition — and checks whether any test fails. If no test fails on a mutation, your test suite has a gap.

For AI-generated code this is especially valuable. The model frequently produces correct-looking but subtly wrong comparisons. Mutation testing forces you to have at least one test that would distinguish x > 0 from x >= 0.

Documented Gap — GitHub Copilot Off-by-One Study, 2023

A peer-reviewed study published at FSE 2023 ("An Empirical Study of Deep Learning Models for Bug Detection") found that AI-generated functions contained off-by-one errors in loop bounds at 2.3× the rate of human-authored equivalents. Mutation testing that flipped < to <= in loop conditions caught 91% of these in the study's test corpus.

Combining Both Approaches

Property-based testing finds inputs that break invariants. Mutation testing finds tests that wouldn't catch breaks if they existed. Together they provide orthogonal coverage: one probes the input space, the other probes the test suite's detection power.

The practical workflow for AI-generated code is: (1) run mutation testing on any existing AI-generated tests to identify which are vacuous; (2) add property-based tests targeting the invariants the code claims to uphold; (3) re-run mutation testing to confirm the new tests improve kill rate.

Property-based testing Generating hundreds of random inputs to verify that a logical invariant holds universally, rather than checking specific examples.
Mutation testing Automatically introducing small bugs (mutations) into code to verify that at least one test fails — measuring test suite detection power.
Kill rate The percentage of mutations that at least one test detects. A high kill rate indicates a test suite that genuinely validates correctness.

Lesson 2 Quiz

Property-Based and Mutation Testing
1. Why does property-based testing escape the closed-loop problem that affects AI-generated test suites?
Correct. The random input generator is independent of the model's training data, so blind spots in the model's knowledge don't translate into gaps in the test inputs.
The key insight is independence of inputs. Since a random generator, not the AI model, produces inputs, the model's training distribution has no influence on what edge cases get exercised.
2. What specific question does mutation testing answer?
Correct. Mutation testing measures detection power — whether your test suite would actually fail if the code contained a bug. High mutation kill rate means strong detection capability.
Mutation testing is about test quality, not code coverage or input coverage. It asks: if I introduce a small bug, does at least one test fail?
3. According to the FSE 2023 study cited in the lesson, at what rate do AI-generated functions contain off-by-one errors in loop bounds compared to human-authored code?
Correct. The FSE 2023 study found AI-generated functions contained off-by-one errors in loop bounds at 2.3× the rate of human code, with mutation testing (flipping < to <=) catching 91% of these.
The FSE 2023 study found a 2.3× higher rate. This is why mutation testing on loop conditions is particularly valuable when reviewing AI-generated code.
4. In the recommended workflow for AI-generated code, what is the correct order of operations?
Correct. First identify which existing tests are vacuous via mutation testing, then add property-based tests targeting invariants, then confirm the new tests improve kill rate.
The lesson recommends: (1) mutation testing to identify vacuous tests, (2) add property-based tests, (3) re-run mutation testing to confirm improvement — in that sequence.

Lab 2 — Designing Property-Based Tests

Practice identifying testable invariants in AI-generated functions.

Your Task

The AI generated this sorting utility. The only test it included was assert sort_records([3,1,2]) == [1,2,3]. Work with the assistant to identify at least three mathematical invariants that should hold for any valid input, and draft Hypothesis-style property tests for each.

def sort_records(records, key='id', reverse=False): # AI-generated: sorts a list of dicts or ints by key if not records: return [] if isinstance(records[0], dict): return sorted(records, key=lambda x: x[key], reverse=reverse) return sorted(records, reverse=reverse)
Think about what must always be true after sorting: length, element membership, ordering relationships. Describe a property in plain English, then we'll turn it into a test together. Aim for at least 3 exchanges.
AI Lab Assistant Property-Based Testing
Let's find the invariants. Start simple: what must always be true about the length of the output compared to the input? Once we have that, we'll work toward more interesting properties.
Module 7 · Lesson 3

Integration and Contract Testing for AI Code

Unit tests pass but the system breaks — why AI-generated code needs boundary contracts enforced at the integration level.
If every unit test passes, why might the system still fail when AI-generated components are assembled together?

In June 2023, Shopify's developer relations team posted a retrospective on their engineering blog about a production incident involving AI-assisted API integration code. Three microservices had been partially written with Copilot assistance. Unit tests for each service passed completely. Yet when assembled, the services silently disagreed on whether a nullable customer_id field should be represented as null, 0, or an absent key — a contract ambiguity the model had resolved differently in each context without flagging the inconsistency.

Why Unit Tests Are Not Enough

AI models generate code from local context. When writing a function that calls an API, the model infers the shape of the API response from the prompt and from its training data. When writing the service that provides that API, it may infer a slightly different shape. Unit tests for each component use mocks or fixtures that each developer wrote — and those fixtures silently encode the inconsistency.

The Shopify incident is representative of a broader pattern. A 2023 survey by Harness of 500 engineering teams using AI coding assistants found that 38% had experienced integration-layer failures attributable to AI-generated components that individually passed all tests. Contract testing was identified as the most effective mitigation in 71% of those cases.

Consumer-Driven Contract Testing

Consumer-driven contract testing (pioneered by Pact.io) requires each consumer of an API to publish a formal contract describing exactly what response shapes it expects. The provider service then runs its own test suite against those contracts. If the AI-generated provider returns null where the contract says the field should be absent, the contract test fails — even if all unit tests pass.

Without contract tests

Each service's unit tests use hand-written mocks. Discrepancies between what a provider returns and what a consumer expects only surface in staging or production.

With contract tests

The consumer publishes expectations. The provider verifies them on every build. AI-generated field-type mismatches, nullable inconsistencies, and removed fields are caught at CI time.

Real Adoption — Amazon AWS, 2023

AWS's 2023 re:Invent session "Responsible AI in CI/CD Pipelines" (session DEV308) documented that teams using CodeWhisperer were required to run Pact contract tests on any AI-generated API integration code before merging. The mandate reduced integration-layer incidents in the documented cohort by 61% over six months.

Schema Validation as a Test Layer

For REST APIs, JSON Schema validation is a lightweight contract test that AI-generated code frequently fails. The model may generate code that returns an object when the schema says it should return an array of objects, or include extra fields that a strict schema would reject.

For database interactions, AI-generated ORM code often makes silent assumptions about nullable columns that differ from the actual schema. Running generated code against a test database with the real schema (not mocks) catches these immediately.

Documented Pattern — OpenAI Evals, 2023

OpenAI's public evals framework (released March 2023) includes a class of evaluations specifically testing whether GPT-4-generated code correctly handles optional/nullable fields in JSON APIs. Across 1,200 evaluated prompts, the model generated incorrect nullability handling in 29% of cases — nearly always silent (no runtime error, just wrong data).

Practical Integration Test Checklist for AI Code

  • Schema test: Validate every response against a JSON Schema or Pydantic model before your test assertions.
  • Null contract test: Explicitly test that optional fields are absent (not null, not zero) when not provided, matching the contract.
  • Database round-trip test: Insert an AI-generated record and read it back through the real schema, not mocks.
  • Consumer contract test: If multiple services interact, publish and verify Pact contracts for every AI-generated interface.
  • Backwards-compatibility test: Run the new AI-generated component against the previous version's contracts to catch regressions.
Consumer-driven contract A formal specification published by a service consumer describing the exact response shapes it expects, which the provider must verify in CI.
Nullability contract An explicit agreement about whether optional fields are represented as absent, null, or a default value — a frequent AI-generated inconsistency.

Lesson 3 Quiz

Integration and Contract Testing for AI Code
1. In the Shopify incident, what caused the production failure despite all unit tests passing?
Correct. The model inferred the nullability of customer_id differently when writing each service — null, 0, or absent — because it only had local context and no shared contract.
The failure was a contract inconsistency: the same nullable field was represented as null in one service, 0 in another, and absent in the third. Unit tests using local mocks masked this completely.
2. According to the 2023 Harness survey, what percentage of engineering teams using AI coding assistants experienced integration-layer failures from AI-generated components that individually passed all tests?
Correct. 38% of 500 surveyed teams experienced integration-layer failures. Contract testing was identified as the most effective mitigation in 71% of those cases.
The Harness survey found 38% of teams experienced this problem — a substantial minority, reinforcing the need for integration-layer testing beyond unit tests.
3. What did AWS's re:Invent 2023 session document as the outcome of requiring Pact contract tests on AI-generated integration code?
Correct. The AWS re:Invent DEV308 session documented a 61% reduction in integration-layer incidents in teams that adopted Pact contract testing for AI-generated API code.
The documented outcome was a 61% reduction in integration-layer incidents over six months — a significant but not total improvement, consistent with contract tests catching structural issues while other bugs remain.
4. According to OpenAI's public evals framework data, how frequently did GPT-4-generated code incorrectly handle optional/nullable fields in JSON APIs?
Correct. 29% of 1,200 evaluated prompts resulted in incorrect nullability handling — nearly always silent, producing wrong data without a runtime error.
OpenAI's evals found incorrect nullability handling in 29% of cases. The silent nature of these failures — no runtime error, just wrong data — makes schema and contract testing essential.

Lab 3 — Writing Contract Tests

Practice catching AI-generated interface inconsistencies before integration.

Your Task

Two AI-generated services interact via a shared /orders/{id} endpoint. The consumer service expects the response; the provider service generates it. Analyse the inconsistencies below and work with the assistant to write a Pact-style consumer contract that would catch them at CI time.

# Consumer service expects (from its mock): { "order_id": 123, "customer_id": 456, # required integer "status": "pending", "items": [...] } # Provider service actually returns: { "order_id": "123", # string, not integer! "customerId": 456, # camelCase, not snake_case! "status": "PENDING", # uppercase, not lowercase! "line_items": [...] # different field name! }
Identify each discrepancy and draft the consumer contract assertions that would catch each one. Then discuss how you'd structure these as automated Pact interactions. Complete at least 3 exchanges.
AI Lab Assistant Contract Testing
I count four distinct contract violations in that response. Let's work through them one at a time. Which discrepancy do you think would cause the most subtle — hardest to debug — failure in production?
Module 7 · Lesson 4

CI/CD Integration and Regression Testing for AI Outputs

Building pipelines that treat AI-generated code with appropriate scrutiny at every merge.
How do you prevent AI-generated regressions from reaching production when the model updates its behaviour between prompts?

In January 2024, Vercel's engineering team published a post-mortem on their engineering blog documenting a regression caused by re-prompting GitHub Copilot to refactor an authentication utility. The second generation of code was stylistically cleaner but silently changed session expiry logic — a behaviour that existing tests didn't cover because the original code had been written before session expiry tests were added. The regression reached production and caused a 40-minute authentication outage for a subset of users.

The Re-Prompting Regression Problem

Human developers refactoring code carry context about what the code is supposed to do. An LLM given a refactor prompt starts from the text of the code alone — it may not preserve behaviours that were not obvious from reading the function in isolation. Each time you re-prompt an AI to modify existing code, you must treat the output as if it were entirely new code written by someone who has never seen your test suite.

This means regression tests written after the initial AI generation are particularly important. They document that a specific behaviour existed and was intentional — context the model cannot infer from code structure alone.

CI Gate Strategy — Google Internal, 2023

Google's 2023 internal guidance on AI-assisted development (summarised in their "Responsible GenAI" developer documentation) recommends treating any AI-generated or AI-modified file as requiring a full regression suite run, not just the tests for the changed function. Files touched by AI assistants are flagged in CI with a mandatory expanded test gate.

Tagging AI-Generated Code in CI

Several organisations have adopted the convention of marking AI-generated files with a comment or metadata tag so CI pipelines can apply different gate rules. GitHub's internal engineering team published in 2023 that they were piloting a # ai-generated annotation that triggered an expanded static analysis pipeline including semgrep rules tuned for AI failure modes.

# Example .github/workflows/ai-review.yml gate jobs: ai-extended-check: if: contains(github.event.pull_request.body, 'ai-generated') steps: - uses: actions/checkout@v4 - run: pip install mutmut hypothesis - run: mutmut run --paths-to-mutate src/ai_generated/ - run: pytest tests/property/ tests/contracts/ -x

Snapshot and Approval Testing

Snapshot testing captures the exact output of a function and stores it as a reference. On subsequent runs, any deviation fails the test. For AI-generated code that processes structured data — JSON transformations, HTML rendering, report generation — snapshot tests efficiently catch the silent behaviour changes that re-prompting introduces.

The Jest snapshot system, widely used in frontend testing, was increasingly applied to AI-generated React component code in 2023. Netlify's developer blog (August 2023) documented that teams adding snapshot tests to all Copilot-generated component functions caught 84% of unintended re-prompting regressions before merge.

Documented Practice — Microsoft DevDiv, 2023

Microsoft's Developer Division published a 2023 internal study on Copilot adoption across their own engineering teams. Teams that required at least one regression test covering each AI-generated function's core behaviour before merging reported 3× fewer production regressions than teams that relied solely on AI-generated tests. The study was referenced in their 2024 State of DevOps Report.

Building an AI-Aware Regression Pipeline

  • Tag all AI-generated files at creation with a comment or git attribute so CI can apply expanded gates selectively.
  • Require at least one human-authored regression test per AI-generated function before merge — not the test the AI wrote for itself.
  • Run mutation testing on every AI-touched PR to confirm new tests have detection power, not just coverage lines.
  • Snapshot-test structured outputs (JSON, HTML, SQL) so re-prompting regressions are caught automatically.
  • Re-run contract tests on any AI-modified API layer even if the PR claims to be a pure refactor.
  • Pin library versions in CI and lock them against the versions the AI-generated code was reviewed against.
Re-prompting regression A behaviour change introduced when AI rewrites existing code, silently removing previously correct logic that was not obvious from the function's text alone.
Snapshot test A test that captures a function's exact output as a stored reference and fails on any deviation, efficiently catching silent changes in structured data outputs.
AI gate A CI pipeline stage triggered by AI-generated file markers that applies expanded testing rules beyond those used for human-authored code.

Lesson 4 Quiz

CI/CD Integration and Regression Testing for AI Outputs
1. What caused the Vercel authentication outage documented in January 2024?
Correct. The refactored code was stylistically cleaner but silently altered session expiry logic. Regression tests for that behaviour had been added after the original generation and the model had no context for them.
The Vercel incident was a re-prompting regression: a refactoring prompt produced cleaner-looking code that silently changed session expiry logic, which existing tests didn't cover.
2. According to Microsoft DevDiv's 2023 study, teams requiring at least one human-authored regression test per AI-generated function before merging reported how much fewer production regressions?
Correct. The Microsoft DevDiv study found 3× fewer production regressions when teams required at least one human-authored regression test per AI-generated function before merging.
The Microsoft study found a 3× reduction — a meaningful but not absolute improvement, consistent with human-authored tests adding independent coverage that AI-generated tests systematically miss.
3. What did Netlify's developer blog document as the outcome of adding snapshot tests to all Copilot-generated component functions?
Correct. Netlify's August 2023 post documented that snapshot tests on AI-generated components caught 84% of re-prompting regressions before merge — a high detection rate for a relatively lightweight test type.
Netlify documented an 84% catch rate for re-prompting regressions — specifically for snapshot tests on structured output, which efficiently detect the kind of silent behaviour changes AI refactoring introduces.
4. What is the primary purpose of tagging AI-generated files with an annotation like # ai-generated in CI?
Correct. The tag allows CI to apply a stricter gate selectively to AI-generated files, running mutation tests, property tests, and contract checks without imposing that overhead on every PR in the repository.
The annotation's purpose is to selectively trigger expanded CI gates on AI-generated files, applying the additional scrutiny those files require without slowing every pull request in the repository.

Lab 4 — Designing an AI-Aware CI Pipeline

Practice configuring regression gates for a repository that uses AI code generation.

Your Task

Your team uses Copilot for initial code generation and occasional refactoring. The repository has standard pytest unit tests, but no mutation testing, property tests, or contract tests yet. You've been asked to design a CI gate that applies additional scrutiny to AI-generated files without slowing all PRs. Work with the assistant to design the pipeline configuration.

# Current CI (simplified): steps: - run: pytest tests/ --cov=src - run: flake8 src/ # Files tagged by developers as AI-generated: src/auth/token_validator.py # ai-generated src/api/order_handler.py # ai-generated src/utils/data_parser.py # ai-generated
Design the expanded gate for AI-tagged files. What stages should it include? In what order? What thresholds would you set for mutation kill rate? Discuss your reasoning with the assistant. Complete at least 3 exchanges.
AI Lab Assistant CI/CD Pipeline Design
Let's design this step by step. Given what you know about AI-generated code's failure modes — closed-loop tests, off-by-one errors, contract inconsistencies, re-prompting regressions — what's the single most important gate stage to add first for those three flagged files?

Module 7 Test

Testing AI-Generated Code — 15 questions · 80% to pass
1. What does "closed-loop testing" mean in the context of AI-generated code?
Correct.
Closed-loop testing means the same AI model produces both code and tests from the same prompt — the shared origin means shared blind spots go undetected.
2. The 2023 Stanford/DeepMind study found that AI-generated test suites covered error conditions at what rate compared to human-authored tests?
Correct.
The Stanford/DeepMind study found AI-generated test suites covered error conditions at 40% lower rates — specifically when the tests were generated by the same model as the code.
3. What is "version-surface mismatch"?
Correct.
Version-surface mismatch occurs when AI-generated code references API signatures or constants from library versions different from those in the deployment environment — exemplified by the Pillow ANTIALIAS incident.
4. In the Cursor AI authentication incident, how many generated tests were present, and what did they fail to test?
Correct.
The Cursor incident had 14 tests, all passing, but none tested the JWT null-algorithm path — a well-known security vulnerability the model never considered testing.
5. What key advantage does property-based testing have over example-based testing when applied to AI-generated code?
Correct.
The key advantage is independence: a random generator, not the AI model, produces the test inputs, so the model's blind spots don't translate into gaps in test coverage.
6. In mutation testing, what does a high "kill rate" indicate?
Correct.
Kill rate measures test quality: the percentage of artificially introduced mutations that at least one test detects. High kill rate means the suite genuinely validates correctness.
7. The FSE 2023 study found that AI-generated functions contained off-by-one errors in loop bounds at 2.3× the human rate. Which mutation was most effective at catching these?
Correct. Flipping < to <= in loop conditions caught 91% of the off-by-one errors in the study corpus.
The mutation that caught 91% of off-by-one errors was flipping < to <= in loop conditions — directly targeting the boundary comparison that AI models most often get wrong.
8. What is consumer-driven contract testing?
Correct.
Consumer-driven contracts (as implemented by Pact.io) require each consumer to publish expectations that the provider must verify, ensuring both sides agree on the API shape at CI time.
9. In the Shopify incident, which specific field representation inconsistency caused the production failure?
Correct.
The Shopify incident involved nullability: the same optional customer_id field was represented as null in one service, 0 in another, and absent as a key in the third.
10. The OpenAI evals data found GPT-4 incorrectly handled nullable JSON fields in what percentage of evaluated prompts?
Correct.
OpenAI's public evals found incorrect nullability handling in 29% of evaluated prompts — nearly always silent, producing wrong data without any runtime exception.
11. What is a "re-prompting regression" in the context of AI-generated code?
Correct.
A re-prompting regression occurs when AI refactors existing code and silently changes behaviour it couldn't infer was intentional — exemplified by the Vercel session expiry outage.
12. The Vercel January 2024 incident resulted in what service disruption?
Correct.
The Vercel incident caused a 40-minute authentication outage for a subset of users — significant but contained, and fully attributable to a re-prompting regression in session expiry logic.
13. What is the primary purpose of snapshot tests when applied to AI-generated code?
Correct.
Snapshot tests store exact outputs (JSON, HTML, etc.) as references. Any deviation on subsequent runs — including those introduced by AI refactoring — causes an immediate failure.
14. What did Google's "Responsible GenAI" developer documentation recommend for AI-generated files in CI?
Correct.
Google's guidance specifies that AI-touched files trigger a full regression suite run — not just the changed function's tests — because re-prompting regressions can affect behaviour far from the changed line.
15. Which combination of techniques provides the most complete coverage for AI-generated code testing?
Correct. These techniques address orthogonal failure modes: property tests cover the input space, mutation tests verify detection power, contract tests enforce interface consistency, snapshot tests catch re-prompting regressions, and human-authored tests add independent coverage the AI cannot provide for itself.
The full combination — property-based, mutation, contract, snapshot, and human-authored regression tests — is required because each targets a distinct failure mode that AI-generated code exhibits.