Module 2 · Lesson 1

What Quality Gates Actually Are

Automated checkpoints that block bad code before it reaches production — and the organizational logic behind them.

Why do teams ship bugs they already detected?

On August 1, 2012, Knight Capital Group deployed new trading software to production. One of eight servers did not receive the updated code. Within 45 minutes, the firm accumulated a $7 billion unintended position in 154 stocks. The loss was $440 million — roughly four times Knight's 2011 net income. The firm was acquired four months later.

The root cause was not a complex algorithm. It was the absence of a deployment gate that verified identical binary versions across all servers before enabling live traffic. The defect was known to exist in principle; no gate existed to catch it in practice.

Defining Quality Gates

A quality gate is a conditional checkpoint — automated or manual — that must pass before code moves to the next stage of a pipeline. Gates encode team policy into executable enforcement. They transform guidelines ("we should have 80% test coverage") into structural requirements ("this pipeline will not proceed until coverage reaches 80%").

The distinction between a guideline and a gate is enforcement. Guidelines can be skipped under pressure. Gates cannot be bypassed without a deliberate administrative override — which creates an audit trail.

Hard GateBlocks the pipeline completely. Build fails, merge is rejected, or deployment halts. No bypass without explicit override with documented justification.

Soft GateProduces a warning and allows progression, but logs the violation for review. Used for non-critical standards or newly introduced metrics during ramp-up.

AdvisoryInformational only. Metrics are collected and reported but do not affect pipeline flow. Useful for baselining before converting to a hard gate.

Where Gates Live in a Pipeline

Modern CI/CD pipelines typically place quality gates at five canonical positions. Each position has a different cost of failure — the earlier a defect is caught, the cheaper it is to fix. This is the economic argument for shifting quality left.

Pre-commit: Local hooks (Git hooks via Husky, pre-commit framework) run linters and formatters before a commit is recorded. Zero network cost. Catches style and syntax errors in seconds.

Pre-merge (PR gate): CI runs on the feature branch before merging. Includes unit tests, static analysis, and coverage checks. The most impactful gate for most teams.

Post-merge (trunk gate): Full test suite runs on the main branch after merge. Catches integration regressions not visible on isolated branches.

Pre-release: Security scans, dependency audits, performance benchmarks. Often triggers a manual approval step.

Post-deployment: Smoke tests, canary analysis, error-rate monitoring. Triggers automatic rollback if thresholds are breached.

IBM Systems Sciences Institute — Cost of Defect Data

A defect caught at the requirements/design phase costs approximately $1 to fix. The same defect caught in testing costs $10–$25. Caught in production: $100–$1,000. Quality gates operationalize this cost curve by making it structurally impossible to skip early-stage detection.

Gate Configuration: The SonarQube Model

SonarQube — the most widely deployed static analysis platform — popularized the term "Quality Gate" as a product feature. A SonarQube Quality Gate is a named set of conditions applied to analysis results. The default "Sonar way" gate for new code requires: zero new bugs rated Critical or Blocker, zero new security hotspots unreviewed, coverage on new code ≥ 80%, and duplication on new code ≤ 3%.

This new code focus is deliberate. Applying gates to the entire legacy codebase at once typically fails immediately and discourages adoption. Gating only new code allows incremental improvement without requiring teams to fix years of technical debt before shipping anything.

# SonarQube Quality Gate conditions (sonar-project.properties)
sonar.qualitygate.wait=true
sonar.coverage.exclusions=**/*Test*,**/generated/**
# Gate will block CI pipeline if analysis fails
# Conditions evaluated server-side against named gate

The Bypass Problem

Gates fail organizationally when teams normalize bypasses. In 2014, Heartbleed — the OpenSSL vulnerability affecting an estimated 17% of all SSL-secured web servers — was a memory safety defect that existed in a codebase with a code review process on paper. The review gate existed; it was routinely completed by a single volunteer reviewer applying minimal scrutiny to large diffs.

The lesson is not that gates are insufficient. It is that gates must be audited for effectiveness, not just existence. A 100% PR approval rate with 2-minute average review times is a signal that the gate is ceremonial, not substantive.

Core Principle

A quality gate is only as strong as the enforcement mechanism behind it. Documenting a gate is not the same as implementing one. Implementing one is not the same as monitoring whether it is being respected in practice.

Lesson 1 Quiz

What Quality Gates Actually Are · 4 questions

1. What is the primary distinction between a quality guideline and a quality gate?

Correct. The enforcement mechanism is the defining difference. A gate that can be silently ignored is functionally a guideline.

Not quite. The distinction is purely about enforcement: guidelines are advisory, gates block pipeline progression unless explicitly overridden with an audit trail.

2. In the Knight Capital incident (2012), what specific gate failure caused the $440M loss?

Correct. One of eight servers retained old code. A pre-deployment consistency check would have caught the mismatch.

The Knight Capital failure was a deployment consistency issue — no gate verified binary parity across all servers before go-live.

3. SonarQube's default "Sonar way" Quality Gate focuses conditions on new code rather than the full codebase. What is the primary reason for this design choice?

Correct. The "new code" focus is a pragmatic adoption strategy — it avoids the situation where a team can never ship because the legacy debt is too large to clear.

The reason is organizational pragmatism: teams with large legacy codebases would fail every gate immediately, killing adoption. New code focus enables gradual improvement.

4. A team notices their PR approval rate is 100% with an average review time of 90 seconds. What does this signal about their quality gate?

Correct. Extremely fast universal approval is a red flag analogous to the Heartbleed review failure — the gate exists on paper but provides no real protection.

Speed and universality of approvals is a warning sign. As the Heartbleed case illustrates, a gate that is never rejected is likely not being genuinely exercised.

Lab 1: Designing a Gate Architecture

Conversational practice — discuss quality gate placement and classification with your AI instructor.

Scenario

You are a senior engineer joining a fintech startup that has zero formal quality gates. They deploy directly from feature branches to production roughly twice a week. You have been asked to design a phased gate rollout plan. Your instructor will guide you through the tradeoffs.

Start by describing which gate position you would implement first and why. Then discuss how you would classify it (hard, soft, or advisory) and what specific conditions it would enforce.

AI Instructor

Quality Gates

Welcome to Lab 1. You're stepping into a fintech startup with no quality gates and direct-to-production deploys. Walk me through your first priority: which pipeline position would you gate first, would it be hard or soft, and what conditions would you enforce? There's no single right answer — I want to hear your reasoning.

Module 2 · Lesson 2

Writing Acceptance Criteria That Actually Work

From vague user stories to testable, unambiguous conditions of satisfaction — and the formats teams use in production.

What is the difference between a requirement and a criterion?

Between 1985 and 1987, the Therac-25 radiation therapy machine delivered massive overdoses to at least six patients, killing three. The software had been reused from a previous model with hardware safety interlocks removed. The acceptance criteria for the software never specified what behavior was acceptable when the operator entered commands faster than the system could process them. No criterion existed for the race condition. No gate tested it. The machine passed acceptance testing because the tests did not cover the failure mode.

The Therac-25 is the canonical case for why acceptance criteria must be derived from failure modes, not from happy-path workflows.

What Acceptance Criteria Must Be

Acceptance criteria (AC) define the conditions under which a feature or story is considered complete. They translate business intent into verifiable technical outcomes. Good AC has four properties, often summarized as SMART-AC: Specific, Measurable, Achievable, Relevant, and Testable.

The "testable" property is the hardest to satisfy. A criterion is testable only if a failing test can be written against it before the implementation exists. If you cannot write a red test from the criterion, the criterion is underspecified.

SpecificNames exact inputs, outputs, state, or actors. "The system handles errors" is not specific. "The API returns HTTP 422 with a JSON error body when the email field is missing" is.

MeasurableQuantifies where quantity matters. "Fast response" is not measurable. "P99 response time under 200ms with 500 concurrent users" is.

TestableA failing automated or manual test can be written directly from the criterion. If it cannot, the criterion is a guideline, not an acceptance condition.

The Gherkin Format

Gherkin — the syntax used by Cucumber, Behave, and SpecFlow — is the most widely adopted structured format for acceptance criteria. It enforces a Given / When / Then grammar that maps directly to test structure: precondition, action, expected outcome.

# Feature: Payment authorization

Scenario: Decline transaction when card is expired
  Given a customer has a card with expiry date 01/2023
  And today's date is 15/03/2024
  When the customer submits a $50.00 payment
  Then the transaction is declined
  And the response code is "expired_card"
  And no charge is applied to the customer's account

Scenario: Allow transaction within credit limit
  Given a customer has an available credit limit of $200.00
  When the customer submits a $150.00 payment
  Then the transaction is approved
  And the available credit limit becomes $50.00

Each Then clause is a direct acceptance criterion. The scenario as a whole is an executable specification — it can be run as an automated test with a BDD framework. This is the concept of Specification by Example, formalized by Gojko Adzic and widely adopted in teams using behavior-driven development.

Non-Functional Acceptance Criteria

Most AC failures in production involve non-functional requirements — performance, security, reliability — that were never specified as criteria at all. Teams write story AC for features but omit system-level criteria that every story must satisfy.

Common Non-Functional AC

API endpoint responds in <200ms at P95 under load
All database queries use parameterized statements
No secrets appear in application logs
Dependency audit shows zero critical CVEs
Accessibility: WCAG 2.1 AA compliance
Mobile: renders correctly at 320px viewport

Definition of Done vs. Acceptance Criteria

DoD: applies to every story on the team
AC: specific to a particular story or feature
DoD includes: tests pass, code reviewed, docs updated
AC includes: specific behavior in specific conditions
Both must be met before a story is "done"
DoD is a gate; AC is the specification

The Definition of Done as a Cross-Story Gate

The Scrum Guide defines the Definition of Done as a formal description of the quality standard required for any Increment. It is team-level acceptance criteria applied universally. When a team's DoD includes "unit test coverage does not decrease," that is a quality gate encoded as a policy. When it includes "all accessibility requirements verified," that is a non-functional criterion applied at the story level.

Effective DoDs are versioned in the repository alongside the code they govern. Teams at Spotify, Atlassian, and ThoughtWorks have published approaches where the DoD is stored as a checked checklist in pull request templates — making the gate visible at the point of review.

Specification by Example

Gojko Adzic's 2011 book documented practices at teams including Microsoft, Google, and various financial institutions where acceptance criteria were written as executable examples before implementation. The key finding: teams that wrote AC before code had 40–70% fewer defects in production-facing releases compared to teams that wrote AC after implementation or not at all.

Lesson 2 Quiz

Writing Acceptance Criteria That Actually Work · 4 questions

1. Which property of acceptance criteria is considered the hardest to satisfy, and why?

Correct. Testability is the litmus test: if you cannot write a failing test directly from the criterion, the criterion is still a guideline rather than a verifiable condition of acceptance.

Testability is the hardest to satisfy. The test: can you write a red (failing) test directly from the criterion before writing any implementation code? If not, the criterion is underspecified.

2. In the Gherkin Given/When/Then format, which clause directly corresponds to an acceptance criterion?

Correct. The Then clause is the assertion — the observable outcome the system must produce. This maps directly to the acceptance criterion being verified.

The Then clause is the acceptance criterion expressed as an expected outcome. Given sets up state; When triggers the action; Then verifies the result.

3. How does a Definition of Done differ from story-specific acceptance criteria?

Correct. The DoD is a team-wide gate (applies to all stories). AC is a story-specific specification. Both must be satisfied for a story to be complete.

The key distinction is scope: DoD is universal to all stories (e.g., "tests pass, code reviewed"), while AC is specific to one story's particular behaviors and edge cases.

4. The Therac-25 case illustrates which fundamental problem with acceptance criteria design?

Correct. The machine passed testing because tests only covered normal operation. AC derived only from intended workflows will always miss defects in edge cases and concurrent operations.

The Therac-25 lesson: AC must be derived from failure modes, not just happy paths. The race condition was never in scope because no criterion required testing for it.

Lab 2: Writing Acceptance Criteria

Practice converting vague user stories into testable Gherkin scenarios with your AI instructor.

Scenario

You have received the following user story from a product manager: "As a user, I want to reset my password so that I can regain access to my account." This story currently has no acceptance criteria. Your job is to write them in Gherkin format.

Start by writing at least two Given/When/Then scenarios for this story — one happy path and one failure mode. Your instructor will review them, suggest edge cases you may have missed, and discuss how to make each criterion testable.

AI Instructor

Acceptance Criteria

Let's work through the password reset story. Write your first two Gherkin scenarios — one for the successful reset flow, one for a failure case. Don't worry about perfection; I'll give you specific feedback on specificity, measurability, and testability, and then we'll explore the edge cases most teams miss on this type of story.

Module 2 · Lesson 3

Coverage Thresholds and Metrics That Matter

Why 80% coverage can be meaningless — and how to design metrics that actually predict production reliability.

What does a coverage number actually guarantee?

Google's internal engineering documentation, portions of which were published through the Google Testing Blog and the book Software Engineering at Google (2020), describes the company's approach to test coverage. Google does not mandate a universal coverage percentage. Instead, engineers are expected to cover all non-trivial logic paths, with coverage used as a diagnostic tool rather than a pass/fail gate. The rationale: a function can be 100% line-covered by a single test that never checks correctness — covering lines without asserting behavior.

The Coverage Measurement Problem

Code coverage measures which lines, branches, or paths were executed during a test run. It does not measure whether those executions produced correct results. A test suite can achieve 100% line coverage with zero assertions — every line runs, nothing is verified.

This is not a hypothetical. In 2018, a study of open-source Python projects by Ahmed et al. found that roughly 30% of covered lines in high-coverage projects were covered by tests with no assertions — the tests existed only to exercise setup and teardown code.

Line CoveragePercentage of executable lines executed by the test suite. The weakest and most commonly reported metric. Easy to game.

Branch CoveragePercentage of conditional branches (if/else, switch cases) exercised. Stronger than line coverage — requires both true and false paths to be tested.

Path CoveragePercentage of distinct execution paths through a function. Exponentially expensive — impractical for most real-world code but ideal for critical functions.

Mutation ScorePercentage of artificially introduced bugs (mutations) that the test suite detects. The strongest coverage proxy — directly measures whether tests verify correctness.

Why Teams Default to 80%

The "80% coverage threshold" is pervasive in industry standards, static analysis tools, and CI configurations. Its origin is partly empirical — several studies in the 1990s and 2000s found diminishing returns on defect detection above approximately 80% line coverage — and partly conventional. When a tool needs a default, it uses the number the industry already uses.

The problem is that 80% line coverage in a 10,000-line codebase means 2,000 lines are never executed in tests. If those lines are concentrated in error-handling paths, edge cases, or security-critical functions, the number is actively misleading — it implies coverage that does not exist where it matters.

The Mutation Testing Evidence

PIT Mutation Testing (the most widely used Java mutation framework, developed by Henry Coles) benchmarks at major companies show that projects with 80% line coverage often achieve mutation scores of 40–55% — meaning nearly half of injected defects go undetected. Projects targeting 80% mutation score typically require significantly more test assertion density but demonstrate dramatically better defect prediction in production.

Coverage Metrics That Predict Production Quality

Research from Microsoft (Nagappan, Ball, Zeller — 2006) and empirical studies published in IEEE TSE consistently identify that the following metrics, used in combination, are stronger predictors of post-release defects than line coverage alone:

Branch coverage on critical paths: Identify the 20% of code that handles authentication, payment, data persistence, and error recovery. Require branch coverage ≥ 90% on those paths specifically.

Assertion density: Average number of assertions per test method. Below 1.5 suggests tests are executing code without verifying behavior. Above 5 may indicate tests that are too broad.

Defect escape rate by module: Production bugs traced back to modules with recent coverage decreases are a leading indicator. Coverage delta on changed files matters more than absolute coverage.

Test failure rate stability: A test suite with high flakiness (intermittent failures) provides unreliable gate enforcement even at high coverage percentages — teams learn to ignore red builds.

Mutation score on business logic: Run mutation testing (PIT for JVM, Stryker for JavaScript/TypeScript) specifically on business logic modules. A score below 60% suggests test assertions are insufficient.

Configuring Coverage Gates in Practice

Effective coverage gates distinguish between code tiers. Not all code warrants identical gate strictness. A common tiered approach adopted by teams at companies including Netflix and ThoughtWorks uses three tiers defined in configuration:

# Jest coverage configuration (jest.config.js) — tiered thresholds
coverageThreshold: {
  global: {
    branches: 70,
    lines: 75,
    functions: 80
  },
  // Critical business logic — stricter thresholds
  './src/payments/**': {
    branches: 90,
    lines: 95,
    functions: 95
  },
  // Auth and security modules
  './src/auth/**': {
    branches: 90,
    lines: 90,
    functions: 90
  }
}

Goodhart's Law Applied to Coverage

"When a measure becomes a target, it ceases to be a good measure." — Charles Goodhart (1975). This applies precisely to coverage thresholds. Teams optimize for the number, not for the quality the number was meant to represent. The antidote is periodic mutation testing audits and coverage delta analysis on changed files, not just global percentages.

Lesson 3 Quiz

Coverage Thresholds and Metrics That Matter · 4 questions

1. What does 80% line coverage actually guarantee about a codebase?

Correct. Coverage measures execution, not correctness. A line can be covered by a test that makes no assertion about the result of executing that line.

Line coverage only measures execution — whether a line was run during a test, not whether the test verified the line produced the correct result. A test with zero assertions can cover 100% of lines.

2. Which coverage metric most directly measures whether tests verify correctness rather than merely execute code?

Correct. Mutation testing introduces artificial defects and checks whether tests fail. If they don't fail, the tests were not verifying that behavior — regardless of coverage percentage.

Mutation score is the strongest proxy because it directly tests whether your tests would catch a real bug. Tools like PIT (Java) and Stryker (JavaScript) automate this.

3. A team has 80% global line coverage but their payments module has 55% branch coverage. According to the tiered approach, what action is appropriate?

Correct. Critical code paths like payments warrant tiered, stricter thresholds. Global averages can mask dangerous gaps in the highest-risk modules.

The tiered approach defines stricter thresholds per-directory for critical code. A global average passing while a payments module sits at 55% branch coverage is exactly the scenario tiering is designed to prevent.

4. What does Goodhart's Law predict will happen when a team makes "80% line coverage" a hard CI gate?

Correct. When coverage becomes a target, teams find the shortest path to the number — which often means assertion-free tests, tests on trivial getters, and coverage-padding. The metric decouples from the quality it was meant to measure.

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Teams optimize for the coverage number itself rather than for what the number was supposed to represent.

Lab 3: Designing a Coverage Gate Strategy

Discuss tiered coverage thresholds and mutation testing trade-offs with your AI instructor.

Scenario

Your engineering manager has asked you to propose coverage gate thresholds for a healthcare data platform. The codebase has three tiers: core medical record logic, API integration layer, and internal tooling/utilities. Currently there are no coverage gates at all. The CTO has heard "80% is industry standard" and wants to set that globally.

Explain to your instructor why a single global 80% line coverage threshold is insufficient for this context, and propose a tiered alternative. Discuss which metrics beyond line coverage you would add for the medical records tier.

AI Instructor

Coverage Strategy

Your CTO wants a single global 80% line coverage gate for a healthcare data platform. Make the case — to me — for why that's insufficient, and propose a concrete tiered alternative. Include at least one metric beyond line coverage that you'd require for the medical records module specifically, and be ready to defend your choices.

Module 2 · Lesson 4

Integrating Gates into CI/CD Pipelines

Practical patterns for encoding quality gates in GitHub Actions, GitLab CI, and Jenkins — with failure handling and bypass governance.

How do you make a gate impossible to ignore without making it impossible to ship?

Etsy became famous in 2012 for deploying to production more than 50 times per day with a small engineering team. Their approach, documented publicly by engineer John Allspaw and others, relied on what they called a "safety net" pipeline — automated gates on every commit that made it structurally impossible to deploy known-broken code without a deliberate human override. Critically, Etsy tracked every bypass: who requested it, why, and what the production outcome was. The bypass data fed back into gate improvements quarterly.

The lesson Etsy published: gates and speed are not opposites. Weak gates produce slow shipping because production incidents consume engineering time. Strong gates with well-designed override governance produce faster shipping because the pipeline is trusted.

Pipeline Gate Implementation Patterns

Modern CI/CD platforms implement gates through different primitives, but the logical structure is identical: a step that can fail the pipeline based on a condition. The implementation patterns below cover the three dominant platforms in enterprise environments.

# GitHub Actions — quality gate job with SonarQube
jobs:
  quality-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Run tests with coverage
        run: npm run test:coverage
      - name: SonarQube Scan
        uses: SonarSource/sonarqube-scan-action@master
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
          SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}
      - name: Quality Gate Check
        uses: SonarSource/sonarqube-quality-gate-action@master
        timeout-minutes: 5
        env:
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
  # This step fails the job if SonarQube gate fails
  # Branch protection rules prevent merge on job failure

Branch Protection as Gate Enforcement

A quality gate job is only meaningful if merging is blocked when the job fails. GitHub and GitLab enforce this through required status checks — branch protection rules that prevent merging a pull request until named CI jobs pass. Without branch protection, a failing gate produces a warning that engineers can and will ignore under deadline pressure.

GitHub — Required Status Checks

Settings → Branches → Branch protection rules
Enable "Require status checks to pass before merging"
Add specific job names (e.g., "quality-gate")
Enable "Require branches to be up to date"
Enable "Do not allow bypassing the above settings" for critical branches

GitLab — Protected Branches + Rules

Settings → Repository → Protected branches
Set merge allowed for: Maintainers only
Add required approval rules
Pipeline must succeed before merge (project-level setting)
Use needs: in .gitlab-ci.yml to enforce job ordering

Bypass Governance: The Override Protocol

No gate system can be fully zero-bypass in practice. Production incidents, regulatory deadlines, and critical hotfixes will occasionally require bypassing a failing gate. The critical question is not whether bypasses are allowed, but how they are governed. Three elements make bypass governance effective:

Explicit override mechanism: A documented, named path for bypass — such as a GitHub admin merge, a JIRA bypass ticket, or a signed approval from a named authority. The mechanism should require more effort than the standard path.

Mandatory justification: The bypasser must record why the gate was bypassed — in a PR comment, ticket, or automated form. This creates the audit trail and prevents silent bypasses that accumulate undetected.

Bypass review cadence: Monthly or quarterly review of all bypasses. Patterns in bypass justifications reveal gate calibration problems — gates being bypassed regularly are either misconfigured or protecting the wrong things.

# GitLab CI — staged pipeline with gate dependencies
stages:
  - test
  - quality
  - security
  - deploy

unit-tests:
  stage: test
  script: pytest --cov=src --cov-report=xml
  coverage: '/TOTAL.*\s+(\d+%)$/'
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage.xml

coverage-gate:
  stage: quality
  needs: [unit-tests]
  script:
    - python scripts/check_coverage.py --min-branch=80 --critical-paths=src/payments,src/auth --min-branch-critical=90
  allow_failure: false  # Hard gate — pipeline fails here if thresholds not met

dependency-audit:
  stage: security
  needs: [coverage-gate]
  script: pip-audit --requirement requirements.txt --fail-on CRITICAL
  allow_failure: false

Flaky Tests and Gate Reliability

A gate that fails intermittently for reasons unrelated to code quality erodes trust in the entire pipeline. Engineers learn to re-run failures until they pass — defeating the gate's purpose. Google's internal research on test flakiness, published in their CACM paper "Flaky Tests at Scale" (2020), found that at Google's test volume, a single 1-in-100 flaky test would fail dozens of builds per day.

Effective gate systems require quarantine mechanisms for flaky tests: automated detection of intermittent failures (using tools like pytest-repeat or GitHub's automatic retry), quarantine labels that remove flaky tests from gate computation until they are fixed, and SLAs for resolving quarantined tests — typically 5 business days before the test is deleted.

The Shift-Left Economics

The Accelerate research (Forsgren, Humble, Kim — 2018), based on the DORA State of DevOps surveys across thousands of organizations, found that elite-performing engineering teams — defined by deployment frequency, lead time, change failure rate, and recovery time — were significantly more likely to have comprehensive automated quality gates in their pipelines. The correlation is not that gates make teams slow: elite teams ship faster and have fewer production failures precisely because gates catch defects before they reach production.

Gate Design Checklist

A well-designed quality gate system satisfies: (1) every gate is implemented as code in version control; (2) every gate has a documented owner and calibration date; (3) bypass history is retained and reviewed; (4) gates are tiered by code criticality; (5) flaky test quarantine is operational; (6) gate metrics (pass rate, bypass rate, false positive rate) are visible to the team.

Lesson 4 Quiz

Integrating Gates into CI/CD Pipelines · 4 questions

1. What is the minimum requirement to make a CI quality gate job an actual hard gate rather than an advisory?

Correct. A failing CI job that does not block merging is just a warning. Branch protection rules — requiring named status checks to pass before merge is allowed — enforce the gate structurally.

allow_failure: false stops the pipeline stage from proceeding, but engineers can still merge without that job passing if branch protection is not configured. The gate must be tied to merge eligibility via protected branch rules.

2. What three elements does Etsy's documented bypass governance model require to make overrides acceptable?

Correct. The three elements are: (1) an explicit, friction-ful override path; (2) mandatory documented justification; and (3) periodic review of bypass patterns to improve gate calibration.

Etsy's model emphasizes: explicit override mechanism (more effort than standard merge), mandatory justification for audit trails, and regular review of bypass patterns to recalibrate gates — not specific approval chains.

3. Google's research on flaky tests found they should be handled with a quarantine mechanism. What is the appropriate SLA for resolving a quarantined test?

Correct. A 5-day SLA is short enough to maintain pipeline trust and long enough to allow investigation. The key principle: an unreliable test that trains engineers to ignore red builds is worse than no test at all.

The recommended SLA is 5 business days, after which deletion is preferred over keeping the test. A flaky test that persists teaches the team to re-run failures until they pass — which defeats the entire gate mechanism.

4. The DORA/Accelerate research found which relationship between quality gates and shipping speed?

Correct. This is one of the central empirical findings in Accelerate — the fastest-shipping teams also have the fewest production failures, and both outcomes correlate with strong automated quality gate practices.

The DORA research (Forsgren, Humble, Kim) found that gates and speed are complementary: elite teams ship more frequently AND fail less, and both correlate with comprehensive gate adoption. Production incidents from weak gates are what actually slow teams down.

Lab 4: Auditing a Pipeline Gate Configuration

Practice identifying gate weaknesses in a real CI configuration with your AI instructor.

Scenario

Your team's current GitHub Actions workflow runs unit tests, a linter, and a SonarQube scan. Branch protection on main requires only "linter" as a required status check — tests and SonarQube are optional. The SonarQube scan uses sonar.qualitygate.wait=false. There is no documented bypass protocol and merges have been happening with failing tests roughly once per sprint.

Identify the specific gaps in this pipeline's gate architecture, explain why each gap is a problem using concepts from this module, and propose three concrete configuration changes that would address the most critical weaknesses.

AI Instructor

Pipeline Audit

Walk me through your audit of this pipeline. I'll tell you what you got right, what you missed, and push back on your proposed fixes to make sure they're implementation-ready rather than theoretical. Start by naming the most critical gap — the one that, if left unfixed, creates the greatest risk — and tell me why it's critical.

Module 2 Test

Quality Gates and Acceptance Criteria · 15 questions · Pass at 80%

1. What is the defining characteristic that separates a quality gate from a quality guideline?

Correct.

The distinction is enforcement: guidelines can be silently ignored; gates structurally block progression and require auditable override.

2. A "soft gate" in a CI pipeline is best described as:

Correct.

A soft gate logs a violation but allows the pipeline to continue — useful during ramp-up before converting to a hard gate.

3. In the IBM Systems Sciences Institute cost-of-defect data, a defect caught in production costs approximately how much more than one caught at design time?

Correct.

Design-time: ~$1. Production: $100–$1,000. This 100–1,000x cost differential is the economic foundation of shift-left quality practices.

4. SonarQube's "new code" focus in its default Quality Gate is designed to:

Correct.

Gating only new code is a pragmatic adoption strategy — teams with large legacy debts would fail immediately on a whole-codebase gate, killing adoption before quality improves.

5. The Therac-25 case demonstrates that acceptance criteria must be derived from:

Correct.

The Therac-25 passed all its acceptance tests because tests only covered normal operation. The fatal race condition was never a criterion, so it was never tested.

6. In Gherkin syntax, which element directly maps to a verifiable acceptance criterion?

Correct.

The Then clause states what the system must produce — this is the assertion that maps directly to the acceptance criterion.

7. What is the key difference between a team's Definition of Done and story-specific acceptance criteria?

Correct.

DoD is team-wide (applies to every story). AC is story-specific (defines this story's particular conditions of satisfaction). Both must be met before a story is complete.

8. A test with 100% line coverage but zero assertions:

Correct.

Tests can execute code without asserting anything about the result. 100% coverage with zero assertions tells you every line runs — not that any line produces a correct result.

9. Mutation testing tools like PIT (Java) and Stryker (JavaScript) measure quality by:

Correct.

Mutation testing works by changing one thing in the source (e.g., flipping a > to <) and checking whether any test fails. Tests that don't detect the mutation are not verifying that behavior.

10. Goodhart's Law applied to coverage thresholds predicts that:

Correct.

Goodhart: "When a measure becomes a target, it ceases to be a good measure." Teams write coverage-padding tests, and the number stops predicting quality.

11. In a GitHub Actions pipeline, what configuration makes a quality gate job an enforced hard gate on the main branch?

Correct.

Without branch protection requiring the job as a status check, a failing job produces a red indicator that engineers can ignore. The merge block is what makes it a hard gate.

12. Etsy's published approach to CI/CD at 50+ daily deploys revealed that quality gates and deployment frequency are:

Correct.

Etsy's documented lesson: gates and speed are not opposites. Weak gates create production incidents that consume more engineering time than the gates would have. Trust in the pipeline enables speed.

13. The recommended approach for handling flaky (intermittently failing) tests in a gated pipeline is:

Correct.

A flaky test that trains engineers to ignore red builds is worse than no test. Quarantine + 5-day SLA + deletion enforces that unreliable tests are fixed quickly or removed.

14. The DORA/Accelerate research found that among elite-performing engineering organizations:

Correct.

The Accelerate finding is clear: elite teams deploy more often AND have fewer failures, and both outcomes correlate with strong automated quality gate practices. Speed and quality are not opposed.

15. An effective bypass governance protocol for a quality gate must include which three elements?

Correct.

The three elements: (1) friction-ful explicit override path, (2) mandatory documented justification, (3) periodic review of bypass patterns. Together they allow necessary bypasses while preventing normalization of gate violations.