Session 1 of 8

AI in the Web Pentest Workflow

Where AI augments classic Burp/ZAP-driven testing and where it still hurts — speed, false positives, judgment
● ~60 minutes

Learning Objectives

  • Identify the stages of a web application pentest where AI tooling provides the greatest time savings without sacrificing finding quality
  • Explain the false-positive problem in AI-assisted scanning and why human verification is non-negotiable before reporting any AI-surfaced finding
  • Describe the current capability boundaries of LLM-assisted tooling — what it can do reliably, what it gets wrong, and where human judgment remains irreplaceable
  • Position AI as an augmentation layer within existing Burp Suite and OWASP ZAP-based workflows rather than a replacement for structured methodology

Session Overview

AI tooling is arriving in web application pentesting at speed, but the hype significantly outpaces the reality of what current tools can reliably deliver. This opening session sets realistic expectations: AI accelerates specific, well-defined subtasks — content enumeration, payload suggestion, response interpretation, and report drafting — while introducing new risks around false positives, over-reliance, and reproducibility. The goal is not skepticism about AI, but calibrated, disciplined adoption.

Use this session to establish the mental model that runs through the rest of the course: AI is a force multiplier, not a replacement for expertise. Every AI-surfaced candidate finding requires manual verification. Every AI-generated payload requires the tester to understand what it does and why. The practitioners who will get the most from these tools are those with the strongest traditional web security foundations — because they can evaluate AI output critically rather than accepting it uncritically.

Key Teaching Points

  • Map the pentest lifecycle to AI fit. Reconnaissance and content discovery, payload generation, response analysis, and report writing are high AI-fit activities. Authorization testing, logic flaw investigation, and final finding verification are low AI-fit — they require contextual judgment that current models do not reliably provide. Teach participants to route tasks appropriately rather than reaching for AI by default.
  • False positives are the primary risk of AI-assisted scanning. LLMs are pattern-matchers that can confidently identify "vulnerabilities" based on superficial response characteristics rather than confirmed exploitability. A finding is not a finding until a human has reproduced it with a clean test case and understood the root cause. Establish this standard as non-negotiable from the first session.
  • Speed gains are real but uneven. Experienced practitioners report 30–50% time savings on content discovery and report drafting, near-zero gains on complex authorization testing, and occasional negative outcomes when AI-generated payloads are used without understanding them. Help participants build a personal mental model of where their own time savings are highest.
  • AI-assisted tooling integrates with Burp at multiple points. Introduce the current integration landscape: AI-enhanced extensions for passive analysis of Burp history, LLM-powered payload generation for Intruder, and natural-language query interfaces for Repeater analysis. Participants should understand where in the proxy workflow these touchpoints exist.
  • Engagement scope and client expectations may need updating. Clients whose pentest contracts were scoped for traditional testing may not have anticipated AI-assisted tooling. Discuss briefly whether and how to disclose AI tool use to clients — a topic covered in depth in Session 8.

Discussion Prompts

  • Think about the last complex web application pentest you ran. At which stages do you think AI assistance would have saved the most time, and at which stages would it have been counterproductive or misleading?
  • If an AI-assisted tool surfaces 200 potential findings in a two-week engagement, how do you prioritize which ones to verify manually? What criteria do you use?
  • A junior practitioner on your team is using an AI tool to generate payloads for SQL injection testing. They are sending payloads they cannot explain. What is the risk, and how do you address it?
  • A client asks whether you used AI tools in their pentest. What is the right answer — and does it depend on what the AI did?
Instructor Notes

The room will often contain both AI enthusiasts and AI skeptics. Avoid taking sides — the goal of this session is calibration, not advocacy. The most useful framing is: "AI tools are already being used in engagements. Our job is to use them in ways that make findings better, not worse." That framing disarms both camps.

If the group has mixed experience levels, the question about junior practitioners generating payloads they cannot explain tends to surface important values about professional responsibility that set the tone well for the rest of the course.

Avoid demo-ing specific AI tools in detail in this session — the landscape changes fast and tool-specific content ages poorly. Keep the teaching at the level of capability categories and workflow integration points rather than specific product names or features.

Timing Guide

Introduction10 min
Core Content28 min
Discussion17 min
Wrap-up5 min
Session 2 of 8

LLM-Assisted Spidering and Discovery

Driving content discovery, parameter mining, and JS analysis with model-aware tooling
● ~60 minutes

Learning Objectives

  • Apply LLM-assisted techniques to expand content discovery beyond what wordlist-based tools surface, using application-aware path and parameter prediction
  • Use AI tooling to analyze minified or bundled JavaScript to identify endpoints, parameter names, and authentication patterns more efficiently than manual review
  • Explain the limitations of AI-driven discovery — hallucinated endpoints, rate-limit sensitivity, and coverage gaps — and how to compensate for them
  • Integrate AI-assisted discovery output into a Burp Suite sitemap without introducing unverified noise into the working session

Session Overview

Content discovery has always been a combination of wordlist brute-force and application-aware inference. AI tools accelerate the inference side of that equation: given a partial sitemap, an LLM can predict likely endpoint patterns, parameter names, and hidden functionality based on the application's apparent purpose and the conventions of its technology stack. This session examines how to use that capability effectively while avoiding the traps of hallucinated endpoints and false confidence.

JavaScript analysis is a second major discovery accelerator. Modern single-page applications bundle enormous volumes of routing logic, API endpoint definitions, and authentication flow code into client-side JavaScript. Manually reading a 2MB minified bundle is impractical; passing it through an LLM with targeted extraction prompts is increasingly effective. Cover both use cases — endpoint prediction and JS analysis — and walk participants through the verification step that must follow both.

Key Teaching Points

  • Application-aware path prediction outperforms generic wordlists for specific targets. Provide the model with a partial sitemap, the application's apparent domain (e-commerce, healthcare portal, banking), and the technology stack hints visible from headers and responses. Ask it to suggest likely endpoint patterns. Compare the output against a standard wordlist — the AI suggestions will typically include domain-specific paths the wordlist misses.
  • Hallucinated endpoints must be explicitly verified. LLMs generate plausible-sounding paths that may not exist. Every AI-suggested path must be probed against the live target before being recorded as a finding candidate. Build the habit of treating AI output as a hypothesis list, not a confirmed sitemap.
  • JS analysis with LLMs is most effective with targeted prompts. Rather than dumping an entire bundle and asking "what are the endpoints?", break the task into targeted extractions: "Extract all URL strings," "Identify all fetch() or axios() call patterns," "Find all references to authentication tokens." Smaller, focused prompts produce higher-quality output than broad requests against large inputs.
  • Parameter mining benefits from semantic analysis. AI can infer likely parameter names and types from field naming conventions, form structures, and API response shapes. Ask the model to hypothesize hidden parameters based on visible patterns — this frequently surfaces parameters that wordlist-based parameter pollution attacks miss.
  • Rate limiting and scope compliance require human oversight. AI-driven discovery tools may issue large volumes of requests without awareness of engagement rate limits or scope boundaries. Ensure participants configure appropriate throttling and scope filters before using any automated AI-assisted discovery tool. The model has no concept of rules of engagement.

Discussion Prompts

  • You are assessing a large single-page application with a heavily minified React bundle. Walk through how you would approach JavaScript analysis with AI assistance — what prompts would you use and in what order?
  • An AI discovery tool suggests 50 endpoint paths. You have time to manually test 15 of them. How do you prioritize which 15 to verify?
  • What are the signs in an AI-generated endpoint list that suggest the model is hallucinating rather than inferring from real patterns in the target application?
  • How should AI-discovered endpoints be documented in your Burp session to distinguish them from manually confirmed endpoints?
Instructor Notes

The JavaScript analysis use case tends to generate the most enthusiasm in rooms with experienced testers, because it directly addresses one of the most tedious aspects of modern web assessment. If you have access to a sample (non-sensitive) minified JavaScript file, a live demonstration of AI extraction prompts is very effective here.

Stress the verification discipline heavily. The danger of AI discovery is that it produces output that looks authoritative and complete. Participants who have been burned by unverified scanner output in the past will understand immediately — invoke that experience if the group has it.

For any participants who primarily use ZAP rather than Burp, the concepts are identical — the specific integration points differ, but the workflow logic (discover, verify, import) is the same. Acknowledge both tools explicitly to avoid alienating ZAP-primary practitioners.

Timing Guide

Introduction10 min
Core Content28 min
Discussion17 min
Wrap-up5 min
Session 3 of 8

Auth and Session Attacks with AI

Identifying flawed auth, broken access control, and session-handling bugs faster using LLM analysis
● ~60 minutes

Learning Objectives

  • Use LLM analysis to rapidly identify structural weaknesses in JWT tokens, OAuth flows, and session cookie implementations from captured traffic
  • Apply AI-assisted access control testing to systematically probe object-level authorization (IDOR) across large APIs more efficiently than manual ID enumeration
  • Explain where AI assistance improves throughput in auth testing (pattern analysis, request modification) versus where it cannot substitute for manual reasoning (multi-role authorization matrices, business logic)
  • Document auth and session findings with AI-assisted evidence that meets the evidentiary standards required for a defensible report

Session Overview

Broken authentication and broken access control remain persistently at the top of the OWASP Web Top 10 — not because the attacks are technically sophisticated, but because they are hard to test comprehensively at scale. A typical REST API may expose hundreds of endpoints, each with multiple HTTP methods and role-specific access rules. Manually testing every combination is impractical; AI assistance can meaningfully reduce the time required to identify the highest-risk candidates for focused manual testing.

This session covers two major use cases. The first is AI-assisted traffic analysis: feeding captured auth flows to an LLM and asking it to identify structural weaknesses — weak JWT algorithms, predictable session tokens, OAuth redirect misconfigurations. The second is AI-accelerated access control testing: using model-generated request variations to probe IDOR conditions across object identifiers more efficiently than pure manual enumeration. Both use cases require careful verification workflows, which this session covers in detail.

Key Teaching Points

  • JWT and OAuth analysis is a strong AI use case. Paste a JWT or an OAuth authorization code flow into an LLM and ask it to identify: algorithm weaknesses (none, RS256 vs HS256 confusion), claim validation gaps (missing exp check, absent audience restriction), and redirect URI patterns that could enable token theft. AI analysis of this kind is fast and catches common structural issues that human review might miss under time pressure.
  • Session token entropy analysis benefits from AI-assisted pattern detection. Given a sample of captured session tokens, an LLM can identify suspicious patterns — predictable components, embedded timestamps, base64-encoded user IDs — that warrant closer investigation. This is faster than manual inspection but requires confirmation with a dedicated entropy analysis tool before reporting.
  • IDOR testing at scale is a strong AI-assist candidate. Ask the model to generate a matrix of test cases for a given API: for each endpoint, for each HTTP method, for each plausible role, what IDs should be tested? Use that matrix as a structured testing guide rather than ad-hoc enumeration. The model's output reduces the cognitive load of tracking coverage across a large API surface.
  • Multi-role authorization matrices require human construction. AI can help document and organize authorization requirements, but the practitioner must define the role taxonomy, understand the business rules, and manually verify each boundary. Do not rely on AI to infer what should be restricted — it does not understand the client's business context.
  • Forced browsing and privilege escalation attempts must stay within scope. AI tools that generate IDOR test cases may produce requests targeting IDs that belong to real users. Ensure participants understand the rules of engagement around accessing other users' data, even in a test context, and configure ID ranges accordingly.

Discussion Prompts

  • You capture a JWT that uses the HS256 algorithm. You paste it into an LLM and ask for a security assessment. What output would you expect, and how would you verify the findings it identifies?
  • A client's API has approximately 400 endpoints. You have three days to test authorization controls. How would you use AI assistance to allocate your manual testing time most effectively?
  • An LLM analysis of session tokens tells you they appear to contain a base64-encoded user ID. What are the next steps, and what evidence do you need before including this in your report?
  • Where do you draw the line between AI-assisted test case generation for IDOR and automated scanning? Does that distinction matter legally or professionally?
Instructor Notes

JWT analysis tends to be the most immediately practical teaching moment in this session — nearly every participant has encountered JWTs in production applications, and many have experienced the frustration of manually reviewing JWT configurations under time pressure. The AI-assisted approach feels like a genuine productivity unlock for practitioners who have done this manually.

The IDOR testing at scale discussion sometimes raises the question of whether AI-generated test case matrices cross the line into automated scanning, which may be outside the engagement scope. This is a legitimate question — address it directly and teach participants to review their rules of engagement before using matrix-driven testing approaches.

Remind the group that the session is about AI augmenting methodology, not replacing expertise. A practitioner who cannot manually verify a JWT weakness should not be relying on AI to find it — they should be building the underlying skill first.

Timing Guide

Introduction10 min
Core Content28 min
Discussion17 min
Wrap-up5 min
Session 4 of 8

Injection Classes Revisited

SQLi, SSRF, XSS, and template injection — using AI to generate targeted payloads while keeping humans in the loop
● ~60 minutes

Learning Objectives

  • Use LLM-assisted payload generation to create context-specific injection payloads for SQLi, XSS, SSRF, and template injection faster than wordlist or pattern-book approaches
  • Explain the "human in the loop" requirement — why every AI-generated payload must be understood and reviewed before sending, and what risks arise when it is not
  • Apply AI analysis to injection response data to distinguish false positives, partial successes, and confirmed exploitation more efficiently than manual pattern matching
  • Identify the injection testing scenarios where AI payload generation provides the greatest productivity gain versus those where traditional tooling remains more reliable

Session Overview

Injection vulnerabilities — SQL, XSS, SSRF, template injection, command injection — are among the most well-understood vulnerability classes in web security, with decades of accumulated tooling, payloads, and evasion techniques. What AI brings to this landscape is not new categories of injection but rather the ability to rapidly generate context-specific payload variations tailored to the target environment: its technology stack, its WAF signatures, its specific parameter shapes and encoding requirements.

This session covers AI-assisted payload generation as a complement to established tooling like SQLmap, Burp's active scanner, and Intruder. The key emphasis is that AI-generated payloads must be understood before they are sent — a practitioner who sends a payload they cannot explain cannot defend the finding in a report, cannot assess whether the payload could cause unintended damage to the target, and cannot help the client understand the root cause of what was found.

Key Teaching Points

  • Context-specific payload generation is the primary AI advantage. Ask the model for SQLi payloads tailored to a specific database (PostgreSQL vs MySQL vs MSSQL), a specific encoding context (URL-encoded, JSON-embedded, XML-escaped), and a specific WAF signature pattern you have observed blocking generic payloads. This produces higher-quality starting candidates than generic wordlists for well-understood target environments.
  • AI is particularly effective for XSS filter evasion brainstorming. Describe the CSP policy, the HTML context, and the character restrictions in place, and ask the model to suggest bypass approaches. Use the output as a brainstorming artifact — a set of hypotheses to test, not a list of confirmed bypasses. The human still evaluates and selects which hypotheses are worth testing.
  • SSRF payload generation benefits from cloud metadata awareness. LLMs have broad knowledge of cloud provider metadata endpoint conventions (AWS IMDSv1/v2, GCP, Azure) and internal service discovery patterns. When assessing cloud-deployed applications, ask the model to generate SSRF payloads targeting the most impactful internal resources for the inferred cloud environment.
  • Template injection identification requires understanding template engine syntax. AI can rapidly identify the likely template engine from error messages and response patterns, then generate syntax probes appropriate to that engine — Jinja2, Twig, FreeMarker, Thymeleaf. This is faster than manually consulting engine-specific documentation for each engagement.
  • Response analysis is a high-value AI use case. After running Intruder or Repeater tests, paste a set of responses and ask the LLM to classify them — "which of these look like successful injection, partial execution, or WAF block?" This can dramatically speed up triage of large result sets. Verify every "successful" classification manually before reporting.

Discussion Prompts

  • You use an AI tool to generate 30 SQL injection payloads for a PostgreSQL target. Before sending any of them through Burp Intruder, what do you need to review or understand about each payload, and why?
  • The target application returns a WAF block response to your standard XSS payloads. How would you structure an AI-assisted brainstorming session to generate evasion approaches, and how would you evaluate the suggestions?
  • An AI analysis of your Intruder results flags 8 responses as "potentially successful SQL injection." You have time to manually verify 3 of them before the end of the engagement. How do you choose which 3?
  • A client's application is deployed on AWS and you suspect SSRF in an image-processing endpoint. What AI-generated payloads would you prioritize first, and why?
Instructor Notes

The "understand before you send" principle is the load-bearing point of this session. Push on it explicitly: ask participants to describe, out loud, what a given AI-generated payload is designed to do and what a successful response would look like. If they cannot do that, they should not send the payload — not for safety reasons alone, but because they cannot document the finding if they do not understand the mechanism.

Template injection tends to be the least familiar topic in this session for participants who come primarily from SQLi and XSS backgrounds. Spend a few extra minutes grounding it in a concrete technology (Jinja2 in a Python web framework is the most common encounter) before moving into AI-assisted identification.

Acknowledge that established tools like SQLmap remain the right choice for deep SQLi exploitation — AI payload generation complements them for initial identification and WAF evasion brainstorming, but does not replace their capability for blind injection, data extraction, and database enumeration.

Timing Guide

Introduction10 min
Core Content30 min
Discussion15 min
Wrap-up5 min
Session 5 of 8

Logic Flaws and Business-Workflow Abuse

How LLMs help reason about multi-step workflows and find bugs scanners miss
● ~60 minutes

Learning Objectives

  • Explain why business logic flaws are systematically missed by automated scanners and how AI-assisted reasoning partially compensates for that gap
  • Use LLM analysis to model multi-step application workflows and generate abuse-case hypotheses that scanners cannot produce
  • Apply AI-assisted state-machine reasoning to identify sequence manipulation attacks — skipping steps, replaying steps, or interleaving steps from different workflows
  • Recognize the limits of AI logic-flaw analysis and where practitioner domain knowledge and creativity remain irreplaceable

Session Overview

Business logic flaws are the category of web vulnerability most resistant to automation. Scanners find what they are programmed to look for — known vulnerability signatures, dangerous function calls, injection patterns. Logic flaws arise from the gap between what the application permits and what the business intends to permit. Identifying that gap requires understanding business context that no scanner possesses. This is where AI-assisted reasoning is genuinely promising — not because LLMs understand business logic, but because they can rapidly generate hypotheses about where such gaps might exist based on patterns in the application's behavior and common abuse scenarios for similar application domains.

This session covers how to use AI as a brainstorming partner for logic flaw discovery: describing the application's workflow to the model and asking it to generate abuse cases, sequence manipulation hypotheses, and state-transition attacks. The approach is explicitly hypothesis-driven — the AI generates candidates, the practitioner evaluates and tests them. This is qualitatively different from automated scanning, and participants should understand that distinction.

Key Teaching Points

  • Describe the workflow, ask for the abuse cases. Narrate the application's multi-step process — checkout flow, account verification, password reset, subscription upgrade — to the LLM in natural language, and ask it to generate a list of abuse-case hypotheses: "What could an attacker try to do with this flow that the designers did not intend?" The model's output is a structured starting point for manual investigation, not a confirmed finding list.
  • Sequence manipulation is a high-yield hypothesis category. Many logic flaws involve step-skipping (completing step 3 without completing step 2), step replaying (applying a one-time discount multiple times), or cross-workflow contamination (applying state from one user's workflow to another's). Ask the AI to generate hypotheses in each of these categories for the specific workflow you have described.
  • Race conditions in multi-step workflows are often overlooked. AI can help identify points in a workflow where a race condition could allow multiple concurrent executions to each "win" a condition that should only be satisfiable once — the classic "double spend" in payment flows, the simultaneous redemption of a single-use coupon, or the parallel approval of a request that requires sequential gate-keeping. Identify these candidate points, then test with concurrent request tooling in Burp.
  • AI can help map state transitions from traffic alone. Feed a sequence of captured requests from a workflow and ask the model to infer the application's state machine — what states are possible, what transitions are permitted, and what would happen if a transition were attempted out of order. This provides a structured map for subsequent manual testing.
  • Domain knowledge is still the differentiator. An AI model does not know that a specific financial regulation prohibits a particular transaction sequence, or that a particular healthcare workflow has compliance requirements that define "normal." The practitioner who understands the client's industry will generate better logic-flaw hypotheses than one who relies on AI alone. Emphasize this as a motivation for building domain expertise alongside technical skills.

Discussion Prompts

  • You are testing a subscription upgrade flow: free tier to paid, with a 7-day free trial. Describe the workflow to the group and brainstorm — with or without AI — what abuse cases you would investigate first.
  • An AI-generated abuse-case list for a password reset flow includes the hypothesis "the reset token is not invalidated after use." How would you test that hypothesis efficiently, and what constitutes proof?
  • Where do you think AI will make the biggest impact on logic flaw discovery in the next two to three years? What capability would need to improve most significantly for that to happen?
  • A client says their application has "no logic flaws because we have unit tests for all our business rules." How do you respond?
Instructor Notes

Logic flaws are where experienced practitioners feel the most secure in their judgment and the most skeptical about AI contributions. Acknowledge that skepticism directly — it is largely correct. The AI is genuinely limited here compared to a practitioner with deep domain knowledge and creative instincts. The value is in structured hypothesis generation for practitioners who are newer to a domain, not in replacing the creative reasoning of experienced testers.

The race condition angle tends to be the most technically surprising for participants who have not encountered it as a category — many are aware of race conditions in theory but have not applied them systematically to business-workflow testing. Spend extra time here, because it is a genuinely high-yield testing area that is underused.

The discussion question about unit tests for business rules is a classic client objection. Role-play the response with the group if energy allows — it is a useful professional skills exercise as well as a content reinforcement.

Timing Guide

Introduction10 min
Core Content28 min
Discussion17 min
Wrap-up5 min
Session 6 of 8

API and GraphQL Surface Coverage

Schema-aware testing of REST and GraphQL with AI-assisted enumeration and abuse-case generation
● ~60 minutes

Learning Objectives

  • Use AI-assisted analysis of OpenAPI/Swagger specifications and GraphQL schemas to rapidly generate a prioritized test plan for API surface coverage
  • Apply LLM-generated abuse cases to REST API endpoints, focusing on authorization failures, mass assignment, and excessive data exposure
  • Explain the specific attack surface that GraphQL introspection and nested queries create, and how AI assists in exploiting it
  • Integrate AI-assisted API testing into a structured methodology that produces defensible, reproducible findings

Session Overview

APIs are the primary attack surface of modern web applications, and their size and complexity often outpaces the time available for manual testing. An OpenAPI specification may document hundreds of endpoints; a GraphQL schema may expose a deeply nested type graph. AI tools can process these specifications rapidly and produce structured test plans, abuse-case hypotheses, and authorization test matrices far faster than a human reading the same documentation. This session covers how to extract maximum testing value from API specifications using LLM-assisted analysis.

GraphQL deserves particular attention because its introspection system makes the schema queryable by default, and its flexible query structure — including nested queries, aliases, and directives — creates testing challenges that simple HTTP-based scanning tools do not handle well. AI assistance is genuinely useful here for generating nested query variations, batched mutation attacks, and introspection-based enumeration strategies.

Key Teaching Points

  • Feed the API spec to the model before testing begins. Upload an OpenAPI specification and ask the LLM to: identify the highest-risk endpoints by data sensitivity and privilege level, generate a prioritized test plan organized by vulnerability category (auth, injection, excessive data exposure, mass assignment), and flag parameters that warrant particular attention based on naming conventions and data types.
  • Mass assignment testing is a strong AI use case. Ask the model to identify, from the API specification, which request body schemas include fields that are likely to be writable but should not be — role, isAdmin, accountBalance, verified, subscriptionTier. Generate test cases that include those fields in requests and observe application behavior.
  • GraphQL introspection should be the first test. If introspection is enabled, the full schema is available. Ask an LLM to analyze a GraphQL introspection response and produce: a list of all queries and mutations, identification of the most sensitive types and fields by name, and suggested test cases for authorization failures and excessive data exposure across the type graph.
  • Nested query depth attacks are a GraphQL-specific DoS vector. GraphQL allows arbitrarily nested queries if not rate-limited or depth-restricted. Ask the model to generate deeply nested query structures and test whether the application returns errors, times out, or responds successfully — the latter indicating potential denial-of-service exposure.
  • Batch mutation attacks exploit GraphQL's aliasing feature. Multiple mutations can be aliased and sent in a single request, potentially bypassing rate limits that operate per-request rather than per-operation. Ask the model to construct batched authentication attempts or coupon-redemption mutations and test whether the application rate-limits them correctly.

Discussion Prompts

  • You receive a 400-endpoint OpenAPI specification at the start of a five-day engagement. Walk through how you would use AI assistance to prioritize your testing effort in the first hour before any active testing begins.
  • A GraphQL API has introspection disabled. What alternative enumeration approaches would you use, and where does AI assistance remain valuable in the absence of the schema?
  • What is the risk of relying on AI-generated API abuse cases without independently understanding the application's data model? Give an example of how that could lead to a missed finding or a false positive.
  • How do you handle a situation where an AI-generated test case for mass assignment succeeds in modifying a field, but the application's response does not clearly confirm whether the change was persisted?
Instructor Notes

Participants who work heavily with APIs will get significant value from the OpenAPI specification analysis workflow. Consider preparing a sample (anonymized) OpenAPI spec in advance to walk through the analysis prompt structure live — the demonstration effect of going from "here is the spec" to "here is a prioritized test plan" in two minutes is compelling.

GraphQL is unfamiliar territory for some participants. Spend a few minutes establishing the fundamentals — introspection, types, queries vs mutations, aliases — before moving into AI-assisted testing approaches. Do not assume GraphQL fluency even in a room of experienced practitioners.

The batch mutation / rate limit bypass is a high-value finding in production GraphQL APIs that is frequently missed because it requires understanding both the GraphQL protocol and the application's rate-limiting architecture. Emphasize it as a category that benefits meaningfully from structured AI-generated test cases.

Timing Guide

Introduction10 min
Core Content28 min
Discussion17 min
Wrap-up5 min
Session 7 of 8

Triage, False Positives, and Re-Verification

Disciplined use of AI in finding triage so reports reflect actual exploitable risk
● ~60 minutes

Learning Objectives

  • Apply a structured triage workflow to AI-surfaced findings that produces a reliable distinction between confirmed vulnerabilities, candidates requiring verification, and false positives
  • Identify the characteristics of AI-generated false positives across common vulnerability classes — what makes a scanner flag something incorrectly, and what manual verification step resolves it
  • Explain the professional and reputational consequences of reporting unverified AI-surfaced findings and articulate the standard of evidence required before a finding enters a report
  • Use AI assistance in the triage and verification phase itself — to analyze response patterns, compare behavior, and draft verification steps — without creating a circular dependency on AI to verify AI-generated findings

Session Overview

The productivity gains from AI-assisted testing are real — and so is the risk that those gains are partially illusory. An engagement that surfaces 300 AI-flagged candidate findings but cannot verify most of them within the engagement window is not more valuable than one that surfaces 40 manually confirmed findings. In fact, it is less valuable: the unverified findings create report noise, consume review time, and — worst case — lead to false findings reaching the final report, damaging the practitioner's professional credibility.

This session focuses on the triage phase of an engagement: the process of evaluating AI-surfaced candidates to determine which are confirmed vulnerabilities, which require more investigation, and which are false positives that should be discarded. The session also covers how AI can assist in the verification process itself — analyzing response patterns, comparing control vs. test behavior, and drafting precise verification steps — while maintaining the principle that a human must ultimately confirm every finding before it goes in a report.

Key Teaching Points

  • Build a three-tier triage system. Classify every AI-surfaced candidate as: Confirmed (manually verified, reproducible, root cause understood), Under Investigation (promising but not yet confirmed), or Closed (verified false positive or out of scope). Only Confirmed findings enter the report. Never let time pressure collapse the Under Investigation bucket directly into Confirmed.
  • Learn the false-positive fingerprints for your common tool set. Specific AI and scanner tools have characteristic false-positive patterns — certain response codes they misinterpret, certain error messages they confuse with successful exploitation, certain differential response patterns they incorrectly classify. Teach participants to build personal knowledge of their tools' known false-positive behaviors and to check for them first in triage.
  • Verification must be independent of the original test. Re-run the finding from a clean session state, with a different payload where applicable, and confirm that the observed behavior is reproducible and consistent with the claimed vulnerability. A single occurrence that cannot be reproduced is a candidate, not a confirmed finding.
  • Use AI to analyze verification evidence, not to generate it. Paste the request/response pair from your verification attempt into an LLM and ask it to confirm whether the response is consistent with successful exploitation, identify any alternative explanations for the observed behavior, and suggest additional verification steps if the evidence is ambiguous. This is using AI as an analytical assistant, not as an oracle.
  • Time-box the verification phase explicitly. At the start of an engagement, allocate specific time for verification that is protected from scope creep by additional discovery. If the verification backlog grows faster than it can be cleared, prioritize by potential impact — verify the critical and high candidates before medium and low.

Discussion Prompts

  • An AI tool flags 12 findings as "high severity SQL injection" in your Burp history. On quick review, 8 of them look like database error messages that triggered on edge-case input, not actual injection. Walk through your triage and verification process for the remaining 4.
  • You are two days from the end of an engagement with 30 unverified medium-severity findings in your backlog. How do you decide what gets verified versus what gets dropped or downgraded?
  • A finding from the previous engagement was reported based on AI classification and later turned out to be a false positive. The client has already shared the report with their board. What are the professional obligations, and how could this have been prevented?
  • How do you distinguish between an AI-assisted triage process and using AI as a crutch to avoid the manual verification work? Where is the line?
Instructor Notes

This session covers professional accountability and can generate strong opinions. Create space for that conversation — the false-positive reporting scenario in the discussion prompts sometimes surfaces real experiences from participants that are instructive for the whole group. Handle such disclosures carefully and non-judgmentally; the goal is learning, not blame.

The three-tier triage system is intentionally simple. Resist participants' suggestions to add more tiers — complexity in triage systems tends to reduce adherence. Simple and consistently applied beats comprehensive and ignored.

The "verification must be independent of the original test" principle is load-bearing. If the model hallucinated a vulnerability based on a response pattern, using the same model to verify the same response will reproduce the same hallucination. Emphasize that verification must involve a different test, a different tool, or a different perspective — not just asking the AI again.

Timing Guide

Introduction10 min
Core Content28 min
Discussion17 min
Wrap-up5 min
Session 8 of 8

Web Pentest Reporting in the AI Era

Producing clean, defensible reports — what you used AI for, what you confirmed manually, evidence trail
● ~60 minutes

Learning Objectives

  • Produce a pentest finding that clearly distinguishes AI-assisted discovery from manually confirmed exploitation, with an evidence trail that supports each claim
  • Apply AI assistance to report writing — executive summary drafting, finding write-up refinement, remediation recommendation generation — while maintaining human responsibility for accuracy
  • Navigate the emerging professional question of AI disclosure in pentest reports: when to disclose, what to disclose, and how to frame it
  • Build a personal reporting workflow that leverages AI efficiency gains without sacrificing the accuracy and defensibility that clients depend on

Session Overview

Pentest reports are professional documents with real consequences — they inform remediation decisions, drive security investment, and sometimes enter regulatory and legal proceedings. The standards for accuracy and defensibility in a pentest report cannot be reduced by the introduction of AI tooling; if anything, they must be applied more rigorously, because AI-assisted engagements introduce new failure modes that a carefully structured report must account for.

This session covers two complementary topics: how to use AI to improve report quality and speed (executive summary drafting, finding language refinement, remediation research), and how to ensure report accuracy when AI has been part of the testing workflow (evidence trail requirements, AI tool disclosure, the distinction between AI-surfaced and manually confirmed findings). End with a practical discussion of where practitioners currently draw their personal lines and how professional standards in this area are likely to evolve.

Key Teaching Points

  • AI is most valuable for report drafting, not for finding accuracy. Use AI to draft finding write-ups from your notes, generate executive summary language from the confirmed finding list, research remediation best practices for specific vulnerability classes, and refine technical prose for non-technical audiences. None of these uses compromises accuracy — the facts come from your verified findings; AI helps communicate them clearly.
  • Every finding must have a human-confirmed evidence trail. Each finding in the report should cite: the specific request(s) that produced the vulnerability, the specific response(s) that confirmed exploitation, the test conditions (authentication state, parameter values, any required sequence), and the date and time of confirmation. AI-surfaced candidates that do not have this trail are not findings — they are hypotheses.
  • Establish a clear language convention for AI-assisted testing in reports. Develop a standard phrasing for methodological notes that distinguishes AI-assisted enumeration from manual discovery and confirms that all reported findings were manually verified. This protects the practitioner professionally and gives technically sophisticated clients appropriate transparency.
  • AI disclosure is an emerging professional standard, not a solved question. There is no industry consensus yet on how to disclose AI tool use in pentest reports. Present the range of current practices — no disclosure required (AI as a tool, like a scanner), brief methodology note, detailed appendix — and help participants develop a principled position rather than following one approach by default.
  • AI-drafted report language must be verified for accuracy. LLMs can generate plausible-sounding but technically incorrect remediation advice, inaccurate vulnerability descriptions, or overstated impact claims. Every AI-drafted sentence in a report must be reviewed by the practitioner for technical accuracy before submission. The practitioner is accountable for the report's content, regardless of what tool produced a draft.
  • The quality floor for AI-assisted reports must match the quality floor for manual reports. Clients cannot assess whether an AI-assisted engagement delivered the same rigor as a traditional one — they depend on the practitioner's professional standards. If AI tools are lowering the quality floor (fewer verified findings, more unresolved candidates, thinner evidence trails), that is a professional failure regardless of the efficiency gains.

Discussion Prompts

  • Draft a one-paragraph methodology note for a pentest report that accurately describes an engagement where you used AI tools for content discovery, payload generation, and report drafting. How would you phrase the distinction between AI-assisted and manually confirmed?
  • A client asks, after receiving your report, "Did you use ChatGPT for this?" What is the most accurate and professional answer, and does your answer change depending on what you used it for?
  • You find an AI-drafted remediation recommendation in your report draft that is technically correct but does not address the root cause of the finding — it addresses only the symptom. How do you identify that problem and fix it before submission?
  • Looking back at this entire course, which AI-assisted technique do you expect to integrate into your workflow first, and what personal standard will you hold yourself to when using it?
Instructor Notes

The AI disclosure question generates genuine professional tension and often the most energetic discussion of the course. There are no wrong answers in the discussion — the goal is for participants to think through their position deliberately rather than adopting a default. Model a reasoned, non-dogmatic stance and encourage the same from participants.

The final discussion question — "which technique will you integrate first and what standard will you hold yourself to?" — is a deliberate closing ritual that prompts participants to commit to a specific, actionable change. Even if the answers are brief, the act of articulating a commitment publicly tends to increase follow-through. Give every participant a moment to answer if the group size permits.

Close by reaffirming that AI tools in pentesting are in a rapid development phase — workflows that seem optimal today may be outdated in 18 months. The most durable thing participants can take from this course is not specific tool knowledge but the habit of critical evaluation: AI augments my judgment, my judgment does not defer to AI. That mindset will serve them regardless of how the tooling evolves.

Timing Guide

Introduction10 min
Core Content28 min
Discussion17 min
Wrap-up5 min