Module 3 · Lesson 1

What Is Attack-Surface Mapping?

From exposed ports to forgotten subdomains — understanding the full perimeter adversaries see before you do.

How do defenders and attackers define an organisation's "surface" — and why does AI change what's discoverable?

In May 2017, an Apache Struts vulnerability (CVE-2017-5638) sat unpatched on a single internet-facing Equifax web application. The organisation had over 35 business units, each managing its own certificate and hostname inventory. The security team tasked with patching the flaw scanned the wrong segment of the network — they simply did not have a reliable, complete map of their own attack surface. Attackers exploited the gap for 78 days, exfiltrating records on 147 million people before detection. The U.S. Government Accountability Office's 2018 post-mortem cited the failure to fully enumerate internet-accessible systems as a root cause.

Defining the Attack Surface

An organisation's attack surface is the complete set of points where an adversary can attempt to enter or extract data. It spans three overlapping domains: the digital surface (IP ranges, domains, APIs, cloud storage, code repositories), the human surface (employee credentials, social profiles, phishing vectors), and the physical surface (physical locations, RFID, badge-access systems).

This module focuses on the digital surface — specifically, the challenge of enumerating it comprehensively when modern organisations span hundreds of cloud accounts, SaaS tenants, acquired subsidiaries, and developer sandboxes that were never formally catalogued.

Attack SurfaceThe sum of all digital entry points, exposed services, and data-exposure vectors that an adversary could use to compromise an organisation.

Shadow ITSystems, services, and cloud resources deployed by business units without formal IT or security approval — routinely absent from official asset inventories.

External Attack Surface Management (EASM)The continuous process of discovering, classifying, and monitoring all internet-facing assets an organisation owns or is associated with.

Why Enumeration Fails at Scale

Traditional asset inventory relied on manual spreadsheets, CMDB entries, and periodic vulnerability scans. Three structural problems make this unworkable at modern scale:

Acquisition drift: Mergers bring in legacy infrastructure with unknown hostnames. When Marriott acquired Starwood Hotels in 2016, it inherited a compromised reservation system that had been breached since 2014 — a system that did not appear in Marriott's own asset inventory until years into the incident investigation.

Cloud sprawl: AWS alone reports that enterprise customers average over 1,000 active cloud accounts. Every S3 bucket, Lambda function endpoint, and API Gateway stage is a potential attack-surface entry point. The 2019 Capital One breach originated in a misconfigured AWS WAF — an asset that post-incident analysis confirmed was not in the company's formal security review queue.

Developer velocity: Continuous deployment pipelines create and destroy hosts, containers, and API endpoints faster than weekly scan cycles can track. PortSwigger's 2022 research found that the average Fortune 500 company has over 500 active subdomains at any given time, with 15–20% unknown to the security team.

Why This Matters for AI-Augmented OSINT

AI tools change attack-surface mapping in two directions simultaneously. Defenders use AI to correlate passive DNS records, certificate transparency logs, job postings, and GitHub commits to discover assets faster than manual analysis allows. Adversaries use the same data sources and similar tooling. The asymmetry that once favoured patient attackers with time to manually enumerate is narrowing — but only for defenders who actually deploy these methods.

The Three Mapping Objectives

Effective attack-surface mapping pursues three ordered objectives:

Discovery: Find all assets associated with an organisation — including those not in any official registry. This uses passive DNS, certificate transparency, autonomous system (AS) data, WHOIS records, and code repository scanning.
Classification: Label each asset by type, owner, sensitivity, and exposure level. An S3 bucket with public-read ACL is categorically more urgent than an internal wiki on a VPN-gated host.
Continuous monitoring: Detect when the surface changes — new subdomains registered, certificates issued, S3 buckets created, code pushed to public repos containing credentials. Change is when risk spikes; static inventories miss it entirely.

Real Metric — Certificate Transparency

Google's Certificate Transparency (CT) project logs every TLS certificate issued by participating Certificate Authorities — approximately 10 billion certificates as of 2024. Tools like crt.sh and Censys index this log in near real-time. A subdomain created at 9 AM and issued a Let's Encrypt certificate is searchable by an adversary — or a defender — within minutes. AI-assisted monitoring pipelines can alert on new certificate issuances for an organisation's domains within seconds of CT log publication.

Scope and Legal Boundaries

Attack-surface mapping, even using purely passive OSINT techniques, operates within legal and ethical constraints that differ by jurisdiction and context. The U.S. Computer Fraud and Abuse Act (CFAA), the UK Computer Misuse Act, and the EU's NIS2 Directive all distinguish between passive observation of publicly available data and active probing of systems. This module focuses exclusively on passive enumeration — techniques that observe data already published to the internet without sending crafted packets to target systems.

For authorised penetration testing or red-team engagements, active techniques (port scanning, banner grabbing, vulnerability probing) require explicit written scope agreements. The passive baseline covered here is the prerequisite phase that precedes any active testing.

Lesson 1 Quiz

Attack-Surface Mapping Foundations · 4 questions

In the Equifax 2017 breach, what specific failure in attack-surface management allowed the vulnerability to go unpatched for 78 days?

Correct. The GAO's post-mortem explicitly cited failure to fully enumerate internet-accessible systems. The patch existed; the problem was not knowing which hosts needed it.

Not quite. The patch was available. The root cause was an incomplete asset inventory that led scanners to miss the vulnerable host entirely.

Which data source allows an adversary to discover a new subdomain within minutes of its TLS certificate being issued?

Correct. Certificate Transparency logs are public, near-real-time, and indexed by tools like crt.sh within minutes of certificate issuance — making new subdomains immediately discoverable.

Not quite. While Shodan and passive DNS are useful, CT logs are specifically the source that captures new subdomains within minutes of certificate issuance.

The Marriott-Starwood breach illustrates which structural attack-surface mapping problem?

Correct. Starwood's compromised reservation system predated the Marriott acquisition and was not incorporated into Marriott's security monitoring, exemplifying acquisition-drift risk.

Not quite. The Marriott case is the canonical example of acquisition drift — systems inherited through M&A that don't appear in the acquiring company's security inventory.

Which of the three mapping objectives specifically addresses the risk that a new S3 bucket with public-read ACL is created at 3 AM without anyone noticing?

Correct. Continuous monitoring detects surface changes — new assets, new exposures — as they happen. Static discovery and classification only capture the state at a point in time.

Not quite. Discovery finds assets; classification labels them. But detecting a new exposure the moment it appears requires continuous monitoring — the third objective.

Lab 1: Scoping an Attack Surface

AI-assisted exercise · Define the enumeration scope for a target organisation

Objective

You are preparing an authorised external attack-surface assessment for a mid-size financial services firm, Meridian Capital Group (fictional stand-in for practice). Before running any tools, you need to define the scope: what domains, ASNs, IP ranges, and subsidiary brands should be included.

Use the AI assistant to work through the scoping methodology. Ask about what data sources to consult, how to find subsidiaries, what legal and ethical boundaries apply, and how to structure the final scope document.

Starter prompt: "I need to scope an external attack-surface assessment for a financial services firm. What are the first three data sources I should consult to discover all domains associated with the organisation, and why?"

AESOP Lab Assistant

Attack Surface Scoping

Welcome to Lab 1. We're working on scoping an external attack-surface assessment — the critical prerequisite before any tool is run. Ask me about data sources for domain discovery, subsidiary identification, AS number lookups, legal scope constraints, or how to structure a scope document. What would you like to start with?

Module 3 · Lesson 2

Passive DNS, Certificates & Subdomain Enumeration

Reading the internet's public memory to reconstruct an organisation's hidden digital footprint.

What does an organisation's DNS history reveal — and how do AI-assisted tools turn billions of passive records into actionable asset lists?

In 2016, researcher Frans Rosén discovered that Uber's attack surface extended far beyond their primary domain. By querying Certificate Transparency logs and passive DNS databases, he found that Uber operated over 300 active subdomains, many pointing to staging environments, internal tools, and acquired startup infrastructure. Several of these hosted vulnerable versions of web frameworks. Rosén disclosed the findings through HackerOne. Uber's own internal asset inventory had catalogued fewer than half of these hosts. His methodology — CT logs first, passive DNS correlation second — became a template adopted by the bug-bounty community and later formalised into tools like Subfinder and Amass.

Passive DNS: What It Is and Why It Matters

Passive DNS is the historical record of DNS queries and responses collected by sensors placed at resolvers, IXPs, and DNS infrastructure operators. Unlike active DNS querying (which asks "what does this domain resolve to right now?"), passive DNS answers "what has this domain resolved to, when, and what other domains resolved to the same IP?"

Commercial providers — Farsight DNSDB, RiskIQ PassiveTotal (now Microsoft Defender Threat Intelligence), VirusTotal, and SecurityTrails — collect and index billions of passive DNS records. Free-tier access is available for researchers. The data is legally collected from consenting resolvers; it does not require querying target infrastructure.

What You Can Find

Historical A/AAAA records, MX records revealing mail providers, NS records showing DNS hosting history, CNAME chains pointing to third-party services, SPF/TXT records disclosing cloud service usage.

Why History Matters

A subdomain decommissioned six months ago may still be registered and claimable (subdomain takeover). Historical IPs reveal hosting providers. Old MX records expose deprecated email infrastructure.

AI Application

LLM-assisted analysis can cluster subdomain naming patterns (dev-, stg-, api-, v2-) to predict undiscovered hosts, and flag anomalies in CNAME chains that indicate third-party services eligible for takeover.

Certificate Transparency Deep Dive

Every certificate logged in CT contains the Subject Alternative Names (SANs) field, which lists all hostnames the certificate covers. Wildcard certificates (*.example.com) are less informative, but most organisations mix wildcards with specific SAN entries that reveal exact subdomain names. The open-source crt.sh project (operated by Sectigo) provides free querying of the full CT log corpus.

A 2023 study by Censys found that 62% of Fortune 1000 companies had at least one subdomain discoverable exclusively through CT logs — meaning passive DNS records had not yet propagated to commercial aggregators, but CT had captured the certificate within 60 seconds of issuance.

Subdomain Takeover — A Real Consequence

When an organisation deletes a service (e.g., removes a Heroku app or an Azure Static Web App) but leaves the CNAME DNS record pointing to the now-deleted endpoint, an adversary can register the same endpoint name on that platform and serve content under the organisation's subdomain. In 2019, researcher Patrik Hudak documented over 2,000 Fortune 500 subdomains vulnerable to takeover using this technique — discovered entirely through CT logs and passive DNS, with no active probing. Microsoft, Shopify, and Airbnb all had affected subdomains disclosed through bug-bounty programmes.

Tool Ecosystem for Subdomain Enumeration

The open-source tooling ecosystem has matured considerably. Modern pipelines typically combine multiple sources to maximise coverage:

Amass (OWASP)Multi-source subdomain enumeration: CT logs, passive DNS APIs, web archives, DNS brute-force (active mode). Produces asset graphs. Industry standard for authorised assessments.

SubfinderPassive-only subdomain discovery using 40+ data sources including SecurityTrails, Shodan, VirusTotal, and crt.sh. Fast, minimal footprint, no active DNS queries.

crt.shFree web and API interface to the full Certificate Transparency log corpus. Query: https://crt.sh/?q=%25.example.com&output=json

SecurityTrailsCommercial passive DNS + WHOIS history. Free tier allows limited queries. Particularly strong on historical IP-to-domain mappings.

dnsxFast DNS resolver for validating discovered subdomains. Resolves lists at high throughput to confirm which hosts are live vs. stale records.

AI-Assisted Pattern Expansion

Raw subdomain lists from CT logs and passive DNS contain noise: expired certificates, honeypot entries, and staging hosts that were live only briefly. AI-assisted analysis adds value at two points:

Pattern prediction: Given a known set of subdomains (api.example.com, api-v2.example.com, api-staging.example.com), an LLM can generate a structured prediction of likely undiscovered hosts (api-dev, api-int, api-uat, api-prod) that can then be validated with targeted DNS resolution. A 2022 paper from NCC Group described using GPT-3 to generate subdomain wordlists from existing enumeration output, improving discovery coverage by 23% on a test corpus of 50 organisations.

Anomaly flagging: CNAME chains pointing to cloud services (e.g., .azurewebsites.net, .herokuapp.com, .github.io) are automatically flagged as potential takeover candidates. An AI pipeline can cross-reference the target of each CNAME against a database of known "dangling" service patterns to prioritise which require immediate validation.

Practical Pipeline — Passive-Only Subdomain Enumeration

1. Query crt.sh for all certificates issued to *.target.com and target.com (JSON API). 2. Query Subfinder against the same domain using passive sources only. 3. Merge and deduplicate the two lists. 4. Run dnsx to resolve all entries — discard NXDOMAIN results. 5. Feed live results into an LLM to identify CNAME takeover candidates and generate pattern-based expansion wordlists. 6. Validate expansion wordlists with another dnsx pass. Total active footprint: zero — all steps use publicly available data and resolve only existing DNS records.

Lesson 2 Quiz

Passive DNS, CT Logs & Subdomain Enumeration · 4 questions

What specific field in a TLS certificate makes Certificate Transparency logs particularly useful for subdomain enumeration?

Correct. SANs list every hostname the certificate is valid for. Even a single certificate for a staging environment reveals subdomain names that may not appear in passive DNS records yet.

Not quite. The SAN field is what makes CT logs so valuable for enumeration — it explicitly lists all hostnames, including internal and staging subdomains, the certificate covers.

A subdomain "legacy-portal.example.com" has a CNAME record pointing to "legacy-portal-xyz.azurewebsites.net" — but that Azure Static Web App was deleted 6 months ago. What vulnerability does this create?

Correct. Dangling CNAME records pointing to deleted cloud service endpoints allow attackers to register that endpoint name and serve arbitrary content under the victim organisation's subdomain — including credential-harvesting pages.

Not quite. This is a subdomain takeover scenario. The CNAME still resolves, but to a claimable endpoint — meaning an attacker can register it and control what's served at that subdomain.

What distinguishes passive DNS from active DNS querying for OSINT purposes?

Correct. Passive DNS is collected from consenting resolvers and retrospectively indexed — you query the passive DNS database, not the target. This leaves no footprint on target infrastructure and reveals historical records that active querying cannot.

Not quite. The key distinction is that passive DNS queries a third-party database of historically observed records — zero interaction with the target's own DNS infrastructure.

According to a 2022 NCC Group paper, what was the approximate improvement in subdomain discovery coverage achieved by using GPT-3 to generate pattern-based wordlists from existing enumeration output?

Correct. The NCC Group paper reported ~23% improvement in discovery coverage on a 50-organisation test corpus when LLM-generated pattern expansion was added to the standard passive enumeration pipeline.

Not quite. The figure cited was approximately 23% — meaningful but not dramatic. AI-assisted pattern expansion is a supplement to, not a replacement for, comprehensive passive enumeration.

Lab 2: Subdomain Enumeration Pipeline

AI-assisted exercise · Design a passive subdomain discovery workflow

Objective

You have been given a target domain: meridianbank.example (fictional). You need to design and justify a complete passive subdomain enumeration pipeline — from CT log queries through AI-assisted pattern expansion — and identify which discovered assets should be prioritised for takeover-vulnerability checks.

Ask the AI assistant to walk you through tool selection, source prioritisation, CNAME takeover identification, and how to structure your findings. Challenge the AI with edge cases like wildcard certificates and historical DNS anomalies.

Starter prompt: "I've run Subfinder and crt.sh against meridianbank.example and found 47 subdomains. Six of them have CNAME records pointing to Heroku, Azure, and GitHub Pages. Walk me through how to determine which are vulnerable to takeover."

AESOP Lab Assistant

Subdomain Enumeration

Lab 2 active. We're working through a passive subdomain enumeration pipeline and CNAME takeover analysis. I can walk you through the takeover identification methodology, tool command syntax, how to prioritise your 47 results, or how to use AI-assisted pattern expansion to find additional hosts. What would you like to tackle first?

Module 3 · Lesson 3

Internet-Wide Scanning: Shodan, Censys & BinaryEdge

When the internet scans itself — indexed exposure data and what AI makes visible in it.

What does a Shodan query reveal that a traditional vulnerability scanner misses — and how do AI-assisted facet searches change what defenders can see at scale?

Prior to the December 2020 public disclosure of the SolarWinds SUNBURST attack, independent researchers examining Shodan and Censys data found that over 18,000 SolarWinds Orion instances were directly internet-accessible — many exposing administrative interfaces on default ports. Post-breach analysis by the Atlantic Council found that organisations with internet-exposed Orion management interfaces had significantly higher lateral movement risk once the trojanised update was installed. The internet-wide scan data was publicly available; what was missing was a systematic, AI-assisted mechanism to correlate Shodan results with an organisation's known asset inventory and flag the exposure proactively. Several vendors subsequently built automated Shodan-correlation features into their EASM platforms as a direct response.

How Internet-Wide Scanners Work

Services like Shodan, Censys, and BinaryEdge operate fleets of scanning nodes that continuously probe the entire IPv4 address space (and significant portions of IPv6) on common and uncommon ports. They collect banners — the raw response data from each service — and index it in searchable databases. The resulting corpus is a snapshot of what every publicly reachable host on the internet was serving, queryable without touching target infrastructure.

Censys was launched in 2015 as an academic project at the University of Michigan (the ZMap paper). Shodan has been running since 2009 and indexes over 1.5 billion devices. BinaryEdge focuses on SSL/TLS certificate data and is particularly useful for tracking certificate chains. All three offer free-tier API access for security researchers.

Banner GrabbingCollecting the initial response string from a network service, which typically reveals software name, version, and sometimes configuration details.

Facet SearchA Shodan/Censys query technique that aggregates results by field (product, OS, ASN, organisation) to produce statistical breakdowns of exposure at scale.

ASN PivotingUsing an organisation's Autonomous System Number to retrieve all IP ranges they advertise, then querying internet-wide scan data for all hosts within those ranges.

Shodan Query Fundamentals

Shodan's query syntax uses field filters applied to its banner index. Effective attack-surface mapping uses several categories of queries:

org: / ssl.cert.subject.cn:Retrieve all hosts associated with an organisation name or a certificate CN/SAN. Finds cloud instances, CDN origins, and third-party-hosted assets.

asn: ASN_NUMBERAll hosts within a specific Autonomous System. Essential for organisations with legacy on-premises IP space registered under their name.

product: "SolarWinds" port:8787Find specific software versions exposed on specific ports. Combines product-name banner data with port filters to isolate vulnerable software at scale.

http.title: "AdminLTE"Match web application titles to find default-credential interfaces, admin panels, and unprotected dashboards.

vuln: CVE-XXXX-YYYYShodan Vuln filter (paid tier) maps banner data against CVE databases to surface hosts likely running vulnerable software versions.

Censys and the Certificate Graph

Censys's data model is structured differently from Shodan's. Rather than a banner-centric index, Censys organises data around hosts, certificates, and domains as first-class entities, with explicit relationships between them. This makes Censys particularly powerful for attack-surface mapping tasks that require cross-referencing: find all certificates issued to *.example.com, then find all IP addresses currently serving those certificates, then pivot to find other domains hosted on those IPs that might belong to the same organisation.

In 2021, Censys published a case study showing that their certificate-graph approach discovered an average of 40% more internet-facing assets per organisation compared to using subdomain enumeration alone — particularly for organisations with complex cloud footprints where different subsidiaries used different domain names but shared TLS certificates or infrastructure.

Real Exposure — Default Credentials on Internet-Facing Panels

In 2020, security researcher Bob Diachenko used Shodan queries to discover 23,000 MongoDB instances openly accessible on the internet with no authentication, containing a combined estimated 4TB of data. The methodology was a single Shodan query filtering for MongoDB on port 27017 with no authentication required. AI-assisted triage of results — classifying databases by likely industry based on database and collection names visible in banner data — took minutes where manual review would have taken days. Diachenko's responsible-disclosure workflow relied on automated organisation-ownership attribution to notify affected parties.

AI-Assisted Triage of Scan Results

Internet-wide scan queries against a large organisation return hundreds or thousands of results. Manual triage at this scale is impractical. AI integration adds value at three points:

Ownership attribution: Banner data, WHOIS records, and SSL certificate organisation fields can be ambiguous for acquired subsidiaries or white-label services. An LLM can cross-reference multiple signals to probabilistically assign each host to a business unit or subsidiary, flagging low-confidence attributions for manual review.

Severity ranking: Given a list of exposed services, an LLM can apply contextual knowledge (CVE severity, default-credential likelihood, data sensitivity indicators visible in banner data) to produce a prioritised list — focusing analyst attention on the highest-risk exposures first.

Context enrichment: For each identified host, AI can automatically enrich findings with context from job postings (e.g., the organisation advertised for "SolarWinds Orion administrators"), LinkedIn technology indicators, and press releases announcing new systems — corroborating what the scan data shows.

BinaryEdge and Historical Exposure Tracking

BinaryEdge stores historical scan data going back several years, allowing analysts to ask "when did this port first appear open on this host?" and "what services were running on this IP range before the organisation migrated to the cloud?" This timeline data is particularly valuable in breach investigations and post-merger assessments. The platform's API is accessible to researchers at free tier and is integrated into several EASM platforms including Detectify and CyCognito.

Lesson 3 Quiz

Internet-Wide Scanning & AI Triage · 4 questions

In the SolarWinds SUNBURST case, what pre-existing exposure condition — visible in Shodan/Censys data — amplified the risk of the trojanised update?

Correct. Atlantic Council's analysis found that organisations with internet-exposed Orion management interfaces faced significantly higher lateral movement risk — a condition that was visible in public scan data before the breach was disclosed.

Not quite. The specific risk amplifier was direct internet exposure of Orion administrative interfaces on over 18,000 instances — visible in Shodan/Censys, but not systematically correlated with asset inventories.

What makes Censys's data model particularly suited for complex multi-subsidiary attack-surface mapping compared to Shodan?

Correct. Censys's entity-relationship model allows analysts to find all IPs serving a given certificate, then pivot to discover other domains on those IPs — critical for organisations where subsidiaries share infrastructure but use different domain names.

Not quite. The key differentiator is Censys's data model — treating hosts, certificates, and domains as related entities enables graph-style pivoting that Shodan's banner-centric model doesn't natively support.

Which Shodan query technique would be most appropriate for mapping ALL internet-facing assets owned by an organisation that has a registered Autonomous System Number?

Correct. ASN-based queries retrieve all hosts within the organisation's own IP space — the most comprehensive method for on-premises and co-located infrastructure that the org directly controls and advertises to BGP.

Not quite. For assets within an organisation's own IP space, ASN pivoting is the most comprehensive approach. Other query types are complementary but may miss IPs registered to the AS that don't carry the org's name in banners.

Bob Diachenko's 2020 discovery of 23,000 open MongoDB instances demonstrates AI's specific value in what part of the attack-surface mapping workflow?

Correct. The Shodan query itself was simple; the scale challenge was triaging 23,000 results. AI-assisted classification — inferring likely industry from database/collection names — enabled rapid prioritisation for responsible disclosure.

Not quite. AI's contribution here was triage — classifying thousands of results by sensitivity to prioritise responsible disclosure. The discovery query was straightforward; scale made manual triage impractical.

Lab 3: Shodan Query Design & AI Triage

AI-assisted exercise · Build effective queries and triage scan results at scale

Objective

You're mapping the external attack surface of Meridian Capital Group. You have their ASN (AS64501, fictional) and primary domain. Your Shodan export has returned 340 hosts across 12 service types. You need to build additional targeted queries, triage the results, and produce a prioritised exposure report.

Work with the AI to design queries for specific exposure categories (default admin panels, legacy management protocols, cloud storage), understand how to structure a triage methodology, and identify which of your hypothetical 340 results to report first.

Starter prompt: "I have 340 Shodan results for AS64501. I can see Elasticsearch on port 9200, several RDP exposures, what looks like a Jenkins CI server, and some devices responding on port 502 (Modbus). Help me triage — which category do I look at first and why?"

AESOP Lab Assistant

Shodan Triage & Query Design

Lab 3 ready. You've got a classic mix of exposure categories in those 340 results — some very high priority, some context-dependent. I can help you triage by risk tier, explain why certain services (Modbus on 502 is particularly interesting for a financial firm) warrant immediate attention, design additional Shodan queries to dig deeper, or structure a prioritised findings report. Where do you want to start?

Module 3 · Lesson 4

Cloud Exposure, Code Repository Leaks & Continuous Monitoring

Where secrets live in public — S3 buckets, GitHub commits, and the infrastructure that watches the watchers.

How do organisations build AI-assisted pipelines that detect new attack-surface exposure in real time — before adversaries find it first?

On July 19, 2019, Capital One disclosed a breach affecting 106 million customers in the US and Canada. The attacker, Paige Thompson (former AWS engineer), exploited a misconfigured Web Application Firewall deployed on an EC2 instance to perform a Server-Side Request Forgery (SSRF) attack against the AWS Instance Metadata Service. The SSRF returned an IAM role credential with S3 read permissions, which Thompson used to exfiltrate data from over 700 S3 buckets. Post-incident analysis by investigators and the U.S. Senate's report noted that the misconfigured WAF was identifiable through cloud configuration scanning — the instance metadata endpoint was reachable from untrusted networks — but Capital One's monitoring systems had not flagged it. Thompson had posted about the breach on GitHub and a Slack channel before Capital One knew they were breached; it was a tip from a security researcher who saw the GitHub post that triggered the disclosure.

Cloud Storage Exposure: The S3 Problem

AWS S3 buckets and their equivalents (Azure Blob Storage, Google Cloud Storage) have been the most consistently exploited category of cloud misconfiguration for the past seven years. The 2019 Capital One breach, the 2017 Verizon data exposure (14 million customer records in a public S3 bucket operated by a third-party vendor, Nice Systems), and the 2021 Twitch source code leak (a misconfigured internal S3 bucket) all share the same root cause: access control misconfiguration on cloud object storage.

Tools for discovering exposed buckets have matured significantly. GrayhatWarfare indexes public S3 buckets searchable by keyword. Bucket Finder and S3Scanner generate bucket name guesses based on organisation names and common naming patterns. For authorised assessments, Prowler and ScoutSuite scan cloud environments directly with appropriate credentials.

Bucket Discovery (Passive)

GrayhatWarfare (public bucket index), Certificate Transparency (buckets issued certificates), passive DNS (bucket subdomains like assets.example.com pointing to S3 endpoints), Shodan S3 banner indexing.

Bucket Discovery (Active — Authorised Only)

Pattern-based bucket name enumeration: companyname-backup, companyname-dev, companyname-logs. AWS allows ListBuckets to return 403 (exists, no access) vs. 404 (does not exist) — a side-channel revealing bucket existence.

AI Application

LLMs generate candidate bucket name lists from organisation names, acronyms, product names, and common suffixes. Pattern learning from job postings and GitHub org names improves hit rate significantly.

Code Repository Leaks: GitHub as an Attack Surface

Public code repositories have become one of the highest-yield OSINT sources for credential and secret exposure. The 2022 Uber breach (separate from the 2016 incident) was initiated by a contractor's credentials for Uber's internal systems being discoverable in code committed to a private — but accessible — GitHub repository. The attacker used those credentials to pivot into Uber's Slack, HackerOne, and AWS environments.

Trufflehog, GitLeaks, and GitHub's own Secret Scanning (available to public repositories and GitHub Advanced Security customers) detect high-entropy strings, API key patterns, and common secret formats in commit history. Critically, commit history persists even after a secret is removed from the current file — a credential deleted from the codebase today is still in the git log and still valid until rotated.

Documented Scale — GitGuardian 2023 Report

GitGuardian's 2023 State of Secrets Sprawl report found that 10 million new secrets were exposed in public GitHub commits in 2022 — a 67% increase from 2021. The most commonly exposed types were: Google API keys (21%), database connection strings (15%), AWS access keys (12%), and generic high-entropy tokens (31%). The median time from secret exposure to first external access (when monitored) was 4 seconds — automated credential-harvesting bots continuously monitor the GitHub public event stream via the API.

Building Continuous Monitoring Pipelines

The shift from periodic assessment to continuous monitoring is the defining architectural change in modern EASM. The components of an effective pipeline are:

Certificate Transparency monitoring: Subscribe to CT log streams (crt.sh Webhook, Certstream) to receive real-time notifications when new certificates are issued for domains matching your patterns. New subdomain = new attack surface.
Passive DNS change detection: Monitor SecurityTrails or Farsight for changes in DNS records for known domains. A CNAME suddenly pointing to a new third-party service warrants immediate review.
GitHub organisation monitoring: Monitor the GitHub API for new public repositories created by your organisation's accounts, new commits to existing repos, and new members added to your org. Tools like GitHub Monitor and commercial platforms (GitGuardian, Nightfall) automate this.
Shodan Monitor: Shodan's commercial Monitor product sends alerts when new hosts matching saved queries (org, ASN, SSL CN) appear in the index or when existing hosts add new ports/services.
AI-assisted change triage: Feed all alerts into an LLM-based triage layer that classifies new findings by risk tier, deduplicates known-good changes (e.g., a planned new CDN endpoint), and escalates anomalies for human review.

The 2022 Uber Breach: A Monitoring Failure Anatomy

The September 2022 Uber breach is instructive because each failure point corresponded to a gap in continuous monitoring. A contractor's credentials for Uber's internal VPN were stored in a script in a private GitHub repository accessible to the attacker. The attacker used MFA fatigue to gain VPN access, then found an internal network share containing a PowerShell script with hardcoded credentials for Uber's Privileged Access Management (PAM) system. Each of these artefacts — the GitHub secret, the network-accessible share, the hardcoded credentials — would have been flagged by properly configured continuous monitoring. The Senate Commerce Committee's 2023 review cited the incident as a case study for the type of systemic monitoring failures that EASM platforms are designed to prevent.

AI in the Monitoring Loop — Practical Considerations

LLM-assisted triage in continuous monitoring pipelines must account for false-positive fatigue. If the AI flags every new certificate or DNS change as critical, analysts stop responding. Effective implementations use the LLM to classify changes into tiers (P1 immediate review, P2 within 24h, P3 weekly batch) rather than as a binary alerting system. The classification criteria — novelty of service type, proximity to sensitive data systems, consistency with known-good patterns — can be encoded in the system prompt and refined through feedback on analyst decisions over time.

Lesson 4 Quiz

Cloud Exposure, Repository Leaks & Continuous Monitoring · 4 questions

In the Capital One 2019 breach, what was the specific technical chain that allowed an outsider to read 700+ S3 buckets?

Correct. The attack chain is canonical: misconfigured WAF allowed SSRF to reach the EC2 instance metadata endpoint, which returned an IAM role credential. That credential had S3 read permissions scoped too broadly.

Not quite. The Capital One breach used SSRF against the AWS Instance Metadata Service to retrieve an IAM role credential — not a public bucket ACL or SQL injection path.

According to GitGuardian's 2023 report, what was the median time from a secret being exposed in a public GitHub commit to first external access by an automated harvesting bot?

Correct. 4 seconds. Automated bots monitor the GitHub public event stream via the API and attempt to use newly discovered credentials almost instantaneously — making credential rotation, not deletion, the only effective remediation once exposure has occurred.

Not quite. The figure is 4 seconds — not minutes, hours, or days. Automated bots watch the GitHub event stream in real time. A secret committed and pushed is accessible to adversarial automation almost immediately.

Why does deleting a secret from a code file and pushing a new commit NOT fully remediate the exposure risk?

Correct. Git history is immutable by default. The secret exists in every commit made after it was introduced until it was deleted. Remediation requires rotating the credential AND rewriting/filtering the git history — plus force-pushing and invalidating any forks or clones.

Not quite. The correct answer relates to git's immutable history — deleting a file only affects the current commit. The secret is still in every prior commit and remains accessible unless the repository history is rewritten.

What problem does AI-assisted tiered classification (P1/P2/P3) solve in continuous attack-surface monitoring that binary alerting does not?

Correct. Alert fatigue is the operational failure mode that degrades otherwise sound monitoring programs. Tiered AI triage preserves analyst attention for genuine P1 findings by batching and contextualising lower-priority changes.

Not quite. The key operational problem is alert fatigue — analysts desensitised by too many low-priority alerts stop investigating all of them. Tiered classification routes urgent findings to immediate attention and batches lower-priority ones.

Lab 4: Continuous Monitoring Pipeline Design

AI-assisted exercise · Architect an EASM monitoring system with AI-assisted triage

Objective

Meridian Capital Group has approved a continuous external attack-surface monitoring programme. You need to design the full pipeline: data sources, alerting logic, AI triage layer, escalation thresholds, and a response playbook for the three most likely exposure types (new subdomain, exposed credential, open cloud storage).

Use the AI to work through the architecture, challenge assumptions (what if the CT monitoring misses a subdomain? what if the GitHub secret was in a private repo?), and draft a concise monitoring programme design document.

Starter prompt: "Help me design the AI triage layer for Meridian's monitoring pipeline. We'll receive alerts from CT logs, Shodan Monitor, GitGuardian, and passive DNS. What criteria should the LLM use to classify each alert as P1, P2, or P3?"

AESOP Lab Assistant

Continuous Monitoring Design

Lab 4 active. We're architecting Meridian's continuous EASM monitoring pipeline with an AI triage layer. I can help you define triage classification criteria, discuss how to weight signals from different sources, design response playbooks, or work through edge cases like private-repo credential exposure or cloud misconfiguration detection. What would you like to tackle first?

Module 3 Test

Attack-Surface Mapping at Scale · 15 questions · Pass mark 80%

1. The U.S. GAO's post-mortem on the Equifax 2017 breach cited which specific failure as a root cause?

Correct.

The GAO cited incomplete enumeration of internet-accessible systems as the root cause that allowed the patch to be applied to the wrong network segment.

2. Which of the following best defines "Shadow IT" in the context of attack-surface mapping?

Correct.

Shadow IT refers specifically to unapproved deployments — business units using cloud services, SaaS tools, or running servers without security team knowledge or approval.

3. The Marriott-Starwood breach illustrates acquisition drift. What made the compromised Starwood system invisible to Marriott's security team?

Correct.

The system was invisible because it was never added to Marriott's inventory or monitoring scope post-acquisition — the canonical acquisition-drift failure.

4. Certificate Transparency logs are particularly valuable for attack-surface mapping because:

Correct.

CT logs are fully public, indexed near-instantly, and expose SANs — making new subdomains discoverable within minutes of certificate issuance.

5. What is a "dangling CNAME" and why does it create a security risk?

Correct.

A dangling CNAME points to a cloud service endpoint (Heroku, Azure, GitHub Pages) that has been deleted — an attacker can register that endpoint name and serve malicious content under the victim's subdomain.

6. In passive DNS, what capability does historical record data provide that active DNS querying cannot?

Correct.

Passive DNS historical records reveal hosting history, IP pivots, and infrastructure relationships — none of which are visible in a current active DNS query.

7. Which OSINT tool is described as using 40+ passive data sources including SecurityTrails, Shodan, and VirusTotal — making zero active DNS queries?

Correct. Subfinder is designed for passive-only subdomain discovery, aggregating results from multiple intelligence sources without querying target DNS servers.

Subfinder is the passive-only tool. Amass can do both active and passive. dnsx is a resolver. Nmap is an active scanner.

8. A Shodan query using "asn:AS12345" is most appropriate for which attack-surface mapping objective?

Correct. ASN queries scope to the organisation's own registered IP space — the most comprehensive method for on-premises and co-located infrastructure.

ASN queries retrieve all Shodan-indexed hosts within the IP ranges the organisation advertises via BGP — their own IP space, not just what their domain name resolves to.

9. What made Censys particularly useful for the post-SolarWinds attack-surface analysis described in Lesson 3?

Correct. The entity-relationship model — combining host, service, and certificate data — allowed identification and organisational attribution of exposed Orion instances at scale.

Censys's value was its structured data model allowing correlation of product banners with organisation data to identify and attribute exposed Orion instances across the internet.

10. The Capital One 2019 breach became known to the company how?

Correct. Capital One did not detect the breach internally. A researcher saw Thompson's GitHub post and notified them — they had been breached for an unknown period before that tip.

Capital One's own monitoring did not detect the breach. A researcher saw the attacker's public GitHub post describing the breach and sent a responsible-disclosure email — that tip triggered Capital One's response.

11. GitGuardian's 2023 report found that the most commonly exposed secret type in public GitHub commits was:

Correct. Generic high-entropy tokens (31%) topped the list, followed by Google API keys (21%), database connection strings (15%), and AWS keys (12%).

Per the GitGuardian 2023 report: generic high-entropy tokens (31%), Google API keys (21%), database connection strings (15%), AWS access keys (12%).

12. Why must remediating an exposed GitHub secret include credential rotation — not just deleting the file?

Correct. Both factors apply: git history is immutable without rewriting, and the 4-second automated harvesting window means the credential may already be in adversary hands regardless of deletion.

Two reasons make deletion insufficient: git history persists immutably (the secret is in every prior commit), and automated bots may have harvested it within 4 seconds of the original push.

13. Which component of a continuous monitoring pipeline is specifically designed to notify analysts within seconds of a new TLS certificate being issued for a monitored domain?

Correct. Certstream and crt.sh webhooks provide real-time CT log streaming — new certificates trigger alerts within seconds, enabling defenders to react as fast as adversaries could discover the new subdomain.

Real-time CT log monitoring (Certstream, crt.sh webhooks) provides near-instant notification of new certificate issuances. Shodan Monitor operates on a scan-cycle delay; passive DNS on propagation delay.

14. In the 2022 Uber breach, the initial credential access was obtained through which technique against which target?

Correct. The attacker obtained contractor credentials (stored in a repository) then used MFA fatigue to get the contractor to approve a push notification, granting VPN access.

The 2022 Uber breach began with contractor credentials from a code repository, followed by MFA fatigue to bypass multi-factor authentication on the VPN — a combination of credential exposure and social engineering.

15. AI-assisted tiered triage (P1/P2/P3) in EASM monitoring primarily addresses which operational failure mode?

Correct. Alert fatigue degrades otherwise sound monitoring programmes. Tiered AI triage preserves analyst attention for genuine emergencies by classifying and batching lower-priority findings.

The primary problem tiered triage solves is alert fatigue — the operational reality that undifferentiated high-volume alerting causes analysts to deprioritise all alerts, including genuine critical ones.