In May 2017, an Apache Struts vulnerability (CVE-2017-5638) sat unpatched on a single internet-facing Equifax web application. The organisation had over 35 business units, each managing its own certificate and hostname inventory. The security team tasked with patching the flaw scanned the wrong segment of the network — they simply did not have a reliable, complete map of their own attack surface. Attackers exploited the gap for 78 days, exfiltrating records on 147 million people before detection. The U.S. Government Accountability Office's 2018 post-mortem cited the failure to fully enumerate internet-accessible systems as a root cause.
An organisation's attack surface is the complete set of points where an adversary can attempt to enter or extract data. It spans three overlapping domains: the digital surface (IP ranges, domains, APIs, cloud storage, code repositories), the human surface (employee credentials, social profiles, phishing vectors), and the physical surface (physical locations, RFID, badge-access systems).
This module focuses on the digital surface — specifically, the challenge of enumerating it comprehensively when modern organisations span hundreds of cloud accounts, SaaS tenants, acquired subsidiaries, and developer sandboxes that were never formally catalogued.
Traditional asset inventory relied on manual spreadsheets, CMDB entries, and periodic vulnerability scans. Three structural problems make this unworkable at modern scale:
Acquisition drift: Mergers bring in legacy infrastructure with unknown hostnames. When Marriott acquired Starwood Hotels in 2016, it inherited a compromised reservation system that had been breached since 2014 — a system that did not appear in Marriott's own asset inventory until years into the incident investigation.
Cloud sprawl: AWS alone reports that enterprise customers average over 1,000 active cloud accounts. Every S3 bucket, Lambda function endpoint, and API Gateway stage is a potential attack-surface entry point. The 2019 Capital One breach originated in a misconfigured AWS WAF — an asset that post-incident analysis confirmed was not in the company's formal security review queue.
Developer velocity: Continuous deployment pipelines create and destroy hosts, containers, and API endpoints faster than weekly scan cycles can track. PortSwigger's 2022 research found that the average Fortune 500 company has over 500 active subdomains at any given time, with 15–20% unknown to the security team.
AI tools change attack-surface mapping in two directions simultaneously. Defenders use AI to correlate passive DNS records, certificate transparency logs, job postings, and GitHub commits to discover assets faster than manual analysis allows. Adversaries use the same data sources and similar tooling. The asymmetry that once favoured patient attackers with time to manually enumerate is narrowing — but only for defenders who actually deploy these methods.
Effective attack-surface mapping pursues three ordered objectives:
Google's Certificate Transparency (CT) project logs every TLS certificate issued by participating Certificate Authorities — approximately 10 billion certificates as of 2024. Tools like crt.sh and Censys index this log in near real-time. A subdomain created at 9 AM and issued a Let's Encrypt certificate is searchable by an adversary — or a defender — within minutes. AI-assisted monitoring pipelines can alert on new certificate issuances for an organisation's domains within seconds of CT log publication.
Attack-surface mapping, even using purely passive OSINT techniques, operates within legal and ethical constraints that differ by jurisdiction and context. The U.S. Computer Fraud and Abuse Act (CFAA), the UK Computer Misuse Act, and the EU's NIS2 Directive all distinguish between passive observation of publicly available data and active probing of systems. This module focuses exclusively on passive enumeration — techniques that observe data already published to the internet without sending crafted packets to target systems.
For authorised penetration testing or red-team engagements, active techniques (port scanning, banner grabbing, vulnerability probing) require explicit written scope agreements. The passive baseline covered here is the prerequisite phase that precedes any active testing.
You are preparing an authorised external attack-surface assessment for a mid-size financial services firm, Meridian Capital Group (fictional stand-in for practice). Before running any tools, you need to define the scope: what domains, ASNs, IP ranges, and subsidiary brands should be included.
Use the AI assistant to work through the scoping methodology. Ask about what data sources to consult, how to find subsidiaries, what legal and ethical boundaries apply, and how to structure the final scope document.
In 2016, researcher Frans Rosén discovered that Uber's attack surface extended far beyond their primary domain. By querying Certificate Transparency logs and passive DNS databases, he found that Uber operated over 300 active subdomains, many pointing to staging environments, internal tools, and acquired startup infrastructure. Several of these hosted vulnerable versions of web frameworks. Rosén disclosed the findings through HackerOne. Uber's own internal asset inventory had catalogued fewer than half of these hosts. His methodology — CT logs first, passive DNS correlation second — became a template adopted by the bug-bounty community and later formalised into tools like Subfinder and Amass.
Passive DNS is the historical record of DNS queries and responses collected by sensors placed at resolvers, IXPs, and DNS infrastructure operators. Unlike active DNS querying (which asks "what does this domain resolve to right now?"), passive DNS answers "what has this domain resolved to, when, and what other domains resolved to the same IP?"
Commercial providers — Farsight DNSDB, RiskIQ PassiveTotal (now Microsoft Defender Threat Intelligence), VirusTotal, and SecurityTrails — collect and index billions of passive DNS records. Free-tier access is available for researchers. The data is legally collected from consenting resolvers; it does not require querying target infrastructure.
Every certificate logged in CT contains the Subject Alternative Names (SANs) field, which lists all hostnames the certificate covers. Wildcard certificates (*.example.com) are less informative, but most organisations mix wildcards with specific SAN entries that reveal exact subdomain names. The open-source crt.sh project (operated by Sectigo) provides free querying of the full CT log corpus.
A 2023 study by Censys found that 62% of Fortune 1000 companies had at least one subdomain discoverable exclusively through CT logs — meaning passive DNS records had not yet propagated to commercial aggregators, but CT had captured the certificate within 60 seconds of issuance.
When an organisation deletes a service (e.g., removes a Heroku app or an Azure Static Web App) but leaves the CNAME DNS record pointing to the now-deleted endpoint, an adversary can register the same endpoint name on that platform and serve content under the organisation's subdomain. In 2019, researcher Patrik Hudak documented over 2,000 Fortune 500 subdomains vulnerable to takeover using this technique — discovered entirely through CT logs and passive DNS, with no active probing. Microsoft, Shopify, and Airbnb all had affected subdomains disclosed through bug-bounty programmes.
The open-source tooling ecosystem has matured considerably. Modern pipelines typically combine multiple sources to maximise coverage:
https://crt.sh/?q=%25.example.com&output=jsonRaw subdomain lists from CT logs and passive DNS contain noise: expired certificates, honeypot entries, and staging hosts that were live only briefly. AI-assisted analysis adds value at two points:
Pattern prediction: Given a known set of subdomains (api.example.com, api-v2.example.com, api-staging.example.com), an LLM can generate a structured prediction of likely undiscovered hosts (api-dev, api-int, api-uat, api-prod) that can then be validated with targeted DNS resolution. A 2022 paper from NCC Group described using GPT-3 to generate subdomain wordlists from existing enumeration output, improving discovery coverage by 23% on a test corpus of 50 organisations.
Anomaly flagging: CNAME chains pointing to cloud services (e.g., .azurewebsites.net, .herokuapp.com, .github.io) are automatically flagged as potential takeover candidates. An AI pipeline can cross-reference the target of each CNAME against a database of known "dangling" service patterns to prioritise which require immediate validation.
1. Query crt.sh for all certificates issued to *.target.com and target.com (JSON API). 2. Query Subfinder against the same domain using passive sources only. 3. Merge and deduplicate the two lists. 4. Run dnsx to resolve all entries — discard NXDOMAIN results. 5. Feed live results into an LLM to identify CNAME takeover candidates and generate pattern-based expansion wordlists. 6. Validate expansion wordlists with another dnsx pass. Total active footprint: zero — all steps use publicly available data and resolve only existing DNS records.
You have been given a target domain: meridianbank.example (fictional). You need to design and justify a complete passive subdomain enumeration pipeline — from CT log queries through AI-assisted pattern expansion — and identify which discovered assets should be prioritised for takeover-vulnerability checks.
Ask the AI assistant to walk you through tool selection, source prioritisation, CNAME takeover identification, and how to structure your findings. Challenge the AI with edge cases like wildcard certificates and historical DNS anomalies.
Prior to the December 2020 public disclosure of the SolarWinds SUNBURST attack, independent researchers examining Shodan and Censys data found that over 18,000 SolarWinds Orion instances were directly internet-accessible — many exposing administrative interfaces on default ports. Post-breach analysis by the Atlantic Council found that organisations with internet-exposed Orion management interfaces had significantly higher lateral movement risk once the trojanised update was installed. The internet-wide scan data was publicly available; what was missing was a systematic, AI-assisted mechanism to correlate Shodan results with an organisation's known asset inventory and flag the exposure proactively. Several vendors subsequently built automated Shodan-correlation features into their EASM platforms as a direct response.
Services like Shodan, Censys, and BinaryEdge operate fleets of scanning nodes that continuously probe the entire IPv4 address space (and significant portions of IPv6) on common and uncommon ports. They collect banners — the raw response data from each service — and index it in searchable databases. The resulting corpus is a snapshot of what every publicly reachable host on the internet was serving, queryable without touching target infrastructure.
Censys was launched in 2015 as an academic project at the University of Michigan (the ZMap paper). Shodan has been running since 2009 and indexes over 1.5 billion devices. BinaryEdge focuses on SSL/TLS certificate data and is particularly useful for tracking certificate chains. All three offer free-tier API access for security researchers.
Shodan's query syntax uses field filters applied to its banner index. Effective attack-surface mapping uses several categories of queries:
Censys's data model is structured differently from Shodan's. Rather than a banner-centric index, Censys organises data around hosts, certificates, and domains as first-class entities, with explicit relationships between them. This makes Censys particularly powerful for attack-surface mapping tasks that require cross-referencing: find all certificates issued to *.example.com, then find all IP addresses currently serving those certificates, then pivot to find other domains hosted on those IPs that might belong to the same organisation.
In 2021, Censys published a case study showing that their certificate-graph approach discovered an average of 40% more internet-facing assets per organisation compared to using subdomain enumeration alone — particularly for organisations with complex cloud footprints where different subsidiaries used different domain names but shared TLS certificates or infrastructure.
In 2020, security researcher Bob Diachenko used Shodan queries to discover 23,000 MongoDB instances openly accessible on the internet with no authentication, containing a combined estimated 4TB of data. The methodology was a single Shodan query filtering for MongoDB on port 27017 with no authentication required. AI-assisted triage of results — classifying databases by likely industry based on database and collection names visible in banner data — took minutes where manual review would have taken days. Diachenko's responsible-disclosure workflow relied on automated organisation-ownership attribution to notify affected parties.
Internet-wide scan queries against a large organisation return hundreds or thousands of results. Manual triage at this scale is impractical. AI integration adds value at three points:
Ownership attribution: Banner data, WHOIS records, and SSL certificate organisation fields can be ambiguous for acquired subsidiaries or white-label services. An LLM can cross-reference multiple signals to probabilistically assign each host to a business unit or subsidiary, flagging low-confidence attributions for manual review.
Severity ranking: Given a list of exposed services, an LLM can apply contextual knowledge (CVE severity, default-credential likelihood, data sensitivity indicators visible in banner data) to produce a prioritised list — focusing analyst attention on the highest-risk exposures first.
Context enrichment: For each identified host, AI can automatically enrich findings with context from job postings (e.g., the organisation advertised for "SolarWinds Orion administrators"), LinkedIn technology indicators, and press releases announcing new systems — corroborating what the scan data shows.
BinaryEdge stores historical scan data going back several years, allowing analysts to ask "when did this port first appear open on this host?" and "what services were running on this IP range before the organisation migrated to the cloud?" This timeline data is particularly valuable in breach investigations and post-merger assessments. The platform's API is accessible to researchers at free tier and is integrated into several EASM platforms including Detectify and CyCognito.
You're mapping the external attack surface of Meridian Capital Group. You have their ASN (AS64501, fictional) and primary domain. Your Shodan export has returned 340 hosts across 12 service types. You need to build additional targeted queries, triage the results, and produce a prioritised exposure report.
Work with the AI to design queries for specific exposure categories (default admin panels, legacy management protocols, cloud storage), understand how to structure a triage methodology, and identify which of your hypothetical 340 results to report first.
On July 19, 2019, Capital One disclosed a breach affecting 106 million customers in the US and Canada. The attacker, Paige Thompson (former AWS engineer), exploited a misconfigured Web Application Firewall deployed on an EC2 instance to perform a Server-Side Request Forgery (SSRF) attack against the AWS Instance Metadata Service. The SSRF returned an IAM role credential with S3 read permissions, which Thompson used to exfiltrate data from over 700 S3 buckets. Post-incident analysis by investigators and the U.S. Senate's report noted that the misconfigured WAF was identifiable through cloud configuration scanning — the instance metadata endpoint was reachable from untrusted networks — but Capital One's monitoring systems had not flagged it. Thompson had posted about the breach on GitHub and a Slack channel before Capital One knew they were breached; it was a tip from a security researcher who saw the GitHub post that triggered the disclosure.
AWS S3 buckets and their equivalents (Azure Blob Storage, Google Cloud Storage) have been the most consistently exploited category of cloud misconfiguration for the past seven years. The 2019 Capital One breach, the 2017 Verizon data exposure (14 million customer records in a public S3 bucket operated by a third-party vendor, Nice Systems), and the 2021 Twitch source code leak (a misconfigured internal S3 bucket) all share the same root cause: access control misconfiguration on cloud object storage.
Tools for discovering exposed buckets have matured significantly. GrayhatWarfare indexes public S3 buckets searchable by keyword. Bucket Finder and S3Scanner generate bucket name guesses based on organisation names and common naming patterns. For authorised assessments, Prowler and ScoutSuite scan cloud environments directly with appropriate credentials.
Public code repositories have become one of the highest-yield OSINT sources for credential and secret exposure. The 2022 Uber breach (separate from the 2016 incident) was initiated by a contractor's credentials for Uber's internal systems being discoverable in code committed to a private — but accessible — GitHub repository. The attacker used those credentials to pivot into Uber's Slack, HackerOne, and AWS environments.
Trufflehog, GitLeaks, and GitHub's own Secret Scanning (available to public repositories and GitHub Advanced Security customers) detect high-entropy strings, API key patterns, and common secret formats in commit history. Critically, commit history persists even after a secret is removed from the current file — a credential deleted from the codebase today is still in the git log and still valid until rotated.
GitGuardian's 2023 State of Secrets Sprawl report found that 10 million new secrets were exposed in public GitHub commits in 2022 — a 67% increase from 2021. The most commonly exposed types were: Google API keys (21%), database connection strings (15%), AWS access keys (12%), and generic high-entropy tokens (31%). The median time from secret exposure to first external access (when monitored) was 4 seconds — automated credential-harvesting bots continuously monitor the GitHub public event stream via the API.
The shift from periodic assessment to continuous monitoring is the defining architectural change in modern EASM. The components of an effective pipeline are:
The September 2022 Uber breach is instructive because each failure point corresponded to a gap in continuous monitoring. A contractor's credentials for Uber's internal VPN were stored in a script in a private GitHub repository accessible to the attacker. The attacker used MFA fatigue to gain VPN access, then found an internal network share containing a PowerShell script with hardcoded credentials for Uber's Privileged Access Management (PAM) system. Each of these artefacts — the GitHub secret, the network-accessible share, the hardcoded credentials — would have been flagged by properly configured continuous monitoring. The Senate Commerce Committee's 2023 review cited the incident as a case study for the type of systemic monitoring failures that EASM platforms are designed to prevent.
LLM-assisted triage in continuous monitoring pipelines must account for false-positive fatigue. If the AI flags every new certificate or DNS change as critical, analysts stop responding. Effective implementations use the LLM to classify changes into tiers (P1 immediate review, P2 within 24h, P3 weekly batch) rather than as a binary alerting system. The classification criteria — novelty of service type, proximity to sensitive data systems, consistency with known-good patterns — can be encoded in the system prompt and refined through feedback on analyst decisions over time.
Meridian Capital Group has approved a continuous external attack-surface monitoring programme. You need to design the full pipeline: data sources, alerting logic, AI triage layer, escalation thresholds, and a response playbook for the three most likely exposure types (new subdomain, exposed credential, open cloud storage).
Use the AI to work through the architecture, challenge assumptions (what if the CT monitoring misses a subdomain? what if the GitHub secret was in a private repo?), and draft a concise monitoring programme design document.