In late 2009, Google's security team discovered that attackers who had compromised the company's networks had not simply stumbled in. Mandiant's subsequent investigation revealed that the threat actors β later attributed to China's PLA Unit 61398 β had spent weeks aggregating open-source data about Google's supply chain partners, employee LinkedIn profiles, and publicly filed patent documents before selecting specific individuals in the recruiting team as their initial entry vector. The target list was not random; it was the output of a structured aggregation process.
Aurora demonstrated a principle that defenders now encode into threat models: the quality of a target list determines the quality of an intrusion. Aggregation precedes exploitation.
Reconnaissance generates volume. A single passive sweep of a mid-sized organization can produce thousands of subdomain records, hundreds of employee names, dozens of technology fingerprints, and scores of leaked credential fragments β all stored in different formats, collected at different times, carrying different confidence levels. Raw volume is not intelligence.
The aggregation problem is threefold: deduplication (the same asset appears under different names in different tools), normalization (data arrives in incompatible schemas), and provenance tracking (knowing which source produced which finding and how fresh it is). Skipping any of these produces target lists that waste time, miss critical assets, or β worse β generate false positives that lead operators to probe the wrong systems.
AI assists at each stage, but it requires clean inputs. Feeding a language model a raw dump of Shodan JSON, SpiderFoot output, and Maltego exports without first normalizing them produces hallucinated synthesis. The workflow matters as much as the tooling.
Professional red teams and threat intelligence units converge on a four-stage model regardless of which specific tools they use:
LLMs handle steps 2 and 4 most effectively. Normalization prompts can instruct a model to parse heterogeneous tool output and emit a structured JSON entity list. Ranking prompts can score entities given a mission brief. Steps 1 and 3 are better handled by deterministic code β hashing, set operations β because LLMs are unreliable deduplicators at scale.
Not all reconnaissance findings carry equal weight. The table below reflects scoring conventions used by frameworks including MITRE ATT&CK's PRE-ATT&CK, PTES, and the methodology documented in the Verizon DBIR analysis of initial access vectors:
| Entity Type | Why It Matters | Default Priority |
|---|---|---|
| Credentials (leaked/reused) | Direct authentication bypass; fastest path to access | Critical |
| Internet-exposed admin interfaces | High-value, often misconfigured; frequent CVE targets | Critical |
| VPN / remote access endpoints | Perimeter entry; targeted in 2020β2023 ransomware waves | High |
| Senior/privileged employees (OSINT) | Spear-phishing and BEC targeting | High |
| Third-party / supply chain vendors | Indirect access; SolarWinds pattern | High |
| Unpatched public-facing services | Exploitable if CVE exists; moderate exploitation complexity | Medium |
| Internal subdomains (discovered) | Attack surface mapping; may reveal internal architecture | Medium |
| Technology stack details | Narrows exploit selection; useful for payload crafting | LowβMedium |
A subdomain confirmed via active DNS resolution in a live engagement is a different kind of finding from one inferred from a certificate transparency log eighteen months ago. Intelligence without timestamps is intelligence without confidence. The industry term is confidence decay: data certainty decreases as a function of age, especially for volatile entities like IP addresses, cloud instances, and employee roles.
When constructing target lists, findings older than 90 days should be flagged for re-verification before operational use. Findings older than 180 days in cloud-heavy environments (where infrastructure turns over frequently) should be treated as speculative until confirmed. These thresholds are documented in CISA's Threat Intelligence Sharing Framework and reflected in commercial TIP (Threat Intelligence Platform) products like Anomali and Recorded Future.
Aggregation is not data collection β it is data refinement. The goal is to reduce thousands of raw findings to dozens of high-confidence, scored entities that an operator can act on without wasting effort on noise. AI accelerates normalization and scoring; human judgment validates provenance and mission relevance.
You have completed passive reconnaissance against a hypothetical target organization and collected output from three tools: Shodan (JSON), SpiderFoot (CSV), and a manual LinkedIn scrape (plain text notes). The data is inconsistent, partially duplicated, and unscored.
Your AI assistant is trained to help you normalize this data, resolve duplicates, assign confidence tiers, and produce a ranked entity list. Work through at least three exchanges to complete the lab.
When FireEye disclosed the SolarWinds Orion compromise in December 2020, subsequent analysis by Microsoft, CrowdStrike, and CISA revealed that the threat actors (attributed to Russia's SVR, designated Cozy Bear / APT29) had built a remarkably precise target list. Of the roughly 18,000 organizations that received the backdoored Orion update, the attackers actively exploited only a few hundred β government agencies, defense contractors, and cybersecurity vendors. The rest received the implant but were never activated.
This selectivity reflected a sophisticated prioritization model: the actors had pre-identified which compromised hosts belonged to high-value organizations and allocated their limited operational bandwidth accordingly. Defenders use the same logic in reverse β a well-scored recon output tells a red team which systems to prioritize before the blue team detects the operation.
Priority is a composite score, not a single metric. Red teams and threat intelligence analysts score targets across at least four independent dimensions, then weight them by mission objective:
Language models can apply weighted scoring rubrics consistently across large entity lists β a task that is tedious and error-prone for humans working manually. The key is specificity in the prompt. Vague instructions ("rank these by importance") produce inconsistent output. Explicit rubrics with defined weights produce reproducible scores.
A well-structured scoring prompt includes: the mission objective, the entity list with all available attributes, an explicit scoring rubric with numerical weights, a tie-breaking rule, and a required output format. The model's role is to apply the rubric mechanically β not to invent criteria the prompt does not specify.
Mission: Gain access to financial data systems. Score each host 0β100.
Access Likelihood (40%): 0=no known vulns, 40=confirmed RCE CVE with PoC.
Mission Alignment (35%): 0=unrelated infrastructure, 35=direct path to finance VLAN.
Data Freshness (15%): 0=data older than 180 days, 15=confirmed live today.
Detection Risk penalty (10%): subtract 0β10 based on known EDR/WAF presence.
Output: JSON array sorted descending by score with rationale per entity.
Scoring models are more defensible β and more accurate β when calibrated against empirical attack data. The MITRE ATT&CK dataset, Verizon DBIR annual reports, and CISA's Known Exploited Vulnerabilities (KEV) catalog all provide frequency data on which asset types are most commonly exploited in real incidents. Incorporating this data into scoring weights prevents analysts from over-weighting exotic attack paths at the expense of the mundane paths attackers actually use.
For example, CISA's 2023 advisory on routinely exploited vulnerabilities showed that internet-facing VPN appliances (Fortinet, Pulse Secure, Citrix) and unpatched Exchange servers accounted for a disproportionate share of initial access events. An AI scoring model that had not been calibrated against this data might rate an exotic web application vulnerability higher than a known-exploited VPN CVE β inverting real-world risk.
AI-generated scores require human review before operational use. Three failure modes appear repeatedly in practice:
AI scoring models are force multipliers for applying consistent rubrics across large entity lists. They are not autonomous decision-makers. Every AI-generated target list requires a human review pass focused on context, calibration against real-world attack frequency data, and explicit checks for the failure modes that models systematically miss.
You have a normalized entity list from a red team engagement targeting a mid-sized financial institution. The list includes 12 hosts: two internet-facing VPN endpoints (Fortinet, one with a known KEV CVE), three Exchange servers (one unpatched), four internal web apps (behind WAF), and three admin interfaces (RDP, SSH, Webmin). Your engagement objective is to reach the core banking system.
Work with the AI to design a scoring rubric, apply it to this entity list, and critically evaluate the output for the three documented failure modes. Complete at least three exchanges.
The 2013 Target Corporation breach β 40 million payment cards compromised β entered security history not because of a sophisticated zero-day but because of a relationship that nobody had mapped: the HVAC vendor Fazio Mechanical Services had network access to Target's systems for remote monitoring. When attackers compromised Fazio's credentials via a phishing email, that relationship became a traversal path straight into Target's payment processing network.
The relationship existed in Target's vendor management system. It was discoverable through OSINT β Fazio's website listed Target as a client. What was missing was a graph that connected "third-party with network access" to "payment card environment" and flagged the traversal risk. Graph analysis makes those connections explicit before attackers find them first.
A flat target list answers the question "what assets exist?" A relationship graph answers "how do these assets connect, and what paths exist between them?" The second question is operationally more important because attackers do not move in straight lines from the internet to the crown jewels β they traverse relationships: vendor β internal network β database server β backup system β offsite storage.
Relationship mapping in OSINT typically builds graphs across several entity types: organizations (subsidiaries, vendors, partners), people (employees, contractors, executives β mapped by role, access level, and reported relationships), infrastructure (shared hosting, shared ASNs, certificate reuse, DNS relationships), and technologies (shared software stacks that imply shared vulnerabilities).
| Relationship Type | Discovery Method | Attack Relevance |
|---|---|---|
| Vendor with network access | Procurement filings, vendor portal subdomains, job postings | Supply chain pivot β Target 2013, SolarWinds 2020 |
| Shared TLS certificate | Censys, crt.sh β SANs reveal related domains | Infrastructure mapping; C2 domain correlation |
| Subsidiary with weaker security | Corporate filings, crunchbase, LinkedIn org charts | Indirect entry; subsidiary may share AD domain |
| Employee with privileged role and reused credential | LinkedIn + HaveIBeenPwned or leaked DB correlation | Spear-phish or credential stuffing entry |
| Shared ASN / hosting provider | BGP routing data, WHOIS, RIPEstat | Co-located assets may share vulnerabilities |
| Technology dependency | Job postings, GitHub repos, BuiltWith/Wappalyzer | Shared software = shared CVEs across business units |
Graph construction is still largely a human-directed or algorithmic task β tools like Maltego, BloodHound (for Active Directory graphs), and Neo4j handle the structural work. AI's contribution comes at the interpretation layer: given a graph structure, which paths represent the highest-value traversal routes? Which relationships are surprising given the organization's stated architecture? Which nodes are chokepoints whose compromise would affect the most downstream assets?
LLMs can also assist with relationship inference β identifying likely connections that have not yet been confirmed through direct OSINT. For example: a job posting for a "Fortinet-certified network engineer" at a target organization implies the presence of a Fortinet VPN or firewall, even if the device has not been directly fingerprinted. This kind of inference, applied systematically across dozens of job postings and public filings, builds a partial graph of inferred relationships that directs subsequent active reconnaissance more efficiently.
BloodHound, developed by SpecterOps and released publicly in 2016, popularized graph-based analysis for Active Directory environments. It maps trust relationships, group memberships, and delegation paths to find attack paths from any compromised account to Domain Admin. Its underlying principle β that privilege escalation follows relationship paths, not direct access β applies equally to OSINT-derived external graphs.
In graph theory, a chokepoint (or articulation point) is a node whose removal disconnects the graph. In attack planning, chokepoints are nodes whose compromise provides access to the most downstream assets. Identifying chokepoints in a recon graph β the single SSO provider that authenticates 12 internal apps, the shared bastion host, the central AD domain controller β reveals which targets yield disproportionate operational value.
AI can assist with chokepoint identification by analyzing text descriptions of architecture (from job postings, blog posts, conference talks, and open GitHub repositories) and flagging which described components have the highest "fan-out" β the largest number of dependent systems. This inference-based chokepoint analysis often surfaces targets that simple asset enumeration misses entirely.
Relationship graphs transform a target list from a collection of independent assets into a network where paths, chokepoints, and indirect access vectors become visible. AI accelerates the interpretation of graph structure and the inference of likely relationships from indirect evidence β but the graph itself must be built on verified OSINT before AI-generated inferences are added as lower-confidence edges.
You are mapping the attack surface of a hypothetical mid-sized healthcare organization. Your OSINT has revealed the following: three subsidiary hospitals (each with their own IT but sharing a central EHR system), two IT vendors with documented remote access (one listed on the vendor's own website), a central SSO provider used by all subsidiaries, and an Azure AD tenant shared across the group. Job postings reveal Palo Alto firewalls and a Citrix remote access deployment.
Work with the AI to build a text-based relationship graph, identify chokepoints, and infer additional relationships from the indirect evidence. Complete at least three exchanges.
In FireEye's 2020 public disclosure of its own breach β attributed to APT29 β the company released an unusually candid account of how the attackers operated. FireEye's subsequent red team methodology documentation (published as part of its transparency response) described how effective red teams produce target lists with explicit attack phase assignments: each target is not just ranked but assigned to a specific phase (initial access, persistence, lateral movement, objective) so that operators know when in the engagement to engage each asset. A flat ranked list without phase assignments creates operational confusion during time-pressured engagements.
This insight β that a target list is an operational document, not just an intelligence output β shapes how modern red teams structure their deliverables from the recon phase forward.
A scored target list is intelligence. An operational target list is a planning document that tells operators what to do, in what order, under what conditions. The transition from intelligence to operations requires adding four elements that are absent from a purely analytical list:
Given a scored entity list and a relationship graph, an LLM can generate a draft operational target list that includes phase assignments and dependency ordering. This is a well-suited task for AI because it involves applying structured rules (ATT&CK phase definitions, dependency logic) to a well-defined input β not open-ended inference.
The most effective prompt pattern provides: the scored entity list with all attributes, the engagement objectives mapped to ATT&CK tactics, the relationship graph as a dependency map, and a required output template. The model fills the template; the human validates phase assignments for operational plausibility and adds alternative paths from their own knowledge of the environment.
Target ID: TGT-007
Asset: vpn.targetcorp.com (Fortinet FortiGate, CVE-2024-21762 confirmed)
Priority Score: 87/100
ATT&CK Phase: Initial Access (T1190 β Exploit Public-Facing Application)
Dependencies: None β first-hop target
Primary Vector: CVE-2024-21762 PoC available, no auth required
Backup Vector: Credential stuffing using leaked creds from HaveIBeenPwned set (3 matches)
Verification Check: Confirm service live on port 443, confirm firmware version pre-7.4.3
Confidence: High (confirmed live, CVE validated against banner, creds fresh <30 days)
Operational target lists require a deconfliction pass before use. Deconfliction in red team operations means verifying that every target is within the authorized scope of the engagement β IP ranges, domains, and personnel explicitly included in the rules of engagement. AI-generated lists must be cross-referenced against the scope document because models have no inherent awareness of engagement boundaries.
A second form of deconfliction applies in threat intelligence contexts: verifying that a target does not belong to a government or critical infrastructure entity where unauthorized access would create legal exposure beyond the engagement's authorization. This check is always human-performed; it cannot be delegated to an AI system.
A target list produced at the end of the recon phase is not static. As the engagement proceeds and new information surfaces β a host proves unreachable, a credential works and opens unexpected access, a new subdomain is discovered during lateral movement β the list must be updated. Effective red teams maintain a living target list with a change log, treating it as a document under version control rather than a static report appendix.
AI can assist with list maintenance by processing operator notes in natural language and updating structured records accordingly β a workflow that reduces the documentation burden during high-tempo operations. The key requirement is that every update is timestamped, attributed to a specific operator observation, and flagged for lead review before it affects operational planning.
The final target list is where intelligence becomes action. Adding phase assignments, dependency ordering, alternative paths, and verification checks transforms a scored ranking into an engagement plan. AI generates the draft structure efficiently; human operators validate operational plausibility, enforce scope boundaries, and maintain the list as a living document throughout the engagement.
You have completed scoring and graph analysis for a red team engagement targeting a logistics company. Your scored entity list contains: a Fortinet VPN (score 87, KEV CVE confirmed), an unpatched Exchange server (score 72, CVE-2021-34473), a CFO with reused credential found in 2023 breach dump (score 68), an internal SharePoint server (score 61, accessible from Exchange), and a file server on the finance VLAN (score 55, objective target). Engagement objective: access finance VLAN file server.
Work with the AI to produce a complete operational target list with phase assignments, dependency ordering, alternative paths, and verification checks. Then ask it to identify what deconfliction questions you need to resolve before operations begin. Complete at least three exchanges.