Intro
L1
·
Quiz
·
Lab
L2
·
Quiz
·
Lab
L3
·
Quiz
·
Lab
L4
·
Quiz
·
Lab
Module Test
AI-Augmented Reconnaissance & OSINT · Introduction

The Intelligence Flood: When Machines Learn to Read the World

Every age gets the surveillance it deserves — this one built it from public data and open APIs.

In 1872, a Pinkerton detective named Allan Pinkerton published The Expressman and the Detective, the first popular account of systematic information gathering against a target. His agents compiled dossiers from newspaper clippings, postmaster interviews, and railroad manifests — entirely public records, assembled with discipline. The practice worked because most people assumed public information was effectively invisible through sheer volume. That assumption held for roughly a century.

The assumption cracked in 2003 when Jeff Jonas, then at Systems Research & Development, demonstrated NORA — Non-Obvious Relationship Awareness — to the Department of Homeland Security. NORA could cross-reference casino employment records, hotel registrations, and watch lists in near real time, surfacing connections invisible to human analysts. What changed was not the availability of the data; casino records had always existed. What changed was the cost of correlation, which dropped from weeks of manual labor to milliseconds. By 2013, Edward Snowden's leaked NSA slides showed that nation-states had extended exactly this logic to internet-scale data collection. By 2023, commercial AI tools had placed equivalent analytic power — pattern recognition, entity resolution, natural-language summarization — in the hands of anyone with a browser.

This course teaches that capability honestly: what AI-augmented OSINT can actually do, where it fails, how defenders detect it, and where legal and ethical lines sit. Each lesson pairs documented real-world cases with hands-on practice. You will leave with concrete technique, calibrated skepticism, and a clear-eyed view of what the tools cannot do. No hype, no fear — just tradecraft in the current environment.

If you finish every module, here's who you become:

  • You'll understand why AI collapsed the cost of correlation from weeks of manual labor to milliseconds — and what that means for every engagement you run.
  • You will run passive OSINT campaigns using LLM-assisted querying that aggregate public data on people and organizations without tipping off the target.
  • You'll map attack surfaces at scale — enumerating subdomains, exposed services, cloud assets, and shadow IT across large estates using AI-powered enumeration tools.
  • You will apply entity resolution, tech-stack fingerprinting, and identity harvesting techniques with calibrated skepticism about where AI inference breaks down.
  • You'll turn raw recon output into a prioritized, scoped engagement plan using AI-assisted triage — the skill that separates signal from noise in real operations.
  • You will operate with tradecraft discipline: managing attribution risk, rate limits, and detection surface so AI-assisted recon doesn't burn the engagement.
  • You're becoming an analyst who reads the intelligence environment clearly — knowing what the tools can do, where they fail, and where the legal and ethical lines sit.
Lesson 1 · Reconnaissance in the AI Era

What Changed and What Did Not

The fundamentals of intelligence collection are unchanged. The economics of scale are not.
What specifically does AI change about reconnaissance — and what parts of the discipline remain entirely human?

On 4 March 2018, Sergei Skripal and his daughter Yulia collapsed on a bench in Salisbury, England, poisoned with the nerve agent Novichok. Within days, Bellingcat — a volunteer open-source intelligence collective founded by Eliot Higgins in 2014 — began correlating flight records, hotel registrations, and passport metadata drawn entirely from publicly accessible Russian databases and social media. By September 2018, they had published the real names, military unit affiliations, and travel histories of the two GRU officers who carried out the attack: Colonel Anatoliy Chepiga and Dr. Alexander Mishkin. The Russian state denied the identification for six days before the evidence became untenable. No classified intelligence was involved. The entire case was built from open sources, cross-referenced manually — a preview of what the same work looks like when AI handles the cross-referencing step at scale.

1.1 — The Pre-AI Baseline

Open-source intelligence has existed as a formal discipline since at least World War II, when the Foreign Broadcast Information Service monitored Axis radio transmissions. The intelligence cycle — direction, collection, processing, analysis, dissemination — was codified by the CIA in the 1950s and has not fundamentally changed. What practitioners did manually was time-bounded by human attention: a skilled analyst could read perhaps 300 documents a day, cross-reference perhaps 50 entities across sources, and maintain working memory of perhaps a few hundred relationships.

The surface area of public information was, for most of that history, manageable. The internet changed that beginning around 1995. By 2010, a single major news event generated more public text, images, and metadata in 24 hours than a Cold War analyst would encounter in a career. The bottleneck shifted from data access to data processing.

1.2 — What AI Actually Changes

The honest answer is: three things specifically, and nothing else.

Scale of correlation. Large language models and embedding-based retrieval systems can identify conceptual relationships across millions of documents in seconds. The Bellingcat-style work on Skripal took weeks with a team. Equivalent entity-resolution across a comparable document set now takes minutes with tools like Palantir AIP, Babel Street Illuminate, or even general-purpose LLMs given structured prompts and document context.

Language barrier elimination. Prior to 2020, multilingual OSINT required either native speakers or expensive translation services. By 2022, OpenAI's GPT-3.5 and Google's PaLM could translate, summarize, and analyze text in over 90 languages with accuracy sufficient for intelligence purposes. A single analyst can now work across Mandarin, Arabic, Russian, and Farsi sources simultaneously.

Synthesis under uncertainty. Human analysts excel at structured analytic techniques but struggle with holding many uncertain hypotheses simultaneously. LLMs do not solve this problem — they introduce hallucination risks — but they do enable rapid hypothesis generation across large evidence sets, surfacing leads a human might not think to pursue.

What AI Does Not Change

Source evaluation remains entirely human. An LLM cannot determine whether a social media account represents a real person, a state-sponsored persona, or a bot farm. It cannot assess the credibility of a source it has not been trained to recognize. The ACH — Analysis of Competing Hypotheses — structured technique developed by Richards Heuer at the CIA in the 1970s remains the gold standard for bias reduction, and no current AI system reliably performs it without human oversight.

1.3 — The Intelligence Cycle in an AI-Augmented Workflow

The classic five-phase cycle maps onto AI augmentation unevenly. Direction — defining requirements — remains human. You must know what question you are trying to answer before a tool can help you answer it. Analysts who skip this step with AI assistance typically generate fluent, confident, and wrong conclusions.

Collection is where AI has the largest near-term impact. Automated scrapers, API harvesters, and scheduled monitoring tools can sustain continuous collection across hundreds of sources with minimal human intervention. Tools like SpiderFoot, Maltego, and custom Python pipelines using the Twitter/X API (prior to its 2023 access restrictions), LinkedIn API, Shodan, and VirusTotal are standard in contemporary OSINT practice.

Processing — converting raw data into structured form — is where LLMs have become genuinely transformative. Named-entity recognition, relationship extraction, geolocation inference from image metadata, and sentiment analysis can all be partially automated. The word "partially" is load-bearing: automated extraction has error rates, and those errors compound downstream.

Analysis benefits from AI as a brainstorming partner and synthesis engine, but structured analytic techniques still require a human to maintain logical hygiene. Dissemination — writing and presenting findings — is where AI drafting assistance is most mature and least risky.

1.4 — Key Terminology

OSINTOpen-Source Intelligence — intelligence derived from publicly available information, defined by Executive Order 12333 and codified in the Intelligence Authorization Act for FY 2023 as a formal intelligence discipline.
SOCMINTSocial Media Intelligence — a sub-discipline of OSINT focused specifically on social platforms; first formally defined by Sir David Omand, Jamie Bartlett, and Carl Miller in a 2012 DEMOS paper.
Entity ResolutionThe process of determining whether two references in different data sources refer to the same real-world entity — a person, organization, location, or event. Central to all OSINT correlation work.
Attack SurfaceIn reconnaissance context, the total set of publicly accessible information about a target. Includes digital footprint, physical presence indicators, organizational affiliations, and temporal patterns.
Passive ReconnaissanceInformation gathering that involves no direct contact with target systems or individuals — no queries to target servers, no social engineering. The entire Bellingcat Skripal investigation was passive.
Active ReconnaissanceGathering that involves direct interaction with target systems — port scanning, DNS queries, credential stuffing. Legally distinct from passive OSINT and governed by the Computer Fraud and Abuse Act (US) and equivalent statutes internationally.

1.5 — The Legal and Ethical Frame

The most important thing to understand before touching any tool in this course is that legality and ethics are not the same axis. Something can be legal and still cause real harm; something can be technically legal in one jurisdiction and criminal in another.

In the United States, passive OSINT against publicly available data has no federal prohibition, but aggregation of individually public data points can create actionable privacy violations under state law — particularly California's CCPA (2018) and the emerging frameworks in Virginia, Colorado, and Texas. The EU's GDPR (2018) explicitly covers personal data in public sources when that data is processed to profile an individual.

For corporate and competitive intelligence, the Economic Espionage Act of 1996 and the Defend Trade Secrets Act of 2016 set boundaries around what information about competitors can be legally collected and used. For security practitioners, the CFAA's "exceeds authorized access" provision (18 U.S.C. § 1030) means that even passive-seeming actions like creating a fake social media profile to access a restricted group may be prosecutable.

The practical rule for this course: every lab uses either synthetic targets, your own systems, or explicitly designated test environments. Real-target reconnaissance, even passive, should only occur with written authorization.

Module 1 Arc

This module covers the foundational shift: what the pre-AI discipline looked like (L1), how passive collection tools work at scale (L2), how AI handles synthesis and entity resolution (L3), and how to structure an OSINT workflow that produces defensible, auditable findings (L4). Each lesson builds on the previous one. The module test at the end covers all four.

Lesson 1 Quiz

Four questions · Select the best answer for each
1. The Bellingcat investigation identifying the GRU officers who poisoned Sergei Skripal relied primarily on which type of intelligence?
Correct. The entire Skripal investigation used flight records, hotel registrations, and passport metadata from publicly accessible Russian databases and social media — no classified sources were involved.
Not quite. Bellingcat is a civilian open-source collective with no access to classified signals intelligence. Their work was entirely passive OSINT from public sources.
2. Which three specific capabilities does AI most meaningfully change in OSINT practice, according to Lesson 1?
Correct. These three are the precise areas identified: scale of correlation (millions of documents in seconds), language barrier elimination (90+ languages), and synthesis under uncertainty (hypothesis generation across large evidence sets).
Incorrect. AI does not change source evaluation (remains fully human), and it does not enable classified data access. The three specific changes are: scale of correlation, language barrier elimination, and synthesis under uncertainty.
3. In the intelligence cycle, at which phase does AI augmentation have the LEAST impact, and why?
Correct. Direction — defining what question to answer — remains entirely human. Analysts who skip this step and let AI drive the collection question typically produce fluent, confident, and wrong conclusions.
Incorrect. The lesson identifies Direction as the phase least affected by AI augmentation, precisely because you must know what you are trying to answer before any tool can help.
4. Under which US statute might creating a fake social media profile to access a restricted group constitute a federal crime, even if the group's content is semi-public?
Correct. 18 U.S.C. § 1030's "exceeds authorized access" clause has been applied to cases involving fake profiles used to access systems or communities the actor was not genuinely authorized to join.
Incorrect. The CFAA's "exceeds authorized access" provision (18 U.S.C. § 1030) is the relevant statute here. The EEA and DTSA govern trade secrets specifically; GDPR is EU law.

Lab 1 — Mapping the Intelligence Cycle

AI-assisted conceptual exercise · Complete 3 exchanges to finish

Exercise: Applying the AI-Augmented Intelligence Cycle

In this lab you will work with an AI assistant to map a real documented OSINT case — the 2018 Bellingcat MH17 investigation — onto the intelligence cycle, identifying exactly where AI augmentation would have changed the workflow and where it would not have.

The MH17 investigation, published by Bellingcat in June 2019, identified the Russian 53rd Anti-Aircraft Missile Brigade as responsible for providing the Buk missile system that shot down Malaysia Airlines Flight 17 over Ukraine in 2014. The entire investigation used open sources: social media posts, satellite imagery from DigitalGlobe, geolocated photographs, and Russian transport records.

Start by describing one phase of the intelligence cycle and asking the assistant how AI tools available in 2024 would have changed — or not changed — that phase in the MH17 investigation context.
OSINT Analyst Assistant
Lab 1
Ready to work through the MH17 investigation with you. This case is a landmark in OSINT history — Bellingcat's team spent months doing manually what modern AI tools could accelerate significantly. Which phase of the intelligence cycle do you want to start with: Direction, Collection, Processing, Analysis, or Dissemination? Tell me what you know about that phase, and we'll examine where AI would and wouldn't change the work.
Lesson 2 · Passive Collection at Scale

Harvesting the Open Web Without Touching the Target

The most powerful reconnaissance leaves no trace on any system it investigates.
How do automated passive collection tools work, and what are the practical limits of each approach?

On 7 May 2021, the DarkSide ransomware group encrypted Colonial Pipeline's billing and business systems, triggering a six-day shutdown of the largest fuel pipeline on the US East Coast. Post-incident analysis by Mandiant (now Google Cloud Security) and CISA found that DarkSide operators had spent at least three months conducting passive reconnaissance before deploying their payload. Using nothing more targeted than Shodan queries, LinkedIn scraping, certificate transparency log analysis, and leaked credential databases, they identified a single VPN account — without multifactor authentication — as their entry point. The username and password were found in a batch of credentials leaked from a prior unrelated breach. No active scanning of Colonial's systems was required before the compromise.

2.1 — The Passive Collection Toolkit

Passive reconnaissance tools fall into four functional categories. Understanding what each category can and cannot see is the foundation of both offensive collection and defensive exposure assessment.

Internet infrastructure databases index publicly reachable services without operator permission. Shodan, launched by John Matherly in 2009, continuously crawls IPv4 and IPv6 address space and stores banner information from every port that responds. As of 2024, Shodan indexes over 1.5 billion internet-connected devices. A single Shodan query for a company's ASN or IP range reveals exposed services, software versions, SSL certificate details, and geographic distribution — all without any connection to the target organization's systems. Censys, developed at the University of Michigan in 2015, offers similar coverage with stronger certificate transparency integration. FOFA, operated by Beijing Huashun Xin'an Technology, provides equivalent coverage with stronger Asian IP range depth.

DNS and certificate transparency is often the most information-rich passive category. Certificate Transparency logs — mandated by Google's Chrome Root Store policy since April 2018 — require every publicly trusted TLS certificate to be logged in append-only public logs. Tools like crt.sh and Certstream make these logs searchable in real time, revealing every subdomain an organization has ever registered a certificate for, including internal staging servers, development environments, and acquisitions not yet publicly announced.

2.2 — Search Engine Dorking and Cached Data

Google dorking — using advanced search operators to surface specific types of exposed content — predates the term OSINT but remains highly effective. The Exploit Database's Google Hacking Database (GHDB), maintained since 2004 and currently listing over 7,000 dorks, catalogs queries that reliably surface misconfigured login panels, exposed configuration files, unsecured cameras, and database files indexed inadvertently. A query like site:target.com filetype:xls "password" costs nothing and leaves no trace on target infrastructure.

The Wayback Machine — operated by the Internet Archive, which has been crawling the web since 1996 — preserves historical versions of websites including pages that have since been taken down, credentials that were briefly exposed, and organizational structures that have changed. It is a standard first step in any corporate OSINT engagement.

Cached SERP data is distinct from Wayback Machine content: Google, Bing, and Yandex cache recent versions of indexed pages. These caches persist 7–90 days after the live page is modified, meaning that a rapidly-removed sensitive post may still be readable through a cache operator query for days or weeks.

2.3 — Leaked Credential Databases

Troy Hunt launched Have I Been Pwned (HIBP) in December 2013 after the Adobe breach exposed 153 million accounts. As of 2024, HIBP indexes over 13 billion accounts from more than 800 breaches. The service is explicitly designed for defensive use — individuals and organizations can check their exposure — but the underlying breach data is widely available through darknet markets and Telegram channels. HIBP's Pwned Passwords API, which allows checking whether a specific password hash appears in known breach data, is now integrated into the default credential checking of Firefox, 1Password, and multiple enterprise identity platforms.

For OSINT practitioners, leaked credential databases serve two passive purposes: establishing that a specific email address is real (it appears in a breach), and identifying password patterns that may predict current credential choices when combined with behavioral analysis. The Colonial Pipeline compromise in 2021 is the canonical example of this vector reaching catastrophic scale from a single credential lookup.

AI Augmentation at This Layer

LLMs cannot query Shodan or crt.sh directly without tool-use integrations. What they add to passive collection is downstream synthesis: given 200 Shodan results for a target organization, an LLM with code execution capability (like OpenAI's Advanced Data Analysis, introduced in 2023) can identify the five most anomalous exposed services in seconds, cross-reference CVE databases, and draft a prioritized exposure summary. The collection itself is unchanged; the triage is transformed.

2.4 — Social Media as Passive Infrastructure Intelligence

LinkedIn, in particular, is a consistently underestimated passive intelligence source for infrastructure mapping. A company's LinkedIn page reveals: org chart depth, technology stack (job postings for "Kubernetes administrator" or "Splunk engineer" imply running deployments), facility locations, vendor relationships (job postings mentioning specific partner tools), and names with photos for spearphishing baseline construction. This was not a theoretical concern: the 2021 SolarWinds post-incident analysis by CISA found that threat actors had used LinkedIn job posting patterns to identify the company's monitoring infrastructure before the supply chain compromise.

Twitter/X, prior to the February 2023 free-tier API shutdown, was a primary source for tracking organizational communications in near real time. The current restriction to paid API tiers at $100/month (Basic) or $5,000/month (Pro) has pushed most automated social OSINT to alternative platforms: Mastodon's open ActivityPub API, Reddit's Pushshift historical archive (partially restored in late 2023), and platform-specific scrapers that operate in a persistent legal gray zone under the Supreme Court's 2022 hiQ Labs v. LinkedIn ruling, which found that scraping publicly accessible data does not violate the CFAA.

Lesson 2 Quiz

Four questions · Select the best answer for each
1. What single access vector did DarkSide use to initiate the Colonial Pipeline compromise in May 2021, and how was it identified?
Correct. Post-incident analysis by Mandiant and CISA identified a single VPN account — username and password found in a prior unrelated breach's leaked credential batch — as the entry point. No active scanning was required.
Incorrect. The entry point was a single VPN account without multifactor authentication, whose credentials appeared in a batch of leaked data from an entirely separate prior breach.
2. Certificate Transparency logs became mandatory for publicly trusted TLS certificates following a policy change by which organization, effective April 2018?
Correct. Google's Chrome Root Store policy required Certificate Transparency logging for all publicly trusted certificates from April 2018, making every subdomain registered for a TLS certificate permanently searchable in public logs.
Incorrect. While RFC 6962 defined the CT standard, the mandate came from Google's Chrome Root Store policy effective April 2018 — browsers that trusted certificates not logged in public CT logs would reject them.
3. The 2022 Supreme Court case hiQ Labs v. LinkedIn is relevant to passive OSINT because it found that:
Correct. hiQ Labs v. LinkedIn established that scraping data that is publicly accessible without login does not constitute unauthorized access under the CFAA — though the ruling is narrow and civil liability under other statutes (GDPR, state privacy law) may still apply.
Incorrect. The ruling found the opposite: that scraping publicly accessible data does not constitute a CFAA violation under the "unauthorized access" provision. Other legal frameworks may still apply.
4. What is the primary intelligence value of LinkedIn job postings for passive infrastructure reconnaissance, beyond identifying personnel?
Correct. A posting for a "Splunk enterprise security engineer" implies an active Splunk deployment; "Kubernetes administrator with EKS experience" implies AWS infrastructure. The SolarWinds post-incident analysis confirmed threat actors used this technique to map monitoring infrastructure.
Incorrect. The primary infrastructure intelligence value is that required skill sets in job postings imply specific running technology deployments — a technique documented in the SolarWinds breach post-incident analysis.

Lab 2 — Passive Collection Tool Selection

AI-assisted planning exercise · Complete 3 exchanges to finish

Exercise: Building a Passive Collection Plan

You are a security consultant who has been hired to assess the external exposure of a mid-sized financial services company (fictional: "Meridian Capital Partners") before a red team engagement. Your written authorization covers passive reconnaissance only — no active scanning, no interaction with target systems.

Work with the AI assistant to design a passive collection plan. Ask about specific tools, query strategies, and data sources. The assistant will help you think through coverage gaps and prioritize sources by signal quality.

Start by describing what you know about the target (public company, financial sector, ~400 employees) and ask the assistant to help you prioritize the passive collection sources from Lesson 2.
OSINT Collection Planner
Lab 2
Ready to help you build a passive collection plan for Meridian Capital Partners. For a financial services firm of that size, the passive surface area is typically rich — regulatory filings, certificate transparency logs, Shodan exposure, and LinkedIn are all high-signal starting points. Tell me what you know about the company and what your primary intelligence requirements are, and we'll structure a prioritized collection approach together.
Lesson 3 · AI-Powered Synthesis and Entity Resolution

From Raw Data to Structured Intelligence

An LLM that cannot tell you whether a source is credible can still find relationships you would never think to look for.
How do you use AI synthesis effectively while controlling for the hallucination and bias risks that make raw LLM output dangerous in intelligence contexts?

In February 2021, following the military coup in Myanmar, the Reuters Investigative unit used a combination of satellite imagery from Planet Labs, Facebook post geolocation, and corporate registry data from Myanmar's Directorate of Investment and Company Administration to trace the military junta's financial holdings across more than 120 shell companies. The key breakthrough came not from any single source but from cross-referencing the names of directors across company registries in Singapore, Hong Kong, and Myanmar. A human analyst had identified the pattern manually; GPT-4, released just two months after publication, could have performed the same name-matching across the structured registry data in under five minutes. The investigation won the 2022 Shorty Award for Best Investigative Journalism. The technique it demonstrated — multi-jurisdiction entity resolution from corporate registry data — is now a standard AI-augmented workflow.

3.1 — What Entity Resolution Actually Means

Entity resolution is the process of determining whether two or more references in different data sources point to the same real-world entity. It sounds simple and is deeply hard. "John Smith" appearing in a LinkedIn profile, a court filing, and a domain registration record may be three different people, or one person using different email addresses, or a fictitious identity used across multiple registrations. The problem scales nonlinearly: ten entities with ten attributes each produce 4,500 potential pairwise relationships to evaluate.

Pre-AI entity resolution tools relied on deterministic matching (exact string match on email or SSN) and probabilistic matching (Fellegi-Sunter statistical models developed in the 1960s). Both approaches require clean, structured data. The real world produces dirty, inconsistent, transliterated, and deliberately obfuscated data. LLMs handle this environment differently: rather than matching on fixed fields, they can assess semantic equivalence across noisy representations. "Anatoly Chepiga," "A.V. Chepiga," and "Anatoly V. Chepiga" are trivially resolved by a language model even without a phone number or birthdate match.

3.2 — LLM Synthesis in Practice

The practical workflow for AI-assisted synthesis in OSINT has three components that must be kept conceptually separate to avoid compounding errors.

Extraction. The first step is converting unstructured text — articles, social posts, PDFs, court filings — into structured data. This is now handled with high reliability by prompting an LLM to extract named entities and relationships into a defined schema. For example: "Extract all named individuals, their stated roles, and their organizational affiliations from the following text. Output as JSON." GPT-4 and Claude 3 Opus perform this task with accuracy rates above 90% on English text of moderate complexity, dropping to 70–80% on non-English text with domain-specific terminology.

Resolution. Once entities are extracted from multiple documents, the resolution step clusters entities that likely refer to the same real-world actor. Embedding-based similarity (using models like text-embedding-3-large from OpenAI) places entities in a semantic vector space where clusters of near-synonymous references can be identified geometrically. This approach identifies non-obvious connections — a shell company name that contains a modified version of its beneficial owner's surname, for example — that deterministic matching would miss entirely.

Synthesis. The final step asks the model to describe the entity network, identify anomalies, and generate hypotheses about relationships not yet confirmed. This is where hallucination risk is highest. The model will generate plausible-sounding statements about entities it has not actually seen evidence for. The mitigation is citation-grounded prompting: explicitly instructing the model to cite the source document for every claim it makes, and treating any claim without a citation as unverified hypothesis rather than established fact.

The Hallucination Problem Is Structural

LLMs generate tokens based on learned probability distributions — they do not "know" whether a statement is true. In a 2023 study published in Nature, researchers found that GPT-4 hallucinated citations to non-existent legal cases at a rate of approximately 35% when asked to support legal arguments. The rate drops to under 5% when the model is given a document corpus to cite from — but does not reach zero. Every AI-generated claim in an intelligence product requires source verification before operational use.

3.3 — Graph-Based Intelligence Mapping

Maltego, first released by Paterva in 2008 and now owned by Maltego Technologies, remains the standard visual graph analysis tool for OSINT entity networks. It connects to over 50 data sources via Transform plugins — Shodan, VirusTotal, HIBP, social platforms — and builds visual relationship graphs that can surface non-obvious connections between entities. The 2024 release of Maltego AI Assist integrates LLM-based natural language querying directly into the graph interface, allowing analysts to describe a relationship pattern in plain English and have the system highlight matching nodes.

SpiderFoot, an open-source OSINT automation tool created by Steve Micallef in 2012, takes a different approach: it automates the collection step entirely, spawning parallel queries across 200+ data sources for a given seed entity (IP address, email, domain, or name) and returning structured JSON results. It does not perform entity resolution or synthesis — it is a collection engine, not an analysis platform — but its output feeds cleanly into LLM synthesis workflows.

3.4 — Structured Analytic Techniques in an AI Context

Richards Heuer's Psychology of Intelligence Analysis (1999, CIA Center for the Study of Intelligence) introduced structured analytic techniques (SATs) specifically to counter cognitive biases: confirmation bias, anchoring, availability heuristic. The two most relevant to AI-augmented OSINT are Analysis of Competing Hypotheses (ACH) and Key Assumptions Check (KAC).

ACH requires listing all plausible hypotheses and systematically evaluating each piece of evidence for its consistency with each hypothesis. An LLM can assist with ACH by generating the initial hypothesis list and populating the evidence matrix — but the analyst must validate the evidence categorizations. The model will confidently mark evidence as "consistent with H2" when a careful reading reveals the evidence is neutral. This is not a failure mode to avoid; it is a workflow to design around: use the LLM for speed, use the analyst for logical validation.

KAC asks: what are we assuming that we have not explicitly stated? LLMs are surprisingly useful here — prompt them with "What assumptions is the following analysis making that are not stated in the text?" and they will often surface implicit assumptions the author overlooked. This is one of the few synthesis tasks where LLM assistance has a low hallucination risk, because the model is identifying gaps in reasoning rather than generating positive claims about the world.

Lesson 3 Quiz

Four questions · Select the best answer for each
1. In the Reuters Myanmar investigation, what was the specific data source combination that enabled identification of the military junta's shell company network?
Correct. Planet Labs satellite imagery, Facebook post geolocation, and multi-jurisdiction corporate registry data (Myanmar, Singapore, Hong Kong) were the three source categories. The breakthrough was cross-referencing director names across registries.
Incorrect. The Reuters investigation used entirely open sources: Planet Labs satellite imagery, Facebook post geolocation, and corporate registry data from Myanmar, Singapore, and Hong Kong. No classified material was involved.
2. What is "citation-grounded prompting" and why is it specifically recommended for the synthesis step of AI-assisted OSINT?
Correct. Citation-grounded prompting instructs the model to reference the specific document supporting each claim. Any claim the model generates without a citation is treated as hypothesis rather than fact, dramatically reducing the operational risk of hallucinated intelligence.
Incorrect. Citation-grounded prompting means instructing the model to cite its source document for every factual claim it makes — claims without citations are treated as unverified hypotheses, not established facts. This mitigates hallucination risk in synthesis.
3. According to the 2023 Nature study cited in Lesson 3, what was GPT-4's approximate hallucination rate for citations when given a document corpus to work from (versus generating citations freely)?
Correct. The study found hallucination rates drop from ~35% (free generation) to under 5% (document-grounded) — a dramatic improvement, but not zero. Every AI-generated claim in an intelligence product still requires source verification.
Incorrect. The study found the rate drops from approximately 35% (free generation) to under 5% when a document corpus is provided — a dramatic improvement, but explicitly noted as not reaching zero, which is why human verification remains mandatory.
4. For which structured analytic technique does Lesson 3 identify AI assistance as having relatively LOW hallucination risk, and why?
Correct. KAC asks the model to find gaps in reasoning — what is assumed but not stated. This task does not require the model to make positive claims about the world, only to identify logical gaps in text already provided, which significantly reduces hallucination risk.
Incorrect. The Key Assumptions Check has the lowest hallucination risk in this context because the model identifies unstated assumptions within provided text, rather than generating positive factual claims about the world — the root cause of most hallucinations.

Lab 3 — Entity Resolution Practice

AI-assisted analysis exercise · Complete 3 exchanges to finish

Exercise: Citation-Grounded Synthesis and Assumptions Checking

You have collected the following synthetic intelligence fragments about a fictional target. Your task is to use the AI assistant to perform entity resolution and then run a Key Assumptions Check on the resulting analysis.

Fragment set (fictional / synthetic — for training only):
(A) Domain registration for "meridian-cap.io" lists admin contact "A. Verikov, admin@meridian-capital.net"
(B) LinkedIn shows "Alexei Verykov" as CFO of Meridian Capital Partners, joined 2019
(C) Singapore ACRA registry lists "Meridian Capital Pte Ltd" director as "Alexander Verikov" since 2020
(D) A 2022 SEC comment letter references "Meridian Capital Partners LLC" with signatory "A.V."

Ask the assistant to perform entity resolution across fragments A–D, then request a Key Assumptions Check on the resulting entity network. Note how the assistant handles name variant disambiguation and what assumptions it flags.
Entity Resolution Assistant
Lab 3
I can see the four synthetic fragments. Before we begin, let's be precise about what we're doing: entity resolution across these fragments, followed by a Key Assumptions Check on whatever network we construct. The name variants alone — Verikov, Verykov, Verikov, "A.V." — are an interesting test case for how to handle transliteration ambiguity and partial identifiers. Walk me through how you want to approach it: should I attempt resolution first, or do you want to set up the resolution criteria before I apply them?
Lesson 4 · Structuring a Defensible OSINT Workflow

From Collection to Auditable Findings

Intelligence without provenance is rumor. Provenance without structure is noise.
What does a professionally structured OSINT workflow look like, and how do you document it so that your findings can survive scrutiny?

In August 2018, the Oxford Internet Institute's Computational Propaganda Project and Graphika — a network analysis firm — published a joint report for the US Senate Select Committee on Intelligence cataloguing the Internet Research Agency's social media influence operation ahead of the 2016 US election. The report documented 3,814 Twitter accounts, 76,000 Facebook posts, and activity across YouTube, Instagram, Reddit, and Google+, representing the work of a St. Petersburg troll farm operating from at least 2013. The methodological value of the report was not the findings alone — the New York Times had reported on the IRA since 2017 — but the documentation chain. Every account was linked to a platform-provided dataset; every cluster was reproducible from the raw data. When the Senate published the underlying data publicly in December 2019, independent researchers were able to replicate and extend the findings. The workflow survived because it was built to be audited.

4.1 — The Documentation Imperative

OSINT findings are only as durable as their documentation chain. This is true in three distinct contexts: legal proceedings, where chain-of-custody requirements demand timestamped collection logs; corporate security investigations, where findings may be challenged by opposing counsel or disclosed to regulators; and journalistic investigations, where the publication's legal team must be able to verify every factual claim before print.

The minimum viable documentation standard for any OSINT collection includes: timestamp of collection (not just "today" but UTC timestamp to the minute), exact query or method used (the precise Shodan query, the exact Google dork, the URL and date of the cached page), raw output preserved (screenshot plus source HTML or JSON), and chain of inference (the explicit logical steps connecting raw evidence to the stated finding).

AI-generated synthesis must be documented separately from human-analyzed findings, with the model version, exact prompt, and model output preserved. A finding that reads "GPT-4 identified X entity as related to Y organization" is not a finding — it is an AI hypothesis. The finding is the human verification of that hypothesis against cited sources.

4.2 — The OSINT Workflow Architecture

A defensible AI-augmented OSINT workflow has seven sequential stages. These are not theoretical — they represent the documented practice of Bellingcat, Graphika, and major OSINT consulting firms as of 2024.

Stage 1 — Requirements Definition. Write down the specific intelligence question before touching any tool. "What is the full extent of Company X's external internet exposure?" is a defined requirement. "Find everything about Company X" is not.

Stage 2 — Source Mapping. List the data sources you will query and justify each one. This prevents scope creep and documents why you chose particular sources over alternatives.

Stage 3 — Collection with Logging. Execute collection with automated logging. Screenshotting tools like GoFullPage, archive tools like archive.today, and command-line tools that output timestamped logs are all standard. If a piece of evidence cannot be re-collected (because a post was deleted, for example), its original collection artifact is the only record — preserve it immediately.

Stage 4 — Processing and Extraction. Convert raw collection into structured data. This is the LLM-assist step: named entity extraction, relationship tagging, metadata normalization.

Stage 5 — Analysis and Hypothesis Generation. Apply structured analytic techniques. Use ACH for competing explanations; use KAC to surface assumptions. Document which hypotheses were considered and rejected, not just the ones that survived.

Stage 6 — Verification. Every claim in the final report must be traceable to a specific artifact from Stage 3. AI-generated hypotheses from Stage 5 that are included in the report must be marked as verified or unverified, with the verification method stated.

Stage 7 — Dissemination with Confidence Levels. Use explicit confidence levels — High, Medium, Low, or numeric probabilities if your client requires them — based on source reliability and evidence quality. The US Intelligence Community's ICD 203 standard for analytic products provides the reference framework most corporate and governmental clients expect.

ICD 203 Confidence Levels

The Intelligence Community Directive 203 (Analytic Standards), revised in 2015, defines three confidence tiers for intelligence assessments. High confidence means the assessment is based on high-quality information with no significant reason to doubt it. Moderate confidence means the information is credible and plausible but not sufficiently corroborated to warrant high confidence. Low confidence means the information is fragmentary, questionable, or from sources with unknown reliability. These designations should appear explicitly in any intelligence product intended for decision-makers.

4.3 — Counter-OSINT Awareness

Understanding how your own collection can be detected is both an operational security concern and a professional competency. Passive reconnaissance, by definition, leaves no traces on target systems — but it may leave traces elsewhere.

LinkedIn notifies profile owners when their profile has been viewed by non-anonymous accounts. Shodan queries are logged by Shodan and can be subpoenaed. Google dorking queries from a specific IP address are logged in Google's infrastructure. Archive.today submissions are publicly visible. If operational security is a requirement, passive collection should route through VPN infrastructure or Tor, and platform-identifying actions (like viewing a LinkedIn profile while logged in) should be avoided.

More broadly, organizations that have deployed digital exhaust monitoring — services like ZeroFOX, Recorded Future, or Digital Shadows — receive alerts when their domain names, IP ranges, or executive names appear in unusual query patterns. A coordinated reconnaissance campaign against a security-aware target may trigger defensive awareness before the collection phase is complete.

4.4 — The Professional and Ethical Standard

OSINT as a formal profession does not yet have a universally recognized certification body, but two frameworks provide ethical reference points. The SANS FOR589 certification (Cybercrime Intelligence) covers professional standards for law enforcement and corporate investigators. The OSINT Curious Project, founded in 2018 by researchers including Michael Bazzell and Micah Lee, publishes community standards for ethical OSINT practice including guidance on anonymization, harm reduction, and responsible disclosure.

The core ethical obligation is proportionality: the depth of collection should match the legitimacy and scope of the requirement. Researching a prospective business partner's public corporate history is proportionate. Mapping a private individual's daily physical movements using cell tower data and social media geolocation is not — regardless of whether each individual data source is technically public. The aggregation of individually innocuous data points into a surveillance profile crosses an ethical threshold even where no legal line is drawn.

AI amplifies this concern precisely because it lowers the cost of aggregation. A workflow that would have taken a dedicated team three weeks in 2018 can now be partially automated by a single analyst in an afternoon. The professional obligation is to apply the proportionality standard more carefully as the technical barrier falls, not to treat the falling barrier as permission to lower the ethical bar alongside it.

Lesson 4 Quiz

Four questions · Select the best answer for each
1. What made the Graphika / Oxford IRI Senate report on the Internet Research Agency methodologically valuable beyond its findings — and why did that value matter when the data was published publicly in December 2019?
Correct. The documentation chain — platform-provided datasets, reproducible cluster methodology — meant that when the raw data was published, independent researchers could replicate the findings from scratch. The workflow survived scrutiny because it was built to be audited.
Incorrect. The New York Times had reported on the IRA since 2017, so the findings were not novel. The value was the documentation chain: every account tied to platform data, every cluster reproducible, enabling independent verification when the underlying data was published.
2. In the seven-stage defensible OSINT workflow, what is the specific rule for AI-generated hypotheses that are included in a final intelligence report?
Correct. AI-generated synthesis is documented separately: model version, exact prompt, and raw output preserved. If included in the report, each AI-generated claim must be marked verified or unverified with the verification method stated.
Incorrect. AI hypotheses can appear in reports, but must be explicitly marked as verified or unverified, with model version, exact prompt, and verification method documented. A claim labeled only as "GPT-4 identified X" is a hypothesis, not a finding.
3. Under ICD 203 standards, what does "Moderate Confidence" specifically mean when applied to an analytical judgment?
Correct. ICD 203 defines Moderate Confidence as: credible and plausible information that is not sufficiently corroborated to meet the High Confidence threshold. High is solid sourcing with no significant doubt; Low is fragmentary or unreliable sourcing.
Incorrect. ICD 203's Moderate Confidence means the information is credible and plausible but lacks sufficient corroboration for High Confidence. High Confidence = quality sourcing with no significant doubt. Low = fragmentary or questionable sourcing.
4. Lesson 4 argues that AI amplifies a specific ethical concern more than any other. What is it?
Correct. AI lowers the cost of data aggregation so dramatically that a single analyst in an afternoon can now do what a team needed three weeks to accomplish in 2018. The professional obligation is to apply the proportionality standard more rigorously as the technical barrier falls — not to lower the ethical bar alongside it.
Incorrect. The specific concern identified is aggregation: AI's ability to compile individually innocuous public data points into detailed surveillance profiles in hours rather than weeks. The professional response is to apply the proportionality standard more carefully, not less.

Lab 4 — Building a Defensible OSINT Report

AI-assisted workflow exercise · Complete 3 exchanges to finish

Exercise: Structuring Findings with ICD 203 Confidence Levels

You have completed a passive OSINT engagement for Meridian Capital Partners (fictional). You have the following raw findings that need to be structured into a defensible report section with proper confidence levels, documented sources, and explicit marking of AI-assisted hypotheses.

Raw finding set (synthetic / fictional):
(1) Shodan shows RDP (port 3389) open on 3 IP addresses in the company's ASN, dated yesterday — high signal but not confirmed active
(2) crt.sh shows 4 subdomains including "staging.meridian-capital.net" and "old-admin.meridian-capital.net" with certificates issued in 2021, currently expired
(3) HIBP API confirms the admin@meridian-capital.net email appears in 2 breach datasets from 2020 and 2022
(4) LinkedIn job postings from Q1 2024 require "Cisco ASA firewall administration" and "Splunk ES" — implying specific running deployments
(5) GPT-4 synthesis (unverified) suggested the expired certificate subdomains may indicate an undecommissioned development environment

Ask the assistant to help you structure these five findings into a report section with ICD 203 confidence levels, proper source documentation, and correct handling of the AI-generated hypothesis in finding 5.
OSINT Report Structuring Assistant
Lab 4
Good set of raw findings to work with — you've got a clean mix of directly observed data (Shodan, crt.sh, HIBP), inferred data (LinkedIn job posting stack implications), and an explicitly flagged AI hypothesis. Structuring these with ICD 203 confidence levels requires us to think carefully about two things: the source reliability for each finding, and the inferential distance between the raw evidence and the stated claim. Let's start with whichever finding you're least sure how to categorize, or I can propose a draft structure for all five if you'd prefer to react to something concrete.

Module 1 Test

15 questions · Score 80% or higher to pass · Covers all four lessons
1. Eliot Higgins founded Bellingcat in which year?
Correct. Bellingcat was founded by Eliot Higgins in 2014, initially as a blog covering the Syrian civil war using open-source imagery analysis.
Incorrect. Bellingcat was founded by Eliot Higgins in 2014.
2. What is the term for the process of determining whether two references in different data sources point to the same real-world entity?
Correct. Entity resolution is the formal term for this process — central to all multi-source OSINT correlation work.
Incorrect. The correct term is entity resolution — the process of determining whether references across different data sources point to the same real-world actor or object.
3. SOCMINT was first formally defined as a sub-discipline in a 2012 paper by Sir David Omand and colleagues published by which organization?
Correct. The 2012 DEMOS paper by Sir David Omand, Jamie Bartlett, and Carl Miller was the first formal academic definition of SOCMINT as an intelligence sub-discipline.
Incorrect. The 2012 paper defining SOCMINT was published by DEMOS, authored by Sir David Omand, Jamie Bartlett, and Carl Miller.
4. Which tool, created by Steve Micallef in 2012, automates OSINT collection across 200+ data sources and outputs structured JSON — but does not perform entity resolution or synthesis?
Correct. SpiderFoot, created by Steve Micallef in 2012, is a collection engine — not an analysis platform. Its structured JSON output feeds well into LLM synthesis workflows.
Incorrect. SpiderFoot is the tool described — created by Steve Micallef in 2012, it automates collection across 200+ sources but is a collection engine rather than an analysis or synthesis platform.
5. The Fellegi-Sunter model, relevant to the history of entity resolution, was developed in which decade?
Correct. The Fellegi-Sunter probabilistic record linkage model was developed in the 1960s and remains foundational to deterministic and probabilistic entity matching approaches.
Incorrect. The Fellegi-Sunter model was developed in the 1960s — it predates the modern internet by decades and was originally designed for census record linkage.
6. What specific legal ruling in 2022 established that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act?
Correct. hiQ Labs v. LinkedIn (2022) found that scraping publicly accessible data does not constitute unauthorized access under the CFAA — though civil liability under other statutes may still apply.
Incorrect. hiQ Labs v. LinkedIn (2022) is the relevant case, finding that scraping publicly accessible data does not violate the CFAA's unauthorized access provision.
7. Richards Heuer's Psychology of Intelligence Analysis, which introduced structured analytic techniques, was published by which institution and in what year?
Correct. Heuer's Psychology of Intelligence Analysis was published by the CIA Center for the Study of Intelligence in 1999 and remains the foundational text for bias-reduction techniques in intelligence analysis.
Incorrect. Heuer's text was published by the CIA Center for the Study of Intelligence in 1999. It introduced ACH and other structured analytic techniques to counter cognitive bias.
8. Have I Been Pwned was launched by Troy Hunt in December 2013 following which major breach that exposed 153 million accounts?
Correct. HIBP was launched in December 2013 following the Adobe breach that exposed 153 million accounts. As of 2024, it indexes over 13 billion accounts from 800+ breaches.
Incorrect. Troy Hunt launched HIBP in December 2013 following the Adobe breach specifically, which exposed approximately 153 million accounts.
9. In the Colonial Pipeline breach investigation, DarkSide operators conducted passive reconnaissance for approximately how long before deploying their payload?
Correct. Mandiant and CISA post-incident analysis found that DarkSide operators spent at least three months conducting passive reconnaissance before deploying ransomware on May 7, 2021.
Incorrect. Post-incident analysis found DarkSide spent at least three months in passive reconnaissance before deploying their payload — using nothing more active than Shodan queries, LinkedIn scraping, and leaked credential database lookups.
10. What is the minimum viable documentation standard for OSINT collection artifacts? Select the answer that includes ALL required elements from Lesson 4.
Correct. All four elements are required: UTC timestamp (to the minute), exact query or method used, raw output preserved (screenshot plus source HTML/JSON), and explicit chain of inference linking evidence to finding.
Incorrect. The four required elements are: UTC timestamp to the minute, exact query or method used, raw output preserved, and explicit chain of inference connecting raw evidence to the stated finding.
11. The Censys internet-scanning platform was developed at which university in what year?
Correct. Censys was developed at the University of Michigan in 2015, offering internet-wide scanning coverage with particularly strong certificate transparency integration.
Incorrect. Censys was developed at the University of Michigan in 2015. It offers coverage comparable to Shodan with stronger certificate transparency data integration.
12. What did Graphika and the Oxford Internet Institute's IRI Senate report document about the Internet Research Agency's social media operation?
Correct. The August 2018 report documented 3,814 Twitter accounts, 76,000 Facebook posts, and activity across YouTube, Instagram, Reddit, and Google+, with operations beginning at least as early as 2013.
Incorrect. The report documented 3,814 Twitter accounts, 76,000 Facebook posts, and multi-platform activity (YouTube, Instagram, Reddit, Google+) operating from at least 2013.
13. The OSINT Curious Project, which publishes community ethical standards for OSINT practitioners, was founded in what year?
Correct. The OSINT Curious Project was founded in 2018 and publishes community standards covering anonymization, harm reduction, and responsible disclosure for open-source investigators.
Incorrect. The OSINT Curious Project was founded in 2018. Its community standards on anonymization, harm reduction, and responsible disclosure are a key reference for ethical OSINT practice.
14. Which AI embedding model from OpenAI is specifically mentioned in Lesson 3 as suitable for semantic vector space clustering in entity resolution workflows?
Correct. text-embedding-3-large is identified in Lesson 3 as the OpenAI embedding model used for placing entities in semantic vector space, enabling geometric clustering of near-synonymous entity references.
Incorrect. text-embedding-3-large is the model identified in Lesson 3 for semantic vector space clustering in entity resolution workflows.
15. What is the core ethical principle that Lesson 4 argues must be applied MORE rigorously — not less — as AI lowers the cost of OSINT collection and aggregation?
Correct. Proportionality — matching collection depth to the legitimacy and scope of the requirement — is the principle Lesson 4 identifies as requiring stricter application as AI makes deep aggregation faster and cheaper. The falling technical barrier is not permission to lower the ethical bar alongside it.
Incorrect. The principle is proportionality: the depth of collection must match the legitimacy and scope of the requirement. AI making aggregation faster demands stricter application of this standard, not looser.