In 1872, a Pinkerton detective named Allan Pinkerton published The Expressman and the Detective, the first popular account of systematic information gathering against a target. His agents compiled dossiers from newspaper clippings, postmaster interviews, and railroad manifests — entirely public records, assembled with discipline. The practice worked because most people assumed public information was effectively invisible through sheer volume. That assumption held for roughly a century.
The assumption cracked in 2003 when Jeff Jonas, then at Systems Research & Development, demonstrated NORA — Non-Obvious Relationship Awareness — to the Department of Homeland Security. NORA could cross-reference casino employment records, hotel registrations, and watch lists in near real time, surfacing connections invisible to human analysts. What changed was not the availability of the data; casino records had always existed. What changed was the cost of correlation, which dropped from weeks of manual labor to milliseconds. By 2013, Edward Snowden's leaked NSA slides showed that nation-states had extended exactly this logic to internet-scale data collection. By 2023, commercial AI tools had placed equivalent analytic power — pattern recognition, entity resolution, natural-language summarization — in the hands of anyone with a browser.
This course teaches that capability honestly: what AI-augmented OSINT can actually do, where it fails, how defenders detect it, and where legal and ethical lines sit. Each lesson pairs documented real-world cases with hands-on practice. You will leave with concrete technique, calibrated skepticism, and a clear-eyed view of what the tools cannot do. No hype, no fear — just tradecraft in the current environment.
If you finish every module, here's who you become:
On 4 March 2018, Sergei Skripal and his daughter Yulia collapsed on a bench in Salisbury, England, poisoned with the nerve agent Novichok. Within days, Bellingcat — a volunteer open-source intelligence collective founded by Eliot Higgins in 2014 — began correlating flight records, hotel registrations, and passport metadata drawn entirely from publicly accessible Russian databases and social media. By September 2018, they had published the real names, military unit affiliations, and travel histories of the two GRU officers who carried out the attack: Colonel Anatoliy Chepiga and Dr. Alexander Mishkin. The Russian state denied the identification for six days before the evidence became untenable. No classified intelligence was involved. The entire case was built from open sources, cross-referenced manually — a preview of what the same work looks like when AI handles the cross-referencing step at scale.
Open-source intelligence has existed as a formal discipline since at least World War II, when the Foreign Broadcast Information Service monitored Axis radio transmissions. The intelligence cycle — direction, collection, processing, analysis, dissemination — was codified by the CIA in the 1950s and has not fundamentally changed. What practitioners did manually was time-bounded by human attention: a skilled analyst could read perhaps 300 documents a day, cross-reference perhaps 50 entities across sources, and maintain working memory of perhaps a few hundred relationships.
The surface area of public information was, for most of that history, manageable. The internet changed that beginning around 1995. By 2010, a single major news event generated more public text, images, and metadata in 24 hours than a Cold War analyst would encounter in a career. The bottleneck shifted from data access to data processing.
The honest answer is: three things specifically, and nothing else.
Scale of correlation. Large language models and embedding-based retrieval systems can identify conceptual relationships across millions of documents in seconds. The Bellingcat-style work on Skripal took weeks with a team. Equivalent entity-resolution across a comparable document set now takes minutes with tools like Palantir AIP, Babel Street Illuminate, or even general-purpose LLMs given structured prompts and document context.
Language barrier elimination. Prior to 2020, multilingual OSINT required either native speakers or expensive translation services. By 2022, OpenAI's GPT-3.5 and Google's PaLM could translate, summarize, and analyze text in over 90 languages with accuracy sufficient for intelligence purposes. A single analyst can now work across Mandarin, Arabic, Russian, and Farsi sources simultaneously.
Synthesis under uncertainty. Human analysts excel at structured analytic techniques but struggle with holding many uncertain hypotheses simultaneously. LLMs do not solve this problem — they introduce hallucination risks — but they do enable rapid hypothesis generation across large evidence sets, surfacing leads a human might not think to pursue.
Source evaluation remains entirely human. An LLM cannot determine whether a social media account represents a real person, a state-sponsored persona, or a bot farm. It cannot assess the credibility of a source it has not been trained to recognize. The ACH — Analysis of Competing Hypotheses — structured technique developed by Richards Heuer at the CIA in the 1970s remains the gold standard for bias reduction, and no current AI system reliably performs it without human oversight.
The classic five-phase cycle maps onto AI augmentation unevenly. Direction — defining requirements — remains human. You must know what question you are trying to answer before a tool can help you answer it. Analysts who skip this step with AI assistance typically generate fluent, confident, and wrong conclusions.
Collection is where AI has the largest near-term impact. Automated scrapers, API harvesters, and scheduled monitoring tools can sustain continuous collection across hundreds of sources with minimal human intervention. Tools like SpiderFoot, Maltego, and custom Python pipelines using the Twitter/X API (prior to its 2023 access restrictions), LinkedIn API, Shodan, and VirusTotal are standard in contemporary OSINT practice.
Processing — converting raw data into structured form — is where LLMs have become genuinely transformative. Named-entity recognition, relationship extraction, geolocation inference from image metadata, and sentiment analysis can all be partially automated. The word "partially" is load-bearing: automated extraction has error rates, and those errors compound downstream.
Analysis benefits from AI as a brainstorming partner and synthesis engine, but structured analytic techniques still require a human to maintain logical hygiene. Dissemination — writing and presenting findings — is where AI drafting assistance is most mature and least risky.
The most important thing to understand before touching any tool in this course is that legality and ethics are not the same axis. Something can be legal and still cause real harm; something can be technically legal in one jurisdiction and criminal in another.
In the United States, passive OSINT against publicly available data has no federal prohibition, but aggregation of individually public data points can create actionable privacy violations under state law — particularly California's CCPA (2018) and the emerging frameworks in Virginia, Colorado, and Texas. The EU's GDPR (2018) explicitly covers personal data in public sources when that data is processed to profile an individual.
For corporate and competitive intelligence, the Economic Espionage Act of 1996 and the Defend Trade Secrets Act of 2016 set boundaries around what information about competitors can be legally collected and used. For security practitioners, the CFAA's "exceeds authorized access" provision (18 U.S.C. § 1030) means that even passive-seeming actions like creating a fake social media profile to access a restricted group may be prosecutable.
The practical rule for this course: every lab uses either synthetic targets, your own systems, or explicitly designated test environments. Real-target reconnaissance, even passive, should only occur with written authorization.
This module covers the foundational shift: what the pre-AI discipline looked like (L1), how passive collection tools work at scale (L2), how AI handles synthesis and entity resolution (L3), and how to structure an OSINT workflow that produces defensible, auditable findings (L4). Each lesson builds on the previous one. The module test at the end covers all four.
In this lab you will work with an AI assistant to map a real documented OSINT case — the 2018 Bellingcat MH17 investigation — onto the intelligence cycle, identifying exactly where AI augmentation would have changed the workflow and where it would not have.
The MH17 investigation, published by Bellingcat in June 2019, identified the Russian 53rd Anti-Aircraft Missile Brigade as responsible for providing the Buk missile system that shot down Malaysia Airlines Flight 17 over Ukraine in 2014. The entire investigation used open sources: social media posts, satellite imagery from DigitalGlobe, geolocated photographs, and Russian transport records.
On 7 May 2021, the DarkSide ransomware group encrypted Colonial Pipeline's billing and business systems, triggering a six-day shutdown of the largest fuel pipeline on the US East Coast. Post-incident analysis by Mandiant (now Google Cloud Security) and CISA found that DarkSide operators had spent at least three months conducting passive reconnaissance before deploying their payload. Using nothing more targeted than Shodan queries, LinkedIn scraping, certificate transparency log analysis, and leaked credential databases, they identified a single VPN account — without multifactor authentication — as their entry point. The username and password were found in a batch of credentials leaked from a prior unrelated breach. No active scanning of Colonial's systems was required before the compromise.
Passive reconnaissance tools fall into four functional categories. Understanding what each category can and cannot see is the foundation of both offensive collection and defensive exposure assessment.
Internet infrastructure databases index publicly reachable services without operator permission. Shodan, launched by John Matherly in 2009, continuously crawls IPv4 and IPv6 address space and stores banner information from every port that responds. As of 2024, Shodan indexes over 1.5 billion internet-connected devices. A single Shodan query for a company's ASN or IP range reveals exposed services, software versions, SSL certificate details, and geographic distribution — all without any connection to the target organization's systems. Censys, developed at the University of Michigan in 2015, offers similar coverage with stronger certificate transparency integration. FOFA, operated by Beijing Huashun Xin'an Technology, provides equivalent coverage with stronger Asian IP range depth.
DNS and certificate transparency is often the most information-rich passive category. Certificate Transparency logs — mandated by Google's Chrome Root Store policy since April 2018 — require every publicly trusted TLS certificate to be logged in append-only public logs. Tools like crt.sh and Certstream make these logs searchable in real time, revealing every subdomain an organization has ever registered a certificate for, including internal staging servers, development environments, and acquisitions not yet publicly announced.
Google dorking — using advanced search operators to surface specific types of exposed content — predates the term OSINT but remains highly effective. The Exploit Database's Google Hacking Database (GHDB), maintained since 2004 and currently listing over 7,000 dorks, catalogs queries that reliably surface misconfigured login panels, exposed configuration files, unsecured cameras, and database files indexed inadvertently. A query like site:target.com filetype:xls "password" costs nothing and leaves no trace on target infrastructure.
The Wayback Machine — operated by the Internet Archive, which has been crawling the web since 1996 — preserves historical versions of websites including pages that have since been taken down, credentials that were briefly exposed, and organizational structures that have changed. It is a standard first step in any corporate OSINT engagement.
Cached SERP data is distinct from Wayback Machine content: Google, Bing, and Yandex cache recent versions of indexed pages. These caches persist 7–90 days after the live page is modified, meaning that a rapidly-removed sensitive post may still be readable through a cache operator query for days or weeks.
Troy Hunt launched Have I Been Pwned (HIBP) in December 2013 after the Adobe breach exposed 153 million accounts. As of 2024, HIBP indexes over 13 billion accounts from more than 800 breaches. The service is explicitly designed for defensive use — individuals and organizations can check their exposure — but the underlying breach data is widely available through darknet markets and Telegram channels. HIBP's Pwned Passwords API, which allows checking whether a specific password hash appears in known breach data, is now integrated into the default credential checking of Firefox, 1Password, and multiple enterprise identity platforms.
For OSINT practitioners, leaked credential databases serve two passive purposes: establishing that a specific email address is real (it appears in a breach), and identifying password patterns that may predict current credential choices when combined with behavioral analysis. The Colonial Pipeline compromise in 2021 is the canonical example of this vector reaching catastrophic scale from a single credential lookup.
LLMs cannot query Shodan or crt.sh directly without tool-use integrations. What they add to passive collection is downstream synthesis: given 200 Shodan results for a target organization, an LLM with code execution capability (like OpenAI's Advanced Data Analysis, introduced in 2023) can identify the five most anomalous exposed services in seconds, cross-reference CVE databases, and draft a prioritized exposure summary. The collection itself is unchanged; the triage is transformed.
LinkedIn, in particular, is a consistently underestimated passive intelligence source for infrastructure mapping. A company's LinkedIn page reveals: org chart depth, technology stack (job postings for "Kubernetes administrator" or "Splunk engineer" imply running deployments), facility locations, vendor relationships (job postings mentioning specific partner tools), and names with photos for spearphishing baseline construction. This was not a theoretical concern: the 2021 SolarWinds post-incident analysis by CISA found that threat actors had used LinkedIn job posting patterns to identify the company's monitoring infrastructure before the supply chain compromise.
Twitter/X, prior to the February 2023 free-tier API shutdown, was a primary source for tracking organizational communications in near real time. The current restriction to paid API tiers at $100/month (Basic) or $5,000/month (Pro) has pushed most automated social OSINT to alternative platforms: Mastodon's open ActivityPub API, Reddit's Pushshift historical archive (partially restored in late 2023), and platform-specific scrapers that operate in a persistent legal gray zone under the Supreme Court's 2022 hiQ Labs v. LinkedIn ruling, which found that scraping publicly accessible data does not violate the CFAA.
You are a security consultant who has been hired to assess the external exposure of a mid-sized financial services company (fictional: "Meridian Capital Partners") before a red team engagement. Your written authorization covers passive reconnaissance only — no active scanning, no interaction with target systems.
Work with the AI assistant to design a passive collection plan. Ask about specific tools, query strategies, and data sources. The assistant will help you think through coverage gaps and prioritize sources by signal quality.
In February 2021, following the military coup in Myanmar, the Reuters Investigative unit used a combination of satellite imagery from Planet Labs, Facebook post geolocation, and corporate registry data from Myanmar's Directorate of Investment and Company Administration to trace the military junta's financial holdings across more than 120 shell companies. The key breakthrough came not from any single source but from cross-referencing the names of directors across company registries in Singapore, Hong Kong, and Myanmar. A human analyst had identified the pattern manually; GPT-4, released just two months after publication, could have performed the same name-matching across the structured registry data in under five minutes. The investigation won the 2022 Shorty Award for Best Investigative Journalism. The technique it demonstrated — multi-jurisdiction entity resolution from corporate registry data — is now a standard AI-augmented workflow.
Entity resolution is the process of determining whether two or more references in different data sources point to the same real-world entity. It sounds simple and is deeply hard. "John Smith" appearing in a LinkedIn profile, a court filing, and a domain registration record may be three different people, or one person using different email addresses, or a fictitious identity used across multiple registrations. The problem scales nonlinearly: ten entities with ten attributes each produce 4,500 potential pairwise relationships to evaluate.
Pre-AI entity resolution tools relied on deterministic matching (exact string match on email or SSN) and probabilistic matching (Fellegi-Sunter statistical models developed in the 1960s). Both approaches require clean, structured data. The real world produces dirty, inconsistent, transliterated, and deliberately obfuscated data. LLMs handle this environment differently: rather than matching on fixed fields, they can assess semantic equivalence across noisy representations. "Anatoly Chepiga," "A.V. Chepiga," and "Anatoly V. Chepiga" are trivially resolved by a language model even without a phone number or birthdate match.
The practical workflow for AI-assisted synthesis in OSINT has three components that must be kept conceptually separate to avoid compounding errors.
Extraction. The first step is converting unstructured text — articles, social posts, PDFs, court filings — into structured data. This is now handled with high reliability by prompting an LLM to extract named entities and relationships into a defined schema. For example: "Extract all named individuals, their stated roles, and their organizational affiliations from the following text. Output as JSON." GPT-4 and Claude 3 Opus perform this task with accuracy rates above 90% on English text of moderate complexity, dropping to 70–80% on non-English text with domain-specific terminology.
Resolution. Once entities are extracted from multiple documents, the resolution step clusters entities that likely refer to the same real-world actor. Embedding-based similarity (using models like text-embedding-3-large from OpenAI) places entities in a semantic vector space where clusters of near-synonymous references can be identified geometrically. This approach identifies non-obvious connections — a shell company name that contains a modified version of its beneficial owner's surname, for example — that deterministic matching would miss entirely.
Synthesis. The final step asks the model to describe the entity network, identify anomalies, and generate hypotheses about relationships not yet confirmed. This is where hallucination risk is highest. The model will generate plausible-sounding statements about entities it has not actually seen evidence for. The mitigation is citation-grounded prompting: explicitly instructing the model to cite the source document for every claim it makes, and treating any claim without a citation as unverified hypothesis rather than established fact.
LLMs generate tokens based on learned probability distributions — they do not "know" whether a statement is true. In a 2023 study published in Nature, researchers found that GPT-4 hallucinated citations to non-existent legal cases at a rate of approximately 35% when asked to support legal arguments. The rate drops to under 5% when the model is given a document corpus to cite from — but does not reach zero. Every AI-generated claim in an intelligence product requires source verification before operational use.
Maltego, first released by Paterva in 2008 and now owned by Maltego Technologies, remains the standard visual graph analysis tool for OSINT entity networks. It connects to over 50 data sources via Transform plugins — Shodan, VirusTotal, HIBP, social platforms — and builds visual relationship graphs that can surface non-obvious connections between entities. The 2024 release of Maltego AI Assist integrates LLM-based natural language querying directly into the graph interface, allowing analysts to describe a relationship pattern in plain English and have the system highlight matching nodes.
SpiderFoot, an open-source OSINT automation tool created by Steve Micallef in 2012, takes a different approach: it automates the collection step entirely, spawning parallel queries across 200+ data sources for a given seed entity (IP address, email, domain, or name) and returning structured JSON results. It does not perform entity resolution or synthesis — it is a collection engine, not an analysis platform — but its output feeds cleanly into LLM synthesis workflows.
Richards Heuer's Psychology of Intelligence Analysis (1999, CIA Center for the Study of Intelligence) introduced structured analytic techniques (SATs) specifically to counter cognitive biases: confirmation bias, anchoring, availability heuristic. The two most relevant to AI-augmented OSINT are Analysis of Competing Hypotheses (ACH) and Key Assumptions Check (KAC).
ACH requires listing all plausible hypotheses and systematically evaluating each piece of evidence for its consistency with each hypothesis. An LLM can assist with ACH by generating the initial hypothesis list and populating the evidence matrix — but the analyst must validate the evidence categorizations. The model will confidently mark evidence as "consistent with H2" when a careful reading reveals the evidence is neutral. This is not a failure mode to avoid; it is a workflow to design around: use the LLM for speed, use the analyst for logical validation.
KAC asks: what are we assuming that we have not explicitly stated? LLMs are surprisingly useful here — prompt them with "What assumptions is the following analysis making that are not stated in the text?" and they will often surface implicit assumptions the author overlooked. This is one of the few synthesis tasks where LLM assistance has a low hallucination risk, because the model is identifying gaps in reasoning rather than generating positive claims about the world.
You have collected the following synthetic intelligence fragments about a fictional target. Your task is to use the AI assistant to perform entity resolution and then run a Key Assumptions Check on the resulting analysis.
Fragment set (fictional / synthetic — for training only):
(A) Domain registration for "meridian-cap.io" lists admin contact "A. Verikov, admin@meridian-capital.net"
(B) LinkedIn shows "Alexei Verykov" as CFO of Meridian Capital Partners, joined 2019
(C) Singapore ACRA registry lists "Meridian Capital Pte Ltd" director as "Alexander Verikov" since 2020
(D) A 2022 SEC comment letter references "Meridian Capital Partners LLC" with signatory "A.V."
In August 2018, the Oxford Internet Institute's Computational Propaganda Project and Graphika — a network analysis firm — published a joint report for the US Senate Select Committee on Intelligence cataloguing the Internet Research Agency's social media influence operation ahead of the 2016 US election. The report documented 3,814 Twitter accounts, 76,000 Facebook posts, and activity across YouTube, Instagram, Reddit, and Google+, representing the work of a St. Petersburg troll farm operating from at least 2013. The methodological value of the report was not the findings alone — the New York Times had reported on the IRA since 2017 — but the documentation chain. Every account was linked to a platform-provided dataset; every cluster was reproducible from the raw data. When the Senate published the underlying data publicly in December 2019, independent researchers were able to replicate and extend the findings. The workflow survived because it was built to be audited.
OSINT findings are only as durable as their documentation chain. This is true in three distinct contexts: legal proceedings, where chain-of-custody requirements demand timestamped collection logs; corporate security investigations, where findings may be challenged by opposing counsel or disclosed to regulators; and journalistic investigations, where the publication's legal team must be able to verify every factual claim before print.
The minimum viable documentation standard for any OSINT collection includes: timestamp of collection (not just "today" but UTC timestamp to the minute), exact query or method used (the precise Shodan query, the exact Google dork, the URL and date of the cached page), raw output preserved (screenshot plus source HTML or JSON), and chain of inference (the explicit logical steps connecting raw evidence to the stated finding).
AI-generated synthesis must be documented separately from human-analyzed findings, with the model version, exact prompt, and model output preserved. A finding that reads "GPT-4 identified X entity as related to Y organization" is not a finding — it is an AI hypothesis. The finding is the human verification of that hypothesis against cited sources.
A defensible AI-augmented OSINT workflow has seven sequential stages. These are not theoretical — they represent the documented practice of Bellingcat, Graphika, and major OSINT consulting firms as of 2024.
Stage 1 — Requirements Definition. Write down the specific intelligence question before touching any tool. "What is the full extent of Company X's external internet exposure?" is a defined requirement. "Find everything about Company X" is not.
Stage 2 — Source Mapping. List the data sources you will query and justify each one. This prevents scope creep and documents why you chose particular sources over alternatives.
Stage 3 — Collection with Logging. Execute collection with automated logging. Screenshotting tools like GoFullPage, archive tools like archive.today, and command-line tools that output timestamped logs are all standard. If a piece of evidence cannot be re-collected (because a post was deleted, for example), its original collection artifact is the only record — preserve it immediately.
Stage 4 — Processing and Extraction. Convert raw collection into structured data. This is the LLM-assist step: named entity extraction, relationship tagging, metadata normalization.
Stage 5 — Analysis and Hypothesis Generation. Apply structured analytic techniques. Use ACH for competing explanations; use KAC to surface assumptions. Document which hypotheses were considered and rejected, not just the ones that survived.
Stage 6 — Verification. Every claim in the final report must be traceable to a specific artifact from Stage 3. AI-generated hypotheses from Stage 5 that are included in the report must be marked as verified or unverified, with the verification method stated.
Stage 7 — Dissemination with Confidence Levels. Use explicit confidence levels — High, Medium, Low, or numeric probabilities if your client requires them — based on source reliability and evidence quality. The US Intelligence Community's ICD 203 standard for analytic products provides the reference framework most corporate and governmental clients expect.
The Intelligence Community Directive 203 (Analytic Standards), revised in 2015, defines three confidence tiers for intelligence assessments. High confidence means the assessment is based on high-quality information with no significant reason to doubt it. Moderate confidence means the information is credible and plausible but not sufficiently corroborated to warrant high confidence. Low confidence means the information is fragmentary, questionable, or from sources with unknown reliability. These designations should appear explicitly in any intelligence product intended for decision-makers.
Understanding how your own collection can be detected is both an operational security concern and a professional competency. Passive reconnaissance, by definition, leaves no traces on target systems — but it may leave traces elsewhere.
LinkedIn notifies profile owners when their profile has been viewed by non-anonymous accounts. Shodan queries are logged by Shodan and can be subpoenaed. Google dorking queries from a specific IP address are logged in Google's infrastructure. Archive.today submissions are publicly visible. If operational security is a requirement, passive collection should route through VPN infrastructure or Tor, and platform-identifying actions (like viewing a LinkedIn profile while logged in) should be avoided.
More broadly, organizations that have deployed digital exhaust monitoring — services like ZeroFOX, Recorded Future, or Digital Shadows — receive alerts when their domain names, IP ranges, or executive names appear in unusual query patterns. A coordinated reconnaissance campaign against a security-aware target may trigger defensive awareness before the collection phase is complete.
OSINT as a formal profession does not yet have a universally recognized certification body, but two frameworks provide ethical reference points. The SANS FOR589 certification (Cybercrime Intelligence) covers professional standards for law enforcement and corporate investigators. The OSINT Curious Project, founded in 2018 by researchers including Michael Bazzell and Micah Lee, publishes community standards for ethical OSINT practice including guidance on anonymization, harm reduction, and responsible disclosure.
The core ethical obligation is proportionality: the depth of collection should match the legitimacy and scope of the requirement. Researching a prospective business partner's public corporate history is proportionate. Mapping a private individual's daily physical movements using cell tower data and social media geolocation is not — regardless of whether each individual data source is technically public. The aggregation of individually innocuous data points into a surveillance profile crosses an ethical threshold even where no legal line is drawn.
AI amplifies this concern precisely because it lowers the cost of aggregation. A workflow that would have taken a dedicated team three weeks in 2018 can now be partially automated by a single analyst in an afternoon. The professional obligation is to apply the proportionality standard more carefully as the technical barrier falls, not to treat the falling barrier as permission to lower the ethical bar alongside it.
You have completed a passive OSINT engagement for Meridian Capital Partners (fictional). You have the following raw findings that need to be structured into a defensible report section with proper confidence levels, documented sources, and explicit marking of AI-assisted hypotheses.
Raw finding set (synthetic / fictional):
(1) Shodan shows RDP (port 3389) open on 3 IP addresses in the company's ASN, dated yesterday — high signal but not confirmed active
(2) crt.sh shows 4 subdomains including "staging.meridian-capital.net" and "old-admin.meridian-capital.net" with certificates issued in 2021, currently expired
(3) HIBP API confirms the admin@meridian-capital.net email appears in 2 breach datasets from 2020 and 2022
(4) LinkedIn job postings from Q1 2024 require "Cisco ASA firewall administration" and "Splunk ES" — implying specific running deployments
(5) GPT-4 synthesis (unverified) suggested the expired certificate subdomains may indicate an undecommissioned development environment