When the New York Times ran its investigation into data broker industry practices in late 2022, reporters purchased a full dossier on a sitting U.S. senator for under $100 from a service called USInfoSearch. The package included her home address, six previous addresses, the names of her adult children, estimated household income, and her registered vehicle. None of it required a breach. It was assembled from public records — voter rolls, property filings, court documents — that had been scraped, normalized, and sold. The senator had no idea the product existed.
Before roughly 2020, a name search on Google typically surfaced news articles, LinkedIn profiles, and social media accounts — information you had deliberately published. The distinction between what you published and what existed elsewhere about you was real, even if imperfect.
That gap has largely closed. Modern AI-assisted search aggregators — tools like Bing's AI overview, Google's AI Overviews, and third-party people-search platforms that incorporate large language models — now synthesize across categories of data simultaneously: direct web content, court records, property registries, professional license databases, and aggregated data broker profiles. The result is a coherent summary rather than a list of blue links.
This matters because coherence multiplies impact. A list of ten disconnected facts about you is manageable. A synthesized paragraph that says "X lives at [address], works at [employer], drives a [vehicle], and has a court record from [year]" is a different kind of exposure.
In 2023, researchers at the Markup tested ten major AI-enhanced people-search engines and found that seven of them returned home addresses for private individuals within thirty seconds of a name + city query, with no account required. Three returned phone numbers. Two returned estimated income ranges sourced from credit-adjacent data aggregators.
Understanding your exposure requires knowing which data categories are routinely surface-level visible:
| Category | Source | Typical Risk Level |
|---|---|---|
| Home address | Voter rolls, property records, USPS change-of-address | High |
| Employment history | LinkedIn, professional license databases, press releases | Medium |
| Court & arrest records | State court portals, county sheriff logs | High |
| Political affiliation | Voter registration rolls (public in 40 states) | Medium |
| Social media history | Cached pages, archive.org, cross-platform scrapers | Medium |
The European Union's GDPR created a right to request removal of personal information from search results, which Google began honoring in 2014 after the Google Spain SL v. AEPD ruling. As of 2023, Google has received over 5.5 million individual removal requests under this framework and honored approximately 47% of them.
The United States has no equivalent federal right. California's Consumer Privacy Act (CCPA) includes a deletion right against data brokers, but enforcement is slow and opt-out must be repeated for each broker individually. There are over 4,000 registered data brokers in the United States.
The practical implication: for most Americans, open-web exposure is a managed risk, not an erasable one. The audit skills in this module are designed for that reality.
Before you can reduce your exposure, you must map it accurately. Lesson 1 establishes the framework. The lab for this lesson walks you through a structured self-search audit using publicly available tools.
In this lab, your AI assistant will guide you through a systematic self-search audit. You'll learn exactly what queries to run, what to look for in results, and how to categorize what you find by risk level. No personal information should be entered into this chat — work conceptually or use a hypothetical name.
In 2013, journalist Julia Angwin published Dragnet Nation and simultaneously ran an experiment: she requested her own file from Acxiom, at the time one of the three largest data brokers in the United States. The company had recently launched a consumer portal called AboutTheData.com, which gave individuals a partial view of what was held about them.
Angwin's profile contained over 1,500 distinct data points, including estimated household income, her categorization as a "power shopper," her political engagement score, and a predicted "health interest" index that included inferences about conditions she had never disclosed to any data broker. The company confirmed the file was accurate. It had been assembled without her knowledge or consent from hundreds of third-party sources over roughly a decade.
Data brokers operate at the intersection of three data streams:
Public records: Court filings, property ownership, business registrations, voter rolls, bankruptcy filings, marriage and divorce records. These are legally public but were practically inaccessible before mass digitization.
Commercial transaction data: Loyalty card programs, warranty registrations, retail purchase histories, and financial transaction data licensed from banks and payment processors. When you filled out a Walgreens Balance Rewards form in 2015, that information entered a resale pipeline.
Behavioral inference: App location tracking, browser cookie aggregation, and social media activity signals that are sold by publishers, apps, and advertising networks. These are used to build predictive scores — propensity to buy, likelihood to vote, estimated health status.
Acxiom's own 2022 annual report states that the company holds data on approximately 2.5 billion people globally, with an average of 1,500 attributes per person in its core US database. LexisNexis Risk Solutions, another major broker, markets profiles covering 99.98% of US adults. Neither company is a household name.
The California Consumer Privacy Act (CCPA), in effect since 2020, requires data brokers registered in California to honor deletion requests. As of January 2024, California's new automated opt-out mechanism (the "Delete Act," SB 362) requires brokers to support a single opt-out signal rather than requiring individual requests to each company.
Outside California, the opt-out process remains fragmented. A 2023 study by Consumer Reports found that completing opt-outs across 50 major data brokers required an average of 34 hours of effort and involved 46 separate online forms. Many brokers re-populate deleted profiles within 90 days from new source data.
Services like DeleteMe and Privacy Bee charge subscription fees to manage this process on your behalf, with mixed effectiveness reviews. The underlying structural problem — that public records continuously re-seed broker databases — is not solved by opt-out.
Auditing your data broker exposure is distinct from auditing your open-web footprint. The two overlap but require different tools and different remediation strategies. Lesson 2's lab focuses specifically on broker lookup and opt-out mechanics.
This lab guides you through identifying which data brokers are most likely to hold your information, understanding the opt-out mechanisms available, and building a realistic prioritized removal strategy. Ask the assistant to help you work through specific brokers or develop a general plan.
The Cambridge Analytica scandal, fully documented through the UK Parliament's Digital, Culture, Media and Sport Committee hearings in 2018, established a specific and important fact about social media inference: 87 million Facebook users had their psychographic profiles constructed from data they never directly provided. The profiles — organized around the OCEAN model of personality traits — were built by analyzing the Facebook likes of 270,000 users who had consented to a personality quiz, then extended to their social networks without consent.
The key finding from testimony by Cambridge Analytica whistleblower Christopher Wylie: a person's Facebook likes alone — not posts, not private messages, just public reactions to content — predicted personality traits with higher accuracy than their own friends' assessments. Data that felt passive and meaningless was deeply revelatory.
Most users think of their social media footprint as the content they actively posted. The actual retained data is much broader. Meta's Data Policy, as updated in 2023, describes retaining:
Content you deleted — posts and photos you removed are retained in Meta's systems for varying periods, and activity around deleted content (comments, reactions from others) may be retained indefinitely. In 2023, a GDPR enforcement action by Ireland's Data Protection Commission resulted in a €1.2 billion fine against Meta partly over cross-border data transfer practices related to retained user data.
Inferences never shown to you — Meta's ad system assigns hundreds of interest and behavioral categories to each user, most of which are never displayed in the "Your Ad Preferences" transparency tool. A 2022 study by Northeastern University researchers found that Meta's internal inference set was 2-3x larger than the categories visible to users through transparency settings.
Network and behavioral signals — who you message, how long you dwell on specific content, whether you screenshot something, and your scroll velocity on certain content types. These are used as training signals for recommendation and ad targeting systems.
A 2013 study published in PNAS (Kosinski, Stillwell, Graepel) showed that Facebook likes alone could predict race with 95% accuracy, sexual orientation with 88% accuracy, political affiliation with 85% accuracy, and whether parents were divorced during childhood with 60% accuracy. None of these attributes were ever directly disclosed.
Individual platform data is concerning. Cross-platform aggregation is more so. When data from Twitter/X, Instagram, LinkedIn, Reddit, and TikTok is combined — as it is in large training datasets and by third-party social analytics platforms — the resulting profile is substantially more detailed than any single platform's record.
In 2021, a data scraping incident exposed 533 million Facebook profiles, and separately, 500 million LinkedIn profiles were scraped and listed for sale. Security researcher Troy Hunt documented both incidents in detail on HaveIBeenPwned. Neither incident involved a traditional "hack" — both exploited legitimate API access. The data was publicly available profile information, just collected at industrial scale.
The audit for social media exposure involves three distinct layers:
Deleting a social media account does not delete the data already shared with advertising partners, already scraped by third parties, or already incorporated into AI training datasets. The audit framework in this module focuses on what can be found and managed, not on reversing history.
Work with the AI assistant to understand what a social media data archive actually contains, how to interpret ad preference categories as indicators of inference, and what practical steps reduce ongoing data collection without requiring full account deletion.
Eva Galperin, Director of Cybersecurity at the Electronic Frontier Foundation, has spent years documenting what she calls "the stalkerware problem" — cases where domestic abusers used commercially available spyware to surveil partners. In a 2019 interview with Wired, she outlined the specific sequence she uses when helping at-risk individuals reduce digital exposure quickly.
The sequence is not about maximum privacy. It is about prioritized risk reduction: the first step is always to secure the accounts an attacker is most likely to access (email, then iCloud or Google account). The second is to identify and remove the highest-risk publicly visible information — a home address on a people-search site being the canonical example. Only after those two layers does the work expand to broader data broker opt-outs and social footprint reduction. The logic: time-bounded resources require triage.
Privacy researchers and consumer advocacy organizations have converged on a roughly similar three-tier framework for personal exposure reduction. The tiers reflect a tradeoff between effort, impact, and permanence:
| Tier | Actions | Impact | Effort |
|---|---|---|---|
| Tier 1 — Immediate | Lock down account recovery (email, primary Google/Apple), opt out of top 10 people-search sites, review and tighten social media privacy settings | High | Low–Med |
| Tier 2 — 30-Day | Submit CCPA deletion requests to major data brokers, review third-party app permissions, set up a Google Alert on your name, download and review platform data archives | Medium | Medium |
| Tier 3 — Ongoing | Quarterly data broker re-checks, dark web monitoring (HaveIBeenPwned alerts), annual social media audit, consider a PO Box for public-record submissions | Medium | Low (routine) |
Consumer advocacy organizations including the Privacy Rights Clearinghouse and the World Privacy Forum consistently identify these ten data broker / people-search sites as the highest-priority opt-outs for US individuals, based on data breadth and the frequency with which their results appear in AI-assisted searches:
A sustainable long-term strategy combines active reduction (the opt-outs above) with passive monitoring (alerts and re-checks). Key monitoring tools:
HaveIBeenPwned.com — maintained by Troy Hunt, indexes known breach datasets and alerts registered email addresses when they appear in new breaches. Free and widely used by security professionals.
Google Alerts — create an alert for your full name (with and without quotes), your email address, and your phone number. Free. Will catch new appearances in indexed web content, though not in data broker databases directly.
Cover Your Tracks (EFF) — coveryourtracks.eff.org tests your browser's fingerprint uniqueness and the effectiveness of tracker-blocking settings. Free diagnostic tool from the Electronic Frontier Foundation.
The core principle: exposure reduction is a practice, not an event. The data ecosystem continuously regenerates information from public sources. Quarterly maintenance is more effective than a single intensive effort followed by inaction.
You now have the framework to audit your own digital exposure systematically: open-web name searches, data broker identification, social media archive review, and a tiered reduction plan. The lab for Lesson 4 helps you build a personalized action plan using all four layers. No complete solution exists — but an informed, maintained strategy substantially reduces the risk of harm from AI-enhanced aggregation of your personal information.
In this capstone lab, work with the AI assistant to build a personalized, realistic exposure reduction plan using the three-tier framework. Describe your situation (level of public presence, specific concerns, available time) and let the assistant help you prioritize and sequence your actions.