The regulator asked for an audit. The company provided documentation of their testing methodology, demographic performance metrics, and governance processes. The regulator reviewed it and found everything in order.
Six months later, a journalist found that the AI was systematically disadvantaging one demographic group in production. The audit had missed it. The question was why — and what a better audit would have looked like.
AI auditing encompasses a range of activities that assess whether AI systems are performing as intended, complying with applicable requirements, and producing outcomes that meet defined standards. The term covers substantially different practices — from automated technical testing to human review of decision processes to third-party institutional evaluation.
Technical auditing involves evaluating the AI system itself — testing accuracy, bias, robustness, and behavior across demographic subgroups and edge cases. Technical audits may include adversarial testing (attempting to cause failures), differential analysis (comparing outcomes across demographic groups), and red-teaming (simulating misuse scenarios).
Process auditing evaluates the procedures surrounding AI development and deployment — documentation practices, training data management, testing protocols, incident response, and governance structures. Process audits ask whether the organization is doing the right things, even if they do not directly evaluate whether the AI is producing good outcomes.
Outcome auditing evaluates the real-world effects of AI deployments — are the decisions the AI makes or influences producing the outcomes the organization intends? Are there disparate impacts? Are affected individuals receiving accurate information? Outcome auditing requires data beyond the AI system itself, including post-deployment monitoring of actual consequences.
Technical audits can miss harms that only manifest in deployment. Process audits can miss systems that follow good processes but still produce harmful outcomes. Outcome audits are the most relevant but also the most resource-intensive and the hardest to connect causally to AI system behavior. Meaningful AI auditing typically requires all three.
Choose an AI system: credit scoring model at a bank, facial recognition in a retail store, or content moderation AI at a social media company.
Design a meaningful audit for that system. Specify the technical, process, and outcome audit components. For each component: what would you test, how would you test it, and what would constitute a passing vs. failing result?
The company announced it had received an independent third-party audit of its AI system and passed with flying colors. The audit firm was paid by the company being audited. The auditors had API access but not model weights.
This is the current state of much third-party AI auditing. The limitation is not incompetence — it is structure. The auditing ecosystem has not yet developed the independence, access, and standards that financial auditing built over decades.
A market for third-party AI auditing has emerged — companies offering to assess AI systems independently of the organizations that built or deploy them. Understanding this ecosystem requires understanding both its promise and its substantial limitations.
Technical AI audit firms: Companies that conduct technical assessments of AI systems — bias testing, performance evaluation, security assessment. Examples include Credo AI, Parity AI, Arthur AI, and others. These firms can run technical tests more rigorously than most internal teams, using standardized methodologies and independent judgment.
Consulting firm AI audit practices: Big Four accounting firms (Deloitte, PwC, KPMG, EY) and large consulting firms have developed AI audit practices, often extending existing risk assurance services. These firms bring existing client relationships and credibility but often have less technical AI depth than specialized firms.
Academic and civil society audits: Researchers and advocacy organizations sometimes conduct independent audits — particularly for AI systems with significant public interest implications. These audits can access information that paid auditors cannot (like testing with affected community members) and have independence from commercial relationships. But they lack standard authority or enforcement backing.
Third-party AI auditing faces several structural challenges that limit its effectiveness: Model access: Auditors typically get limited access to the systems they audit — API access or testing environments, not full model weights or training data. This makes comprehensive technical assessment difficult. Conflict of interest: Third-party auditors are paid by the organizations they audit — creating the same structural capture problem as financial auditing. Standards gap: Unlike financial auditing, AI auditing lacks agreed-upon standards for what constitutes a compliant system. Auditor judgment substitutes for objective criteria, with significant variation. Sandbagging risk: Organizations may optimize their systems for test conditions that auditors evaluate, without improving their general-use behavior.
Financial auditing faced similar limitations before standardization — auditors paid by the audited, inconsistent standards, inadequate access. The resolution involved mandatory standards (GAAP, IFRS), auditor independence requirements, and regulator-backed oversight. Whether AI auditing can follow a similar path is an open question. The complexity and pace of AI development makes standardization significantly harder than for financial statements.
Find or research a real third-party AI audit (several are publicly available — Algorithmic Justice League has conducted some; academic researchers have audited facial recognition systems; some companies have published audit results).
Critique the audit: (1) What type of audit was it (technical, process, outcome)? (2) What was the auditor's access? (3) What structural limitations affected its validity? (4) What would a more rigorous audit have done differently?
The company had violated the rule. The question was whether anyone would find out, whether there was a regulator with clear authority to act, whether there was evidence sufficient for enforcement, and whether the penalty would be worth the enforcement effort.
This calculation plays out continuously in AI governance. Enforcement is not automatic. It is resource-constrained, jurisdictionally bounded, and largely reactive.
How do AI governance rules get enforced in practice? Several notable cases illustrate different enforcement mechanisms and their effectiveness.
The FTC has pursued enforcement actions related to AI in several categories. Algorithmic bias in pricing — including a settlement with a rental housing algorithm that charged different prices based on protected characteristics. Deceptive AI claims — including action against companies that claimed their AI could accurately detect cancer, lying, or other conditions it could not. Data collection practices underlying AI systems — extending existing privacy and consumer protection authority into AI-specific contexts.
FTC enforcement is reactive — it responds to harms after they occur, often based on complaints or journalism. It lacks pre-deployment authority for most AI categories, meaning harms must occur before the enforcement mechanism activates.
Financial regulators have more developed AI enforcement mechanisms than most sectors, reflecting existing model risk management requirements. Bank regulators (OCC, Federal Reserve, FDIC) have issued guidance requiring banks to document, validate, and monitor AI models used in credit and risk decisions. CFPB has pursued enforcement against discriminatory credit algorithms. These agencies have examination authority — they can require documentation and review systems as part of regular bank examinations, not only in response to complaints.
The EU AI Office — newly established under the EU AI Act — represents a different enforcement model: a dedicated authority with primary responsibility for GPAI model oversight, investigation authority, and direct fine-levying capability. Early cases will shape precedent for how the Act is interpreted. As of 2024–2025, the Office was still in early operational stages, with enforcement cases limited but anticipated to increase.
Most AI enforcement operates reactively and sectorally. Consumer AI applications that cause diffuse harms, AI systems used in small business operations, and AI operating across jurisdictions often lack effective enforcement coverage. The enforcement systems that exist were designed for other contexts and adapted to AI — creating gaps that only dedicated AI regulation begins to fill.
Choose an AI application with significant potential harms that you believe operates in an enforcement gap — where no existing regulator has clear authority and effective enforcement mechanisms.
(1) Map which regulators have potential jurisdiction and why each has limitations. (2) Describe the harm that the gap enables. (3) Propose a specific enforcement mechanism that would address the gap — who would have authority, what would trigger enforcement, what remedies would be available.
The new Chief AI Officer asked for a list of all AI systems in production. No one could produce one. Different systems had been built by different teams, acquired from different vendors, and deployed across different business units.
That was the first compliance gap — not a policy gap, not a technical gap. A visibility gap. And it is where most AI compliance programs should start.
A compliance program for AI governance is not simply a collection of policies — it is a system of processes, controls, documentation, and accountability that enables an organization to consistently meet its governance obligations and demonstrate that it has done so. Drawing on analogies from established compliance fields (privacy, financial regulation, environmental compliance), AI compliance programs share common structural elements.
AI Inventory: A systematic registry of AI systems in use within the organization — what they do, where they are deployed, what data they use, and what risk classification they carry. Without knowing what AI is deployed, governance is impossible.
Risk Classification: A process for assessing each AI system against risk criteria — probability of harm, severity of potential harm, affected population, reversibility of decisions. Risk classification determines what governance requirements apply to each system.
Pre-deployment Review: A structured process for evaluating new AI systems before deployment — covering technical testing, data governance review, documentation review, ethics assessment, and governance sign-off. The review threshold and rigor should scale with risk classification.
Ongoing Monitoring: Post-deployment performance monitoring with defined metrics, alert thresholds, and escalation processes. Compliance is not a one-time certification — AI systems drift, contexts change, and new failure modes emerge.
Incident Management: A defined process for handling AI system failures — who is responsible for identification, escalation, investigation, remediation, and reporting (internal and, where required, regulatory).
Documentation and Records: Systematic maintenance of governance documentation — design decisions, testing results, risk assessments, incident records, and governance approvals — sufficient to demonstrate compliance to regulators or internal auditors.
Organizations without mature AI compliance programs often struggle to know where to begin. The most effective starting point is typically the AI inventory — you cannot govern what you cannot see. A complete inventory, even without sophisticated governance attached, reveals where the highest-risk systems are and where to invest first.
Choose an organization type: a mid-sized bank, a healthcare system, a large social media company, or a government agency using AI in benefits administration.
Design the core elements of an AI compliance program for that organization: inventory approach, risk classification criteria, pre-deployment review process, monitoring approach, and incident management. Identify the three biggest implementation challenges for your design.