Module 4 · Lesson 1

Why Explanation Builds Trust

The psychology and evidence behind transparent AI systems — from COMPAS to credit scoring

What makes a person trust a machine — and what breaks that trust irreparably?

In 2016, ProPublica published an analysis of COMPAS — a recidivism prediction algorithm used in U.S. courts to inform sentencing and parole. The investigation found that Black defendants were nearly twice as likely as white defendants to be falsely flagged as high risk. The algorithm's creator, Northpointe, refused to disclose its methodology, calling it proprietary. Judges had been relying on scores they could not explain. Defendants had no mechanism to challenge them. The episode remains the canonical illustration of what happens when consequential AI operates without explainability.

Trust Is Not Given — It Is Earned Through Transparency

Psychologists distinguish two varieties of trust relevant to AI systems. Cognitive trust is rational: it forms when we have evidence of competence, consistency, and accountability. Affective trust is emotional: it develops through repeated positive experience and perceived goodwill. Explanation mechanisms serve both channels. A model that shows its reasoning satisfies cognitive trust. A model designed to make its logic accessible signals goodwill, supporting affective trust.

Research at Google Brain published in 2019 (the TCAV project — Testing with Concept Activation Vectors) demonstrated that users who received concept-level explanations for image classifier decisions showed statistically higher appropriate reliance — they were more likely to follow correct AI recommendations and more likely to override wrong ones. Explanation improved calibration, not just satisfaction.

Historical Anchor — COMPAS, 2016

Northpointe's COMPAS tool was used in at least 10 U.S. states to recommend sentences. When ProPublica examined 7,000 Florida defendants, they found false-positive rates for violent recidivism were 77.4% for Black defendants vs. 41.4% for white defendants. The core problem: no defendant, lawyer, or judge could inspect the model's reasoning. Opacity made accountability impossible.

The Automation Bias Problem

A consistent finding across aviation, medicine, and finance is automation bias: humans over-rely on automated recommendations and fail to apply independent judgment. A landmark 1999 study by Mosier and Skitka showed that pilots following automated cockpit alerts made more errors than pilots working without automation — because they stopped thinking critically when a machine spoke.

Explainability is the primary countermeasure. When a system exposes why it reached a conclusion, users regain the cognitive foothold needed to evaluate it. A 2021 study published in Nature Machine Intelligence found that clinicians who received feature-attribution explanations alongside AI diagnostic outputs showed 15% better decision accuracy than those receiving only AI confidence scores.

Three Pillars of Explanation-Based Trust

Competence Verification

Explanation lets users verify that the model is attending to sensible features — not spurious correlations like image watermarks or dataset artifacts.

Accountability Enablement

When reasoning is visible, errors can be traced to specific causes. This makes correction possible and creates legal and organizational accountability pathways.

Agency Preservation

Explanations return decision-making authority to the human. The person retains the ability to disagree with the model on principled grounds.

The EU AI Act and the Right to Explanation

The EU's AI Act (adopted 2024) mandates that high-risk AI systems — including those used in employment, credit, education, and law enforcement — must provide meaningful explanations of their outputs to affected individuals. This extends earlier GDPR provisions that established a right to explanation for automated decisions affecting individuals. These are not aspirational standards: non-compliance carries fines up to 3% of global annual turnover for inadequate transparency.

The practical effect is that explainability has moved from academic ideal to legal obligation. Organisations that built opaque pipelines face genuine regulatory exposure — and the reputational consequences of being the next COMPAS.

Key Insight

Trust calibration — neither over-trusting nor under-trusting AI — is the measurable goal of explainability. Explanations that are too vague fail cognitive trust. Explanations that are too complex undermine affective trust. The engineering challenge is precision: enough detail to support judgment, not so much as to overwhelm it.

Automation Bias —The tendency to favour automated recommendations over contradictory human judgment, often without critical evaluation.

Trust Calibration —Aligning a user's confidence in an AI system with that system's actual reliability — neither over-relying nor under-trusting.

Right to Explanation —Legal principle (codified in GDPR Art. 22 and EU AI Act) entitling individuals to meaningful information about automated decisions affecting them.

Lesson 1 Quiz

Why Explanation Builds Trust — check your understanding

What was the primary transparency failure identified in ProPublica's 2016 COMPAS investigation?

Correct. Northpointe refused to disclose COMPAS's methodology as proprietary, making accountability impossible. Defendants could not challenge scores they had no access to understand.

Not quite. The core failure was opacity — a proprietary black-box model that judges and defendants could not examine or challenge.

Google Brain's TCAV research found that concept-level explanations improved what specific outcome?

Correct. TCAV explanations improved trust calibration — users relied on the model when it was right and rejected it when it was wrong, rather than blindly accepting outputs.

Incorrect. TCAV improved calibrated reliance, not satisfaction or raw accuracy. The key finding was better human decision-making with explanations.

Automation bias, as demonstrated by Mosier and Skitka's 1999 study, refers to:

Correct. Automation bias is the cognitive tendency to defer to machine recommendations, suppressing independent human evaluation — even when the machine is wrong.

Incorrect. Automation bias describes human behaviour — specifically the tendency to accept automated outputs uncritically — not model behaviour.

Under the EU AI Act (2024), which categories of AI deployment require meaningful explanations of outputs to affected individuals?

Correct. The EU AI Act designates specific high-risk domains — employment, credit, education, law enforcement, migration, and critical infrastructure — where explanation mandates apply.

Not correct. The EU AI Act uses a risk-tiered approach. Explanation requirements attach to high-risk applications in specific domains, not all AI systems.

Lab 1 — Trust Calibration Analyst

Practice identifying automation bias and explanation failures in real-world AI deployments

Your Task

You are reviewing AI deployment cases for a public-sector oversight body. For each scenario the assistant presents, identify: (1) whether automation bias is likely, (2) what explanation failures exist, and (3) what transparency mechanisms would restore appropriate trust calibration.

Start by asking the assistant to present your first case — a real-world AI deployment where trust and explainability intersect. Engage with at least three scenarios to complete this lab.

Trust Calibration Lab

AI Explainability · Module 4

Welcome to the Trust Calibration Lab. I'll present you with real AI deployment scenarios — drawn from documented cases in healthcare, criminal justice, finance, and hiring. For each one, your job is to diagnose the trust and explainability failures and propose what transparency mechanisms could have prevented them. Ready to begin? Just say "give me my first case" and we'll start.

Module 4 · Lesson 2

Explanation Formats and Their Effects

Feature attributions, counterfactuals, natural language — what works for whom, and when

Does the format of an explanation change whether people can actually use it — or does it just change whether they feel they understand?

In 2018, internal documents leaked from MD Anderson Cancer Center revealed that IBM's Watson for Oncology had been recommending treatment plans that senior oncologists described as "unsafe and incorrect." The system was trained primarily on synthetic cases and hypothetical patient scenarios rather than real treatment outcomes. But the deeper problem was explanatory: Watson presented its recommendations with confidence scores and brief justifications — formats that looked authoritative without surfacing the model's actual reasoning or the limitations of its training data. Clinicians who trusted the interface had no mechanism to detect the underlying problems.

The Format Is Not Neutral

Explanation format profoundly affects whether users can actually evaluate AI outputs — or merely feel they can. Research distinguishes three failure modes. Illusion of explanatory depth: users believe they understand a system after seeing an explanation, even when the explanation is insufficient to support that understanding. Selective attention: users focus on salient explanation elements while ignoring equally important ones. Format mismatch: explanations designed for data scientists (SHAP bar charts) fail non-technical users who need plain-language summaries.

A 2020 study published in CHI (ACM Conference on Human Factors in Computing Systems) found that users shown feature-importance bar charts were significantly more likely to believe they understood a loan-rejection model than users shown nothing — but their ability to correctly predict model behaviour in new cases was not better than chance. The chart created confidence without understanding.

Major Explanation Formats Compared

Format	Best For	Limitations	Example Use
Feature Attribution (SHAP/LIME)	Technical auditors; model developers	Abstract for lay users; can create illusion of understanding	Credit model audit by data science team
Counterfactual Explanations	Affected individuals seeking recourse	May suggest unfeasible changes; can obscure systemic bias	"Your loan would be approved if your income were £5,000 higher"
Natural Language Explanations	Non-technical end users; regulated contexts	Risk of over-simplification; may not reflect true model reasoning	EU AI Act compliance notices to job applicants
Example-Based (Case-Based)	Domain experts who reason by analogy	Computationally expensive; privacy risks from training data exposure	Medical diagnosis: "Similar patients were diagnosed with…"
Visual Saliency Maps	Image/vision AI tasks; radiologists	Often highlight artifacts, not true causal features	Chest X-ray AI highlighting pneumonia regions

Research Finding — Saliency Map Artifacts

A 2019 study by Adebayo et al. ("Sanity Checks for Saliency Maps") showed that popular saliency methods — including Gradient, Guided Backprop, and SHAP — produced visually similar heat-maps whether or not they were computed from a trained model or a randomly-initialized network. This meant the maps often reflected input statistics, not learned model reasoning — a critical finding for radiology AI deployments.

Counterfactual Explanations and the Recourse Problem

Counterfactuals are particularly important in regulated domains because they provide actionable recourse: they tell the affected person what would have produced a different outcome. The EU's GDPR Article 22 implementation guidelines specifically reference counterfactual-style explanations as satisfying the "meaningful information" standard for automated decisions.

However, counterfactuals carry a documented failure mode: they can suggest changes that are individually actionable but systemically discriminatory. If a credit model uses postcode as a proxy variable, a counterfactual might advise a person to move to a different area — which is simultaneously technically accurate and fundamentally unjust. Researchers at the Alan Turing Institute (Wachter, Mittelstadt & Russell, 2017) identified this as the "counterfactual recourse trap."

Matching Format to Audience

Research by Miller (2019) — drawing on social science literature on human explanation — found that people naturally explain decisions through contrastive reasoning ("Why X rather than Y?") rather than through causal chains. This suggests counterfactuals and comparative examples are cognitively more natural than feature-importance lists — yet most deployed XAI systems lead with feature attributions because they are easier to compute.

The practical implication: organisations should design explanation interfaces with the intended audience in mind, not the technical pipeline. A fraud analyst needs different information than a customer disputing a declined transaction, who needs different information than a regulator conducting a systemic audit.

Design Principle

Explanation format should be chosen for the decision the recipient must make, not for the convenience of the system producing it. When in doubt, provide multiple formats layered by level of detail — a plain-language summary first, with technical attribution accessible on request.

Illusion of Explanatory Depth —The phenomenon where exposure to an explanation creates subjective confidence in understanding without enabling accurate prediction of model behaviour.

Counterfactual Recourse —An explanation stating what change to inputs would have produced a different AI output — providing actionable information to affected individuals.

Format Mismatch —When an explanation is designed for a different audience than the one receiving it, undermining its practical utility regardless of technical correctness.

Lesson 2 Quiz

Explanation Formats and Their Effects — check your understanding

What did the 2020 CHI study on feature-importance bar charts find about users' understanding of loan rejection models?

Correct. This is the illusion of explanatory depth: the format created subjective confidence without enabling genuine understanding or improved decision-making.

Incorrect. The key finding was that confidence and actual understanding diverged — users felt more confident but couldn't actually predict model behaviour better than chance.

The "counterfactual recourse trap" (Wachter, Mittelstadt & Russell, 2017) refers to:

Correct. When a model uses proxy variables for protected characteristics (e.g. postcode for race), counterfactuals may accurately describe what would change the output while pointing to an unjust remedy.

Incorrect. The recourse trap is about justice, not computation or privacy. It occurs when technically accurate counterfactuals imply systemically unjust advice.

Adebayo et al.'s 2019 "Sanity Checks for Saliency Maps" found that popular saliency methods:

Correct. This was a landmark finding exposing a fundamental limitation of gradient-based saliency: the maps could not be reliably interpreted as representing what the model had learned.

Incorrect. The key finding was that these methods failed a basic sanity check — their outputs looked similar even for untrained random networks, questioning their validity as explanations.

According to Miller's 2019 research on human explanation, what format of explanation is most cognitively natural for people?

Correct. Miller's synthesis of social science research found that humans naturally explain by contrast — "why this rather than that" — which is why counterfactual and comparative explanations tend to resonate more than feature lists.

Incorrect. Miller found that human explanation is naturally contrastive — focused on what distinguishes the actual outcome from an alternative — not on comprehensive causal chains or ranked feature scores.

Lab 2 — Explanation Format Designer

Practice selecting and critiquing explanation formats for different audiences and contexts

Your Task

You are a responsible AI designer at a financial services firm. The assistant will give you deployment scenarios involving AI decisions — loan approvals, fraud flags, investment recommendations — and different audience types. For each, determine which explanation format is most appropriate and why. Critique the weaknesses of your chosen format and suggest mitigation strategies.

Begin by telling the assistant which type of financial AI decision you'd like to explore first, or ask it to assign you a scenario. Complete at least three format-design exercises to finish this lab.

Explanation Format Design Lab

AI Explainability · Module 4

Welcome to the Explanation Format Design Lab. I'll present you with financial AI scenarios — specifying the AI decision type, the audience receiving the explanation, and any regulatory context. Your job is to choose the best explanation format, defend your choice, and identify its risks. You can either pick a scenario type (loan rejection, fraud flag, investment advice, credit scoring) or say "assign me one" and I'll choose. What would you like to do?

Module 4 · Lesson 3

Organisational Trust Architecture

Model cards, datasheets, governance boards — how institutions build systematic explainability

If individual explanations can be gamed or misread, what structural mechanisms can make AI trustworthiness durable?

In 2019, Google published the first formal Model Card framework — a structured documentation standard requiring AI developers to record intended use cases, performance metrics across demographic groups, ethical considerations, and known limitations for any deployed model. The initiative was led by researchers including Margaret Mitchell and Timnit Gebru, both of whom would later be dismissed from Google in circumstances that themselves became a case study in AI governance failure. The model card standard was subsequently adopted by Hugging Face, major cloud providers, and referenced by the EU AI Act's technical documentation requirements — demonstrating how one institution's transparency mechanism can propagate into industry-wide infrastructure.

From Individual Explanations to Systemic Trust

Individual explanations — a SHAP chart for a specific loan decision — are necessary but insufficient for organisational trust. They address single outputs without establishing systematic accountability for how the AI behaves across the full population of decisions. Systemic trust requires institutional architecture: documented processes, defined responsibilities, auditable records, and governance structures that persist across personnel changes.

The distinction matters especially in regulated sectors. The UK's Financial Conduct Authority's 2022 Discussion Paper on AI in financial services identified "model governance" — not individual explainability — as the primary trust mechanism for institutional use. A bank's credit AI must be accountable as a policy, not just as a decision.

Model Cards and Datasheets: Documentation as Trust Infrastructure

Model Cards (Mitchell et al., 2019) are structured documents accompanying ML models that disclose: intended use, out-of-scope use cases, training data characteristics, evaluation results broken down by relevant subgroups, ethical considerations, and caveats. The empirical case for them is straightforward: when organisations know in advance that they must document demographic performance disparities, they test for them — and frequently change deployment decisions as a result.

Datasheets for Datasets (Gebru et al., 2018) apply the same principle to training data: motivation for creation, composition, collection process, preprocessing decisions, recommended uses, and known limitations. Together, model cards and datasheets create a chain of documentation connecting training data to deployment outcome — the evidentiary foundation for any post-hoc accountability investigation.

Model Card Components

Model details · Intended use · Factors · Metrics · Evaluation data · Training data · Quantitative analyses · Ethical considerations · Caveats and recommendations

Datasheet Components

Motivation · Composition · Collection process · Preprocessing · Uses · Distribution · Maintenance · Legal and ethical considerations

Governance Board Functions

Pre-deployment review · Ongoing monitoring mandates · Incident response protocols · Escalation pathways · Third-party audit commissioning

Case Study — Dutch SyRI Ruling, 2020

In February 2020, a Dutch court ruled that the government's SyRI system — an AI algorithm that scored citizens for welfare fraud risk — violated human rights law. The court found that the government had provided insufficient information about SyRI's functioning, that citizens had no meaningful way to understand or challenge their scores, and that the system's opacity was incompatible with the right to private life under ECHR Article 8. The ruling was significant: it established that organisational opacity — not just individual explanation failures — constitutes a human rights violation.

AI Governance Boards: Structure and Authority

Effective AI governance boards require three properties that most internal committees lack: real authority (the ability to halt deployments), diverse composition (domain experts, ethicists, affected community representatives, technical staff), and independence from product timelines. Microsoft's Aether Committee and Salesforce's Office of Ethical and Humane Use are examples of institutionalised structures with formal review mandates — though critics note that internal boards face structural conflicts of interest and may not provide the independence of external audit.

The NIST AI Risk Management Framework (AI RMF, 2023) provides a practical governance architecture organised around four functions: Govern, Map, Measure, and Manage. The Govern function specifically addresses organisational culture, accountability structures, and transparency policies — recognising that trust cannot be built at the model level if it is undermined at the institutional level.

Red-Teaming as Trust Infrastructure

Red-teaming — systematic adversarial testing of AI systems before deployment — has emerged as a critical trust-building mechanism particularly in generative AI. In August 2023, the White House secured commitments from major AI companies (Anthropic, Google, Meta, Microsoft, OpenAI) to conduct red-teaming prior to public release of frontier models. The commitments included sharing results with governments. This represents a shift from voluntary to quasi-mandatory adversarial transparency — acknowledging that self-reported safety claims without structured challenge are insufficient for public trust.

Core Principle

Trust infrastructure must outlast the individuals who built the system. Documentation, governance structures, and audit trails are the mechanisms by which institutional trust persists through personnel changes, corporate acquisitions, and model updates. An AI system's trustworthiness cannot depend on the continued presence of its original developers.

Model Card —A structured documentation standard (Mitchell et al., 2019) that discloses AI model characteristics, intended use, performance metrics by subgroup, and ethical considerations.

Datasheet for Datasets —A structured documentation framework (Gebru et al., 2018) recording the motivation, composition, collection, and limitations of ML training datasets.

NIST AI RMF —The US National Institute of Standards and Technology AI Risk Management Framework (2023), providing structured guidance for AI governance across Govern, Map, Measure, and Manage functions.

Lesson 3 Quiz

Organisational Trust Architecture — check your understanding

What was the significance of the Dutch SyRI ruling in February 2020?

Correct. The court found that SyRI's systemic opacity violated ECHR Article 8 — a landmark finding that extended explainability obligations beyond individual decisions to the institutional level.

Incorrect. The SyRI ruling addressed the systemic opacity of an AI welfare fraud scoring system, finding that citizens' inability to understand or challenge their scores violated their right to private life.

The NIST AI Risk Management Framework (2023) organises governance around four functions. Which set is correct?

Correct. The NIST AI RMF's four core functions — Govern, Map, Measure, Manage — provide a structured framework for AI risk management at the organisational level.

Incorrect. The NIST AI RMF uses Govern, Map, Measure, and Manage as its four core functions. The first set (Identify, Protect, Detect, Respond) belongs to the separate NIST Cybersecurity Framework.

What distinguishes a Datasheet for Datasets (Gebru et al., 2018) from a Model Card (Mitchell et al., 2019)?

Correct. Together they create a documentation chain: Datasheets capture where the model learned from, Model Cards capture what it learned to do and how it performs. Both are needed for full accountability.

Incorrect. These are distinct frameworks. Datasheets apply to training datasets (provenance, composition, limitations). Model Cards apply to deployed models (performance, intended use, ethical considerations).

What three properties does the lesson identify as essential for an effective AI governance board — properties most internal committees lack?

Correct. Without genuine authority, diverse perspectives (including affected communities), and independence from commercial timelines, governance boards risk becoming rubber stamps rather than accountability mechanisms.

Incorrect. The three essential properties identified are: real authority (ability to halt deployments), diverse composition including affected community representatives, and independence from product timelines.

Lab 3 — AI Governance Architect

Design documentation and governance structures for a real AI deployment scenario

Your Task

You are an AI governance consultant advising a public-sector organisation planning to deploy AI in a high-stakes context. The assistant will brief you on the scenario — the deployment domain, the affected population, the regulatory environment. Your job is to design the governance architecture: what Model Card sections are critical, what datasheet disclosures are required, and what governance board structure is appropriate.

Ask the assistant to brief you on your client scenario, or describe a public-sector AI deployment you'd like to work through. Design governance structures for at least two different scenarios to complete this lab.

AI Governance Architecture Lab

AI Explainability · Module 4

Welcome to the AI Governance Architecture Lab. I'll brief you on public-sector AI deployment scenarios — in areas like benefits assessment, policing, education, or healthcare — and you'll design the governance infrastructure: Model Card priorities, Datasheet requirements, and governance board structure. You can ask me to assign a scenario or tell me which domain you'd like to focus on. What would you like to do?

Module 4 · Lesson 4

When Trust Breaks — and How to Rebuild It

Amazon hiring AI, healthcare algorithms, and the documented anatomy of AI trust failures

Once an AI system has betrayed trust — through bias, opacity, or error — what makes recovery possible, and what makes it impossible?

Reuters reported in October 2018 that Amazon had quietly disbanded an AI recruiting tool developed internally since 2014, after discovering it systematically downgraded CVs that included the word "women's" — as in "women's chess club" — and penalised graduates of all-women's colleges. The model had been trained on a decade of Amazon's own hiring decisions, absorbing the historical male dominance of the company's technical roles. Amazon's engineers tried to correct for this, but concluded the system could not be reliably fixed. The episode was significant not only for the bias it revealed, but for the internal concealment: the tool had been used to score candidates without the knowledge of the hiring managers relying on it.

The Anatomy of AI Trust Failure

Trust failures in AI systems follow identifiable patterns. Research by Dietvorst, Logg, and colleagues identifies three distinct phases: discovery (when the failure is identified), attribution (when causes are established), and response (when corrective action is taken or refused). Trust recovery depends critically on what happens in the attribution phase — specifically, whether the organisation accepts responsibility or deflects it.

Amazon's response to its hiring AI was to quietly shut the tool down and say nothing publicly until Reuters discovered it. This non-response — in contrast to transparent acknowledgment and remediation — represents the highest-risk recovery strategy. Once exposed, the company had lost both the opportunity to control the narrative and the credibility that comes with voluntary disclosure.

Case Study — Optum Healthcare Algorithm, 2019

A study published in Science (Obermeyer et al., 2019) found that a healthcare algorithm used by Optum — and deployed across US health systems to identify patients needing additional care — systematically underestimated the medical needs of Black patients. The algorithm used healthcare cost as a proxy for health need, but Black patients had historically received less care at equivalent levels of illness due to systemic barriers. The model learned and amplified this disparity. Optum initially disputed the findings, then acknowledged the bias and committed to redesigning the algorithm — but critics noted the company had used the tool for years without demographic auditing that would have detected the problem.

Three Conditions for Trust Recovery

Research on organisational trust repair — particularly work by Kim, Dirks, and Cooper (2004, 2009) — identifies conditions that determine whether trust can be rebuilt after violation. Applied to AI systems, three conditions consistently emerge:

1. Acknowledgment

The organisation explicitly acknowledges the failure, its causes, and its impact on affected individuals. Minimisation or deflection of blame makes recovery far less likely. The Optum response — initial dispute, then late acknowledgment — illustrates partial recovery under duress.

2. Structural Change

Trust is rebuilt through demonstrated systemic change, not reassurance. This means new processes — demographic auditing, explainability requirements, external oversight — not just promises. Affected stakeholders must be able to observe the change.

3. Accountability

Someone or something is held accountable. This does not require punishment; it requires visible consequence. When no one is accountable for a failure, there is no incentive structure preventing recurrence — and stakeholders know it.

Proactive Transparency as Trust Insurance

The most durable trust strategy is one that makes failures less damaging by establishing transparency before they occur. When an organisation has publicly documented its model's known limitations, tested for demographic disparities, and established clear remediation processes, a discovered error is understood as a known risk — not evidence of concealment.

This is the logic behind the UK's 2022 Algorithmic Transparency Recording Standard (ATRS), which requires central government bodies to proactively publish records of AI tools used in decision-making, including purpose, data sources, oversight mechanisms, and known risks. The standard treats disclosure as default rather than last resort — shifting the burden from affected individuals (who must seek information) to deploying organisations (who must provide it).

The Limits of Trust Repair

Not all trust failures are recoverable. Research on algorithmic aversion (Dietvorst et al., 2015) found that people who observe an algorithm make even a single mistake show persistent preference for human judgment — even when the algorithm significantly outperforms humans over time. This aversion is resistant to information and correction. The implication for AI deployment is severe: a single high-profile failure in a domain can trigger systemic rejection of AI assistance — including in cases where AI would save lives.

The 2020 controversy over the UK's A-Level algorithm — where an Ofqual model was used to moderate grades during COVID-19 lockdowns and produced systematic downgrades for students at state schools compared to private schools — was withdrawn within days of public release after widespread protests. The episode effectively ended political support for algorithmic grading in UK education for years. No amount of technical explanation rebuilt trust; the contextual legitimacy of algorithmic decision-making in education was destroyed, at least temporarily.

Final Principle

Trust is asymmetric: it takes years of consistent, transparent behaviour to build and can be destroyed by a single opaque failure. The most reliable trust strategy is to make explainability and accountability structural — embedded in documentation, governance, and oversight before any failure occurs — so that when errors arise (and they will), they are understood as known risks managed by responsible institutions, not as evidence of systemic deception.

Algorithmic Aversion —The persistent human tendency to reject AI/algorithmic tools after observing even minor errors, even when the algorithm outperforms human alternatives over time.

Algorithmic Transparency Recording Standard —UK government standard (2022) requiring proactive public disclosure of AI tools used in central government decision-making, including risks and oversight mechanisms.

Proxy Discrimination —When an AI model uses a variable that correlates with a protected characteristic (e.g. postcode, healthcare cost) as a substitute, producing discriminatory outcomes without explicitly encoding the protected characteristic.

Lesson 4 Quiz

When Trust Breaks — and How to Rebuild It — check your understanding

Amazon's hiring AI, discontinued in 2018, exhibited bias primarily because:

Correct. The model absorbed the historical male dominance in Amazon's technical hiring, treating maleness as a proxy for suitability. This is a textbook case of training data encoding historical discrimination.

Incorrect. The bias emerged from training data — a decade of Amazon's own hiring decisions that reflected historical gender disparities in technical roles. The model learned to replicate those patterns.

The Optum healthcare algorithm (Obermeyer et al., 2019) underestimated Black patients' medical needs because it used what problematic proxy variable?

Correct. Using cost as a proxy for need embedded existing disparities — Black patients had received less care at equivalent illness levels, so their costs were lower, causing the model to systematically underestimate their needs.

Incorrect. The algorithm used healthcare cost as a proxy for health need. Because Black patients historically received less care for equivalent conditions, their costs were lower — causing the model to underestimate their medical needs.

What does research on "algorithmic aversion" (Dietvorst et al., 2015) suggest about human responses to AI errors?

Correct. Algorithmic aversion is asymmetric and persistent — a single observed failure can trigger lasting rejection of AI assistance, even in contexts where the AI remains the better performer statistically.

Incorrect. Dietvorst et al. found the opposite: even observing a single AI error created persistent, information-resistant aversion to algorithm use — even when participants were shown the algorithm's superior overall performance.

The UK's Algorithmic Transparency Recording Standard (ATRS, 2022) represents what approach to AI trust?

Correct. ATRS treats disclosure as the default — institutions must proactively publish AI tool details rather than waiting for affected individuals to discover and challenge their use.

Incorrect. ATRS is specifically a proactive disclosure standard, requiring government bodies to publish AI tool information as a matter of course — not in response to complaints or challenges.

Lab 4 — Trust Recovery Strategist

Design recovery plans for documented AI trust failures — applying the three conditions for rebuilding trust

Your Task

You are an AI ethics consultant brought in after a trust failure has been publicly exposed. The assistant will brief you on a documented case — specifying the failure type, the affected population, the organisation's initial response, and the regulatory context. Apply the three conditions for trust recovery (acknowledgment, structural change, accountability) to design a recovery plan. Then assess whether full trust recovery is realistic or whether algorithmic aversion has foreclosed it.

Ask the assistant to present your first trust-failure case, or choose a domain (healthcare AI, hiring AI, criminal justice, education AI). Work through at least three cases to complete this lab.

Trust Recovery Strategy Lab

AI Explainability · Module 4

Welcome to the Trust Recovery Strategy Lab. I'll present documented AI trust failure cases with full context — failure type, affected population, organisational response, and regulatory setting. Your job is to apply the three recovery conditions (acknowledgment, structural change, accountability), design a concrete recovery plan, and honestly assess whether full recovery is possible or whether algorithmic aversion makes it unrealistic. Ready to start? Say "give me a case" or pick a domain: healthcare, hiring, criminal justice, or education.

Module 4 Test

Building Trust Through Explanation — 15 questions · Pass mark: 80%

1. ProPublica's 2016 COMPAS investigation found that Black defendants were flagged as high risk for violent recidivism at what false-positive rate compared to white defendants?

Correct. ProPublica's analysis of 7,000 Florida defendants found this stark disparity in violent recidivism false-positive rates.

Incorrect. ProPublica found 77.4% false-positive rate for Black defendants vs. 41.4% for white defendants in violent recidivism predictions — nearly double.

2. Cognitive trust in AI systems is best described as:

Correct. Cognitive trust is rational — built on evidence of the system's reliability, competence, and accountability mechanisms.

Incorrect. Cognitive trust is rational and evidence-based. Affective trust is emotional, built through positive experience and perceived goodwill.

3. The 2021 Nature Machine Intelligence study on clinical AI found that feature-attribution explanations improved clinicians' decision accuracy by approximately:

Correct. The study found 15% better decision accuracy for clinicians who received feature-attribution explanations alongside AI diagnostic outputs.

Incorrect. The Nature Machine Intelligence study found a 15% improvement in decision accuracy for clinicians receiving feature-attribution explanations vs. confidence scores alone.

4. IBM Watson for Oncology's 2018 failure was primarily an issue of:

Correct. Watson was trained largely on synthetic cases and used explanation formats — confidence scores with brief justifications — that created an illusion of reliability without exposing underlying limitations.

Incorrect. The core problem was training data quality combined with explanation formats that looked authoritative but did not surface the model's actual reasoning or the synthetic nature of its training data.

5. The "illusion of explanatory depth" in AI contexts refers to:

Correct. This phenomenon — where confidence in understanding exceeds actual understanding — was demonstrated in the 2020 CHI study on loan-rejection feature importance displays.

Incorrect. The illusion of explanatory depth is about users' subjective confidence in understanding exceeding their actual ability to understand or predict model behaviour after seeing an explanation.

6. Which explanation format does Miller's 2019 research suggest is most cognitively natural for humans, based on social science literature?

Correct. Miller found that human explanation is naturally contrastive — people ask "why this rather than that?" not "what are all the causes of this?" This supports counterfactual explanation formats.

Incorrect. Miller's synthesis found contrastive explanations most natural — humans explain by contrast, asking "why this rather than that?", not by enumerating causal chains or feature rankings.

7. Adebayo et al.'s 2019 "Sanity Checks for Saliency Maps" found a critical limitation of popular gradient-based saliency methods. What was it?

Correct. This "sanity check" failure revealed that popular saliency methods may not actually show what the model learned — undermining their use as explanations in high-stakes domains like radiology.

Incorrect. The key finding was that these methods failed a basic sanity check — producing similar visual outputs regardless of whether the network was trained or randomly initialized — questioning their validity as explanations.

8. Model Cards, introduced by Mitchell et al. (2019), were first developed at:

Correct. Model Cards were developed at Google Brain/Google AI and are now widely adopted across the AI industry as a documentation standard.

Incorrect. Model Cards were developed at Google, with key researchers including Margaret Mitchell and Timnit Gebru, both of whom were later dismissed from the company in contentious circumstances.

9. The Dutch SyRI court ruling (2020) was significant because it established that:

Correct. The SyRI ruling extended explainability obligations to the systemic level — institutional opacity, not just individual explanation failures, constitutes a rights violation.

Incorrect. The SyRI ruling established that systemic organisational opacity in AI deployment violates ECHR Article 8 rights — an extension of explainability obligations to the institutional level.

10. The three conditions for AI trust recovery identified from organisational trust repair research are:

Correct. Kim, Dirks, and Cooper's research on trust repair, applied to AI systems, identifies acknowledgment, structural change, and accountability as the three essential recovery conditions.

Incorrect. The three conditions are: explicit acknowledgment of the failure and its impacts, structural change in processes (observable by affected parties), and visible accountability for what occurred.

11. The UK's Algorithmic Transparency Recording Standard (ATRS) differs from previous transparency approaches by:

Correct. ATRS inverts the traditional burden: institutions must proactively disclose AI use rather than waiting for individuals to discover and challenge it.

Incorrect. ATRS requires proactive disclosure as a default obligation — government bodies publish AI tool information as standard practice, not in response to individual challenges.

12. Amazon's hiring AI was ultimately discontinued because:

Correct. Amazon's engineers attempted to fix the bias but concluded it could not be reliably remediated. The tool had also been in use without hiring managers knowing they were relying on it.

Incorrect. Amazon discontinued the tool because engineers could not reliably fix the gender bias they discovered — and because the opacity of its use (managers didn't know about it) created additional accountability problems.

13. The NIST AI Risk Management Framework's "Govern" function specifically addresses:

Correct. The Govern function addresses the institutional foundations — culture, accountability, and transparency policies — that underpin all other risk management activities.

Incorrect. The Govern function in the NIST AI RMF addresses organisational culture, accountability structures, and transparency policies — the institutional foundations of AI risk management.

14. Proxy discrimination occurs when:

Correct. Proxy discrimination is particularly dangerous because it produces discriminatory outcomes without technically encoding protected characteristics — making it harder to detect and challenge.

Incorrect. Proxy discrimination occurs through correlated variables — using postcode as a proxy for race, or healthcare cost as a proxy for health need — producing discriminatory outcomes without directly encoding the protected characteristic.

15. The UK's 2020 A-Level algorithm controversy demonstrated what aspect of algorithmic trust failure?

Correct. The A-Level episode illustrates the limits of trust repair — the algorithm was withdrawn within days and effectively ended political support for algorithmic grading in UK education for years, regardless of technical fixes.

Incorrect. The A-Level controversy illustrates a failure of contextual legitimacy — the withdrawal destroyed public acceptance of algorithmic grading in education in the UK, a collapse that technical explanations could not reverse.