Governing AI Systems You Didn't Build · Module 9 · Lesson 1

Inside the Black Box

How to understand and govern AI systems created by vendors — understanding the technical reality behind the promises

In 2018, a major healthcare system licensed an IBM Watson for Oncology system. It cost $62 million. The marketing promised AI-powered treatment recommendations comparable to expert oncologists. The reality was different. The system had been trained on a small dataset from a single institution. It had not been validated in real clinical settings. When deployed, it regularly recommended treatments for patients with conditions it had never encountered. The healthcare system depended on the vendor's assurances that the system worked. They had no way to verify.

The hospital had licensed an AI system without actually knowing what it was — or being able to audit whether it did what the vendor claimed.

What "Black Box" Actually Means

An AI system is not black box because the underlying mathematics is complex — a trained machine learning model is mathematically opaque to humans, but many non-AI systems are equally opaque to non-experts. An AI system is "black box" as a governance problem because organizations cannot observe its decision-making process. You can test it with inputs and observe outputs. But you cannot see what features the model is using, why it made a specific decision, or whether it is making decisions differently than it was when you licensed it. This creates a fundamental governance asymmetry: the vendor knows how the system works; you do not.

The governance challenge of black-box systems has three components. Training data: What data was the system trained on? What does that data represent, what biases does it encode, what populations does it exclude? Most vendors will not disclose this. You typically see: "trained on X million examples" without knowing what those examples are. Model weights and parameters: These are the numerical values the training process learned. They are proprietary and will not be disclosed. But they determine what the system actually does. Decision logic: Why did the system make a specific decision? Some AI systems can provide feature importance (this decision was 60% based on factor A, 30% based on factor B). Many cannot. And feature importance is often misleading — a feature the model is using heavily may or may not be a legitimate basis for the decision.

What Vendors Will and Won't Disclose

Vendors will disclose: overall accuracy metrics, benchmark comparisons, high-level system architecture. Vendors will almost never disclose: training data composition, model weights, decision logic for individual cases, performance breakdowns by demographic group (often called "disaggregated performance"). Understanding what you cannot know about a system you have licensed is the starting point of governance.

Technical Auditing Without Access

If you cannot access the model weights or training data, how do you audit whether a system does what the vendor claims? Technical audit without access relies on behavioral testing and outcome analysis. Behavioral testing: You can test the system with your own data, observing input-output patterns. Does it perform as advertised? Does it perform equally well across demographic groups? Does it behave consistently, or do similar inputs sometimes produce different outputs? Performance disaggregation: Overall accuracy can mask serious problems. A system that is 95% accurate overall might be 70% accurate for a minority subgroup — and the vendor may never tell you this. Request disaggregated performance metrics for any demographic groups relevant to your use case. Outcome monitoring: After deployment, does the system's performance degrade over time? Is there outcome drift — does the target variable the system predicts change, undetected? This can happen when the real-world distribution of what you are predicting shifts but the model was trained on historical data.

The Auditing Asymmetry

Vendor-created AI systems create a structural asymmetry in governance: the vendor knows what the system does; you do not. You can test behavior, but the vendor knows whether the system is performing as designed or whether something is wrong. Governance of vendor AI systems requires accepting this asymmetry and building in verification mechanisms — asking vendors for disaggregated performance, monitoring outcomes, creating fallback plans for when vendor systems fail. You cannot govern what you cannot observe.

Lesson 1 Quiz

Understanding black-box AI systems

An AI system is "black box" as a governance problem primarily because:

✓ Correct — Correct. The black-box problem is not about mathematical complexity — it is about observability. You cannot see inside the system, which creates a governance asymmetry: the vendor knows how it works; you do not.

Black-box is a governance term, not a mathematics term. The issue is that you cannot observe decision-making — what features matter, why specific decisions were made, how the system changes over time.

Which of the following would a vendor typically NOT disclose about an AI system you are considering licensing?

✓ Correct — Correct. Disaggregated performance — how a system performs for different groups — is precisely the information you need for governance and equity assessment. Vendors almost never disclose this voluntarily because it often reveals serious disparities.

Vendors disclose high-level information and overall metrics. They avoid disclosing disaggregated performance — how the system performs for specific demographic groups — because this often reveals serious problems the overall metric hides.

When you cannot access a vendor AI system's model weights or training data, auditing relies on:

✓ Correct — Correct. Without access to internals, audit must be behavioral: testing how the system behaves with your data, comparing performance across groups, and monitoring whether performance degrades after deployment.

Auditing black-box systems requires working within constraints: behavioral testing, disaggregated performance analysis, and outcome monitoring. You cannot force the vendor to disclose proprietary details.

Module 9 · Lab 1

Audit a Vendor AI System's Claims

Develop a testing plan to verify what a vendor-provided AI system actually does

You are evaluating an AI system a vendor has proposed your organization license. The vendor claims the system is 94% accurate on their benchmark and is "ready for production." You have access to the system for 30 days of testing. You cannot access the training data or model weights. Your organization will use this system to make decisions affecting real people.

Design a behavioral testing plan. What data would you test with? What performance metrics would you ask the vendor for? What demographics would you test for? What would trigger rejection of the system? Start by describing your organization's use case and the populations affected.

Describe your organization (healthcare provider, financial institution, government agency, etc.), the AI system you are evaluating, and the specific use case. Then outline your testing strategy — what would you test, what metrics matter, and what would be red flags that the system isn't trustworthy?

Governing AI Systems You Didn't Build · Module 9 · Lesson 2

Procurement as Governance

How licensing agreements create (or fail to create) accountability for vendor AI systems

A major US bank licensed a credit-scoring AI system from a vendor in 2019. The contract specified accuracy levels. It did not specify that the vendor would notify the bank of changes to the underlying model. For three years, the vendor pushed model updates to production without notification. When the bank discovered that recent performance degradation was due to an unannounced model change, the contract had no mechanism to address this. The vendor had the legal right to change the system — the contract did not forbid it.

The bank had signed a licensing agreement. It had not established governance over what the vendor could do with the system.

What Procurement Contracts Should Specify

A licensing contract for an AI system should function as a governance document — specifying obligations, constraints, and consequences that make the vendor accountable. Standard software licensing contracts are often inadequate for AI systems because they do not address the specific governance challenges of machine learning. Key provisions that AI procurement contracts must include:

Performance specifications: Accuracy metrics alone are insufficient. The contract should specify: disaggregated performance (how the system performs for each demographic group your organization cares about), performance thresholds (what accuracy level is required for ongoing use), and performance monitoring requirements (the vendor must provide regular performance reporting).

Training data transparency: The vendor should disclose: what data the system was trained on, what date ranges it covers, what biases are known to exist in the training data, and what populations are underrepresented. This information may be provided under NDA if the data itself is proprietary, but you cannot govern what you do not know.

Model change notification: The vendor should be required to notify your organization of any model updates, retraining, or changes to the system's behavior before they go to production. This allows your organization to test and validate the changes before they affect your operations.

Incident response: What happens if the system causes serious harm? The contract should specify: how quickly the vendor will respond to incident reports, what investigation and remediation the vendor will undertake, what liability the vendor accepts, and what happens if the vendor cannot address the incident.

Audit rights: Your organization should have the right to audit the system's behavior — both through the testing and behavioral analysis covered in Lesson 1, and potentially through third-party technical audit. Some vendors will accept independent audit; many will not.

The Negotiation Reality

Large AI vendors have significant market power and standard contracts they are unwilling to modify substantially. Smaller organizations often face a choice: accept the vendor's standard terms or go without the system. Negotiating better terms requires either: scale (your organization is large enough that the vendor wants your business), alternatives (a competitor offers better terms), or coalition (multiple organizations negotiate jointly). Understanding what provisions matter most to your organization — and which you might accept standard terms on — is essential for effective negotiation.

Governance Through Procurement

A well-designed procurement contract creates ongoing governance obligations that persist through the contract term. The most important contracts are not those that get the best pricing — they are those that create accountability. This means: specifying what you can measure and audit, requiring the vendor to be transparent about what they can, and building in mechanisms for addressing the inevitable cases where the system does not perform as promised.

Procurement as a Governance Lever

Many organizations think of procurement as a purchasing function — getting the best system at the best price. Procurement is also a governance lever: the contract you sign determines what accountability the vendor will accept. A well-structured contract transfers risk and responsibility to the vendor in proportion to what they control. A weak contract leaves you liable for the vendor's failures.

Lesson 2 Quiz

Procurement as governance

A vendor AI system contract should specify disaggregated performance because:

✓ Correct — Correct. Overall accuracy (95% across all groups) can hide disparities (70% for minority groups). Disaggregated performance is a governance necessity, not a legal requirement.

Disaggregated performance is essential for governance — you need to know how the system performs for groups your organization cares about, not just overall averages.

A "model change notification" provision is important in procurement contracts because:

✓ Correct — Correct. Without notification requirements, vendors can change systems without your knowledge. Notification provisions give you the opportunity to test and validate before the changes affect your operations.

Model change notification is not about preventing updates — it is about making sure you know about them before they affect your operations.

Procurement contracts create governance because they:

✓ Correct — Correct. Contracts create accountability by specifying what the vendor must do, what they must disclose, and what happens when performance fails. This is governance — not through policy, but through contractual obligation.

Procurement contracts create governance by assigning obligations and responsibility. A well-designed contract specifies what the vendor will do, what metrics matter, and what happens if they fail.

Module 9 · Lab 2

Draft Procurement Requirements for an AI System

Write the contract provisions that would create real accountability for vendor AI systems

You are the governance lead for an organization evaluating an AI system for a high-stakes use case. You need to draft the AI-specific provisions that your legal team should insist on in any vendor contract. These are not standard software licensing terms — they are governance requirements.

For your organization and use case, specify: (1) What performance metrics the vendor must guarantee. (2) What training data information the vendor must disclose. (3) What happens if the system fails. (4) What audit rights your organization requires. Include at least one metric for demographic performance disparity.

Describe your organization, the AI system you are evaluating, and who will be affected by its decisions. Then outline the top 3 governance provisions you would insist on in the contract — why each one matters and what it obligates the vendor to do.

Governing AI Systems You Didn't Build · Module 9 · Lesson 3

Specifying What You Want

How to write governance requirements before building or licensing an AI system — the design layer of governance

A government agency requested proposals for an AI system to process benefit applications. The RFP was vague: "The system should accurately determine eligibility for benefits." Vendors responded with systems optimized for what accuracy meant to them — and accuracy alone. One vendor's system was 96% accurate but rejected 30% more applicants from certain zip codes. Another achieved high accuracy by applying existing biases in the training data more efficiently. The agency had asked for accuracy. The vendors had delivered it. But the agency had not specified the constraints that would make accuracy actually serve the agency's mission.

The agency had specified what it wanted. It had not specified the constraints that would determine whether what it got would actually serve the intended purpose.

Requirements as Governance Design

How you write AI system requirements — whether you are building the system internally or licensing it from a vendor — shapes what governance will actually be possible. Poor requirements produce systems that meet the stated requirements but fail the underlying purpose. Strong requirements specify not just what the system should do, but the constraints that govern how it should do it.

Functional requirements: What is the system supposed to accomplish? Traditional functional requirements are sufficient but insufficient. "Classify applicants as eligible or ineligible" is a functional requirement. "Classify with 90% accuracy" adds a performance metric. But neither specifies whether it is acceptable to achieve this accuracy by systematically underestimating the eligibility of certain demographic groups.

Fairness and equity requirements: How should the system perform across demographic groups? This requires specifying: Which demographic groups matter to your organization? What is acceptable disparity? What metrics will you use to measure fairness (demographic parity, equalized odds, calibration, other)? What will you do if the system does not meet fairness requirements? Many organizations skip this entirely. This is governance failure.

Explainability requirements: How much can your organization tolerate not knowing why the system made a specific decision? For high-stakes decisions affecting individuals, explainability requirements should mandate that the system provide human-understandable justification. For lower-stakes decisions, a feature importance score may be sufficient. For some applications, complete black-box operation is acceptable.

Human oversight requirements: At what points should human review be required? For high-stakes decisions, human review of all decisions might be required. For medium-stakes decisions, human review of borderline cases (decisions the model is uncertain about) might be sufficient. For low-stakes decisions, no human review might be acceptable. Specifying where humans stay in the loop is specifying where governance remains.

The Fairness Metrics Decision

Organizations often struggle with fairness requirements because there is no single agreed-upon definition of fairness. Demographic parity (equal outcomes across groups) may conflict with equalized odds (equal false positive and false negative rates across groups). Specifying requirements forces the organization to decide which fairness definition matters for its use case. This decision reflects values — which groups the organization prioritizes, what type of error it finds more unacceptable — so it should not be delegated to vendors or technical teams.

Requirements as Constraints on Automation

The strongest requirements explicitly constrain where automation can and cannot be applied. A government agency processing benefits applications might specify: "Automated approval is permitted for straightforward cases where stated income can be verified and applicant information matches prior records. Automated denial is never permitted — all denials must receive human review by a trained adjudicator." This is a governance requirement disguised as a functional specification. It creates accountability — humans are responsible for the decisions that harm people — while permitting automation where harm is unlikely.

Requirements as Your Governance Document

Well-written AI system requirements function as a governance document. They specify what the system must and must not do, which groups' interests the system is designed to serve, what metrics demonstrate whether the system is serving those interests, and where humans remain in control. If your requirements do not address these points, your system will be designed without governance constraints — and governance after deployment will be nearly impossible.

Lesson 3 Quiz

Specifying governance requirements

Fairness requirements in AI system specifications are necessary because:

✓ Correct — Correct. High overall accuracy can hide serious disparities. Fairness requirements force the organization to specify what it actually cares about — and what it will accept or reject.

Accuracy and fairness are different metrics. High accuracy does not guarantee fairness. Specifying fairness requirements is how organizations ensure systems do not produce biased outcomes.

Explainability requirements in AI system specifications matter because:

✓ Correct — Correct. Whether explainability matters depends on the use case and stakes. For high-stakes decisions affecting individuals, humans need to understand why. For other decisions, different levels of explainability may be acceptable.

Explainability requirements should match the stakes and consequences of decisions. What matters is specifying clearly where humans need to understand the system and where they do not.

Specifying "where humans stay in the loop" is a governance requirement because:

✓ Correct — Correct. Specifying human oversight is specifying governance structure — where decisions are too important to fully automate, accountability must remain with humans.

Human oversight requirements are governance requirements. They specify where the organization chooses to keep humans in control, maintaining accountability at points where full automation would be irresponsible.

Module 9 · Lab 3

Write Governance Requirements for an AI System

Design an AI system that serves your organization's values, not just a narrow definition of accuracy

Your organization is commissioning an AI system. This is your opportunity to specify governance into the system from the start. You will write the governance-focused requirements that should constrain how the system is built or selected.

For a specific use case and organization: (1) Write the fairness requirement — specify which demographic groups matter, what disparity is acceptable, what metric will measure it. (2) Write the explainability requirement — specify what level of explainability is required and why. (3) Write the human oversight requirement — specify at what points humans must review decisions and why. (4) Identify one constraint on automation — is there a type of decision the system should never make fully automatically?

Name your organization, the AI system, and its use case. Then describe the fairness requirement you would specify — which groups should be treated equally, and how you would measure that? Push yourself to be specific about what "fairness" means in this context.

Governing AI Systems You Didn't Build · Module 9 · Lesson 4

Ongoing System Governance

How to monitor, audit, and maintain control of AI systems after deployment — when governance actually matters

A major financial institution deployed a credit-scoring AI system in 2020. The system had passed extensive testing. The first year of deployment showed strong performance. Then monitoring was deprioritized — the system was working, or so it seemed. By 2022, the system had drifted. The market had changed; the population applying for credit had changed; the definition of what constituted good credit risk had shifted. The model was still making predictions — but it was making them based on increasingly irrelevant patterns. Nobody noticed until an audit (triggered by external pressure, not internal monitoring) revealed the problem. By then, the system had made three years of biased decisions.

The system had been deployed. But it had never been governed.

Monitoring and Performance Tracking

AI systems decay over time. This decay has a name: model drift. The relationships the model learned during training may no longer hold in production. The populations making decisions about the system has changed. The target variable — what the system is trying to predict — may have changed. Monitoring detects when these changes are happening and triggers investigation.

Performance monitoring: Track whether the system's accuracy is changing over time. Separate monitoring for overall accuracy and disaggregated performance by demographic group — overall accuracy can be stable while performance for specific groups deteriorates. Distribution monitoring: Does the distribution of inputs to the system match the distribution it was trained on? If the population being classified has changed, the model's performance will be different than in training, even if nothing is wrong. Outcome monitoring: What is actually happening as a result of the system's decisions? If credit scores are leading to loans that default more frequently, or hiring recommendations are leading to hires that underperform, these outcome measures can surface problems that standard performance metrics miss.

The Monitoring Discipline

Monitoring requires discipline and ownership. Without a named role responsible for regular monitoring, with regular reporting to decision-makers, and with defined thresholds that trigger investigation, monitoring falls away. The system works until someone realizes it doesn't — and by then, harm may have been done. The strongest governance systems treat monitoring as non-negotiable.

Retraining, Updates, and Model Governance

When a system's performance degrades, the response is usually retraining — updating the model with new data to restore performance. This creates new governance challenges. Retraining procedures: What triggers retraining? Automated trigger thresholds (if accuracy drops below 90%, automatically retrain) can be dangerous — a poorly retrained model may perform even worse. Manual trigger thresholds are safer but require monitoring discipline. Validation before deployment: A retrained model must be tested before replacing the production model, using recent data and the full test suite that validated the original model. Rollback procedures: What happens if a retrained model performs worse than the current production model? A governance system specifies that the previous model remains in production until the new model demonstrates superior performance.

Incident Management

When an AI system causes significant harm, what is the organizational response? A governance system specifies this. An incident might be: the system's decisions causing demonstrable harm to individuals (a hiring system disproportionately rejecting qualified candidates from a protected class, a credit system making decisions that harm a subgroup). The incident response: who is notified, how quickly, what investigation is undertaken, what remediation is offered, and what changes are made to prevent recurrence. Without incident management procedures, harm often remains hidden until external pressure forces a response.

Governance is Ongoing, Not Finished

The governance challenges described in Lessons 1-3 (understanding black-box systems, negotiating contracts, writing requirements) all set the stage. But governance actually happens in Lesson 4 — in the daily, unglamorous work of monitoring whether systems do what they should, updating them when they drift, and responding when they fail. An organization that commits to pre-deployment governance but deprioritizes post-deployment monitoring has not actually established governance — it has completed a governance theater performance.

Lesson 4 Quiz

Ongoing system governance

Model drift is dangerous because it can lead to:

✓ Correct — Correct. Drift is dangerous precisely because it is invisible — the system keeps making predictions, monitoring may show overall accuracy is stable, but the predictions are becoming less reliable or more biased.

Model drift is the slow, invisible decay of system performance. The system keeps working, but not in the ways it was designed to. Monitoring detects drift before it causes harm.

Why is disaggregated performance monitoring important for ongoing governance?

✓ Correct — Correct. Disaggregated monitoring is how you detect when a system's bias is increasing — the overall metric hides what's happening to specific groups.

Disaggregated performance monitoring reveals disparities that overall accuracy hides. A system with 95% overall accuracy but 70% accuracy for one group is not monitored adequately without disaggregated metrics.

In a well-governed system, when should a retrained model be deployed to production?

✓ Correct — Correct. Retraining can make things worse, not better. Governance requires testing before deployment — and rollback procedures if the new model underperforms the current one.

A retrained model is not automatically better. It must be tested thoroughly before deployment, and the organization must be able to rollback to the previous model if the new one performs worse.

Module 9 · Lab 4

Design a Monitoring and Governance Plan

Create the operational structure that keeps a deployed AI system under governance

An organization has deployed an AI system. Your job is to design the post-deployment governance system that will keep it accountable. You will specify: what will be monitored, how often, what thresholds will trigger action, who is responsible, how incidents are handled, and how the system will be updated.

For a deployed AI system: (1) Write your monitoring plan — what metrics will you track, how frequently, with what disaggregation? (2) Specify trigger thresholds — what performance level would cause you to investigate or pull the system? (3) Describe your incident response — what happens when the system causes significant harm? (4) Outline retraining governance — what triggers retraining, how is the retrained model validated, and how do you ensure it is actually better before deploying?

Describe the deployed system and its use case. Then outline your monitoring plan — what three metrics would you track, and why each one matters for catching problems early?

Module 9 Test

Governing AI Systems You Didn't Build — covering all 4 lessons

Score: 0 / 15

1. The primary governance challenge of "black-box" AI systems is:

2. Disaggregated performance metrics are essential in vendor procurement because:

3. A strong vendor procurement contract for an AI system should include all of the following EXCEPT:

4. Fairness requirements in system specifications matter because:

5. Human oversight requirements in AI system specifications are governance requirements because:

6. Model drift is particularly dangerous because:

7. When a system's disaggregated performance is stable overall but declining for a specific demographic group, this is best detected through:

8. A retrained AI model should be deployed to production only after:

9. Behavioral testing of a vendor AI system can determine all of the following EXCEPT:

10. The most important reason to specify governance requirements before building or licensing an AI system is:

11. A vendor's refusal to disclose training data should lead your organization to:

12. In the IBM Watson for Oncology case discussed in the module, the fundamental governance failure was:

13. Which of the following is NOT a legitimate reason to skip explainability requirements in an AI system specification?

14. An organization that commits to pre-deployment governance but deprioritizes post-deployment monitoring has:

15. The through-line connecting lessons 1-4 is that governing AI systems you didn't build requires: