How to understand and govern AI systems created by vendors — understanding the technical reality behind the promises
In 2018, a major healthcare system licensed an IBM Watson for Oncology system. It cost $62 million. The marketing promised AI-powered treatment recommendations comparable to expert oncologists. The reality was different. The system had been trained on a small dataset from a single institution. It had not been validated in real clinical settings. When deployed, it regularly recommended treatments for patients with conditions it had never encountered. The healthcare system depended on the vendor's assurances that the system worked. They had no way to verify.
The hospital had licensed an AI system without actually knowing what it was — or being able to audit whether it did what the vendor claimed.
An AI system is not black box because the underlying mathematics is complex — a trained machine learning model is mathematically opaque to humans, but many non-AI systems are equally opaque to non-experts. An AI system is "black box" as a governance problem because organizations cannot observe its decision-making process. You can test it with inputs and observe outputs. But you cannot see what features the model is using, why it made a specific decision, or whether it is making decisions differently than it was when you licensed it. This creates a fundamental governance asymmetry: the vendor knows how the system works; you do not.
The governance challenge of black-box systems has three components. Training data: What data was the system trained on? What does that data represent, what biases does it encode, what populations does it exclude? Most vendors will not disclose this. You typically see: "trained on X million examples" without knowing what those examples are. Model weights and parameters: These are the numerical values the training process learned. They are proprietary and will not be disclosed. But they determine what the system actually does. Decision logic: Why did the system make a specific decision? Some AI systems can provide feature importance (this decision was 60% based on factor A, 30% based on factor B). Many cannot. And feature importance is often misleading — a feature the model is using heavily may or may not be a legitimate basis for the decision.
Vendors will disclose: overall accuracy metrics, benchmark comparisons, high-level system architecture. Vendors will almost never disclose: training data composition, model weights, decision logic for individual cases, performance breakdowns by demographic group (often called "disaggregated performance"). Understanding what you cannot know about a system you have licensed is the starting point of governance.
If you cannot access the model weights or training data, how do you audit whether a system does what the vendor claims? Technical audit without access relies on behavioral testing and outcome analysis. Behavioral testing: You can test the system with your own data, observing input-output patterns. Does it perform as advertised? Does it perform equally well across demographic groups? Does it behave consistently, or do similar inputs sometimes produce different outputs? Performance disaggregation: Overall accuracy can mask serious problems. A system that is 95% accurate overall might be 70% accurate for a minority subgroup — and the vendor may never tell you this. Request disaggregated performance metrics for any demographic groups relevant to your use case. Outcome monitoring: After deployment, does the system's performance degrade over time? Is there outcome drift — does the target variable the system predicts change, undetected? This can happen when the real-world distribution of what you are predicting shifts but the model was trained on historical data.
Vendor-created AI systems create a structural asymmetry in governance: the vendor knows what the system does; you do not. You can test behavior, but the vendor knows whether the system is performing as designed or whether something is wrong. Governance of vendor AI systems requires accepting this asymmetry and building in verification mechanisms — asking vendors for disaggregated performance, monitoring outcomes, creating fallback plans for when vendor systems fail. You cannot govern what you cannot observe.
Develop a testing plan to verify what a vendor-provided AI system actually does
You are evaluating an AI system a vendor has proposed your organization license. The vendor claims the system is 94% accurate on their benchmark and is "ready for production." You have access to the system for 30 days of testing. You cannot access the training data or model weights. Your organization will use this system to make decisions affecting real people.
Design a behavioral testing plan. What data would you test with? What performance metrics would you ask the vendor for? What demographics would you test for? What would trigger rejection of the system? Start by describing your organization's use case and the populations affected.
How licensing agreements create (or fail to create) accountability for vendor AI systems
A major US bank licensed a credit-scoring AI system from a vendor in 2019. The contract specified accuracy levels. It did not specify that the vendor would notify the bank of changes to the underlying model. For three years, the vendor pushed model updates to production without notification. When the bank discovered that recent performance degradation was due to an unannounced model change, the contract had no mechanism to address this. The vendor had the legal right to change the system — the contract did not forbid it.
The bank had signed a licensing agreement. It had not established governance over what the vendor could do with the system.
A licensing contract for an AI system should function as a governance document — specifying obligations, constraints, and consequences that make the vendor accountable. Standard software licensing contracts are often inadequate for AI systems because they do not address the specific governance challenges of machine learning. Key provisions that AI procurement contracts must include:
Performance specifications: Accuracy metrics alone are insufficient. The contract should specify: disaggregated performance (how the system performs for each demographic group your organization cares about), performance thresholds (what accuracy level is required for ongoing use), and performance monitoring requirements (the vendor must provide regular performance reporting).
Training data transparency: The vendor should disclose: what data the system was trained on, what date ranges it covers, what biases are known to exist in the training data, and what populations are underrepresented. This information may be provided under NDA if the data itself is proprietary, but you cannot govern what you do not know.
Model change notification: The vendor should be required to notify your organization of any model updates, retraining, or changes to the system's behavior before they go to production. This allows your organization to test and validate the changes before they affect your operations.
Incident response: What happens if the system causes serious harm? The contract should specify: how quickly the vendor will respond to incident reports, what investigation and remediation the vendor will undertake, what liability the vendor accepts, and what happens if the vendor cannot address the incident.
Audit rights: Your organization should have the right to audit the system's behavior — both through the testing and behavioral analysis covered in Lesson 1, and potentially through third-party technical audit. Some vendors will accept independent audit; many will not.
Large AI vendors have significant market power and standard contracts they are unwilling to modify substantially. Smaller organizations often face a choice: accept the vendor's standard terms or go without the system. Negotiating better terms requires either: scale (your organization is large enough that the vendor wants your business), alternatives (a competitor offers better terms), or coalition (multiple organizations negotiate jointly). Understanding what provisions matter most to your organization — and which you might accept standard terms on — is essential for effective negotiation.
A well-designed procurement contract creates ongoing governance obligations that persist through the contract term. The most important contracts are not those that get the best pricing — they are those that create accountability. This means: specifying what you can measure and audit, requiring the vendor to be transparent about what they can, and building in mechanisms for addressing the inevitable cases where the system does not perform as promised.
Many organizations think of procurement as a purchasing function — getting the best system at the best price. Procurement is also a governance lever: the contract you sign determines what accountability the vendor will accept. A well-structured contract transfers risk and responsibility to the vendor in proportion to what they control. A weak contract leaves you liable for the vendor's failures.
Write the contract provisions that would create real accountability for vendor AI systems
You are the governance lead for an organization evaluating an AI system for a high-stakes use case. You need to draft the AI-specific provisions that your legal team should insist on in any vendor contract. These are not standard software licensing terms — they are governance requirements.
For your organization and use case, specify: (1) What performance metrics the vendor must guarantee. (2) What training data information the vendor must disclose. (3) What happens if the system fails. (4) What audit rights your organization requires. Include at least one metric for demographic performance disparity.
How to write governance requirements before building or licensing an AI system — the design layer of governance
A government agency requested proposals for an AI system to process benefit applications. The RFP was vague: "The system should accurately determine eligibility for benefits." Vendors responded with systems optimized for what accuracy meant to them — and accuracy alone. One vendor's system was 96% accurate but rejected 30% more applicants from certain zip codes. Another achieved high accuracy by applying existing biases in the training data more efficiently. The agency had asked for accuracy. The vendors had delivered it. But the agency had not specified the constraints that would make accuracy actually serve the agency's mission.
The agency had specified what it wanted. It had not specified the constraints that would determine whether what it got would actually serve the intended purpose.
How you write AI system requirements — whether you are building the system internally or licensing it from a vendor — shapes what governance will actually be possible. Poor requirements produce systems that meet the stated requirements but fail the underlying purpose. Strong requirements specify not just what the system should do, but the constraints that govern how it should do it.
Functional requirements: What is the system supposed to accomplish? Traditional functional requirements are sufficient but insufficient. "Classify applicants as eligible or ineligible" is a functional requirement. "Classify with 90% accuracy" adds a performance metric. But neither specifies whether it is acceptable to achieve this accuracy by systematically underestimating the eligibility of certain demographic groups.
Fairness and equity requirements: How should the system perform across demographic groups? This requires specifying: Which demographic groups matter to your organization? What is acceptable disparity? What metrics will you use to measure fairness (demographic parity, equalized odds, calibration, other)? What will you do if the system does not meet fairness requirements? Many organizations skip this entirely. This is governance failure.
Explainability requirements: How much can your organization tolerate not knowing why the system made a specific decision? For high-stakes decisions affecting individuals, explainability requirements should mandate that the system provide human-understandable justification. For lower-stakes decisions, a feature importance score may be sufficient. For some applications, complete black-box operation is acceptable.
Human oversight requirements: At what points should human review be required? For high-stakes decisions, human review of all decisions might be required. For medium-stakes decisions, human review of borderline cases (decisions the model is uncertain about) might be sufficient. For low-stakes decisions, no human review might be acceptable. Specifying where humans stay in the loop is specifying where governance remains.
Organizations often struggle with fairness requirements because there is no single agreed-upon definition of fairness. Demographic parity (equal outcomes across groups) may conflict with equalized odds (equal false positive and false negative rates across groups). Specifying requirements forces the organization to decide which fairness definition matters for its use case. This decision reflects values — which groups the organization prioritizes, what type of error it finds more unacceptable — so it should not be delegated to vendors or technical teams.
The strongest requirements explicitly constrain where automation can and cannot be applied. A government agency processing benefits applications might specify: "Automated approval is permitted for straightforward cases where stated income can be verified and applicant information matches prior records. Automated denial is never permitted — all denials must receive human review by a trained adjudicator." This is a governance requirement disguised as a functional specification. It creates accountability — humans are responsible for the decisions that harm people — while permitting automation where harm is unlikely.
Well-written AI system requirements function as a governance document. They specify what the system must and must not do, which groups' interests the system is designed to serve, what metrics demonstrate whether the system is serving those interests, and where humans remain in control. If your requirements do not address these points, your system will be designed without governance constraints — and governance after deployment will be nearly impossible.
Design an AI system that serves your organization's values, not just a narrow definition of accuracy
Your organization is commissioning an AI system. This is your opportunity to specify governance into the system from the start. You will write the governance-focused requirements that should constrain how the system is built or selected.
For a specific use case and organization: (1) Write the fairness requirement — specify which demographic groups matter, what disparity is acceptable, what metric will measure it. (2) Write the explainability requirement — specify what level of explainability is required and why. (3) Write the human oversight requirement — specify at what points humans must review decisions and why. (4) Identify one constraint on automation — is there a type of decision the system should never make fully automatically?
How to monitor, audit, and maintain control of AI systems after deployment — when governance actually matters
A major financial institution deployed a credit-scoring AI system in 2020. The system had passed extensive testing. The first year of deployment showed strong performance. Then monitoring was deprioritized — the system was working, or so it seemed. By 2022, the system had drifted. The market had changed; the population applying for credit had changed; the definition of what constituted good credit risk had shifted. The model was still making predictions — but it was making them based on increasingly irrelevant patterns. Nobody noticed until an audit (triggered by external pressure, not internal monitoring) revealed the problem. By then, the system had made three years of biased decisions.
The system had been deployed. But it had never been governed.
AI systems decay over time. This decay has a name: model drift. The relationships the model learned during training may no longer hold in production. The populations making decisions about the system has changed. The target variable — what the system is trying to predict — may have changed. Monitoring detects when these changes are happening and triggers investigation.
Performance monitoring: Track whether the system's accuracy is changing over time. Separate monitoring for overall accuracy and disaggregated performance by demographic group — overall accuracy can be stable while performance for specific groups deteriorates. Distribution monitoring: Does the distribution of inputs to the system match the distribution it was trained on? If the population being classified has changed, the model's performance will be different than in training, even if nothing is wrong. Outcome monitoring: What is actually happening as a result of the system's decisions? If credit scores are leading to loans that default more frequently, or hiring recommendations are leading to hires that underperform, these outcome measures can surface problems that standard performance metrics miss.
Monitoring requires discipline and ownership. Without a named role responsible for regular monitoring, with regular reporting to decision-makers, and with defined thresholds that trigger investigation, monitoring falls away. The system works until someone realizes it doesn't — and by then, harm may have been done. The strongest governance systems treat monitoring as non-negotiable.
When a system's performance degrades, the response is usually retraining — updating the model with new data to restore performance. This creates new governance challenges. Retraining procedures: What triggers retraining? Automated trigger thresholds (if accuracy drops below 90%, automatically retrain) can be dangerous — a poorly retrained model may perform even worse. Manual trigger thresholds are safer but require monitoring discipline. Validation before deployment: A retrained model must be tested before replacing the production model, using recent data and the full test suite that validated the original model. Rollback procedures: What happens if a retrained model performs worse than the current production model? A governance system specifies that the previous model remains in production until the new model demonstrates superior performance.
When an AI system causes significant harm, what is the organizational response? A governance system specifies this. An incident might be: the system's decisions causing demonstrable harm to individuals (a hiring system disproportionately rejecting qualified candidates from a protected class, a credit system making decisions that harm a subgroup). The incident response: who is notified, how quickly, what investigation is undertaken, what remediation is offered, and what changes are made to prevent recurrence. Without incident management procedures, harm often remains hidden until external pressure forces a response.
The governance challenges described in Lessons 1-3 (understanding black-box systems, negotiating contracts, writing requirements) all set the stage. But governance actually happens in Lesson 4 — in the daily, unglamorous work of monitoring whether systems do what they should, updating them when they drift, and responding when they fail. An organization that commits to pre-deployment governance but deprioritizes post-deployment monitoring has not actually established governance — it has completed a governance theater performance.
Create the operational structure that keeps a deployed AI system under governance
An organization has deployed an AI system. Your job is to design the post-deployment governance system that will keep it accountable. You will specify: what will be monitored, how often, what thresholds will trigger action, who is responsible, how incidents are handled, and how the system will be updated.
For a deployed AI system: (1) Write your monitoring plan — what metrics will you track, how frequently, with what disaggregation? (2) Specify trigger thresholds — what performance level would cause you to investigate or pull the system? (3) Describe your incident response — what happens when the system causes significant harm? (4) Outline retraining governance — what triggers retraining, how is the retrained model validated, and how do you ensure it is actually better before deploying?