L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
Module 4 Β· Lesson 1

The Report That Moves Money

A recommendation isn't just an opinion. It's a decision someone else will act on.
What separates a real recommendation from a guess dressed up in confident language?

In the autumn of 2022, the Danish government's Agency for Digital Government published a report evaluating which AI tools public sector offices should adopt. The document β€” translated and studied across Europe β€” was careful, specific, and methodical. It named tools, described use cases, assessed risks, and gave clear recommendations with reasoning. Within three months, fourteen European nations had cited it when making their own purchasing decisions.

The report did not say "AI is great" or "AI is dangerous." It said: for this specific task, in this specific context, with these specific constraints, this tool is appropriate β€” and here is why. That precision is what made it useful. Vague enthusiasm helps no one. Structured reasoning moves real resources.

Why "It Depends" Is the Beginning, Not the End

If someone asks you which AI tool they should use, the honest first answer is always "it depends." But stopping there is not helpful. "It depends" is only the start of a thinking process, not the conclusion. The whole point of a Tool Recommendation Report is to do the work of figuring out what, exactly, it depends on β€” and then answering that.

Professional analysts, technology consultants, and policy writers all produce documents like this. The format varies. The underlying logic does not. You identify the task, the user, the constraints, and the options. You evaluate each option against those factors. You make a recommendation and explain it. That sequence is what turns an opinion into something actionable.

In this module, you are going to build that skill from scratch. By the end, you will be able to produce a structured AI tool recommendation that a real decision-maker could actually use.

The Four Parts of Every Good Recommendation

The Danish report β€” and almost every credible technology evaluation document β€” shares four structural elements. Learning to recognize them makes you a much sharper reader of any claim about AI tools.

1. Task Definition

What, precisely, needs to happen? Not "write stuff" but "draft a 200-word summary of a legal document for a non-expert reader, in under 30 seconds."

2. Constraint Inventory

What are the limits? Budget, data privacy rules, required accuracy level, who will use it, what happens if it's wrong.

3. Option Comparison

At least two real tools evaluated against the same criteria. No cherry-picking. Each tool gets a fair look at the same set of questions.

4. Justified Conclusion

A clear choice, with the reasoning visible. Not "Tool A is better" but "Tool A is better for this task because of X, even though Tool B wins on Y."

Notice what is not in that list: hype, brand loyalty, or what your friend uses. A recommendation is not a preference. It is a reasoned judgment about fit between a tool and a purpose.

The Constraint Most People Forget

In January 2023, the Italian data protection authority β€” called the Garante β€” temporarily banned ChatGPT for Italian users, citing concerns about how the tool handled personal data. The ban lasted about a month before OpenAI made changes, but in that window, Italian businesses that had built workflows around ChatGPT without accounting for local data law were suddenly stuck.

They had picked the right tool for the task. They had completely missed a constraint. That is a failure of the recommendation process, not of the tool itself.

The constraint inventory is the part most beginners skip. It feels like bureaucracy. It is actually the part that determines whether your recommendation survives contact with reality. Before you evaluate any tool, you have to know what the tool has to work within.

Ethical Question β€” No Clean Answer

When you write a recommendation that a company or government will follow, you share responsibility for what happens when they act on it. If the tool you recommended causes harm β€” even harm you didn't predict β€” do you bear any of that responsibility? At what point does a recommendation become a decision?

You Now See What Most People Miss

Most people evaluate AI tools by asking "Is this good?" You now know that question is almost meaningless. The real question is always: Is this the right fit for this specific task, within these specific constraints, for this specific user? Every headline that says "AI Tool X is better than AI Tool Y" is missing all of the context that would make that statement useful.

Lesson 1 Quiz

Five questions Β· Apply the concepts, don't just recall them
1. The Danish Agency for Digital Government's 2022 AI report was influential because it was vague enough for many countries to interpret it differently.
Correct. The report's value came from its specificity β€” named tools, named use cases, explicit reasoning. That precision is what made it usable by other governments.
Not quite. The report's influence came from being specific, not vague. Fourteen other nations cited it precisely because the reasoning was clear enough to apply to their own decisions.
2. A school wants to use an AI writing assistant. The assistant must not store student data on external servers. Which part of the recommendation framework does this belong to?
Correct. Data storage rules are a constraint β€” a limit the tool must operate within regardless of how good it is at the core task.
Data privacy rules are constraints β€” conditions the tool must satisfy before it even gets considered for the task. That belongs in the Constraint Inventory.
3. Italy's 2023 temporary ban on ChatGPT is best described as an example of what recommendation failure?
Exactly right. The tool may have been a good fit for the task. But no one had mapped the legal constraint landscape, so the recommendation couldn't survive real-world conditions.
The Italy example is specifically about a missing constraint β€” data protection law. Businesses had picked a good tool for the task but skipped the step of asking what legal rules it had to operate within.
4. Your friend says "ChatGPT is obviously the best AI tool." Based on Lesson 1, what is the most precise criticism of this statement?
Exactly. "Best" without specifying best-for-what, best-for-whom, and best-within-what-constraints is the kind of claim that sounds confident but carries no usable information.
The problem isn't the tool choice itself β€” it's the missing context. "Best" requires a task, a user, and constraints before it becomes a meaningful claim.
5. A recommendation's justified conclusion should include which of the following?
Right. A justified conclusion is not just a verdict β€” it shows its work. The reader should be able to see why Tool A was chosen even though Tool B had advantages in some areas.
A justified conclusion requires visible reasoning and acknowledged trade-offs. Popularity and feature counts alone don't justify a recommendation.

Lab 1 β€” The Constraint Auditor

Your role: Constraint Investigator. Find the holes before the recommendation fails.

The Scenario

A hospital administrator has asked you to recommend an AI tool that can help nurses write shift-handover notes β€” the brief summaries nurses write when one shift ends and another begins. She says she wants something fast and easy to use. She hasn't mentioned any constraints.

Your job isn't to recommend a tool yet. Your job is to surface the constraints she's forgotten to mention β€” before a tool gets chosen and something goes wrong.

Start by telling the AI what constraints you think matter most in this scenario. Then push the conversation: ask what you might be missing, argue for a constraint you think is underrated, or challenge whether a constraint the AI names is actually real.
Constraint Advisor β€” VERA
Lab Partner
Hospital AI tools. High stakes. Let's be real with each other about what could go wrong here. What constraints do you think the administrator has overlooked? Don't just list them β€” tell me which one you think matters most and why.
Module 4 Β· Lesson 2

Structured Comparison: The Grid That Prevents Bias

When you don't have a comparison framework, you don't compare tools β€” you justify a preference you already had.
How do you evaluate two tools fairly when one of them is more famous?

In March 2023, the U.S. General Services Administration β€” the agency that handles procurement for the federal government β€” quietly began a process to evaluate AI writing and summarization tools for use across federal agencies. The process was documented and later partially released under the Freedom of Information Act. What made it unusual was a single design choice: every tool was evaluated against the same ten criteria, in the same order, by the same reviewers, without knowing the tool's brand name during the initial scoring.

This is called a blinded structured evaluation. The GSA used it because they had seen what happened without it: reviewers unconsciously scored recognizable brand names higher, even when the outputs were identical. By the time the brand names were revealed, scores were locked in. The resulting recommendation was credible precisely because the process had been designed to resist the evaluators' own biases.

The Problem With "I Tried Both and Liked One Better"

Informal testing is not evaluation. When you try two AI tools back-to-back without a fixed set of criteria, several things happen that corrupt your judgment without you noticing.

First, you compare what you happen to test, not what the tools are actually for. If you test both tools on writing a poem and Tool A writes a better poem, that tells you almost nothing if both tools were being considered for customer service responses.

Second, the order matters. The tool you test second always seems more impressive or more disappointing relative to the first β€” your brain is still calibrated to the previous one. This is called a contrast effect, and it is very hard to eliminate without structure.

Third, familiarity creates a halo. If you have heard more about Tool A, you will notice its good outputs more readily than its bad ones, and notice Tool B's bad outputs more readily than its good ones. This happens automatically, even when you are trying to be fair.

A comparison grid β€” sometimes called an evaluation matrix β€” solves all three problems. You fix the criteria before you start. You test the same inputs on both tools. You score each criterion independently. The conclusion emerges from the scores rather than from your general impression.

Building a Comparison Grid

A comparison grid has three parts: the criteria, the weights, and the scores. Here is how each one works.

Criteria are the specific dimensions you care about for the task at hand. They should come directly from the constraint inventory you built in Lesson 1. If one of your constraints was "must not produce hallucinated medical information," then accuracy-under-pressure is a criterion. If one was "must work for users with no technical training," then ease of use is a criterion. You do not use generic criteria β€” you use criteria derived from the actual task and constraints.

Weights reflect how much each criterion matters relative to the others. Not all criteria are equal. If the task involves patient safety, accuracy may be worth five times as much as interface design. The weights force you to commit to your priorities before you score anything β€” which prevents you from conveniently weighting things toward the tool you already prefer after seeing the results.

Scores are your judgment on each criterion for each tool, applied consistently. The same input, the same test, the same standard. You can use a simple 1–5 scale or a pass/fail on binary criteria. The key is that the standard doesn't change between tools.

Real Example β€” 2023 UK National Health Service

In 2023, NHS England published guidance on evaluating AI diagnostic tools. Their framework required that any comparison include at least three datasets from different patient populations, that criteria weights be declared before scoring began, and that the evaluating team include someone with no prior exposure to either tool's marketing. The document explicitly warned against "experienced user bias" β€” where the person doing the evaluation is better at using one tool because they've practiced with it longer.

When the Grid Gives You a Surprising Answer

In 2021, researchers at Stanford's Human-Centered AI Institute published a study examining how teams at 12 technology companies chose AI tools for internal use. In 70% of cases, the team had informally decided which tool they wanted before any evaluation started. The evaluation process β€” when it existed at all β€” was used to confirm the existing preference rather than to genuinely test it.

This phenomenon is called confirmation bias in evaluation, and it is devastatingly common. The fix is almost embarrassingly simple: commit to following the grid's result before you do the scoring. If the grid says Tool B wins, Tool B gets recommended β€” even if Tool A is the one you have been using for a year and Tool B feels unfamiliar. If you are not willing to change your recommendation based on the results, you do not actually have an evaluation process. You have a performance of one.

This is harder than it sounds. It requires a kind of intellectual honesty that most institutions β€” and most people β€” find genuinely uncomfortable. But knowing this about yourself is already an advantage. You can now catch yourself doing it.

Ethical Question β€” No Clean Answer

Suppose your structured evaluation grid concludes that Tool B is the better choice β€” but Tool B is made by a company with a controversial record on worker pay or data ethics. The grid measures task performance, not company values. Should non-performance factors be allowed to override a structured evaluation? If so, who decides which values count?

You Now See What Most People Miss

Most AI tool debates are actually debates between two different sets of criteria with different weights β€” not disagreements about which tool is objectively better. When someone argues passionately that Tool A beats Tool B, they are usually revealing which criteria they weight most heavily, not discovering an objective truth. Knowing this changes how you listen to every AI product comparison you will ever encounter.

Lesson 2 Quiz

Five questions Β· Evaluation framework and bias
1. The U.S. GSA used a blinded structured evaluation in 2023 primarily to:
Correct. The blinding was specifically designed to counter the unconscious bias reviewers showed toward recognizable brand names, even when outputs were identical.
The blinding was about removing brand recognition as a variable β€” reviewers were scoring outputs without knowing which company made the tool, because brand familiarity was inflating scores.
2. You test AI Tool A first, then Tool B. Tool B seems less impressive. A classmate points out this might be a contrast effect. What does that mean?
Exactly. The contrast effect means your judgment of Tool B is distorted by the fact that you just experienced Tool A. Sequential testing without structure is unreliable for this reason.
The contrast effect is about brain calibration β€” after experiencing Tool A, your baseline shifts, which makes the next tool seem better or worse than it would if evaluated in isolation.
3. In a comparison grid, weights should be assigned:
Correct. Pre-scoring weights are essential. If you set weights after seeing results, you can unconsciously weight toward the tool you already prefer β€” which defeats the whole purpose.
Weights must be set before scoring. Assigning weights after you see the results allows you to make any tool "win" by boosting the criteria it happened to score well on.
4. The Stanford 2021 study found that in 70% of cases, teams had chosen their preferred tool before evaluation began. This is called:
Correct. Confirmation bias in evaluation means using the process to justify a preference already formed, rather than genuinely testing it. The evaluation looks rigorous but serves a predetermined conclusion.
This is confirmation bias in evaluation β€” conducting what looks like an evaluation but using it to confirm a preference that was already set. The grid is there, but it's theater.
5. Two people are arguing about which AI image tool is better. Person A emphasizes speed; Person B emphasizes accuracy. They keep talking past each other. The real source of their disagreement is most likely:
Exactly. Different weights produce different conclusions from the same data. The apparent disagreement about tools is actually a disagreement about priorities β€” which is a much more productive conversation once you see it clearly.
They probably have the same data but different weights. Person A is weighting speed highest; Person B is weighting accuracy. The same tool performance leads to different verdicts based on what you care about most.

Lab 2 β€” The Grid Builder

Your role: Evaluation Designer. Build the comparison grid before the bias sets in.

The Scenario

A small nonprofit that helps refugees navigate immigration paperwork wants to add an AI tool to help case workers draft letters and summaries. They're considering two tools: a large general-purpose language model (like GPT-4) and a specialized legal document assistant built specifically for immigration cases. Your job is to design the comparison grid β€” not pick the winner yet.

Work with VERA to identify the right criteria and weights for this specific context. Push back if VERA suggests criteria that don't fit. Argue for criteria you think are underweighted.

Start by naming three criteria you think belong in this grid and explain your reasoning for at least one of them. Be specific about why it matters for this particular context β€” not AI tools in general.
Evaluation Advisor β€” VERA
Lab Partner
Immigration case work. Real stakes β€” a bad letter could hurt someone's case. What criteria do you think absolutely have to be in this grid? Give me three and justify at least one. I'll push back if I think you're missing something critical.
Module 4 Β· Lesson 3

Writing the Recommendation: Claim, Evidence, Caveat

The hardest part isn't picking the tool. It's writing a conclusion that earns trust.
What makes a written recommendation trustworthy β€” and what makes it just look trustworthy?

In August 2023, Singapore's Ministry of Digital Development and Innovation published a 47-page evaluation of AI tools for use in public services. What made the document extraordinary wasn't its length. It was a single section on page 12: "Limitations of This Evaluation." The section listed β€” in plain language β€” four specific things the evaluation could not determine, two conditions under which the recommendation might not hold, and one scenario where the second-place tool should be used instead of the first-place tool.

Technology journalists who covered the document noted that the section on limitations was the reason they trusted the rest of it. When a document tells you what it cannot tell you, it earns credibility for what it claims it can.

The Three-Part Structure: Claim, Evidence, Caveat

Every strong written recommendation has three components. They do not have to appear in separate labeled sections β€” but they have to all be present, or the recommendation fails to do its job.

The Claim is your actual recommendation β€” stated clearly and without hedging. "We recommend Tool A for this task." Not "Tool A might be worth considering." Not "both tools have their merits." A claim commits to something. If you are not willing to commit, you have not finished your thinking yet.

The Evidence is the data from your evaluation β€” the criteria scores, the specific test results, the constraint checks. Evidence is not "I felt like Tool A was smoother." Evidence is "Tool A scored 4/5 on accuracy under the nurse handover test prompt, compared to Tool B's 2/5, and Tool A passed the data residency check while Tool B did not." Specificity is what separates evidence from impression.

The Caveat is where you tell the reader what your recommendation does not cover: conditions under which it might not hold, things you could not test, factors that could change the answer. Caveats do not weaken a recommendation. As Singapore's report demonstrated, they make it more trustworthy, because they show the reader that you understand the limits of your own analysis.

The Difference Between Confident and Honest

In 2022, a consulting firm called Gartner β€” which advises corporations on technology decisions and is one of the most influential names in the industry β€” published an internal analysis of its own AI tool recommendations from the previous three years. The finding, reported by Bloomberg, was troubling: recommendations that included explicit uncertainty or caveats were rated lower by clients in initial satisfaction surveys β€” but those same recommendations were rated significantly higher in retrospective accuracy surveys conducted 18 months later.

In other words: clients initially preferred confident-sounding recommendations, even when those recommendations were less accurate. Honest caveats felt like weakness in the moment but proved to be indicators of quality over time.

This creates a genuine tension in the real world. If you write an honest recommendation with appropriate caveats, your audience might trust it less immediately, even though it deserves more trust. The temptation β€” for consultants, for analysts, for you β€” is to sand away the uncertainty and write something that sounds more confident than the evidence supports. Most professional recommendation documents contain more certainty than the underlying analysis actually justifies, for exactly this reason.

Ethical Question β€” No Clean Answer

If you know that adding caveats to your recommendation will cause your audience to trust it less β€” even though the caveats make it more accurate β€” is it ethical to remove them? You're not lying. You're just presenting your findings in the way most likely to be acted on. Where is the line between strategic communication and misleading your audience?

What Institutional Recommendations Look Like at Scale

When governments or large organizations produce AI tool recommendations, the stakes are different from an individual's choice. A government recommendation can affect thousands of workers, millions of citizens, and billions of dollars in procurement over years. This changes what "good enough" means.

The European Union's AI Act, which passed in 2024, created specific requirements for how high-risk AI tools must be evaluated before deployment in areas like law enforcement, healthcare, and critical infrastructure. One requirement is mandatory documentation of evaluation methodology β€” you can't just say "we tested it," you have to show your criteria, your weights, your test conditions, and your limitations section. The EU essentially legislated the structure you are learning here because informal evaluation had produced too many expensive failures.

This is the institutional version of the same skill. The underlying logic is identical. The stakes are higher. The documentation requirements are stricter. But the thinking β€” task, constraints, comparison, justified conclusion, honest caveats β€” is the same process you are learning right now.

You Now See What Most People Miss

Most recommendation documents are written to sound persuasive, not to be honest. The professionals who write recommendations that hold up over time β€” the ones actually cited by governments, the ones used to make billion-dollar decisions β€” are the ones who included caveats, listed limitations, and refused to overclaim. Knowing how to read a limitations section is one of the most underrated analytical skills in existence. You now have it.

Lesson 3 Quiz

Five questions Β· Writing recommendations that earn trust
1. Singapore's 2023 Ministry report included a section on limitations of the evaluation. According to journalists who covered it, this section:
Correct. When a document tells you what it cannot tell you, it earns trust for what it claims it can. The limitations section was cited by journalists as the reason they trusted the rest of the document.
The limitations section had the opposite effect from what you might expect β€” it made the document more credible, not less. Honesty about what you don't know makes your claims about what you do know more trustworthy.
2. "Both tools have real strengths and it's worth considering your specific needs carefully." This sentence is an example of:
Exactly. This sentence has zero commitment. It's the written equivalent of "it depends" β€” true but useless. A claim must commit to something. This doesn't.
This isn't a claim at all. It contains no recommendation and makes no commitment. A good claim is clear and specific: "We recommend Tool A for this reason." That's what this sentence avoids doing.
3. The Gartner/Bloomberg finding from 2022 showed that clients initially rated caveated recommendations lower. What did the 18-month follow-up surveys find?
Correct. The caveats that felt like weakness initially turned out to be indicators of quality. The recommendations that overclaimed sounded better in the moment but aged worse.
The retrospective finding was the opposite of the initial reaction. Caveated recommendations scored significantly higher for accuracy 18 months later β€” the honest uncertainty was a quality signal, not a weakness signal.
4. The EU AI Act (2024) required documentation of evaluation methodology for high-risk AI tools. This is best described as:
Exactly. The EU essentially legislated the structure you're learning here β€” criteria, weights, test conditions, limitations β€” because informal evaluation had produced too many expensive failures in high-stakes settings.
The EU AI Act institutionalized the structured evaluation approach at law because informal "we tested it" claims had produced too many failures. The methodology you're learning here has real legal standing in some of the world's most important AI deployments.
5. You are writing an AI tool recommendation. Your evaluation found Tool A is better for 80% of the use cases β€” but Tool B should be used when the data involves minors. The honest recommendation would:
Correct. The Singapore model. A clear primary recommendation with a specific, named condition under which a different choice applies. This is more useful and more honest than simplifying away the complexity.
The honest answer here looks like Singapore's report β€” a clear primary recommendation with an explicit caveat naming the condition where it doesn't apply. Hiding that exception to appear decisive would be a form of misleading the reader.

Lab 3 β€” The Recommendation Drafter

Your role: Report Author. Write the claim, evidence, and caveat β€” then defend each one.

The Scenario

You have completed a structured evaluation for a city government deciding between two AI tools for answering citizen questions about local services (bus schedules, permit applications, park closures). Tool A is a large general-purpose chatbot. Tool B is a specialized civic information assistant trained specifically on city data. Based on your fictional evaluation: Tool A scores higher on natural conversation quality (4/5 vs 3/5) but Tool B scores higher on factual accuracy about local services (5/5 vs 2/5).

Write your recommendation to VERA. It must include a clear claim, at least one piece of evidence, and at least one honest caveat. VERA will challenge your claim and test whether your caveats are genuine or decorative.

Draft your recommendation now. One clear sentence of claim, then your evidence, then your caveat. VERA will question every part of it.
Evaluation Reviewer β€” VERA
Lab Partner
City government, real citizens, real consequences. Give me your recommendation β€” claim first, then evidence, then caveat. I'm going to push back hard on whichever part sounds weakest. Don't hedge your claim trying to protect yourself. Make the call.
Module 4 Β· Lesson 4

Putting It Together: Your Complete Report

A full recommendation that moves from task all the way to honest conclusion.
What does it actually look like when all four parts work together in a real document?

In September 2023, the New Zealand Ministry of Education published a 32-page document titled "AI Tools in Schools: An Evaluation Framework and Preliminary Recommendations." It had been written by a team of four β€” two curriculum specialists, a data privacy lawyer, and a classroom teacher who had been testing AI tools with her Year 9 students in Wellington since February of that year.

The document was unusual for what it included: a one-page executive summary written at reading level Year 8, a full constraint inventory that listed privacy law, equity of access, and teacher workload as top constraints, a comparison grid with weights declared upfront, tool-specific findings, and a final recommendations section that named one tool for general classroom use, a different tool for administrative tasks, and explicitly stated that three use cases should not use any AI tool until better options existed.

It became a model document. Several Australian states and the Scottish government requested permission to adapt it. The teacher from Wellington was asked to present her findings at a UNESCO conference in Paris the following spring. Not because the tools she evaluated were remarkable. Because the way she evaluated them was.

The Full Report: What Goes Where

A complete Tool Recommendation Report does not need to be long. The New Zealand document was 32 pages because the context was complex. Your report might be five. What matters is that every element is present and in a logical order. Here is the sequence that produces a document someone can actually use.

1. Executive Summary (1 paragraph). The whole thing condensed: what task, what constraint environment, which tool was recommended, and one sentence on the most important caveat. Someone should be able to read this and know your conclusion immediately. If they want the reasoning, they read the rest.

2. Task Definition (1–2 paragraphs). Exactly what the tool needs to do, for whom, how often, and what a successful output looks like. Be specific. "Summarize meeting notes" is a task. "Produce a 100-word summary of a 45-minute recorded meeting that a non-attendee could use to decide whether they need to watch the recording" is a task definition.

3. Constraint Inventory (bulleted list with notes). Every constraint you identified, with a note on why it matters. Data privacy, accuracy requirements, budget, user technical skill, regulatory environment, time to deployment. Nothing that's actually a constraint should be left out of this section, even if it doesn't change the final recommendation.

4. Comparison Grid (table). Criteria in rows, tools in columns, weights visible, scores filled in, totals calculated. One additional row: "passes all non-negotiable constraints." Any tool that fails a non-negotiable constraint is eliminated before weighted scoring begins.

5. Justified Conclusion (1–2 paragraphs). Claim, evidence, caveat β€” as learned in Lesson 3. Name the winning tool, cite the evidence, name the trade-offs you accepted, and state the conditions under which this recommendation should be revisited.

The Hardest Part: Knowing When Not to Recommend

The Wellington teacher's report explicitly named three use cases where no current tool should be used. This is the hardest part of the whole process β€” and the part most professionals skip.

In April 2022, a widely cited report from McKinsey Global Institute estimated that AI tools were being deployed in workplace contexts where they were not yet adequate for the task in roughly 40% of cases studied. Not badly deployed β€” just deployed where the existing tools did not actually meet the task requirements when evaluated rigorously. The organizations doing it had not done the constraint inventory step. Or they had done it but not acted on what they found, because the business pressure to deploy something was stronger than the analytical conclusion to wait.

Saying "no tool currently meets these requirements" is a valid, complete recommendation. It is also the recommendation that requires the most confidence to deliver, because it disappoints the person who was hoping for a solution. But it is more useful than recommending a tool that will fail β€” and more honest than recommending one you know is inadequate while hoping no one notices.

The Wellington teacher was asked to present at UNESCO not in spite of the three negative recommendations. Because of them. Rigorous honesty is rare enough to be remarkable.

Ethical Question β€” No Clean Answer

Suppose you complete a rigorous evaluation and your honest conclusion is "no current tool is adequate for this task." Your client has already publicly committed to deploying an AI tool and has political pressure to show progress. Do you soften your conclusion to help them save face? Do you refuse to change a word? Is there a middle path β€” and if so, who decides where it is?

What You Can Do Now That Most People Can't

Most people who encounter AI tools β€” colleagues, journalists, politicians, parents, teachers β€” evaluate them informally, inconsistently, and with confirmation bias running unchecked. They read marketing materials. They ask their friends. They try a tool once and form a permanent impression. They make decisions worth thousands or millions of dollars based on vibes.

You now have a complete framework that real government agencies, international organizations, and policy bodies use when the stakes are high enough to demand rigor. You know how to define a task precisely. You know how to build a constraint inventory before bias sets in. You know how to design a comparison grid with pre-declared weights. You know how to write a recommendation with a real claim, real evidence, and honest caveats. You know that saying "no tool qualifies" is a legitimate and sometimes correct conclusion.

This is not a theoretical skill. Fourteen governments cited a document built on these same principles. The EU put this framework into law. A classroom teacher in Wellington used it to get invited to a UNESCO conference. The methodology is real, it is in use right now, and you understand it as well as most professionals who are paid to apply it.

You Now See What Most People Miss

Every AI product announcement, every "best AI tool" list, every corporate deployment decision you will encounter from this point forward can be read through this framework. You will immediately see what task definition is missing, which constraints weren't mentioned, what the comparison grid would need to look like, and whether the conclusion has been honestly justified. This is not a small thing to know. It is the difference between being a consumer of AI hype and being someone who can evaluate it.

Lesson 4 Quiz

Five questions Β· Assembling the complete report
1. The New Zealand Ministry of Education report was adapted by other governments primarily because:
Correct. The Wellington teacher's report was valued for its methodology β€” not its author's seniority or the tools it covered. The rigor, including the willingness to say "no tool qualifies" for some cases, was what made it worth adapting.
It was the methodology that made the document valuable β€” a complete, structured evaluation that included the uncommon step of recommending against any tool for three specific use cases. Other governments wanted the framework, not just the conclusions.
2. In a comparison grid, a tool that fails a non-negotiable constraint should be:
Correct. Non-negotiable constraints are pass/fail gates, not scoring criteria. If a tool fails one, it doesn't participate in the weighted comparison β€” no matter how well it performs on everything else.
Non-negotiable constraints are absolute gates. A tool that fails one is out β€” you don't score it into the grid and then note the problem. The constraint check happens before scoring begins.
3. The McKinsey 2022 finding β€” that AI tools were deployed inadequately in roughly 40% of studied cases β€” was most directly caused by:
Correct. The lesson connects the McKinsey finding explicitly to the constraint inventory step β€” either it wasn't done, or its conclusions were overridden by business pressure to deploy something quickly.
The McKinsey finding pointed to organizations that hadn't rigorously evaluated whether a tool met their requirements β€” specifically, skipping or overriding the constraint inventory because of pressure to show deployment progress.
4. Which element of the full report structure is specifically designed to help a busy decision-maker who doesn't have time to read the whole document?
Correct. The Executive Summary is explicitly described as the condensed version β€” task, constraint environment, recommendation, key caveat β€” that gives someone the essential information in one paragraph.
The Executive Summary is the one-paragraph condensed version designed for exactly this purpose: someone should be able to read it and know your conclusion without reading further. The other sections contain the reasoning for those who need it.
5. A colleague completes an evaluation and concludes "no current tool meets the requirements." Their manager says this isn't a real recommendation. Based on Lesson 4, how would you respond?
Correct. "No tool qualifies" is explicitly validated in Lesson 4 β€” the Wellington teacher used it for three use cases and it was cited as one of the strongest parts of her report. Saying no is harder than saying yes, but it's often the most honest and most useful answer.
Lesson 4 explicitly states that "no tool currently meets these requirements" is a valid, complete recommendation. It requires the most confidence to deliver but is more useful than recommending an inadequate tool. The Wellington teacher's credibility came partly from being willing to say it.

Lab 4 β€” The Full Report Critic

Your role: Peer Reviewer. Read a draft recommendation and find its weaknesses.

The Scenario

A classmate has submitted the following draft AI tool recommendation for a school district wanting to use AI for grading short-answer quiz questions. Here is the full draft:

"We recommend Tool A for AI-assisted quiz grading. We tested it on ten questions and it got most of them right. It's easy to use and the interface is clean. Tool B was also tested but Tool A seemed better overall. Tool A is widely used in schools and has good reviews online. We are confident this will save teachers time and improve grading consistency."

Your job: tell VERA which of the four report elements (task definition, constraint inventory, comparison grid, justified conclusion) this draft is missing or doing badly β€” and explain what would need to be added. Be specific. VERA will ask you to defend your critique and may point out things you missed.

Report Reviewer β€” VERA
Lab Partner
I've read the draft. Before I say anything, I want to know what you think. What are the two biggest weaknesses in this recommendation? Don't just say "it's vague" β€” point to a specific sentence and explain why it fails as a recommendation element.

Module 4 β€” Module Test

15 questions Β· Pass at 80% or higher Β· Covers all four lessons
1. The Danish Agency for Digital Government's 2022 AI report influenced fourteen other nations because it was:
Correct. Specificity and explicit reasoning made the document usable across contexts. Vague documents don't travel.
The document's influence came from its precision β€” specific tools, specific use cases, specific reasoning. That structure is what other governments could actually apply to their own decisions.
2. "It depends" is described as the beginning, not the end of a recommendation because:
Correct. "It depends" is accurate but incomplete. The recommendation is the work of figuring out what it depends on and then answering it.
"It depends" is honest but incomplete. The value of a recommendation comes from completing the sentence β€” identifying the factors and then making a judgment about them.
3. Italy's 2023 Garante ban on ChatGPT best illustrates which failure in the tool recommendation process?
Correct. The tool may have been right for the task. The constraint β€” Italian data law β€” was simply never mapped before deployment.
The Italy case is specifically about a missed legal constraint. Businesses had picked a reasonable tool but hadn't asked what legal environment it had to operate within.
4. A comparison grid's weights must be assigned before scoring because:
Correct. Assigning weights after seeing scores allows confirmation bias to operate invisibly β€” you boost the criteria your preferred tool happened to excel at.
Pre-scoring weights lock in your priorities before you know results. Without this, you can unconsciously weight whichever criteria favor the tool you already prefer.
5. The U.S. GSA's 2023 blinded evaluation found that without blinding, reviewers scored recognizable brand names:
Correct. Even trained reviewers gave higher scores to familiar brand names when outputs were identical. The halo effect from brand recognition is powerful and hard to eliminate without structural blinding.
Brand familiarity inflated scores upward β€” even when outputs were identical to unknown brands. This is why blinding is a structural solution, not just a best practice.
6. The Stanford 2021 study finding β€” that 70% of teams had chosen their preferred tool before evaluation β€” describes which bias?
Correct. Confirmation bias in evaluation means the process looks like an evaluation but functions as justification for a preference already formed.
This is confirmation bias β€” using an evaluation process to confirm what you already believe, rather than to genuinely discover which tool is better for the task.
7. NHS England's 2023 guidance required evaluators to include someone with no prior exposure to either tool's marketing. This addresses which specific bias?
Correct. The NHS document called out "experienced user bias" β€” where someone better at using one tool because they've practiced with it longer gives it an unfair advantage in evaluation.
This specifically addressed experienced user bias β€” the advantage that comes from simply having practiced more with one tool. Requiring a naive evaluator controls for this.
8. Singapore's 2023 Ministry report listed four things the evaluation could not determine. This made the document:
Correct. When a document tells you what it cannot tell you, the reader trusts it more for what it claims it can tell you. Honest limitations are credibility investments.
Journalists cited the limitations section as the reason they trusted the rest of the document. Acknowledging uncertainty in one area signals rigor in others.
9. The Gartner/Bloomberg 2022 finding is most important for which practical reason?
Correct. The practical lesson is that the incentive to overclaim is real and structural β€” clients reward confident-sounding documents even when those documents are less accurate. This is pressure you will encounter.
The finding's importance is that it names a real structural pressure: honest documents with caveats face immediate trust penalties even when they're more accurate over time. Knowing this helps you resist the pressure to sand away uncertainty.
10. The EU AI Act's 2024 requirement for mandatory evaluation methodology documentation was introduced because:
Correct. The EU legislated the structured evaluation framework because informal "we tested it" claims were not sufficient to prevent failures in healthcare, law enforcement, and critical infrastructure.
The EU AI Act was a response to documented failures that came from informal evaluation processes. The structured framework you've been learning was codified into law because its absence had demonstrable costs.
11. A task definition says "use AI to help with writing." According to Lesson 4's standard, this is:
Correct. A real task definition specifies exactly what needs to happen, for whom, how often, and what success looks like. "Help with writing" tells you none of those things.
Lesson 4 distinguishes a vague description from a real task definition. A proper task definition names the output type, length, audience, time constraint, and what "done well" means β€” not just the general activity.
12. A nonprofit recommends Tool A for refugee case work. Their evaluation did not include a lawyer on the evaluation team. Which recommendation framework element is most likely to be weak?
Correct. Without legal expertise on the team, the constraint inventory almost certainly misses data protection, confidentiality, and immigration law constraints β€” exactly what a lawyer would surface.
Legal constraints β€” data privacy, confidentiality rules, immigration-specific regulations β€” are exactly what a non-legal team is likely to miss. The constraint inventory is the weak point here.
13. You run a comparison grid. Tool B wins clearly on all weighted criteria. But your team has been using Tool A for two years. The right action is:
Correct. Lesson 2 is explicit: if you are not willing to change your recommendation based on results, you have a performance of evaluation, not an actual one. Tool B gets recommended.
Lesson 2 draws the line clearly here: committing to following the grid's result is what separates a real evaluation from confirmation bias theater. The two-year familiarity with Tool A is exactly the kind of bias the grid was designed to overcome.
14. The New Zealand teacher from Wellington was invited to present at a UNESCO conference specifically because:
Correct. The three cases where she recommended against any current tool were cited as the strongest part of her credibility. The methodology, not the conclusions, was what earned her the invitation.
It was the quality of the methodology β€” and specifically the willingness to give three negative recommendations β€” that distinguished her report. Rigorous honesty at that level is rare enough to be remarkable.
15. A school district asks you for an AI grading tool recommendation. After thorough evaluation, your honest conclusion is that no current tool is adequate for their use case. The best course of action is:
Correct. Lesson 4 is unambiguous: "no tool currently meets these requirements" is a valid, complete, and sometimes correct recommendation. Recommending an inadequate tool to satisfy the expectation of an answer would be the real failure here.
Lesson 4 explicitly validates this conclusion. Delivering a "no tool qualifies" finding takes more confidence than naming a winner, but it is more useful and more honest than recommending a tool you know will fail.