In the autumn of 2022, the Danish government's Agency for Digital Government published a report evaluating which AI tools public sector offices should adopt. The document β translated and studied across Europe β was careful, specific, and methodical. It named tools, described use cases, assessed risks, and gave clear recommendations with reasoning. Within three months, fourteen European nations had cited it when making their own purchasing decisions.
The report did not say "AI is great" or "AI is dangerous." It said: for this specific task, in this specific context, with these specific constraints, this tool is appropriate β and here is why. That precision is what made it useful. Vague enthusiasm helps no one. Structured reasoning moves real resources.
If someone asks you which AI tool they should use, the honest first answer is always "it depends." But stopping there is not helpful. "It depends" is only the start of a thinking process, not the conclusion. The whole point of a Tool Recommendation Report is to do the work of figuring out what, exactly, it depends on β and then answering that.
Professional analysts, technology consultants, and policy writers all produce documents like this. The format varies. The underlying logic does not. You identify the task, the user, the constraints, and the options. You evaluate each option against those factors. You make a recommendation and explain it. That sequence is what turns an opinion into something actionable.
In this module, you are going to build that skill from scratch. By the end, you will be able to produce a structured AI tool recommendation that a real decision-maker could actually use.
The Danish report β and almost every credible technology evaluation document β shares four structural elements. Learning to recognize them makes you a much sharper reader of any claim about AI tools.
What, precisely, needs to happen? Not "write stuff" but "draft a 200-word summary of a legal document for a non-expert reader, in under 30 seconds."
What are the limits? Budget, data privacy rules, required accuracy level, who will use it, what happens if it's wrong.
At least two real tools evaluated against the same criteria. No cherry-picking. Each tool gets a fair look at the same set of questions.
A clear choice, with the reasoning visible. Not "Tool A is better" but "Tool A is better for this task because of X, even though Tool B wins on Y."
Notice what is not in that list: hype, brand loyalty, or what your friend uses. A recommendation is not a preference. It is a reasoned judgment about fit between a tool and a purpose.
In January 2023, the Italian data protection authority β called the Garante β temporarily banned ChatGPT for Italian users, citing concerns about how the tool handled personal data. The ban lasted about a month before OpenAI made changes, but in that window, Italian businesses that had built workflows around ChatGPT without accounting for local data law were suddenly stuck.
They had picked the right tool for the task. They had completely missed a constraint. That is a failure of the recommendation process, not of the tool itself.
The constraint inventory is the part most beginners skip. It feels like bureaucracy. It is actually the part that determines whether your recommendation survives contact with reality. Before you evaluate any tool, you have to know what the tool has to work within.
When you write a recommendation that a company or government will follow, you share responsibility for what happens when they act on it. If the tool you recommended causes harm β even harm you didn't predict β do you bear any of that responsibility? At what point does a recommendation become a decision?
Most people evaluate AI tools by asking "Is this good?" You now know that question is almost meaningless. The real question is always: Is this the right fit for this specific task, within these specific constraints, for this specific user? Every headline that says "AI Tool X is better than AI Tool Y" is missing all of the context that would make that statement useful.
A hospital administrator has asked you to recommend an AI tool that can help nurses write shift-handover notes β the brief summaries nurses write when one shift ends and another begins. She says she wants something fast and easy to use. She hasn't mentioned any constraints.
Your job isn't to recommend a tool yet. Your job is to surface the constraints she's forgotten to mention β before a tool gets chosen and something goes wrong.
In March 2023, the U.S. General Services Administration β the agency that handles procurement for the federal government β quietly began a process to evaluate AI writing and summarization tools for use across federal agencies. The process was documented and later partially released under the Freedom of Information Act. What made it unusual was a single design choice: every tool was evaluated against the same ten criteria, in the same order, by the same reviewers, without knowing the tool's brand name during the initial scoring.
This is called a blinded structured evaluation. The GSA used it because they had seen what happened without it: reviewers unconsciously scored recognizable brand names higher, even when the outputs were identical. By the time the brand names were revealed, scores were locked in. The resulting recommendation was credible precisely because the process had been designed to resist the evaluators' own biases.
Informal testing is not evaluation. When you try two AI tools back-to-back without a fixed set of criteria, several things happen that corrupt your judgment without you noticing.
First, you compare what you happen to test, not what the tools are actually for. If you test both tools on writing a poem and Tool A writes a better poem, that tells you almost nothing if both tools were being considered for customer service responses.
Second, the order matters. The tool you test second always seems more impressive or more disappointing relative to the first β your brain is still calibrated to the previous one. This is called a contrast effect, and it is very hard to eliminate without structure.
Third, familiarity creates a halo. If you have heard more about Tool A, you will notice its good outputs more readily than its bad ones, and notice Tool B's bad outputs more readily than its good ones. This happens automatically, even when you are trying to be fair.
A comparison grid β sometimes called an evaluation matrix β solves all three problems. You fix the criteria before you start. You test the same inputs on both tools. You score each criterion independently. The conclusion emerges from the scores rather than from your general impression.
A comparison grid has three parts: the criteria, the weights, and the scores. Here is how each one works.
Criteria are the specific dimensions you care about for the task at hand. They should come directly from the constraint inventory you built in Lesson 1. If one of your constraints was "must not produce hallucinated medical information," then accuracy-under-pressure is a criterion. If one was "must work for users with no technical training," then ease of use is a criterion. You do not use generic criteria β you use criteria derived from the actual task and constraints.
Weights reflect how much each criterion matters relative to the others. Not all criteria are equal. If the task involves patient safety, accuracy may be worth five times as much as interface design. The weights force you to commit to your priorities before you score anything β which prevents you from conveniently weighting things toward the tool you already prefer after seeing the results.
Scores are your judgment on each criterion for each tool, applied consistently. The same input, the same test, the same standard. You can use a simple 1β5 scale or a pass/fail on binary criteria. The key is that the standard doesn't change between tools.
In 2023, NHS England published guidance on evaluating AI diagnostic tools. Their framework required that any comparison include at least three datasets from different patient populations, that criteria weights be declared before scoring began, and that the evaluating team include someone with no prior exposure to either tool's marketing. The document explicitly warned against "experienced user bias" β where the person doing the evaluation is better at using one tool because they've practiced with it longer.
In 2021, researchers at Stanford's Human-Centered AI Institute published a study examining how teams at 12 technology companies chose AI tools for internal use. In 70% of cases, the team had informally decided which tool they wanted before any evaluation started. The evaluation process β when it existed at all β was used to confirm the existing preference rather than to genuinely test it.
This phenomenon is called confirmation bias in evaluation, and it is devastatingly common. The fix is almost embarrassingly simple: commit to following the grid's result before you do the scoring. If the grid says Tool B wins, Tool B gets recommended β even if Tool A is the one you have been using for a year and Tool B feels unfamiliar. If you are not willing to change your recommendation based on the results, you do not actually have an evaluation process. You have a performance of one.
This is harder than it sounds. It requires a kind of intellectual honesty that most institutions β and most people β find genuinely uncomfortable. But knowing this about yourself is already an advantage. You can now catch yourself doing it.
Suppose your structured evaluation grid concludes that Tool B is the better choice β but Tool B is made by a company with a controversial record on worker pay or data ethics. The grid measures task performance, not company values. Should non-performance factors be allowed to override a structured evaluation? If so, who decides which values count?
Most AI tool debates are actually debates between two different sets of criteria with different weights β not disagreements about which tool is objectively better. When someone argues passionately that Tool A beats Tool B, they are usually revealing which criteria they weight most heavily, not discovering an objective truth. Knowing this changes how you listen to every AI product comparison you will ever encounter.
A small nonprofit that helps refugees navigate immigration paperwork wants to add an AI tool to help case workers draft letters and summaries. They're considering two tools: a large general-purpose language model (like GPT-4) and a specialized legal document assistant built specifically for immigration cases. Your job is to design the comparison grid β not pick the winner yet.
Work with VERA to identify the right criteria and weights for this specific context. Push back if VERA suggests criteria that don't fit. Argue for criteria you think are underweighted.
In August 2023, Singapore's Ministry of Digital Development and Innovation published a 47-page evaluation of AI tools for use in public services. What made the document extraordinary wasn't its length. It was a single section on page 12: "Limitations of This Evaluation." The section listed β in plain language β four specific things the evaluation could not determine, two conditions under which the recommendation might not hold, and one scenario where the second-place tool should be used instead of the first-place tool.
Technology journalists who covered the document noted that the section on limitations was the reason they trusted the rest of it. When a document tells you what it cannot tell you, it earns credibility for what it claims it can.
Every strong written recommendation has three components. They do not have to appear in separate labeled sections β but they have to all be present, or the recommendation fails to do its job.
The Claim is your actual recommendation β stated clearly and without hedging. "We recommend Tool A for this task." Not "Tool A might be worth considering." Not "both tools have their merits." A claim commits to something. If you are not willing to commit, you have not finished your thinking yet.
The Evidence is the data from your evaluation β the criteria scores, the specific test results, the constraint checks. Evidence is not "I felt like Tool A was smoother." Evidence is "Tool A scored 4/5 on accuracy under the nurse handover test prompt, compared to Tool B's 2/5, and Tool A passed the data residency check while Tool B did not." Specificity is what separates evidence from impression.
The Caveat is where you tell the reader what your recommendation does not cover: conditions under which it might not hold, things you could not test, factors that could change the answer. Caveats do not weaken a recommendation. As Singapore's report demonstrated, they make it more trustworthy, because they show the reader that you understand the limits of your own analysis.
In 2022, a consulting firm called Gartner β which advises corporations on technology decisions and is one of the most influential names in the industry β published an internal analysis of its own AI tool recommendations from the previous three years. The finding, reported by Bloomberg, was troubling: recommendations that included explicit uncertainty or caveats were rated lower by clients in initial satisfaction surveys β but those same recommendations were rated significantly higher in retrospective accuracy surveys conducted 18 months later.
In other words: clients initially preferred confident-sounding recommendations, even when those recommendations were less accurate. Honest caveats felt like weakness in the moment but proved to be indicators of quality over time.
This creates a genuine tension in the real world. If you write an honest recommendation with appropriate caveats, your audience might trust it less immediately, even though it deserves more trust. The temptation β for consultants, for analysts, for you β is to sand away the uncertainty and write something that sounds more confident than the evidence supports. Most professional recommendation documents contain more certainty than the underlying analysis actually justifies, for exactly this reason.
If you know that adding caveats to your recommendation will cause your audience to trust it less β even though the caveats make it more accurate β is it ethical to remove them? You're not lying. You're just presenting your findings in the way most likely to be acted on. Where is the line between strategic communication and misleading your audience?
When governments or large organizations produce AI tool recommendations, the stakes are different from an individual's choice. A government recommendation can affect thousands of workers, millions of citizens, and billions of dollars in procurement over years. This changes what "good enough" means.
The European Union's AI Act, which passed in 2024, created specific requirements for how high-risk AI tools must be evaluated before deployment in areas like law enforcement, healthcare, and critical infrastructure. One requirement is mandatory documentation of evaluation methodology β you can't just say "we tested it," you have to show your criteria, your weights, your test conditions, and your limitations section. The EU essentially legislated the structure you are learning here because informal evaluation had produced too many expensive failures.
This is the institutional version of the same skill. The underlying logic is identical. The stakes are higher. The documentation requirements are stricter. But the thinking β task, constraints, comparison, justified conclusion, honest caveats β is the same process you are learning right now.
Most recommendation documents are written to sound persuasive, not to be honest. The professionals who write recommendations that hold up over time β the ones actually cited by governments, the ones used to make billion-dollar decisions β are the ones who included caveats, listed limitations, and refused to overclaim. Knowing how to read a limitations section is one of the most underrated analytical skills in existence. You now have it.
You have completed a structured evaluation for a city government deciding between two AI tools for answering citizen questions about local services (bus schedules, permit applications, park closures). Tool A is a large general-purpose chatbot. Tool B is a specialized civic information assistant trained specifically on city data. Based on your fictional evaluation: Tool A scores higher on natural conversation quality (4/5 vs 3/5) but Tool B scores higher on factual accuracy about local services (5/5 vs 2/5).
Write your recommendation to VERA. It must include a clear claim, at least one piece of evidence, and at least one honest caveat. VERA will challenge your claim and test whether your caveats are genuine or decorative.
In September 2023, the New Zealand Ministry of Education published a 32-page document titled "AI Tools in Schools: An Evaluation Framework and Preliminary Recommendations." It had been written by a team of four β two curriculum specialists, a data privacy lawyer, and a classroom teacher who had been testing AI tools with her Year 9 students in Wellington since February of that year.
The document was unusual for what it included: a one-page executive summary written at reading level Year 8, a full constraint inventory that listed privacy law, equity of access, and teacher workload as top constraints, a comparison grid with weights declared upfront, tool-specific findings, and a final recommendations section that named one tool for general classroom use, a different tool for administrative tasks, and explicitly stated that three use cases should not use any AI tool until better options existed.
It became a model document. Several Australian states and the Scottish government requested permission to adapt it. The teacher from Wellington was asked to present her findings at a UNESCO conference in Paris the following spring. Not because the tools she evaluated were remarkable. Because the way she evaluated them was.
A complete Tool Recommendation Report does not need to be long. The New Zealand document was 32 pages because the context was complex. Your report might be five. What matters is that every element is present and in a logical order. Here is the sequence that produces a document someone can actually use.
1. Executive Summary (1 paragraph). The whole thing condensed: what task, what constraint environment, which tool was recommended, and one sentence on the most important caveat. Someone should be able to read this and know your conclusion immediately. If they want the reasoning, they read the rest.
2. Task Definition (1β2 paragraphs). Exactly what the tool needs to do, for whom, how often, and what a successful output looks like. Be specific. "Summarize meeting notes" is a task. "Produce a 100-word summary of a 45-minute recorded meeting that a non-attendee could use to decide whether they need to watch the recording" is a task definition.
3. Constraint Inventory (bulleted list with notes). Every constraint you identified, with a note on why it matters. Data privacy, accuracy requirements, budget, user technical skill, regulatory environment, time to deployment. Nothing that's actually a constraint should be left out of this section, even if it doesn't change the final recommendation.
4. Comparison Grid (table). Criteria in rows, tools in columns, weights visible, scores filled in, totals calculated. One additional row: "passes all non-negotiable constraints." Any tool that fails a non-negotiable constraint is eliminated before weighted scoring begins.
5. Justified Conclusion (1β2 paragraphs). Claim, evidence, caveat β as learned in Lesson 3. Name the winning tool, cite the evidence, name the trade-offs you accepted, and state the conditions under which this recommendation should be revisited.
The Wellington teacher's report explicitly named three use cases where no current tool should be used. This is the hardest part of the whole process β and the part most professionals skip.
In April 2022, a widely cited report from McKinsey Global Institute estimated that AI tools were being deployed in workplace contexts where they were not yet adequate for the task in roughly 40% of cases studied. Not badly deployed β just deployed where the existing tools did not actually meet the task requirements when evaluated rigorously. The organizations doing it had not done the constraint inventory step. Or they had done it but not acted on what they found, because the business pressure to deploy something was stronger than the analytical conclusion to wait.
Saying "no tool currently meets these requirements" is a valid, complete recommendation. It is also the recommendation that requires the most confidence to deliver, because it disappoints the person who was hoping for a solution. But it is more useful than recommending a tool that will fail β and more honest than recommending one you know is inadequate while hoping no one notices.
The Wellington teacher was asked to present at UNESCO not in spite of the three negative recommendations. Because of them. Rigorous honesty is rare enough to be remarkable.
Suppose you complete a rigorous evaluation and your honest conclusion is "no current tool is adequate for this task." Your client has already publicly committed to deploying an AI tool and has political pressure to show progress. Do you soften your conclusion to help them save face? Do you refuse to change a word? Is there a middle path β and if so, who decides where it is?
Most people who encounter AI tools β colleagues, journalists, politicians, parents, teachers β evaluate them informally, inconsistently, and with confirmation bias running unchecked. They read marketing materials. They ask their friends. They try a tool once and form a permanent impression. They make decisions worth thousands or millions of dollars based on vibes.
You now have a complete framework that real government agencies, international organizations, and policy bodies use when the stakes are high enough to demand rigor. You know how to define a task precisely. You know how to build a constraint inventory before bias sets in. You know how to design a comparison grid with pre-declared weights. You know how to write a recommendation with a real claim, real evidence, and honest caveats. You know that saying "no tool qualifies" is a legitimate and sometimes correct conclusion.
This is not a theoretical skill. Fourteen governments cited a document built on these same principles. The EU put this framework into law. A classroom teacher in Wellington used it to get invited to a UNESCO conference. The methodology is real, it is in use right now, and you understand it as well as most professionals who are paid to apply it.
Every AI product announcement, every "best AI tool" list, every corporate deployment decision you will encounter from this point forward can be read through this framework. You will immediately see what task definition is missing, which constraints weren't mentioned, what the comparison grid would need to look like, and whether the conclusion has been honestly justified. This is not a small thing to know. It is the difference between being a consumer of AI hype and being someone who can evaluate it.
A classmate has submitted the following draft AI tool recommendation for a school district wanting to use AI for grading short-answer quiz questions. Here is the full draft:
Your job: tell VERA which of the four report elements (task definition, constraint inventory, comparison grid, justified conclusion) this draft is missing or doing badly β and explain what would need to be added. Be specific. VERA will ask you to defend your critique and may point out things you missed.