Lesson 1 · Module 2

The Anatomy of a Well-Delegated Task

Why Klarna fired 700 customer-service agents — and what the brief actually said.

What separates a task Claude can own end-to-end from one that will silently fail?

In February 2024, Swedish fintech Klarna announced that its Claude-powered customer-service agent had handled 2.3 million conversations in its first month — work that previously required 700 full-time employees. CEO Sebastian Siemiatkowski told investors the agent resolved issues in under two minutes, versus eleven minutes for human agents, with the same customer-satisfaction score.

What the press release glossed over was the six months of internal delegation design that preceded the launch. Klarna's team spent weeks specifying exactly which tasks the agent could own — refund eligibility under $250, order-status lookups, payment-plan adjustments — and which required human handoff. The efficiency gains came not from Claude's raw capability but from the precision of the task boundaries Klarna drew around it.

What Makes a Task Delegable

A task is safely delegable to Claude when it has four properties: a clear success criterion (you can judge the output without ambiguity), bounded context (all necessary information fits inside one conversation or a structured tool call), a recoverable failure mode (a wrong answer can be corrected before it causes harm), and no irreversible side effects unless you have explicitly reviewed the action plan.

Klarna's refund task had all four. The policy rules were explicit. The data came from a live API. A wrong refund could be reversed within 24 hours. And the financial ceiling of $250 kept each mistake small. Strip away any one of those properties and the same task becomes a liability rather than a time-saver.

The Delegation Quadrant

Plot tasks on two axes: Context Completeness (do you have everything Claude needs?) vs. Reversibility (can a mistake be undone?). High on both axes = delegate freely. Low reversibility regardless of context = always require human sign-off. Low context completeness = fix the brief before delegating, not after.

The Task Brief: Five Required Fields

A useful mental model is to treat every Claude delegation as a mini-project brief. Five fields must be present before you hand anything over.

ObjectiveOne sentence stating the desired end-state in measurable terms. "Draft a 400-word executive summary of Q2 results highlighting EBITDA and churn" — not "summarize the report."

InputsEvery file, data point, or context document Claude needs. If you find yourself saying "Claude should know this," it probably doesn't — include it explicitly.

ConstraintsTone, length, format, off-limits topics, regulatory guardrails. Constraints are not optional polish — they are the fence that keeps Claude in the right field.

Output FormatJSON, markdown table, numbered list, prose paragraph. Specify this even when it feels obvious; Claude will default to flowing prose unless told otherwise.

Success CriteriaHow you will judge the output. A checklist you can run against the result before accepting it. If you can't write this, the task is not yet ready to delegate.

Where Delegation Breaks Down

The most common failure is what researchers at Stanford's Human-AI Interaction group call specification debt — the user leaves implicit knowledge out of the brief and then blames Claude when the output reflects that gap. In a 2023 study of 1,400 GPT-4 and Claude task transcripts, 61% of outputs rated "poor" by users contained no information in the original prompt that was absent from the final output; the model produced exactly what was specified, just not what was wanted.

The second failure mode is scope creep: delegating a task with fuzzy edges and getting a result that is technically correct but strategically wrong. When HubSpot began using AI agents to draft blog posts in 2023, early versions produced SEO-optimized text that contradicted the company's brand voice. The brief had specified keyword density but omitted a style guide. The lesson: every omission in a brief is an invitation for Claude to fill the gap with its defaults, not yours.

Rule of Thumb

Before sending any non-trivial task to Claude, ask: "If a new freelancer received only this prompt and nothing else, would they produce exactly what I need?" If the answer is no, the brief needs more work — not the model.

Lesson 1 Quiz

Five questions · select the best answer

1. Klarna's February 2024 AI customer-service deployment achieved its efficiency gains primarily because:

Correct. The efficiency came from precise task-boundary design over six months before launch, not from Claude's raw capability alone.

Not quite. The key insight is that Klarna's team spent six months designing precise task boundaries — specifying what the agent could own and what required a human.

2. According to the Delegation Quadrant framework, which task profile is safest to delegate freely to Claude?

Correct. When all needed context is present and mistakes can be undone, the risk of delegation is minimized on both dimensions.

Not quite. The safest delegation profile is high on both axes: complete context and reversible outcomes.

3. "Specification debt" as identified in Stanford's 2023 transcript study refers to:

Correct. Specification debt is the accumulated cost of things the user assumed Claude would know but never actually stated in the brief.

Not quite. Specification debt means the user left something out of the prompt; 61% of "poor" outputs matched exactly what was specified — the spec was just incomplete.

4. Which of the five task-brief fields is most commonly omitted by new Claude users according to the lesson?

Correct. Success criteria are the hardest to write and therefore most often skipped — yet they are the only way to objectively evaluate Claude's output.

Not quite. Success criteria are the field most commonly omitted. If you can't write a checklist to evaluate the result, the task isn't ready to delegate.

5. HubSpot's early AI blog-post experiment produced SEO-optimized content that contradicted brand voice. The root cause was:

Correct. Every omission in a brief is an invitation for Claude to fill the gap with its defaults — in this case, generic SEO writing conventions over HubSpot's specific voice.

Not quite. The issue was a missing constraints field (the style guide). Claude filled the gap with its own defaults, not HubSpot's brand preferences.

Lab 1 · Building the Perfect Task Brief

Practice constructing delegation-ready briefs with Claude as your reviewer.

Your Mission

You are a product manager at a mid-size SaaS company. You need to delegate a task to a Claude agent: producing a one-page competitive analysis comparing your product to two rivals. Your job is to write a complete five-field task brief (Objective, Inputs, Constraints, Output Format, Success Criteria) and have Claude evaluate it. Then refine based on feedback.

Start by drafting your brief for the competitive analysis task, then ask Claude to identify any missing or weak fields. Aim for at least three exchanges to earn lab completion.

Task Brief Coach

Lab 1

Welcome to Lab 1. I'm your Task Brief Coach. Your challenge: write a complete five-field brief (Objective, Inputs, Constraints, Output Format, Success Criteria) to delegate a competitive analysis task to a Claude agent. Share your draft brief and I'll give you detailed, field-by-field feedback. Let's build something airtight.

Lesson 2 · Module 2

Context Windows and Memory: What Claude Actually Remembers

How Notion's engineering team learned that Claude forgets everything between API calls — and what they built to fix it.

If Claude has no persistent memory, how do real teams build agents that feel continuous?

When Notion shipped its AI writing assistant in 2023, the product team quickly discovered a gap between user expectations and technical reality. Users assumed the AI "knew" their previous pages, their writing history, their preferences. In practice, each Claude API call was stateless — a blank slate with no awareness of what came before.

Notion's engineers, led by Akshay Kothari and the AI team, built what they called a "context injection" layer: a retrieval system that fetched the three most relevant prior documents and prepended them to every prompt. Within two months of deploying this architecture, user satisfaction scores for AI responses rose 34%. Claude hadn't changed; the context it received had.

The Context Window: Limits and Leverage

Claude's context window (as of Claude 3 and Claude 3.5 models) spans up to 200,000 tokens — roughly 150,000 words, or about two full novels. This sounds enormous, but real delegation pipelines fill it fast. A single large PDF, a conversation history, a system prompt, and a tool-response can together exceed 50,000 tokens before the user's first message.

The practical implication: context is a resource, not a given. Teams that treat it as unlimited fill prompts with everything they have and then wonder why Claude's responses become generic or contradictory near the end of a long task. Teams that treat context as scarce budget it deliberately — putting the most critical information closest to the instruction, and summarizing or removing material that does not directly inform the current step.

Recency BiasClaude, like most transformer-based models, weighs content near the end of the context window more heavily than content near the beginning. Critical instructions placed at the top of a long system prompt may have less influence than the same instructions repeated near the final user message.

Context InjectionThe practice of dynamically retrieving and prepending relevant documents, summaries, or prior conversation turns to each new API call — creating an illusion of persistent memory.

Summarization CompressionPeriodically asking Claude to summarize prior conversation history into a compact paragraph, then replacing the full history with that summary to free up token budget for new context.

Three Memory Patterns in Production

Pattern 1 — Full Context Window (Stateless): Every call is independent. Works for isolated tasks like "summarize this document" or "translate this paragraph." Fails for anything requiring cross-turn coherence.

Pattern 2 — Windowed History: The system keeps the last N turns (typically 10–20) in the context. Simple to implement. Good enough for chatbots and short workflows. Breaks down on long research tasks where early decisions affect later steps.

Pattern 3 — RAG + Compression (Notion's approach): Retrieval-Augmented Generation fetches semantically relevant material from an external store; a compression step keeps the injected context short. Best for knowledge-heavy agents like document assistants, legal review tools, and customer-support systems with large policy databases.

What This Means for Delegation

When you delegate a multi-step task to Claude — say, researching a market and then drafting a strategy memo — you must either (a) keep the entire research output within a single conversation so Claude can reference it, (b) copy-paste the research into the memo prompt explicitly, or (c) use a tool-calling agent that can query stored results. There is no magic "Claude remembers" option in production environments.

Practical Context Budgeting

The engineers at Anthropic have published guidance suggesting that for tasks requiring sustained attention, the most important instructions should appear both at the start of the system prompt and be briefly restated just before the user turn. This "bookending" technique compensates for the recency bias that causes early instructions to lose influence in long conversations.

For document-heavy tasks, a practical rule of thumb used by teams at Sourcegraph (which powers its Cody coding assistant on Claude) is to never put more than 40% of the context budget on background material — leaving the majority for the current task, tool outputs, and Claude's working space. When background material exceeds that threshold, they summarize it first.

The Delegation Implication

Before delegating any multi-step task, sketch a rough "context budget": how many tokens will the system prompt, input data, and conversation history consume? If the sum approaches 60% of the context window before the task even begins, the task design needs restructuring — either break it into smaller sequential calls or compress the inputs.

Lesson 2 Quiz

Five questions · select the best answer

1. Notion's "context injection" layer improved AI satisfaction scores by 34% because it:

Correct. Claude didn't change — the context it received did. Retrieval of the three most relevant prior documents per call was the key architectural decision.

Not quite. Notion's solution was context injection: dynamically fetching and prepending relevant prior documents to each prompt, not model fine-tuning.

2. Claude's recency bias means that:

Correct. Transformer architecture gives more weight to recent tokens. Critical instructions placed only at the top of a long system prompt lose influence over time.

Not quite. Recency bias means later content in the context window typically has more influence — which is why bookending critical instructions is recommended.

3. Summarization compression is used in long Claude agent workflows primarily to:

Correct. Summarization compression preserves the semantic content of prior turns while dramatically reducing the token count, freeing budget for new context.

Not quite. The primary purpose is token budget management — replacing a long conversation history with a short summary to make room for new inputs.

4. Sourcegraph's Cody engineering team recommends keeping background material to no more than what percentage of the context budget?

Correct. 40% for background material, leaving the majority for the current task, tool outputs, and Claude's working space. Exceeding this threshold triggers summarization.

Not quite. Sourcegraph's rule of thumb is 40% — if background material would exceed that share, summarize it first before including it in the call.

5. Which memory pattern is LEAST suitable for a long research-and-strategy task requiring that early decisions influence later drafting steps?

Correct. Windowed history truncates older turns and therefore loses early research decisions that need to inform later drafting — precisely the failure mode described in the lesson.

Not quite. A fixed window history (last N turns only) drops older turns, losing early decisions that affect later steps. That's the weakest pattern for this use case.

Lab 2 · Context Budget Planning

Design a context architecture for a real multi-step agent task.

Your Mission

You need to build an agent that: (1) reads a 30-page market research report, (2) identifies the top five competitive threats, and (3) drafts three strategic responses. Describe your context architecture plan — what goes in the system prompt, how you'll handle the report, and how you'll manage history across steps. Claude will pressure-test your plan.

Describe your context budget plan. Address: how you'll handle the 30-page report, what memory pattern you'll use, and how you'll ensure step 3's drafts reflect step 1's findings. Aim for at least three back-and-forth exchanges.

Context Architecture Advisor

Lab 2

Welcome to Lab 2. I'm your Context Architecture Advisor. Your task: design the memory and context architecture for a three-step research-to-strategy agent. Tell me how you'd handle the 30-page report, which memory pattern you'd use, and how you'd preserve early findings for later drafting. I'll push back on weak points and help you build something production-ready.

Lesson 3 · Module 2

Tool Use and Agentic Loops: Giving Claude Real-World Reach

How Salesforce's Agentforce gave Claude the ability to update CRM records — and the oversight layer that kept it from going rogue.

When Claude can take actions in the world — not just generate text — what new delegation risks emerge, and how do production teams manage them?

At Dreamforce 2024, Marc Benioff unveiled Agentforce — Salesforce's platform for deploying Claude-based agents that could autonomously update CRM records, send follow-up emails, schedule meetings, and escalate deals. Within 60 days of launch, Salesforce reported over 1,000 enterprise customers in pilot, with one early customer, Wiley (the publisher), claiming a 40% reduction in case resolution time.

What distinguished Agentforce architecturally was its action guardrail system: every tool the agent could call had a configurable confidence threshold. Below the threshold, the agent drafted the action and sent it to a human queue. Above the threshold, it executed. Benioff called this "supervised autonomy" — Claude held real-world reach, but the reach was graduated by trust level, not granted all at once.

What Tool Use Actually Means

In Claude's tool-use (function-calling) framework, the model does not directly execute code or API calls. Instead, it outputs a structured JSON object describing the action it wants to take — the tool name and its parameters. An external orchestrator receives that JSON, performs the actual call, and returns the result to Claude's context. Claude never "does" anything directly; it proposes, and your code decides whether to act.

This architecture is important for delegation design because it means you control the execution gate. You can intercept every tool call before it runs, log it, validate it against business rules, or require human approval. The agent's autonomy is precisely as wide as the execution gate you build around it — not wider.

Agentic LoopA pattern where Claude is called repeatedly in a cycle: observe → plan → tool call → observe result → plan next step. Each iteration updates the context with new information, allowing the agent to pursue multi-step goals without human intervention between steps.

Tool SchemaA JSON specification (name, description, parameters) provided to Claude describing the tools available to it. The quality of the description — especially the description field — heavily influences when and how Claude chooses to use each tool.

Confidence ThresholdA policy-level setting (as in Salesforce Agentforce) that routes low-confidence tool calls to a human queue rather than executing them automatically. Typically calibrated per tool type based on reversibility and cost of error.

Designing Safe Agentic Loops

Anthropic's own guidance on agentic systems (published in the Claude documentation, 2024) identifies three principles for safe loops. First, minimal footprint: request only the permissions needed for the current task, avoid storing sensitive data beyond immediate use. Second, prefer reversible actions: when Claude must choose between an action that can be undone and one that cannot, it should default to the reversible option. Third, pause and verify at key junctures: especially when the task requires taking actions outside the originally anticipated scope.

In practice, the teams at Linear (project management software) built their AI triage agent on this principle by defining two tiers of actions: read-only (search, retrieve, summarize — fully autonomous) and write (create issue, assign, close — requires a thumbs-up from the requesting engineer). The read/write split solved 90% of their oversight concerns at minimal friction cost.

The Tool Description Trap

Claude's decision to call a tool depends heavily on how you describe it. A tool described as "searches the database" will be used conservatively; the same tool described as "finds the best matching customer record for any query" will be invoked far more aggressively. Vague descriptions produce over-use; over-restrictive descriptions produce under-use. Precision in the description field is as important as precision in the task brief.

The "Last Mile" Problem

Research by MIT CSAIL published in 2024 analyzing 500 production AI agent deployments found that 73% of agent failures occurred not during reasoning steps but at the last mile — the moment a tool call was about to execute. The most common failure types: calling a tool with a correct intent but incorrect parameter format (38%), calling a redundant tool when prior results already contained the answer (27%), and calling a destructive tool when a read-only alternative would have sufficed (8%).

The implication for delegation design: your tool schemas need examples, not just descriptions. Claude dramatically over-performs on parameter formatting when each tool schema includes one or two concrete usage examples in the description field. This is the single highest-leverage improvement available for agentic reliability that requires no model changes.

Delegation Rule for Agentic Tasks

Before deploying an agentic loop, map every tool to a tier: (1) read-only/fully reversible → autonomous, (2) write/reversible within 24h → confirm once per session, (3) write/irreversible → always require explicit human approval. Never grant tier-3 autonomy to a tool just because the agent is usually right. "Usually" is not "always," and irreversible mistakes are irreversible.

Lesson 3 Quiz

Five questions · select the best answer

1. In Claude's tool-use framework, the model "takes action" by:

Correct. Claude never executes actions directly. It outputs a structured proposal; your orchestration layer decides whether to act on it. You control the execution gate.

Not quite. Claude only proposes actions via structured JSON. An external orchestrator decides whether to execute. This architecture is what gives you control over the execution gate.

2. Salesforce's Agentforce used confidence thresholds primarily to:

Correct. Below the threshold, the agent drafts the action and queues it for human review. Above the threshold, it executes. This is the "supervised autonomy" pattern Benioff described.

Not quite. Confidence thresholds were used to route tool calls: low confidence → human queue, high confidence → auto-execute. A graduated-trust architecture, not a quality filter.

3. According to MIT CSAIL's 2024 analysis of 500 agent deployments, the most common failure type at the "last mile" was:

Correct. 38% of last-mile failures were correct-intent/wrong-format errors — which is why including concrete usage examples in tool schemas dramatically improves reliability.

Not quite. The most common failure (38%) was correct intent but incorrect parameter formatting. Tool schema examples — not just descriptions — are the fix for this class of error.

4. Anthropic's "minimal footprint" principle for agentic systems means:

Correct. Minimal footprint is about permission scope and data retention — take only what you need for this task, don't accumulate access or data speculatively.

Not quite. Minimal footprint refers to permission scope and data handling: request only what the current task requires, and don't retain sensitive data beyond immediate need.

5. Linear's read/write tier split for their AI triage agent addressed oversight concerns because:

Correct. A simple read/write tier split — autonomous reads, human-approved writes — resolved 90% of oversight concerns without adding significant friction to the workflow.

Not quite. Linear kept read-only actions fully autonomous and required a human thumbs-up for write actions. This simple two-tier split solved 90% of their oversight concerns at minimal friction.

Lab 3 · Designing a Tool Schema with Guardrails

Build a safe tool architecture for a CRM-updating agent.

Your Mission

Your company wants to deploy a Claude agent that can search customer CRM records, update deal stages, and send follow-up emails. Design the tool schema and guardrail tier system: which tools are read-only/autonomous, which need confirmation, and which need hard human approval. Also write one tool description with a concrete usage example for the "update deal stage" tool.

Describe your three-tier tool architecture and paste your "update_deal_stage" tool description including a usage example. Claude will evaluate your guardrails for safety gaps and suggest improvements. Aim for at least three exchanges.

Agentic Safety Reviewer

Lab 3

Welcome to Lab 3. I'm your Agentic Safety Reviewer. Your task: design a three-tier tool architecture (autonomous / confirm-once / always-human) for a CRM agent, and write a schema description for the "update_deal_stage" tool that includes a concrete usage example. Share your design and I'll identify safety gaps, over-permissioning, and description quality issues.

Lesson 4 · Module 2

Evaluating and Iterating on Agent Outputs

How Stripe's document-intelligence team built an evals pipeline — and why it changed how they thought about delegation permanently.

Once Claude produces an output, how do you know whether to trust it, iterate on it, or throw it out and redesign the brief?

In 2023, Stripe's developer documentation team began using Claude to draft API reference updates. Early results were mixed: some outputs were publication-ready; others contained subtle technical inaccuracies that only expert reviewers caught. The problem was that the review process was entirely ad hoc — individual engineers used personal judgment, producing inconsistent accept/reject decisions on nearly identical outputs.

Jeff Weinstein, Stripe's product lead for AI tooling, commissioned an evals framework: a structured rubric that scored each output on five dimensions (technical accuracy, completeness, tone, code example correctness, adherence to Stripe's style guide) using a combination of automated checks and spot human review. Within three months, the acceptance rate for first drafts rose from 41% to 74% — not because Claude had improved, but because the briefs had been iteratively refined based on the patterns the evals surfaced.

The Evaluation Stack

Evaluating Claude outputs is not a single act — it is a stack of four layers, each catching a different class of error at a different cost. Teams that skip to human review for every output waste time on errors that automated checks could catch instantly. Teams that trust automated checks alone miss the subtle judgment-dependent errors that only humans can identify.

Layer 1 — Format ValidationAutomated checks that the output matches the required structure: correct JSON schema, right number of sections, word count within bounds, required headers present. Catches ~30% of output problems in under a millisecond.

Layer 2 — Factual Spot-CheckAutomated or LLM-assisted verification of specific factual claims against ground-truth sources. Used for tasks with verifiable outputs: prices, dates, statistics, code that compiles and runs.

Layer 3 — Rubric ScoringA structured evaluation (often another Claude call) that scores the output against the success criteria defined in the original brief. Requires that the brief's success criteria are specific enough to score against — circular dependency with brief quality.

Layer 4 — Human Spot ReviewTargeted human review of outputs that fall below the rubric threshold, outputs in high-stakes categories, and a random 5–10% sample for ongoing calibration. Not every output — that defeats the purpose of delegation.

The Iteration Flywheel

Stripe's most important insight was that evals data should feed back into briefs, not just into model selection. When rubric scores revealed that Claude consistently missed Stripe's tone guidelines, the fix was not to switch models — it was to add three annotated examples of correct tone to the brief. When code example accuracy dropped on edge-case API parameters, the fix was to add a reference document covering those parameters to the Inputs field.

This creates what teams at Anthropic call the iteration flywheel: deploy → eval → identify weak field in brief → strengthen brief → redeploy → eval again. The loop typically converges to stable high performance within three to five iterations for well-defined tasks. Tasks that never stabilize usually have a fundamentally ambiguous success criterion — the signal that the task design needs a deeper rethink.

The "Eval Before You Scale" Rule

Before routing more than 100 tasks per day through any Claude agent pipeline, run at least 50 outputs through a structured rubric evaluation manually. The patterns you find in those 50 will save you from deploying a systematically flawed brief to thousands of cases. Scaling a broken brief is far more costly than the week spent building proper evals first.

Failure Taxonomy: What to Do When Outputs Fail

Type A — Format Failure: Output has wrong structure. Fix: add explicit output format instructions with an example. Do not change the substance of the brief.

Type B — Context Gap: Output is plausible but wrong because Claude lacked information. Fix: identify the missing data, add it to the Inputs field. Often reveals that the original brief's Inputs section was specifying categories of data rather than the actual data itself.

Type C — Constraint Drift: Output is high-quality but violates a constraint (wrong tone, wrong length, off-brand). Fix: move the violated constraint from a generic list to a named, prioritized instruction. "Be concise" fails; "Maximum 150 words — cut ruthlessly" succeeds.

Type D — Objective Ambiguity: Multiple plausible outputs would each satisfy the brief. Fix: this is a brief design problem, not a model problem. Rewrite the Objective with a specific, measurable end-state. If you cannot, the task is not yet ready for delegation.

The Meta-Skill of Delegation

The teams that get the most from Claude agents are not the ones with the best prompting intuition. They are the ones with the most disciplined iteration process. Evals without iteration loops are expensive autopsies. Iteration without evals is guesswork. The combination — deploy, measure, diagnose, refine — is how Stripe took a 41% first-draft acceptance rate to 74% without touching the model at all.

Lesson 4 Quiz

Five questions · select the best answer

1. Stripe's documentation team improved first-draft acceptance from 41% to 74% primarily by:

Correct. The improvement came entirely from brief refinement informed by structured rubric evaluations — Claude didn't change; the instructions it received did.

Not quite. Stripe improved results by using evals to diagnose weak fields in their briefs and iteratively strengthening those fields. The model was the same throughout.

2. Layer 1 (Format Validation) in the evaluation stack is best suited for catching:

Correct. Format validation is automated, fast, and catches structural compliance errors — approximately 30% of all output problems at near-zero cost.

Not quite. Layer 1 catches structural/format errors (wrong schema, wrong length, missing headers). Factual and tonal issues require layers 2–4.

3. A Type C failure (Constraint Drift) is best fixed by:

Correct. Generic constraints ("be concise") fail; specific ones ("Maximum 150 words — cut ruthlessly") succeed. Constraint drift means the rule was too vague to enforce.

Not quite. Type C failures occur when constraints are too vague. The fix is precision: move from "be concise" to "Maximum 150 words." Don't change the Objective or the Inputs.

4. The "Eval Before You Scale" rule recommends running structured evaluations on at least how many outputs before routing 100+ tasks per day through an agent?

Correct. 50 manually evaluated outputs is enough to surface systematic patterns in a brief's weak fields before scaling a flawed pipeline to thousands of cases.

Not quite. The recommendation is 50 outputs — enough to identify systematic patterns without the full cost of scaled deployment on a flawed brief.

5. A task that "never stabilizes" across multiple iteration cycles of the flywheel most likely indicates:

Correct. When iteration doesn't converge, the root cause is almost always an unmeasurable success criterion — you can't optimize toward a target you can't define.

Not quite. Unstable task performance across multiple iterations signals a fundamentally ambiguous objective or unmeasurable success criterion — a brief design problem, not a model or schema problem.

Lab 4 · Building an Evals Rubric

Design a structured evaluation framework for a real agent output.

Your Mission

Your Claude agent has been producing weekly market summary reports for your investment team. First-draft acceptance is only 38%. You need to build a four-layer evaluation rubric to diagnose exactly what's failing. Define your Layer 1 format checks, Layer 2 factual spot-checks, Layer 3 rubric dimensions with scoring criteria, and Layer 4 human review triggers. Then ask Claude to stress-test your rubric against a sample failure scenario.

Draft your four-layer evaluation rubric for the market summary report agent. Be specific: for Layer 3, list at least three named dimensions with measurable scoring criteria. Claude will identify gaps and test your rubric against a described failure case. Aim for at least three exchanges.

Evals Framework Designer

Lab 4

Welcome to Lab 4. I'm your Evals Framework Designer. Your task: build a four-layer evaluation rubric for a market summary report agent with a 38% first-draft acceptance rate. Share your rubric — Layer 1 format checks, Layer 2 factual spot-checks, Layer 3 rubric dimensions with scoring criteria, and Layer 4 human review triggers. I'll stress-test it against failure scenarios and identify structural gaps.

Module 2 Test

15 questions · 80% required to pass · covers all four lessons

1. Which of the four properties must ALL be present for a task to be safely delegable to Claude?

Correct. All four properties must be present: clear success criterion, bounded context, recoverable failure, and no unreviewed irreversible actions.

Not quite. The four delegation properties from Lesson 1 are: clear success criterion, bounded context, recoverable failure mode, and no unreviewed irreversible side effects.

2. Klarna's customer-service agent handled 2.3 million conversations in its first month. The six-month prerequisite to that launch was:

Correct. The efficiency gains came from precision of task boundaries, not model capability. Six months of delegation design preceded the launch.

Not quite. Klarna's team spent six months defining exact task-ownership boundaries — what the agent could resolve autonomously versus what required a human agent.

3. In the five-field task brief framework, "Constraints" refers to:

Correct. Constraints are the fence that keeps Claude in the right field — tone, length, format, and off-limits areas are all constraint-layer content.

Not quite. Constraints in the five-field brief are operational rules for the output: tone, length, format, what to avoid. Not API limits or computing budgets.

4. According to the Stanford 2023 transcript study, what percentage of outputs rated "poor" by users contained information that was absent from the original prompt?

Correct. 61% of poor-rated outputs contained no information absent from the original prompt — Claude produced exactly what was specified; the spec itself was the problem.

Not quite. 61% of poor outputs reflected exactly what was specified — the prompt was just incomplete. Only 39% involved Claude diverging from the specified content.

5. Notion's 34% improvement in AI response satisfaction came from:

Correct. Context injection — fetching and prepending relevant prior documents — was the architectural change. Claude's model didn't change; its context did.

Not quite. Notion built a context injection layer: fetching and prepending the three most relevant prior documents per call. The model was unchanged.

6. The "bookending" technique recommended by Anthropic for important instructions involves:

Correct. Bookending compensates for recency bias by restating key instructions near the end of the context — close to where Claude actually generates its response.

Not quite. Bookending means stating key instructions at the system prompt start AND briefly restating them just before the user turn, compensating for recency bias in long contexts.

7. Which memory pattern is best for a long research task where early findings must influence a final deliverable written many steps later?

Correct. RAG with summarization compression retrieves semantically relevant early findings at each step while keeping the context budget manageable — the right pattern for multi-step research workflows.

Not quite. RAG with summarization compression is the right pattern: it retrieves relevant early-step findings at each later step without blowing the context budget on full history.

8. In Claude's tool-use framework, the critical implication for safe delegation is:

Correct. Claude outputs structured JSON proposals; your code intercepts and decides whether to act. The agent's autonomy is precisely as wide as the execution gate you build.

Not quite. The key insight is that you control the execution gate. Claude never directly executes — it proposes in JSON format, and your orchestration layer decides whether to proceed.

9. Salesforce Agentforce's "supervised autonomy" model graduated agent trust by:

Correct. Confidence thresholds per tool created a graduated system: below threshold → human queue; above threshold → auto-execute. Trust was calibrated, not binary.

Not quite. Agentforce used per-tool confidence thresholds: low-confidence actions went to a human queue; high-confidence actions executed automatically. Graduated trust, not all-or-nothing.

10. The most common type of "last mile" agent failure in MIT CSAIL's 2024 analysis (38% of failures) was:

Correct. 38% of last-mile failures involved right intent, wrong format — which is why concrete usage examples in tool descriptions are the highest-leverage reliability fix.

Not quite. The most common failure (38%) was correct intent, wrong parameter format. Adding usage examples to tool descriptions is the targeted fix for this class of error.

11. Anthropic's "minimal footprint" principle for agentic systems specifically addresses:

Correct. Minimal footprint is about permission scope and data handling: request only what you need now, don't retain sensitive data beyond immediate use, don't acquire access speculatively.

Not quite. Minimal footprint is about permission scope and data retention: take only the access needed for this task; don't store sensitive data beyond immediate use.

12. Stripe's evals framework improved first-draft acceptance rates primarily by feeding rubric data into:

Correct. Stripe's flywheel: evals identified weak brief fields, briefs were refined, outputs improved — all without touching the model.

Not quite. Stripe used rubric data to diagnose which brief field was responsible for each failure, then strengthened that field. Model unchanged; brief improved.

13. A Type B (Context Gap) failure is best resolved by:

Correct. Type B failures mean Claude was plausible but wrong because it lacked specific data. The fix is adding that data to the Inputs field — not changing the objective or constraints.

Not quite. Type B (Context Gap) means Claude had the right intent but lacked a specific piece of data. Fix: identify and add that data to the Inputs field of the brief.

14. The iteration flywheel (deploy → eval → diagnose → refine → redeploy) typically converges to stable performance for well-defined tasks within:

Correct. Three to five cycles is typical for well-defined tasks. Tasks that never stabilize after this many iterations usually have a fundamentally ambiguous success criterion.

Not quite. Three to five iterations is the typical convergence range for well-defined tasks. If it doesn't stabilize by then, the success criterion likely needs a fundamental redesign.

15. A task that produces multiple plausible outputs each satisfying the brief equally well signals which failure type?

Correct. Type D: when multiple valid outputs all satisfy the brief, the Objective lacks a specific, measurable end-state. The task is not yet ready to delegate until that is fixed.

Not quite. This is Type D — Objective Ambiguity. When any of several conflicting outputs could satisfy the brief, the Objective must be rewritten with a single measurable end-state.