In 2023, Salesforce released its Einstein Copilot architecture with a published library of over 40 discrete "actions" — each a self-contained skill covering tasks like summarizing a CRM record, drafting a follow-up email, or querying sales pipeline data. Rather than training a monolithic model for each use case, Salesforce engineers built these skills as modular units with defined inputs, outputs, and permission scopes. An agent orchestrating a sales workflow could invoke SummarizeAccount, then chain to DraftOutreachEmail, then call ScheduleMeeting — combining three library skills in sequence without any skill knowing about the others. This architecture let teams ship new capabilities in days instead of months, because adding a skill meant writing one new module, not retraining a system.
The lesson from Salesforce's published engineering blog: the leverage in agent design comes not from clever prompting but from how you package and expose capabilities as composable units.
A skill is a discrete, reusable capability that an agent can invoke. Unlike a monolithic agent that embeds all logic in a single prompt, a skill library separates concerns: each skill has a name, a description the agent uses to decide when to call it, a typed input schema, a typed output schema, and the execution logic itself.
The most important property is atomicity — a skill does one thing well. The skill ExtractDates extracts dates from text. It does not also summarize the text, classify its sentiment, or schedule a calendar event. This constraint feels restrictive at first, but it is what makes skills composable. An agent can chain ExtractDates → LookupCalendarAvailability → BookMeeting precisely because each skill has no hidden side-effects on the others.
OpenAI's function-calling API, released in June 2023, formalized this pattern at scale. Each "function" registered with a model is structurally a skill: a JSON Schema defines its inputs, a description tells the model when to invoke it, and the model outputs structured calls rather than free-form text. Within months, developers had built shared libraries of hundreds of such functions — effectively community skill repositories.
A skill is not a prompt — it is an interface contract. The description is for the agent's reasoning layer; the schema is for the execution layer. These two concerns must be designed separately.
Before skill libraries became standard, teams built agent capabilities ad hoc: a custom function in the prompt for one task, an inline code block for another, a hardcoded API call somewhere else. The result was fragile systems where changing one capability could break five others, and where the same logic was reimplemented in slightly different ways across different agents.
When Microsoft released Semantic Kernel in 2023, their central design decision was the "plugin" — a versioned, documented skill module that could be registered into any kernel instance. Their published case studies showed teams at major enterprise clients reducing agent development time by 60–70% once they had a shared skill library, because new agents were assembled from existing tested modules rather than built from scratch.
The economic logic is the same as software libraries generally: write once, test thoroughly, reuse everywhere. But for agents, the stakes are higher — a skill invoked in a production pipeline may execute real-world actions (sending emails, writing to databases, calling APIs), so the thoroughness of the write-once step is critical.
The shift from "an agent with tools" to "a library of skills an agent can compose" is the difference between a tradesperson with a specific set of fixed tools and a workshop with a catalogued inventory of every tool ever built — the second can tackle problems the first never imagined.
You are going to practice the first and most important skill design decision: decomposing a broad capability into atomic skills. The AI tutor will give you a complex agent task and challenge you to identify the individual library skills it should invoke.
In early 2024, Anthropic published documentation on their "tool use" API, including specific guidance on why tool descriptions matter as much as schemas. Their engineering notes cited internal experiments where the same underlying function — a web search capability — was described in two ways: one description read "searches the web" and another read "retrieves up-to-date information from the internet when the answer may have changed since the training cutoff, or when a specific URL or current fact is needed." The second description reduced incorrect tool invocations by over 40% in their evaluations. The lesson: the natural language contract between description and model reasoning is as precise an engineering decision as the typed schema.
Every production skill needs four precisely designed components. Getting any one wrong causes subtle failures that are hard to debug at the agent orchestration level.
1. Name: Should be an unambiguous verb-noun pair that describes the action, not the implementation. SearchWeb and FetchURL are different skills — one queries a search engine, the other retrieves a specific page. Names that are too generic (GetData) force the agent to rely entirely on the description and often lead to mis-selection.
2. Description: This is the reasoning contract — the text the model uses to decide whether this skill is appropriate for the current step. A good description names: what the skill does, what it requires as preconditions, and critically, when not to use it. Anthropic's guidance explicitly recommends including negative cases: "Use this skill when X; do not use it when Y."
3. Input Schema: Every parameter should have a type, a description, and a specification of whether it's required. Optional parameters need default values documented. Parameters should be named for what they represent semantically, not what they map to technically — customerName not param1.
4. Output Schema: The shape of what comes back. This is often under-designed. A skill that returns an untyped string forces the calling agent to parse and interpret — introducing a reasoning step that should be structural. Returning a typed object with named fields ({ urgencyLevel: "high", confidence: 0.92 }) lets downstream skills consume the output reliably.
The description is a contract with the model's reasoning layer. The input and output schemas are contracts with the execution layer. Conflating these two contracts is the most common skill design error in production systems.
A skill that can fail silently is a liability in any composed workflow. Production skill libraries must define their error contract: what error types can this skill return, what do they mean, and what should the calling agent do in each case?
LangChain's published tooling documentation distinguishes between three error categories for skills: transient errors (network timeout — retry is appropriate), input errors (malformed parameter — the agent should reformat and retry), and hard failures (permission denied — escalate to human or abandon the task). An agent that receives a skill error without this categorization has no principled way to decide how to proceed, and will either retry forever, fail silently, or hallucinate a result.
The most robust skill libraries implement what engineers call a "sealed result type" — the skill always returns either a typed success value or a typed error value, never throws an exception that propagates to the orchestrator. This means the agent's reasoning about errors is part of the documented interface, not a hidden failure mode.
When Google DeepMind published their Gemini function-calling architecture in late 2023, they emphasized that every registered function should declare its possible error codes alongside its output schema. This allowed their evaluation harness to test not just happy-path invocations but the agent's error-handling behavior — a quality bar that teams building with raw function calls typically never reached.
In a skill library serving multiple agents, an undocumented failure mode in one skill becomes a production incident across all agents that use it. Document errors with the same rigor as outputs.
The AI tutor will give you a skill to design. Write a complete interface contract: name, description (including when NOT to use it), input schema with typed parameters, and output schema. Then get feedback on the precision and completeness of your contract.
In 2024, Cognition AI publicly demonstrated their agent "Devin" completing a multi-step software engineering task: reading a GitHub issue, cloning the relevant repository, writing a code fix, running the test suite, and submitting a pull request. Each of these was a separate library skill. What drew attention from engineers was not the individual skills — any of them could be implemented by a junior developer — but the agent's runtime composition: Devin's orchestrator decided, based on the output of each skill, which skill to invoke next. When tests failed, it did not blindly retry; it invoked a code inspection skill to analyze the failure output, then a different patching skill. The composition was dynamic, not scripted. This is the core challenge of runtime skill composition: the agent must reason about skill outputs to plan the next invocation.
There are three fundamental patterns for composing skills at runtime. Understanding which pattern applies to a given task is an architectural decision that determines both the agent's efficiency and its robustness.
Sequential composition is the most common pattern: the output of skill A becomes the input to skill B. The customer support agent from Lesson 1 uses sequential composition — read ticket → classify urgency → look up account → draft response. Each step depends on the previous. The risk is latency: each skill must complete before the next begins, and a failure in step 2 aborts the entire chain.
Parallel composition is appropriate when multiple skills need the same input but produce independent outputs that are combined later. A research agent might invoke SearchAcademic, SearchNews, and SearchPatents simultaneously, then pass all three results to a SynthesizeFindings skill. Parallel composition requires the orchestrator to manage concurrency and handle the case where some parallel branches fail while others succeed.
Conditional composition is where agent reasoning becomes most visible. The agent evaluates the output of one skill and uses it to decide which skill to invoke next — not just what inputs to pass. The Cognition Devin case is conditional composition: test failure output determines whether to invoke AnalyzeFailure or SubmitPullRequest. This pattern requires the agent's reasoning layer to interpret skill outputs as decision inputs, which means output schemas must be designed with downstream branching in mind.
The composition pattern is not chosen once at design time — sophisticated agents switch between sequential, parallel, and conditional composition within a single task. The skill library must support all three, which means skills cannot have hidden state that assumes they are always called in the same order.
The most advanced pattern — used in systems like AutoGPT and the early LangChain agents that followed — is dynamic skill discovery: the agent does not have a fixed set of skills pre-loaded, but queries a skill registry at runtime to find capabilities that match its current need. This requires skill descriptions to function as semantic search targets, not just documentation.
When researchers at Stanford published the "ToolFormer" paper in 2023, they trained a model to decide, for each step in a task, whether to call a tool, which tool to call, and how to format the call — all from the tool descriptions alone. The quality of those descriptions determined the accuracy of tool selection. Tools with vague descriptions like "queries external data" were selected incorrectly 3–4 times more often than tools with precise descriptions naming the specific data source and query format.
Runtime planning requires the agent to reason about skill composition before executing it. Systems like LangChain's "Plan and Execute" agent architecture, published in 2023, explicitly separated the planning step (which skills, in what order) from the execution step (actually invoking them). This separation allows the plan to be evaluated, logged, and even overridden by a human reviewer before execution — a safety property that is impossible in reactive (one-step-at-a-time) composition.
The tradeoff: plan-then-execute is more auditable but brittle if the execution context changes mid-plan. Reactive composition adapts better to changing state but is harder to inspect. Production systems increasingly use hybrid approaches: a high-level plan is generated upfront, but each skill invocation may trigger reactive replanning if its output is unexpected.
A skill registry is not just a list — it's the agent's ability to discover what it can do. Designing skill descriptions as semantic search targets is as important as any other part of the interface contract.
Practice designing explicit composition plans — the kind that could be reviewed before execution. The AI tutor will give you a task and a skill library. You will write a plan specifying which composition pattern (sequential, parallel, conditional) to use at each step and why.
This lesson explores l4: skill governance and safety — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: skill governance and safety.