In 2022, DeepMind's AlphaCode system tackled competitive programming problems by decomposing each problem into a hierarchy: first identify the problem class (sorting, graph traversal, dynamic programming), then generate candidate algorithms, then implement and verify each piece independently. On Codeforces competitions, it ranked in the top 54% of human participants — not by solving problems whole, but by breaking them into recognizable subtask categories and solving each layer in sequence. The hierarchical decomposition was not optional — it was the architecture.
The same principle showed up in NASA's Mars Science Laboratory mission, where autonomous onboard planning software called AEGIS used a layered goal hierarchy: high-level science objectives decomposed into observation targets, then into specific instrument commands, then into power and timing constraints. When a lower-level step failed, the system could revise that layer without discarding the entire science plan.
Hierarchical task decomposition (HTD) is the process of taking a high-level goal and recursively breaking it into smaller subgoals until you reach atomic actions — steps the agent can execute directly. The result is a tree structure: the root is the goal, intermediate nodes are subgoals, and leaves are executable actions.
The formal version of this, Hierarchical Task Network (HTN) planning, has been used in AI systems since the 1990s. HTN planners like SHOP2 and SIPE-2 were deployed in real logistics operations and military mission planning. What makes the architecture powerful is that it encodes human expert knowledge about how to approach problems, not just what the goal is.
Flat planners ask: "What sequence of actions leads from state A to state B?" Hierarchical planners ask: "What is the expert decomposition of this task class, and how do I instantiate it here?" The hierarchy carries domain knowledge that flat search cannot recover on its own.
Modern LLM-based agents rediscover this pattern. When GPT-4 is prompted with a complex task and chain-of-thought reasoning, it naturally generates hierarchical subgoal trees — not because it was explicitly told to, but because decomposition is the structure of competent human reasoning that it learned from training data.
There are two axes of decomposition. Horizontal decomposition splits a task into parallel or sequential subtasks at the same level of abstraction — like splitting "prepare a report" into "gather data," "analyze data," and "write findings." Each piece is independent or lightly coupled. Vertical decomposition drills down through abstraction levels — "gather data" becomes "identify sources," then "query each source," then "parse response format," then "handle pagination."
Real planning systems combine both. The DARPA Urban Challenge autonomous vehicle teams in 2007 used horizontal decomposition to split driving into perception, planning, and control modules running in parallel — and vertical decomposition within the planning module to go from route-level decisions down to steering commands. Sebastian Thrun's Stanford team, which won the 2005 DARPA Grand Challenge, published detailed accounts of this layered architecture in their 2006 Journal of Field Robotics paper.
Premature concreteness — decomposing too deep too fast, without confirming the high-level structure is correct. NASA's Mars Observer mission failure in 1993 was partly attributed to teams working detailed subsystem designs before the system-level architecture was validated. Fix the tree structure first, then fill in leaves.
AutoGPT (2023), LangChain agents, and the BabyAGI architecture all implement implicit hierarchical decomposition. BabyAGI, released by Yohei Nakajima in March 2023, explicitly maintains a task list where an "executive" LLM generates subgoals, an "execution" LLM completes them, and a "prioritization" LLM reorders remaining tasks based on results. This three-layer structure is a direct implementation of HTN principles with language model components.
The key engineering decision in these systems is decomposition granularity: how many levels deep, and how specific each leaf node must be before handing to an executor. Too coarse and the executor fails on ambiguous instructions. Too fine and the planner spends more tokens on decomposition than on execution.
You're going to practice hierarchical task decomposition by working with an AI planning assistant. Choose a complex goal — something like "build and deploy a machine learning model for customer churn prediction" or "design a disaster response plan for a city of 500,000." Then:
In 2016, OpenAI's research into RL agent behavior on Montezuma's Revenge (an Atari game notorious for sparse rewards) revealed that standard reinforcement learning completely failed — the agent received no reward signal for thousands of steps and learned nothing. The breakthrough came from subgoal-based approaches: researchers used intrinsic motivation to generate intermediate objectives (reaching a new room, picking up a key) that served as stepping stones. When Uber AI Labs published their "Go-Explore" algorithm in 2018, it addressed the same problem by explicitly storing and returning to promising intermediate states, treating each as a subgoal. The game that defeated standard RL for years fell to subgoal generation.
IBM's Watson system during its 2011 Jeopardy! victory used a different form of subgoal generation: decomposing each clue into evidence-gathering subgoals across 100+ algorithms running in parallel, then merging their confidence scores. The subgoals were dynamically generated per question — Watson didn't have a fixed decomposition for "what is a Jeopardy! answer," it generated the appropriate evidence subgoals based on the clue structure.
In classical AI planning, landmarks are facts or states that must be true at some point in any valid solution to a planning problem. Automatically identifying landmarks gives a planner a set of necessary subgoals — not hypothetical intermediates, but provably required ones. The Fast Downward planning system (used extensively in academic and competition planning) includes landmark detection as a core component of its heuristic evaluation.
The landmark approach is powerful because it's deductive — subgoals are derived from the problem structure, not guessed. In a logistics planning domain, "the truck must visit the warehouse" might be a landmark because no valid solution exists without it. The planner can then focus search effort on achieving landmarks in a sensible order.
Landmark detection algorithms work by analyzing the causal structure of the planning domain: if achieving goal G requires action A, and action A requires precondition P, then P is a landmark. Disjunctive landmarks capture "at least one of these must hold" — more complex but more complete coverage of necessary intermediate states.
Means-ends analysis (MEA), introduced in the General Problem Solver (GPS) by Newell and Simon in 1961, generates subgoals by working backwards from the goal: "What is the difference between the current state and the goal? What operator reduces that difference? What preconditions must hold for that operator?" Each precondition becomes a new subgoal.
This backward chaining creates a subgoal chain that, when executed forward, achieves the original goal. STRIPS-style planners use a version of this called goal regression. When NASA's Remote Agent Experiment ran aboard Deep Space 1 in 1999 — the first time an AI planning system controlled a spacecraft autonomously — it used a goal-regression planner to generate the sequence of intermediate states needed to achieve mission objectives. For 11 hours in May 1999, the spacecraft flew itself.
Modern LLM-based agents generate subgoals through prompted reasoning rather than formal analysis. The ReAct framework (Yao et al., 2022) interleaves Reasoning and Acting: the model generates a "thought" that identifies the next subgoal, then takes an action to pursue it, then reasons about the result to identify the next subgoal. This creates a dynamic subgoal chain that adapts to observed results rather than following a pre-committed plan.
The Reflexion paper (Shinn et al., 2023) added a memory layer: after each failed attempt, the agent generates a verbal reflection on what went wrong, which is stored and prepended to future prompts. This means the subgoal generation process learns from failure within a session — not through gradient updates but through language-mediated self-critique.
LLM-generated subgoals are plausible but not provably necessary. Unlike landmark detection, there's no guarantee that a language model's suggested intermediate steps are actually required — or even sufficient. Hallucinated subgoals that seem reasonable can lead agents confidently down paths that cannot produce the goal. Verification layers are essential.
You'll practice identifying and critiquing subgoal chains using the ReAct-style reasoning approach. Pick a goal that requires multiple intermediate steps in an uncertain domain — for example, "identify the root cause of a production system outage" or "design a clinical trial protocol for a new drug candidate."
On July 4, 1997, the Pathfinder lander's onboard computer suffered repeated resets on the Martian surface due to a priority inversion bug. The software used VxWorks, a real-time operating system with a preemptive scheduler. A low-priority task held a shared resource that a high-priority task needed, but a medium-priority task kept preempting the low-priority task, preventing resource release. The system's watchdog timer detected that high-priority operations weren't completing and triggered a full reset — effectively "backtracking" to a safe initial state. Engineers on Earth diagnosed the issue remotely, enabled priority inheritance in the scheduler over a radio link, and the lander resumed normal operation. The backtracking was costly but correct: the system chose a known-safe state over an unknown-corrupted one.
DeepMind's AlphaGo, in its famous 2016 match against Lee Sedol, exhibited a form of plan revision in Game 4. After Lee Sedol played the unexpected "divine move" (Move 78), AlphaGo's evaluation function initially assigned it low value — a misestimation. The system continued executing its prior plan for several moves before its lookahead detected that the board position had shifted significantly. It then revised its strategy mid-game, a form of online replanning that ultimately failed — Lee Sedol won that game — but demonstrated that even state-of-the-art game AI faces the challenge of timely plan revision when unexpected moves occur.
Backtracking begins with detection. An agent cannot revise a plan it doesn't know has failed. There are three categories of failure detection in planning systems:
The Mars Pathfinder case illustrates a critical timing challenge: failure detection must be fast enough to prevent cascading damage, but the "backtrack to safe state" response must not be triggered so aggressively that normal operations are disrupted. This threshold calibration is an active research area in fault-tolerant systems, distinct from the planning problem itself.
In LLM-based agents, failure detection is harder because outputs are text, not formal state transitions. The Voyager system (Wang et al., 2023) for Minecraft used a code execution environment to detect failures precisely: if generated JavaScript code raised an exception or produced wrong game state, the agent knew definitively that the subtask had failed and triggered replanning.
Once failure is detected, the agent must decide where to rewind. The three main strategies are:
Chronological backtracking: Undo the most recent decision and try an alternative. Simple and cheap, but can lead to "thrashing" — repeatedly undoing and retrying the same level when the real problem is deeper in the decision tree. PROLOG's default search uses chronological backtracking.
Dependency-directed backtracking (DBT): Analyze the cause of the failure — which earlier decision created the constraint that is now violated — and jump back to that specific decision point, skipping irrelevant intermediate steps. The ATMS (Assumption-based Truth Maintenance System) developed by Johan de Kleer at Xerox PARC in the 1980s formalized this approach. DBT can recover in O(n) steps what chronological backtracking might need O(2^n) steps to find.
Partial plan preservation: Rather than discarding the entire plan, identify which subgoals have already been validly achieved and preserve them as fixed points, replanning only the failed portion. This is the approach used in CPEF (Continuous Planning and Execution Framework) and is essential in long-horizon tasks where early subgoal achievements are expensive to reproduce.
Commitment levels: an agent can hold plans at three levels of commitment — "committed" (will execute), "tentative" (preferred but revisable), and "open" (not yet decided). When failure occurs, only tentative and open commitments need revision. This is formalized in Bratman's theory of Practical Reasoning and implemented in BDI (Belief-Desire-Intention) agent architectures used in industrial autonomous systems.
When a plan fails, the agent has two broad options: replan from scratch (generate a new plan from the current state to the goal) or repair the existing plan (make minimal edits to the failed plan to make it valid again). Replanning from scratch is simpler to implement but computationally expensive and discards potentially valid plan segments. Plan repair is cheaper but requires formal analysis of which parts of the plan remain valid.
Research by Koenig and Sun (2009) on "D*-Lite" — an incremental replanning algorithm for robot navigation — showed that incremental repair can be orders of magnitude faster than full replanning when the plan changes are localized. D*-Lite was used in DARPA Urban Challenge vehicles that needed to replan routes when unexpected obstacles appeared. When a parked car blocked a planned route, the system updated only the affected portion of the navigation plan rather than recomputing from scratch.
You'll practice diagnosing plan failures and choosing the right backtracking strategy. Start with a multi-step plan in any domain, then introduce a mid-plan failure scenario. Your goal is to identify which type of backtracking is most appropriate.
This lesson explores lesson 4 — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4.