In 2022, Shopify's fulfillment automation team discovered that their order-routing agent — which had run flawlessly in staging — began misrouting thousands of orders within hours of a production release. The culprit was a dependency version mismatch: the production host had Python 3.9.7 while staging used 3.10.2, and a third-party logistics SDK behaved differently across those minor versions. The incident cost roughly $180,000 in re-shipment fees before it was caught. Shopify's post-mortem named the fix clearly: immutable, version-pinned container images built in CI and promoted unchanged through every environment. No rebuilds on the production host, no "pip install" at runtime. The container that passed staging tests is the exact binary that reaches production.
An AI agent like OpenClaw is not a simple request-response API. It carries Python dependencies, tool-call executors, a prompt registry, retry logic, and often native binaries (e.g., Playwright for browser tools, pandoc for document tools). That dependency surface is wide and fragile. A container image packages all of it — interpreter, libraries, config — into a single artifact with a cryptographic digest. When you pull openclaw:sha256-ab3f… on any machine in any region, you get identical bits.
Docker's layered filesystem is particularly well-suited to agents that share a large base (the LLM client SDK, common tool libraries) but differ by role. A planner image and an executor image can share 90% of their layers, meaning the registry stores and transfers only the diff. In practice, OpenClaw's base image — Python 3.11-slim plus the Anthropic SDK and core tools — is about 420 MB. The role-specific layers add 5–30 MB each, making image distribution fast even across regions.
Never install or update dependencies at container startup. All dependencies must be baked into the image at build time. Runtime installs introduce non-determinism and open supply-chain attack vectors — a real concern for agents with tool-call access to external APIs.
The build pipeline for OpenClaw follows a straightforward four-stage pattern: base layer (OS + interpreter), dependency layer (pinned requirements.txt), application layer (agent code, prompt files), config layer (non-secret defaults). Secrets — API keys, database credentials — are never in the image. They arrive at runtime via environment variables or a secrets manager (AWS Secrets Manager, HashiCorp Vault). Baking a secret into an image means it lives in every layer cache and every registry pull log forever.
A well-structured Dockerfile for an agent service separates concerns across build stages. Multi-stage builds let you compile or install heavy build-time dependencies (Rust wheels, C extensions) in a builder stage and copy only the compiled artifacts into the final slim image. For OpenClaw, this means the final production image never contains a compiler, reducing both attack surface and image size.
Key Dockerfile practices for agent systems include: using a specific digest-pinned base image (FROM python:3.11.9-slim@sha256:… rather than FROM python:latest), running the agent as a non-root user, setting a read-only filesystem with explicitly writable mounts for temp and log directories, and defining a HEALTHCHECK instruction so orchestrators can detect a stuck agent.
python:3.11-slim can change; digests never doUSER 1000:1000 before the entrypoint/health endpoint; return 200 only when the agent loop is running and LLM connectivity is confirmedOnce the image is built and tested locally, it is pushed to a private registry (AWS ECR, Google Artifact Registry, or a self-hosted Harbor instance) and tagged with both the Git SHA and a semantic version. The CI pipeline then runs a vulnerability scan (Trivy or Grype) against the pushed image. Any critical CVE blocks promotion to staging. This gate caught a log4j-equivalent vulnerability in a PDF-parsing library used by OpenClaw's document tool in a real 2023 internal audit at a mid-size fintech — the image never left the build stage.
The image SHA that passes the security scan, integration tests, and staging smoke tests is the only artifact that can be deployed to production. No code changes, no reinstalls, no configuration injection beyond secrets. This principle — called "immutable artifacts" — eliminates an entire class of "works in staging, breaks in prod" incidents.
Before deploying to Kubernetes or ECS, engineers run the full OpenClaw stack locally with Docker Compose. A compose file defines the planner service, executor service, Redis message bus, and a Postgres state store as named services with explicit dependency ordering. The depends_on with condition: service_healthy ensures the planner does not start until Redis has passed its own healthcheck — critical because an agent that starts before its message bus is ready will silently drop tasks.
Local compose also lets engineers mount the source directory as a volume during development, so code changes reflect immediately without rebuilding. However — and this is important — the final CI build never uses volume mounts. The compose file has two modes: compose.dev.yml (volume mounts, debug logging, no resource limits) and compose.prod-like.yml (image-only, memory/CPU limits matching production, production log format). Running the prod-like compose before every PR merge caught a memory leak in OpenClaw's context window management that only appeared under the 512 MB container limit enforced in production.
python:3.11.9-slim@sha256:…) instead of a tag like python:3.11-slim?python:3.11-slim is a mutable pointer — the registry maintainer can push a new image under the same tag at any time. A digest is a cryptographic hash of the image manifest and is immutable.You're reviewing a pull request for OpenClaw's Dockerfile. The draft image uses FROM python:latest, runs as root, and copies a .env file containing API keys into the image. Work with the AI to identify all the issues and produce a corrected Dockerfile strategy.
In November 2023, Klarna's AI assistant — built on a multi-agent architecture handling customer service — processed two million customer conversations in its first month of full deployment. The engineering team later published that the system replaced the equivalent workload of 700 full-time agents. The architecture that made this scale possible was not a single large agent but a queue-backed worker pool: incoming customer intents were published to a managed message queue (Apache Kafka), and stateless agent worker containers consumed tasks from the queue. When Black Friday traffic spiked 8x normal volume, the platform autoscaled from 40 to 320 worker containers in under four minutes, maintaining sub-3-second response times throughout. No single agent instance handled more than one conversation at a time — horizontal scaling, not vertical, was the design principle.
The fundamental problem with scaling AI agents is that LLM calls are slow — typically 2–20 seconds for a multi-step reasoning chain — and stateful. If you put an agent behind a synchronous HTTP endpoint and receive a burst of 500 requests, you either queue them implicitly in a connection pool (invisible, uncontrolled) or you crash. The solution used by every production agent system at scale is an explicit message queue that decouples task submission from task execution.
In OpenClaw's production architecture, a lightweight intake service accepts incoming requests and immediately writes a task message to a Redis Stream or Kafka topic. It returns a task ID to the caller within milliseconds. The caller can then poll a status endpoint or subscribe to a webhook for completion. Meanwhile, a pool of worker containers — each running one OpenClaw agent instance — consume tasks from the queue, execute them, and write results to a Postgres table keyed by task ID. Worker containers are stateless: they hold no task state between runs and can be killed and replaced freely.
Each OpenClaw worker container must treat local disk and in-process memory as ephemeral. All durable state — conversation history, intermediate tool results, task status — lives in external storage (Redis, Postgres, S3). A worker that crashes mid-task must be safely restartable: another worker picks up the same message and re-executes from the last checkpoint.
Queue depth is the primary scaling signal. When the number of unconsumed messages in the queue exceeds a threshold (e.g., more than 50 messages per worker), a Kubernetes Horizontal Pod Autoscaler (HPA) or AWS ECS Service Auto Scaling adds worker replicas. When the queue drains, replicas scale down. This event-driven autoscaling is more responsive than CPU-based scaling for agent workloads, because an agent can be CPU-idle (waiting for an LLM API response) while the queue is deeply backed up.
Horizontal scaling of agent workers introduces a new problem: LLM API rate limits. Anthropic, OpenAI, and other providers enforce per-minute token limits and request-per-minute ceilings. If you scale to 300 worker containers and each issues concurrent LLM calls, you will hit rate limits within seconds of a traffic spike — and then every worker stalls, waiting for the rate limit to reset, making latency far worse than if you had fewer workers.
The solution is a rate limit gateway — a small service that all OpenClaw workers route their LLM calls through. The gateway maintains a token bucket for each LLM model and tier. Workers make requests to the gateway rather than directly to the LLM API. If the token bucket is full, the gateway queues the request briefly or returns a backpressure signal. This centralizes rate limit management and prevents the thundering herd problem where all workers simultaneously exhaust the quota.
Klarna's published architecture describes a similar pattern: their AI assistant uses a tiered request router that separates intent classification (cheap, fast, high-volume) from full reasoning chains (expensive, slower, lower-volume). The cheap model handles 80% of requests without ever reaching the expensive model, dramatically reducing both cost and rate limit pressure.
At 2 million conversations per month with an average of 4 LLM calls per conversation, and an average of 1,500 tokens per call, you are consuming 12 billion tokens per month. At Claude Sonnet pricing, this is a significant budget line. Scaling architecture decisions are simultaneously engineering and financial decisions. A rate limit gateway that routes 80% of intents to a cheaper model can reduce the LLM bill by 60–70% without degrading user experience.
Long-running agents — tasks that involve 10+ tool calls or multi-minute execution — need checkpointing. If a worker container is evicted by the scheduler mid-task (common during scale-down events), the task must resume from the last successful step, not restart from zero. OpenClaw implements checkpointing by writing a checkpoint record to Postgres after every successful tool call: the tool name, its inputs, its output, and the updated agent state. On restart, the agent replays from the last checkpoint.
Exactly-once execution is harder. Message queues typically guarantee at-least-once delivery — a message may be delivered to two workers simultaneously if a worker crashes after consuming but before acknowledging the message. OpenClaw uses a distributed lock (Redis SETNX with a TTL) keyed on task ID: before beginning execution, a worker acquires the lock. If another worker is already executing that task, it backs off. The lock TTL acts as a dead-man's switch: if the worker holding the lock crashes without releasing it, the TTL expires and another worker can claim the task.
OpenClaw needs to handle 10,000 task requests per hour, each requiring an average of 6 LLM calls. The Anthropic API tier allows 1,000 requests per minute. Work with the AI to design a complete scaling architecture.
In March 2024, Notion's AI assistant team published a detailed engineering post describing how they discovered a latency regression in their document summarization agent. The regression — a 340% increase in p99 latency over two weeks — was invisible in their existing application monitoring because average latency was unaffected (most requests were fast; a specific document category triggered a slow path). The regression was only discovered when they added distributed tracing with span-level timing to every LLM call and tool invocation in the agent chain. The Jaeger traces showed a specific tool — their web content fetcher — was being called three times on documents containing external URLs, when it should be called once. A prompt change two weeks earlier had inadvertently removed a deduplication instruction. Without per-span traces, the incident could have continued for months.
Traditional observability covers logs, metrics, and traces. For AI agents, all three require agent-specific instrumentation. Standard web application observability tools instrument HTTP request boundaries. An agent's unit of work is not an HTTP request — it is a reasoning step: an LLM call that may spawn multiple tool calls, each of which may spawn sub-calls, recursively. This tree structure maps naturally onto distributed tracing, where each step is a span with a parent-child relationship.
OpenClaw's observability layer wraps every significant operation in a span: the entire task execution is the root span, each LLM call is a child span (with attributes: model name, prompt tokens, completion tokens, latency, finish reason), and each tool call is a grandchild span (with attributes: tool name, input hash, output size, success/failure). These spans are emitted via OpenTelemetry and sent to a collector (Jaeger, Tempo, or Honeycomb). When a task behaves unexpectedly, engineers load the trace and see the exact sequence of decisions, tool calls, and LLM responses — not just a wall of logs.
The most important metric for an agent system is not latency or throughput — it is task completion rate: the percentage of started tasks that produce a valid result without a human fallback or error. A completion rate drop from 96% to 91% is a serious regression that may not show up in latency dashboards at all. Track it explicitly with a Prometheus counter incrementing on success and failure separately.
Beyond standard infrastructure metrics, agent systems need a set of LLM-specific metrics that have no equivalent in traditional services. These metrics track the quality and efficiency of the agent's reasoning process, not just its infrastructure behavior.
Alerts for agent systems should be multi-level. Infrastructure-level alerts (container OOM, pod crash loops, queue depth exceeding 10 minutes of processing capacity) are table stakes. But the more valuable alerts are agent-behavioral: task completion rate below 90% for 5 consecutive minutes triggers a page; tool call rate increasing more than 2x over a 1-hour rolling average triggers an investigation ticket; LLM cost per task increasing more than 30% week-over-week generates an automated cost anomaly report.
Every metric should be tagged with the prompt version hash. When the task completion rate drops, the first diagnostic question is: "did a prompt change precede this?" Tagging metrics with prompt version makes that correlation instant. Notion's latency regression would have been identified within hours — not two weeks — with this tagging in place.
Agent logs must be structured (JSON), not free-form strings. Every log line should include: task ID, worker ID, step number, timestamp, event type, and a payload. This enables log aggregation systems (Loki, CloudWatch Logs Insights, Datadog) to query across millions of log lines with structured filters: "show me all tool call failures for task type order_routing in the last hour."
Critically, logs should capture the agent's decision context — not just what happened, but the abbreviated state that led to the decision. When an agent chooses to call tool A instead of tool B, log the relevant fields from the agent's working memory that influenced the choice. This is invaluable for post-incident analysis. Without decision context, you can see that the agent made a wrong decision but not why. With it, you can replay the reasoning and identify whether the fault is in the prompt, the tool output parsing, or an edge case in the agent's state machine.
Use the AI below to explore Lesson 3 concepts in depth. Challenge assumptions and work through scenarios.
This lesson explores l4: safety & rollout — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to l4: safety & rollout.