OpenAI released GPT-5.5 on April 23, rolling it out to Plus, Pro, Business, and Enterprise users in ChatGPT and in the Codex coding assistant. The company says the new model is stronger on agentic coding, computer use, knowledge work, and early scientific research, and that it matches the per-token latency of GPT-5.4 while running roughly 20% faster in real-world serving. A separate GPT-5.5 Pro tier is shipping to paid business plans for longer reasoning runs. Independent write-ups cite an 82.7% score on an internal agentic coding benchmark, and OpenAI claims the model outperforms Gemini 3.1 Pro and Claude Opus 4.5 on the same suite.
The release matters less as a capability jump than as a cadence signal. GPT-5.4 shipped in early April, and GPT-5.5 arrives two to three weeks later — a compressed update schedule that points to how frontier labs now prefer small, frequent increments over large version-number launches. The gains are concentrated in things agents need: following long tool-call chains, navigating a desktop, writing and debugging code over many turns. Every tick there makes it cheaper and more reliable to let an AI finish a task without a human in the loop, which is the economic hinge the whole industry is pushing on.
The launch also tightens the three-way race. Google shipped Gemini 3.1 Flash and Gemini Enterprise this week, Anthropic previewed Claude Mythos, and DeepSeek V4 is already in circulation. None of the headline benchmarks separate these models by more than a few points; the real differentiation is increasingly about price, latency, tool integration, and ecosystem — not raw IQ. Codex integration is OpenAI's lever, Google has the enterprise platform, and Anthropic has the agent-first ergonomics.
For learners: don't over-index on any single benchmark score. A 1-point gap on an agentic coding eval can reverse on your actual codebase, and model-quality differences are often dwarfed by differences in how you prompt, retrieve, and evaluate. The useful exercise is to pick one real task you care about — refactor a repo, summarize a policy document, answer customer questions — and benchmark two or three of the current frontier models on it directly. That is the skill that compounds.