A look under the hood of Codex

OpenAI has taken an unusually transparent step by publishing a Detailed technical details of how its Codex CLI proxy works Under the hood. The publication, written by OpenAI engineer Michael Bollen, provides one of the clearest looks yet at how a production-level AI agent coordinates large language models, tools, and user input to perform real software development tasks.

At the core of Codex is what OpenAI calls an agent loop: an iterative cycle that alternates between model inference and tool execution. Each cycle begins when Codex generates a vector from structured inputs: system instructions, developer constraints, user messages, environment context, as well as available artifacts, and sends them to OpenAI’s response API for inference.

The model output can take one of two forms. It may produce a personalized help message for the user, or may request a tool call, such as running a shell command, reading a file, or calling a charting or searching utility. When a tool call is requested, Codex executes it locally (within the specified sandbox limits), appends the result to the router, and queries the form again. This loop continues until the model sends a final helper message, indicating the end of the conversation cycle.

While this high-level pattern is common across many AI agents, the OpenAI documentation is unique. Bolin explains how prompts are grouped item by item, how roles (system, developer, user, helper) determine priority, and how small design choices, like the order of tools in a menu, can have big impacts on performance.

One of the most notable architectural decisions is Codex’s completely stateless interaction model. Instead of relying on server-side conversation memory via the optional previous_response_id parameter, Codex retransmits the entire conversation history with each request. This approach simplifies the infrastructure and enables Zero Data Retention (ZDR) for customers who need strict privacy guarantees.

The downside is obvious: spot sizes grow with each interaction, resulting in quadratic increases in data sent. OpenAI technology mitigates this with robust claim caching, which allows the model to reuse the calculation as long as each new claim is an exact prefix extension of the previous claim. When caching is on, the inference cost is scaled linearly instead of quadratic.

However, this limitation imposes strict discipline on the system. Changing widgets during a chat, switching templates, modifying sandbox permissions, or even rearranging widget definitions can cause cache errors and severely degrade performance. Bolin points out that early support for Model Context Protocol (MCP) tools exposed exactly this kind of fragility, forcing the team to carefully redesign how it handles dynamic updates to the tool.

Rapid growth also hits another hard limit: the model context window. Because input and output tokens count against this limit, a long-running agent that makes hundreds of tool calls risks exhausting its usable context.

To address this issue, Codex automatically uses conversation compression. When the number of tokens exceeds a configurable limit, Codex replaces the complete conversation history with a condensed representation generated via private responses/built-in API endpoint. Importantly, this compressed context includes an encoded payload that preserves the model’s latent understanding of past interactions, allowing it to continue reasoning coherently without access to the full initial history.

Previous versions of Codex required users to operate the compression process manually; Today, the process is automated and largely invisible – an important improvement in usability as agents take on longer, more complex tasks.

OpenAI has historically been reluctant to publish deep technical details about leading products like ChatGPT. However, Codex is treated differently. The result is a rare, frank account of the trade-offs involved in building a real-world AI agent: performance versus privacy, resilience versus cache efficiency, and autonomy versus safety. Bolin doesn’t shy away from describing mistakes, shortcomings, or hard-learned lessons, reinforcing the message that today’s AI agents are powerful but far from magical.

Beyond the Codex itself, this post is a blueprint for anyone building agents on top of modern LLM APIs. It highlights emerging best practices: stateless design, stable prefix claims, and explicit context management, which are quickly becoming industry standards.