The first time I used Claude Code to refactor an entire module, it felt almost mystical. I described what I wanted, went to grab a coffee, and when I came back, there was a pull request with 14 files changed, updated tests, and a decent commit message. “This is magic,” I thought.

It’s not magic. It’s a while loop.

Michael Bolin from OpenAI recently published an article dissecting the internals of Codex CLI. And it turns out that the secret behind AI coding agents isn’t a groundbreaking algorithm or an enigmatic neural network. It’s a loop that calls an LLM, executes tools, and repeats until there’s nothing left to do.

Let’s take it apart.

The State Machine: 5 Phases and a Loop

Every coding agent — Codex, Claude Code, Cursor, doesn’t matter — follows the same fundamental pattern. Michael Bolin describes it as a loop with 5 phases:

flowchart TD
    A["1. Prompt Assembly\n(build the prompt)"] --> B["2. Inference\n(send to LLM)"]
    B --> C{Tool call?}
    C -->|Yes| D["3. Tool Invocation\n(run tool)"]
    D --> E["4. Tool Response\n(return result to LLM)"]
    E --> B
    C -->|No| F["5. Assistant Message\n(final response)"]
    F -->|New input| A

    style A fill:#2d3748,stroke:#4a9eed,color:#fff
    style B fill:#2d3748,stroke:#4a9eed,color:#fff
    style C fill:#4a3728,stroke:#ed9a4a,color:#fff
    style D fill:#2d3748,stroke:#4a9eed,color:#fff
    style E fill:#2d3748,stroke:#4a9eed,color:#fff
    style F fill:#283d28,stroke:#4aed5c,color:#fff

In plain language:

  1. Prompt Assembly: a massive prompt is built with everything the agent needs to know — your message, system instructions, available tools, files it’s read, and the complete conversation history.
  2. Inference: that prompt is tokenized and sent to the model. The model returns a stream of events: internal reasoning, tool calls, or response text.
  3. Tool Invocation: if the model requests a tool (read a file, run a command, write some code), the tool is executed. If it fails, the error is sent back to the model.
  4. Tool Response Loop: the tool’s result is returned to the model as additional context. Steps 2-4 repeat until the model stops requesting tools.
  5. Assistant Message: when the model decides it’s done, it outputs a final message and the loop ends.

That’s it. No knowledge graphs, no symbolic planners, no sophisticated architectures. It’s a while loop with an LLM inside.

The difference between a good agent and a bad one isn’t in the loop’s architecture — which is identical — but in the details of each phase.

Phase 1: The Art of Building a Prompt

The first phase is where everything happens. Before the LLM even sees a single line of your code, the agent has to construct a prompt that includes:

flowchart LR
    subgraph Prompt["Prompt Assembly"]
        direction TB
        SP["System Prompt\n(personality, rules)"]
        Tools["Available Tools\n(Read, Write, Bash, MCP...)"]
        Ctx["Files/images\npreviously read"]
        Inst["CLAUDE.md / AGENTS.md\n(repo instructions)"]
        Env["Environment info\n(OS, shell, git status)"]
        Hist["Conversation\nhistory"]
        User["User's message"]
    end

    SP --> Final["Complete\nPrompt"]
    Tools --> Final
    Ctx --> Final
    Inst --> Final
    Env --> Final
    Hist --> Final
    User --> Final

    style Final fill:#283d28,stroke:#4aed5c,color:#fff

You’ll immediately see a critical design decision here: order matters. The prompt is built from most stable to least stable. The system prompt comes first (it never changes), then the tools (rarely change), followed by files and the conversation history (which grows with each interaction), and finally, your latest input.

Why this order? For prompt caching. Since caching works via exact prefix matching, putting stable content first maximizes the number of tokens read from the cache at each iteration. Changing anything at the start invalidates everything that follows. I covered this in detail in my article on prompt caching, but the key takeaway is: your prompt’s order isn’t cosmetic; it’s economical.

Then there are the CLAUDE.md and AGENTS.md files. These work like leaving a note for the plumber before leaving your house: “the water valve is under the sink, don’t touch the blue pipe.” The agent reads them at startup and injects them into every prompt. These are your way of providing vital context without repeating yourself every time.

The Square Problem: Why Context Grows Like a Snowball

Here comes the slap of reality: each loop iteration sends the entire conversation history to the model. The server is stateless. Each request is independent, stateless.

Why? So the provider can promise Zero Data Retention — your data doesn’t linger on their servers between requests. It’s a privacy decision, not an efficiency one.

But it comes at a brutal cost:

flowchart LR
    subgraph Msg1["Iteration 1"]
        S1["System\n10K tokens"] --> U1["User\n500 tokens"]
    end

    subgraph Msg5["Iteration 5"]
        S5["System\n10K tokens"] --> H5["History\n40K tokens"] --> U5["User\n500 tokens"]
    end

    subgraph Msg20["Iteration 20"]
        S20["System\n10K tokens"] --> H20["History\n180K tokens"] --> U20["User\n500 tokens"]
    end

    style Msg1 fill:#1a2332,stroke:#4a9eed,color:#fff
    style Msg5 fill:#2a2332,stroke:#9a4eed,color:#fff
    style Msg20 fill:#3a1a1a,stroke:#ed4a4a,color:#fff

In iteration 1, you send 10K tokens. By iteration 5, it’s 50K. By iteration 20, it’s 190K. Each message resends the entire backlog. And because the transformer model’s self-attention mechanism has a quadratic cost relative to token count, the data volume isn’t the only thing growing — the processing cost increases exponentially.

In other words: iteration 20 doesn’t cost 20 times more than iteration 1. It costs much more.

Compaction: Compressing Without Losing the Essentials

Both Codex and Claude Code have a solution for runaway context growth: compaction (or automatic compression).

When the history nears the context window limit, the agent does something clever: it sends the entire history to a special endpoint that generates a compressed representation. Instead of 180K tokens of conversation, you might get 20K tokens that capture crucial decisions, modified files, and the current task state.

flowchart TD
    Full["Full History\n180K tokens"] --> Check{Near the limit?}
    Check -->|No| Continue["Continue normally"]
    Check -->|Yes| Compact["Compaction endpoint"]
    Compact --> Summary["Compressed Summary\n~20K tokens"]
    Summary --> NewCtx["New Context\n= System + Summary + Latest message"]
    NewCtx --> Continue2["Resume with fresh context"]

    style Full fill:#3a1a1a,stroke:#ed4a4a,color:#fff
    style Summary fill:#283d28,stroke:#4aed5c,color:#fff
    style Compact fill:#2d3748,stroke:#4a9eed,color:#fff

Pay attention: compaction isn’t free. You lose details. The model no longer has access to the exact diff you made in step 7; instead, there’s a summary like “the authentication module was refactored.” For most tasks, that’s fine. For surgical debugging, it might be an issue.

Codex calls it compaction. Claude Code does something similar with automatic context compression. The idea is identical: when context gets out of hand, compress the past and move forward with a lighter version.

Sandbox: The Gilded Cage

Both agents execute tools in a sandbox — a restricted environment where network and file system access are limited by default.

This is essential. Without a sandbox, a rm -rf / generated by model hallucination could destroy your machine. With a sandbox, the worst-case scenario is breaking something within confined, controlled limits.

Claude Code asks for confirmation for potentially destructive operations (unless explicitly pre-approved). Codex CLI operates in a default mode with similar explicit permissions.

The lesson here isn’t technical; it’s philosophical: an agent that can do anything is an agent you can’t trust. Restrictions aren’t limitations — they’re guarantees.

Codex CLI vs Claude Code: Non-Identical Twins

Now for the fun part. Both agents are the same loop inside, but the design decisions diverge in fascinating ways:

flowchart TB
    subgraph Codex["Codex CLI (OpenAI)"]
        direction TB
        CG["Desktop GUI\n(Command Center)"]
        CS["Generic Shell\n(bash/terminal)"]
        CA["Automations\n(native scheduling)"]
        CD["Diffs with\ninline comments"]
    end

    subgraph Claude["Claude Code (Anthropic)"]
        direction TB
        CC["CLI-first\n(native terminal)"]
        CT["Dedicated tools\n(Read, Edit, Grep, Glob)"]
        CK["Skills\n(/blog, /improve...)"]
        CF["Conversational\nFeedback"]
    end

    style Codex fill:#1a2332,stroke:#4a9eed,color:#fff
    style Claude fill:#2a1a32,stroke:#9a4eed,color:#fff

Tools: Generic vs Specialized

Codex gives the model access to a generic shell. Want to read a file? The model runs cat file.py. Want to search text? It runs grep -r "pattern" ..

Claude Code takes the opposite approach: it provides dedicated tools for each operation. Read to read files, Edit to edit them (with exact string replacement, not full rewrites), Grep for searches, and Glob to find files by pattern.

Which is better? It depends. The generic shell is more flexible — anything you can do in a terminal, the model can do. But dedicated tools are more secure and efficient. An Edit that sends only the diff of a change is faster and less error-prone than cat > file.py << 'EOF', which overwrites the entire file.

In my experience: dedicated tools win for 90% of cases. The generic shell wins when you need to do something exotic no tools cover.

GUI vs CLI

Codex bets on a desktop GUI (Command Center), where you see diffs like in a pull request, add inline comments, and view graphical interfaces of agent activities.

Claude Code is CLI-only. Your terminal. Your shell. No windows. If you want to review a change, the agent shows you text. If you want to give feedback, you just type it into the conversation.

Which do I prefer? CLI, hands down. Not because I’m a hacker purist, but because CLI integrates with everything: tmux, scripts, cron jobs, CI pipelines, SSH-based remote control. A GUI ties you to a specific screen. For interactive sessions, the GUI is more visual, sure. But for real work — long tasks, automations, independent agents — CLI wins every time.

Scheduling: Native vs DIY

Codex has Automations: you can schedule tasks to run automatically (react to GitHub events, trigger agents every morning, etc.). It’s native scheduling within the platform.

Claude Code has none of that. If you want an agent to run every 30 minutes, you use cron or a systemd timer. If you need it to react to a webhook, you build the integration yourself.

Here, Codex clearly wins for teams needing out-of-the-box automation. But Claude Code’s DIY approach has an unexpected advantage: you control the infrastructure. If Anthropic changes their API, your cron keeps working because it’s on your machine. If OpenAI experiments with Automations, you’re at their mercy.

What Really Matters

After dissecting the guts of both agents, the conclusion is almost disappointingly simple:

A coding agent is a loop that builds a prompt, invokes an LLM, executes tools, and repeats. That’s it.

The magic isn’t in the loop. It’s in three things:

  1. The model’s quality. A while loop with GPT-3 does nothing useful. With Claude Opus or GPT-4, it can refactor entire modules. The loop is the same — the brain inside the loop makes the difference.

  2. Context management. The prompt can’t grow forever. How you order information, when you compress, what you prioritize when compressing — that’s where the real engineering happens. An agent that loses key context when compressing makes mistakes no human would.

  3. Tool design. Giving an LLM unrestricted bash access is like handing car keys to someone who’s never driven. Well-designed tools (with validation, restrictions, and clear error feedback) are the difference between an agent that helps and one that goes off the rails and deletes your node_modules folder at 3 a.m.

The next time your coding agent pulls off something that seems like magic, just remember: it’s a while True with an LLM inside. Elegant? Yes. Powerful? Absolutely. But magic? Not really.


Sources: The main article is “What Actually Happens Inside an AI Coding Agent (We Unrolled It)” by Michael Bolin (OpenAI). The comparison with Claude Code comes from firsthand experience and the official documentation from Anthropic. If you’re interested in context and caching, check out Why 99% of What You Send to Claude Is Already Cached and Your LLM’s Cache Charges You Twice to Save You Money.