1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
title: "Context Engineering: The Invisible Skill that Separates Great AI Agents from Mediocre Ones"
date: 2026-03-11T22:00:00+01:00
draft: false
slug: "context-engineering-invisible-skill-ai-agents"
slug_en: "context-engineering-invisible-skill-ai-agents"
description: "Prompt engineering is writing a good prompt. Context engineering is designing EVERYTHING the model sees: what goes in, in what order, what gets excluded, and what gets compressed. And that’s what truly matters."
tags: ["llm", "agents", "context engineering", "openai", "claude code", "memory"]
categories: ["opinion"]

translation:
  hash: ""
  last_translated: ""
  notes: |
    - "dicho en cristiano": "in plain language". No religious connotation.
    - "ojo al dato": colloquial for "pay attention to this" / "here's the key point".
    - "chapuza": "hack/bodge/kludge". Quick-and-dirty solution, not derogatory.
    - "morro que te pisas": colloquial for "incredible nerve/audacity". Not offensive, humorous.
    - "te la juegas": "you're taking a risk" / "you're gambling".
    - "currar": colloquial for "to work". Common in Spain.
    - "barra del bar": "bar counter" — casual conversation metaphor, common in Spanish tech blogs.
---

Imagine you hire a brilliant consultant. They have two PhDs, speak seven languages, and solve problems you didn’t even know existed. You sit them down in a room and say, “I need you to refactor the authentication system for this project.”

The consultant nods, looks at you, and asks, “Which project?”

You haven’t given them access to the code. You haven’t explained the architecture. They don’t know if you’re using JWT tokens or session cookies. They don’t know what language you’re using, how many microservices there are, or why the last migration attempt ended in disaster.

That consultant is your LLM. And you’ve just made the same mistake 90% of people working with AI agents make: **caring more about the brain than what the brain sees**.

## Prompt Engineering Is Dead. Long Live Context Engineering.

For months now, I’ve been watching the same conversation unfold everywhere: on forums, in Twitter threads, during team meetings: "GPT-5 or Claude Opus?", "Which model is better for coding?", "Which one reasons better?"

And every time I run the numbers, the answer is the same: **it doesn’t matter.** Well, it doesn’t *exactly* not matter. But the difference between one top-tier model and another is tiny compared to the difference between giving it good context or garbage.

A mediocre model with perfect context beats a top-tier model with garbage context. Every single time. No exceptions.

This has a name: **context engineering.** And no, it’s not the same as *prompt engineering.*

*Prompt engineering* is writing a good prompt: choosing the right words, structuring the request, adding examples. It’s important, but it’s just one piece of the puzzle.

*Context engineering* is designing **everything** the model sees: what goes in, in what order, what gets excluded when there’s no room, what gets compressed, what absolutely must stay. It’s information architecture for LLMs.

In plain language: *prompt engineering* is writing a good question. *Context engineering* is deciding which books the student has on their desk before taking the test.

## The Four Phases of Memory: A Lifecycle You Don’t See

OpenAI recently published two *Cookbook articles* breaking down how context management functions in agents with long-term memory. It’s not RAG. It’s not a vector database. It’s a state-based system that functions like a field notebook with strict rules.

The pattern is *local-first* and *state-based*: a structured state object that travels with the agent and updates at every phase.

```mermaid
flowchart TD
    A["1. INJECTION\n(session start)"] --> B["2. DISTILLATION\n(during conversation)"]
    B --> C["3. CONSOLIDATION\n(post-session)"]
    C --> D["4. TRIMMING\n(preservation)"]
    D -->|"New session"| A

    A1["Render state as YAML\n+ global memories (max 6)\n+ precedence rules"] -.-> A
    B1["save_memory_note()\nValidate durability\nMandate actionability\nReject PII and speculation"] -.-> B
    C1["Async job\nMerge session → global\nLLM deduplication\nFilter ephemeral notes"] -.-> C
    D1["TrimmingSession: last N\nReinject trimmed notes\nin system prompt"] -.-> D

    style A fill:#2d3748,stroke:#4a9eed,color:#fff
    style B fill:#2d3748,stroke:#ed9a4a,color:#fff
    style C fill:#2d3748,stroke:#9a4eed,color:#fff
    style D fill:#2d3748,stroke:#4aed5c,color:#fff

Phase 1: Injection — The Test Desk

When a session starts, the agent assembles its initial context. This is not random. It’s a well-defined structure:

  • YAML frontmatter with the user’s state (preferences, configuration).
  • Global memory list: up to 6 items, sorted by recency. Why 6? Because more than 6 start competing for the model’s attention and become diluted. Less is more.
  • A <memory_policy> block with explicit precedence rules.

Precedence rules are critical: Current Input > Session Memory > Global Memory > Recency within the same scope. If the user tells you, “I’m using Vim now,” but your global memory says, “Uses VS Code,” the recent input wins. Seems obvious, but without explicit rules, the model sometimes clings to what it “remembers” rather than what it’s being told.

Phase 2: Distillation — Capturing Without Contaminating

During the conversation, the agent can capture memories in real time using a tool like save_memory_note(). But not everything is fair game. The tool comes with strict guardrails:

  • Validate durability: “The user wants pizza tonight” isn’t a durable memory. It gets rejected.
  • Mandate actionability: the memory must serve a purpose in future sessions.
  • Reject PII: full names, addresses, credit card numbers. Out.
  • Reject speculation: “I think the user prefers Python” isn’t a fact. It’s a guess.
  • Require user confirmation: ask before saving.

This filtering is ruthless, and for good reason. One contaminated memory can poison all future sessions. It’s like having a false note in your notebook: every time you refer to it, you make decisions based on incorrect information.

Phase 3: Consolidation — The Nightly Cleanup

After every session, an asynchronous job collects session notes and merges them with global memory. It’s not an append. It’s intelligent consolidation:

  • LLM-assisted deduplication: if two notes say the same thing in different words, they get merged.
  • Filtering ephemeral notes: anything with “this time,” “today,” or “right now” gets discarded.
  • Recency-based conflict resolution: if a new note contradicts an old one, the new one wins.

Think of this as someone cleaning your desk at the end of the day. They don’t throw everything away—they keep the essentials, consolidate post-its that say the same thing, and toss what’s no longer relevant.

Phase 4: Trimming — Cutting Without Losing

When history gets too long, you need to trim. TrimmingSession preserves only the last N turns. But — pay attention — memory notes that lived in those trimmed turns aren’t lost. They’re re-injected into the system prompt for the next turn.

It’s like tearing out old pages of a notebook but copying the important notes onto the first page before throwing them away.

Trimming vs. Summarization: Two Philosophies, One Dilemma

To handle short-term memory (conversation history within a session), there are two primary techniques. Each has its advantages and pitfalls.

flowchart LR
    subgraph Trimming["Trimming (Last-N Turns)"]
        direction TB
        T1["Full history\n(40 turns)"]
        T2["Trim turns 1-30"]
        T3["Retain turns 31-40\n(intact, unaltered)"]
        T1 --> T2 --> T3
    end

    subgraph Summarization["Summarization (Compression)"]
        direction TB
        S1["Full history\n(40 turns)"]
        S2["LLM summarizes turns 1-30\nin ~400 tokens"]
        S3["Inject synthetic summary\n+ turns 31-40"]
        S1 --> S2 --> S3
    end

    style Trimming fill:#1a2332,stroke:#4a9eed,color:#fff
    style Summarization fill:#2a1a32,stroke:#9a4eed,color:#fff

Trimming: The Deterministic Guillotine

Scans history backward, keeps the last N complete interactions, and deletes everything before.

Advantage: total fidelity to recent context. What remains is unaltered, not summarized or interpreted.

Disadvantage: abrupt memory loss. Turn N-1 exists in detail. Turn N-2 doesn’t exist at all. There’s no gradual fade—just a binary split between “remember everything” and “remember nothing.”

It’s like the memory of a goldfish with an external hard drive: the last 10 seconds are perfect; everything before that simply doesn’t exist.

Summarization: Compression with Risk

When the history exceeds a threshold, an LLM compresses the old data and injects it as a synthetic user/assistant pair at the start of the conversation. The summarization prompt follows strict principles:

  • Preserve milestones (decisions made, agreements).
  • Maintain temporal order.
  • Flag contradictions.
  • Mark uncertain facts as “UNVERIFIED.”
  • Limit to 400 tokens per summary.

Advantage: the essence of the entire conversation is preserved. No abrupt memory loss. The model “knows” that 30 turns ago, you decided to use PostgreSQL instead of MongoDB, even if it no longer has the original messages.

Disadvantage: compounding errors. If an error makes it into the summary, it contaminates future performance. And since the summary is generated by an LLM, it’s not immune to hallucinations. A bad summarization creates incorrect summaries that subsequent turns treat as absolute truth.

The nerve is outrageous: you’re using one LLM to summarize the history of another LLM, and if the first one messes up, the second inherits the error unknowingly.

To distinguish real from synthetic, every record includes observability metadata: {"synthetic": bool, "kind": "...", "summary_for_turns": "..."}. At least this way, you can audit which parts of the context are original and which ones are potentially flawed summaries.

You’re Already Doing This (But You Didn’t Know It)

If you’re using Claude Code, you already have a context engineering system in place. You just didn’t design it — Anthropic did. But if you take a closer look, the pieces line up:

Your global CLAUDE.md + project-level CLAUDE.md files and SKILL.md files = manual injection. You’re deciding what context the model gets at the start of each session. You’re the one choosing which “books go on the desk.”

The ~/.claude/projects/*/memory/ directory where Claude Code stores notes between sessions = a direct implementation of the injection + distillation pattern. The model captures facts during a session and retrieves them in the next.

The automatic context compression Claude Code applies to lengthy conversations = trimming + summarization. You don’t see it because it’s transparent, but every time your session exceeds a certain threshold, part of the conversation is compressed.

The skills (/blog, /commit, etc.) = on-demand injection of specialized context. Instead of loading all possible context at the start, you load only what’s needed, when it’s needed.

Here’s what I find most interesting: the quality of your CLAUDE.md determines the quality of your agent far more than the model you’re using. A well-structured CLAUDE.md — with clear conventions, correct paths, and well-documented architectural decisions — can turn any decent model into a highly effective assistant. An empty or disorganized CLAUDE.md turns the best model in the world into a brilliant consultant locked in a dark room.

Prompt Debt: The Technical Debt You Can’t See

You’ve heard of technical debt, right? Code that works but accumulates future problems. Shortcuts you’ll pay for later.

Context engineering has its own debt: prompt debt. It’s all those config files, instructions, memories, and notes piling up with nobody maintaining them.

A CLAUDE.md with contradictory instructions. Global memories that no longer apply. Skills with paths that changed three months ago. Implicit precedence rules no one documented.

Every piece of obsolete context is noise. And noise competes with signal for the model’s attention. More noise → worse results. Not because the model is worse, but because you’re feeding it garbage mixed with useful information and expecting it to figure it out.

Keeping your context engineering layer clean is just as important as clean code. Maybe even more so, because a bug in the code fails loudly. A bug in the context fails silently — the model just starts making worse decisions, and nobody notices.

Actionable Tips: What You Can Do Today

All this theory is nice, but what do you do with it on a random Tuesday morning?

1. Audit your CLAUDE.md (or equivalent). Does it have contradictory instructions? Paths that no longer exist? Rules that no longer apply? Clean it up. Every extra line is noise.

2. Organize your context by stability. Put what never changes first (conventions, stack). Put what changes frequently last (current task). This maximizes cache hits and reduces costs. It’s not just cosmetic — it’s economical.

3. Define explicit precedence rules. If the user says one thing and memory says another, who wins? If you don’t define this, the model decides for you. And you’re gambling.

4. Filter aggressively. Not everything deserves to be remembered. An architecture decision? Yes. That the user prefers tabs over spaces? Maybe. That it was raining when the session started? No.

5. Separate real from synthetic context. If you use summarization, tag summaries as such. When something fails, you need to know if the model was working with real data or a potentially flawed summary.

6. Treat context maintenance as technical debt. Put it in the backlog. Review it periodically. It’s not glamorous, but it’s what separates an agent that works from one that hallucinates.

The Skill Nobody Lists on Their Résumé

Context engineering is the invisible skill. It doesn’t show up in job listings. There’s no certification for it. There’s no 40-hour Udemy course with a certificate at the end.

But it’s what separates people who “use ChatGPT” from people who build agents that work. It’s the difference between asking a question to an LLM and designing a system where the LLM has everything it needs to give the right answer.

The next time your agent does something dumb, before blaming the model, take a look at what context you were giving it. Chances are, the problem isn’t the brain — it’s what the brain was seeing.

And unlike the model, that’s something you can control.


Sources: The two OpenAI Cookbook articles on Context Engineering for Long-Term Personalization and Short-Term Memory Management with Sessions. If you’re interested in how the internal loop of a coding agent works, read Your AI Coding Agent is Just a While Loop with Delusions of Grandeur. And if you want to understand why prompt order impacts cost, check out Why 99% of What You Send to Claude is Already Cached.