| |
Phase 1: Injection — The Test Desk
When a session starts, the agent assembles its initial context. This is not random. It’s a well-defined structure:
- YAML frontmatter with the user’s state (preferences, configuration).
- Global memory list: up to 6 items, sorted by recency. Why 6? Because more than 6 start competing for the model’s attention and become diluted. Less is more.
- A
<memory_policy>block with explicit precedence rules.
Precedence rules are critical: Current Input > Session Memory > Global Memory > Recency within the same scope. If the user tells you, “I’m using Vim now,” but your global memory says, “Uses VS Code,” the recent input wins. Seems obvious, but without explicit rules, the model sometimes clings to what it “remembers” rather than what it’s being told.
Phase 2: Distillation — Capturing Without Contaminating
During the conversation, the agent can capture memories in real time using a tool like save_memory_note(). But not everything is fair game. The tool comes with strict guardrails:
- Validate durability: “The user wants pizza tonight” isn’t a durable memory. It gets rejected.
- Mandate actionability: the memory must serve a purpose in future sessions.
- Reject PII: full names, addresses, credit card numbers. Out.
- Reject speculation: “I think the user prefers Python” isn’t a fact. It’s a guess.
- Require user confirmation: ask before saving.
This filtering is ruthless, and for good reason. One contaminated memory can poison all future sessions. It’s like having a false note in your notebook: every time you refer to it, you make decisions based on incorrect information.
Phase 3: Consolidation — The Nightly Cleanup
After every session, an asynchronous job collects session notes and merges them with global memory. It’s not an append. It’s intelligent consolidation:
- LLM-assisted deduplication: if two notes say the same thing in different words, they get merged.
- Filtering ephemeral notes: anything with “this time,” “today,” or “right now” gets discarded.
- Recency-based conflict resolution: if a new note contradicts an old one, the new one wins.
Think of this as someone cleaning your desk at the end of the day. They don’t throw everything away—they keep the essentials, consolidate post-its that say the same thing, and toss what’s no longer relevant.
Phase 4: Trimming — Cutting Without Losing
When history gets too long, you need to trim. TrimmingSession preserves only the last N turns. But — pay attention — memory notes that lived in those trimmed turns aren’t lost. They’re re-injected into the system prompt for the next turn.
It’s like tearing out old pages of a notebook but copying the important notes onto the first page before throwing them away.
Trimming vs. Summarization: Two Philosophies, One Dilemma
To handle short-term memory (conversation history within a session), there are two primary techniques. Each has its advantages and pitfalls.
flowchart LR
subgraph Trimming["Trimming (Last-N Turns)"]
direction TB
T1["Full history\n(40 turns)"]
T2["Trim turns 1-30"]
T3["Retain turns 31-40\n(intact, unaltered)"]
T1 --> T2 --> T3
end
subgraph Summarization["Summarization (Compression)"]
direction TB
S1["Full history\n(40 turns)"]
S2["LLM summarizes turns 1-30\nin ~400 tokens"]
S3["Inject synthetic summary\n+ turns 31-40"]
S1 --> S2 --> S3
end
style Trimming fill:#1a2332,stroke:#4a9eed,color:#fff
style Summarization fill:#2a1a32,stroke:#9a4eed,color:#fff
Trimming: The Deterministic Guillotine
Scans history backward, keeps the last N complete interactions, and deletes everything before.
Advantage: total fidelity to recent context. What remains is unaltered, not summarized or interpreted.
Disadvantage: abrupt memory loss. Turn N-1 exists in detail. Turn N-2 doesn’t exist at all. There’s no gradual fade—just a binary split between “remember everything” and “remember nothing.”
It’s like the memory of a goldfish with an external hard drive: the last 10 seconds are perfect; everything before that simply doesn’t exist.
Summarization: Compression with Risk
When the history exceeds a threshold, an LLM compresses the old data and injects it as a synthetic user/assistant pair at the start of the conversation. The summarization prompt follows strict principles:
- Preserve milestones (decisions made, agreements).
- Maintain temporal order.
- Flag contradictions.
- Mark uncertain facts as “UNVERIFIED.”
- Limit to 400 tokens per summary.
Advantage: the essence of the entire conversation is preserved. No abrupt memory loss. The model “knows” that 30 turns ago, you decided to use PostgreSQL instead of MongoDB, even if it no longer has the original messages.
Disadvantage: compounding errors. If an error makes it into the summary, it contaminates future performance. And since the summary is generated by an LLM, it’s not immune to hallucinations. A bad summarization creates incorrect summaries that subsequent turns treat as absolute truth.
The nerve is outrageous: you’re using one LLM to summarize the history of another LLM, and if the first one messes up, the second inherits the error unknowingly.
To distinguish real from synthetic, every record includes observability metadata: {"synthetic": bool, "kind": "...", "summary_for_turns": "..."}. At least this way, you can audit which parts of the context are original and which ones are potentially flawed summaries.
You’re Already Doing This (But You Didn’t Know It)
If you’re using Claude Code, you already have a context engineering system in place. You just didn’t design it — Anthropic did. But if you take a closer look, the pieces line up:
Your global CLAUDE.md + project-level CLAUDE.md files and SKILL.md files = manual injection. You’re deciding what context the model gets at the start of each session. You’re the one choosing which “books go on the desk.”
The ~/.claude/projects/*/memory/ directory where Claude Code stores notes between sessions = a direct implementation of the injection + distillation pattern. The model captures facts during a session and retrieves them in the next.
The automatic context compression Claude Code applies to lengthy conversations = trimming + summarization. You don’t see it because it’s transparent, but every time your session exceeds a certain threshold, part of the conversation is compressed.
The skills (/blog, /commit, etc.) = on-demand injection of specialized context. Instead of loading all possible context at the start, you load only what’s needed, when it’s needed.
Here’s what I find most interesting: the quality of your CLAUDE.md determines the quality of your agent far more than the model you’re using. A well-structured CLAUDE.md — with clear conventions, correct paths, and well-documented architectural decisions — can turn any decent model into a highly effective assistant. An empty or disorganized CLAUDE.md turns the best model in the world into a brilliant consultant locked in a dark room.
Prompt Debt: The Technical Debt You Can’t See
You’ve heard of technical debt, right? Code that works but accumulates future problems. Shortcuts you’ll pay for later.
Context engineering has its own debt: prompt debt. It’s all those config files, instructions, memories, and notes piling up with nobody maintaining them.
A CLAUDE.md with contradictory instructions. Global memories that no longer apply. Skills with paths that changed three months ago. Implicit precedence rules no one documented.
Every piece of obsolete context is noise. And noise competes with signal for the model’s attention. More noise → worse results. Not because the model is worse, but because you’re feeding it garbage mixed with useful information and expecting it to figure it out.
Keeping your context engineering layer clean is just as important as clean code. Maybe even more so, because a bug in the code fails loudly. A bug in the context fails silently — the model just starts making worse decisions, and nobody notices.
Actionable Tips: What You Can Do Today
All this theory is nice, but what do you do with it on a random Tuesday morning?
1. Audit your CLAUDE.md (or equivalent). Does it have contradictory instructions? Paths that no longer exist? Rules that no longer apply? Clean it up. Every extra line is noise.
2. Organize your context by stability. Put what never changes first (conventions, stack). Put what changes frequently last (current task). This maximizes cache hits and reduces costs. It’s not just cosmetic — it’s economical.
3. Define explicit precedence rules. If the user says one thing and memory says another, who wins? If you don’t define this, the model decides for you. And you’re gambling.
4. Filter aggressively. Not everything deserves to be remembered. An architecture decision? Yes. That the user prefers tabs over spaces? Maybe. That it was raining when the session started? No.
5. Separate real from synthetic context. If you use summarization, tag summaries as such. When something fails, you need to know if the model was working with real data or a potentially flawed summary.
6. Treat context maintenance as technical debt. Put it in the backlog. Review it periodically. It’s not glamorous, but it’s what separates an agent that works from one that hallucinates.
The Skill Nobody Lists on Their Résumé
Context engineering is the invisible skill. It doesn’t show up in job listings. There’s no certification for it. There’s no 40-hour Udemy course with a certificate at the end.
But it’s what separates people who “use ChatGPT” from people who build agents that work. It’s the difference between asking a question to an LLM and designing a system where the LLM has everything it needs to give the right answer.
The next time your agent does something dumb, before blaming the model, take a look at what context you were giving it. Chances are, the problem isn’t the brain — it’s what the brain was seeing.
And unlike the model, that’s something you can control.
Sources: The two OpenAI Cookbook articles on Context Engineering for Long-Term Personalization and Short-Term Memory Management with Sessions. If you’re interested in how the internal loop of a coding agent works, read Your AI Coding Agent is Just a While Loop with Delusions of Grandeur. And if you want to understand why prompt order impacts cost, check out Why 99% of What You Send to Claude is Already Cached.