| |
A Dictionary<String, UInt64>. 900 entries. ~55KB. Nothing groundbreaking.
And here’s the kicker that makes this even more absurd: that file was created by the app itself. It’s not a JSON file from an external API. It doesn’t come from Claude Code. It’s an internal state file that Tokamak writes and reads to keep track of where it left off in each session. The AI was reading a file it had generated itself from disk 900 times.
“But why don’t you use Core Data or SQLite? They’re already in the app.” Good question. Because this file is a disposable progress cache. If it gets corrupted, you delete it, and the next scan reconstructs all offsets by reading all the files in full once. Zero data loss. Plus, I can just cat session-offsets.json | jq . to debug it (with Core Data, I’d need sqlite3 and the sandboxed database path). It’s Sendable without messing around with background contexts. And if Core Data’s SQLite gets corrupted, it doesn’t drag the offsets down with it (or vice versa). For 55KB of a flat dictionary, the ceremony of setting up an entity with schema migration just isn’t worth it.
The format wasn’t the issue. The access was.
Here’s the code the AI generated for the scan loop:
| |
Two disk IO operations per iteration. 900 iterations. 1,800 IO operations when there should be exactly two: one read at the start, one write at the end.
The numbers (xctrace doesn’t lie)
I caught it with Instruments (Time Profiler). Here’s the data:
| Metric | Before | After |
|---|---|---|
| Total samples | 7,260 | 489 |
Samples in OffsetStore.load() | 1,704 (88%) | 10 (2%) |
| Scan time | >20s | <0.5s |
| CPU | 81% | ~1.5% |
Eighty-eight percent of the scan time was spent reading and parsing a 900-line JSON file. Over and over again. Like Sisyphus pushing the same rock, but with JSONDecoder.
The fix (which should embarrass you)
| |
Here’s the key point: the data structure didn’t change. It was still a Dictionary<String, UInt64>. The hash table was already optimal. What wasn’t optimal was rebuilding it from disk in every iteration.
What doesn’t work: adding “don’t do this” to your CLAUDE.md
After the fix, I added this to the project’s CLAUDE.md:
“NEVER perform IO (disk, network, JSON decoding, Core Data fetch) inside a loop if it can be done beforehand. Load data once before the loop, operate in memory, save once after.”
And here’s the real takeaway: it didn’t work.
Weeks later, while adding a second service (Codex), the AI generated exactly the same pattern. With the instruction right in front of it. It’s like putting up a “Don’t walk on the grass” sign and expecting it to work.
Why? Because the LLM doesn’t understand the rule. It has seen the rule. Statistically, most of the code it trained on performs IO sporadically, not inside 900-iteration loops. The load → use → save pattern in a function is more common. Whether that function is called inside a loop of 900 iterations is a contextual detail the model has no incentive to track.
What doesn’t work either: linters
There’s no linter that catches this. Not SwiftLint, ESLint, Ruff, or Clippy. Think about it: the code is syntactically correct and semantically valid. Each individual call to offsetStore.offset(for:) is perfectly reasonable. The problem isn’t in any single line—it’s in the composition.
If we think in terms of code layers of meaning (an idea I teach in my adversarial development course):
| Layer | Question | Fails here? |
|---|---|---|
| 1. Signal | Is this code? | No |
| 2. Language | Is it valid Swift? | No |
| 3. Syntax | Does it compile? | No |
| 4. Local semantics | Does the function do what it promises? | No |
| 5. System semantics | Does it respect contracts and performance? | Yes |
| 6. Architecture | Does it scale without degrading? | Yes |
The failure is at layers 5-6. Exactly where LLMs fail today, in 2026. The syntax and local logic are spotless. The problem is emergent: it arises when a correct function is used in a context that turns it into a bottleneck.
A linter operates at layers 2-4. It has no visibility into composition or performance. Asking a linter to catch this would be like expecting Microsoft Word’s spell checker to catch a logical fallacy.
What does work: performance tests after the fact
After the first fix, I wrote this test:
| |
It’s a brutally simple regression test. A thousand files, less than 3 seconds, or the test fails. If anyone (human or AI) puts IO back into the loop, the scan time jumps from 0.2 seconds to 30, and the test explodes.
And that’s exactly what happened. When the AI generated the second service with the same bug, the performance test for the first service still passed (it was a different service). But when I wrote the equivalent test for the new service, it immediately failed. The test did its job: catch the regression that neither the CLAUDE.md nor any linter could spot.
What this confirms
This bug is a textbook example of the core thesis of what I call adversarial development: never trust, always verify.
You can’t trust the AI not to make beginner-level mistakes. It will. Repeatedly. Even if you tell it not to.
You can’t trust linters to catch it. They can’t. The mistake is above their abstraction level.
What you can do:
- Performance tests as a safety net after the fact
- Real profiling (xctrace, Instruments) to measure, not guess
- Defense in depth: multiple layers, because no single layer covers everything
In plain language: the defense isn’t a wall. It’s an onion. Layer upon layer. And when one layer fails, the next catches it.
For the skeptics
“But Fernando, wouldn’t a human programmer make the same mistake?”
A junior, yes. A senior, probably not—they have the pattern internalized. Even if they did make the mistake, code reviews would catch it. The problem with AI-generated code is volume: 50 files in 10 minutes. Nobody reviews 50 files line by line. Discriminator fatigue is real.
And that’s why your verification needs to be automatic, not human. A performance test doesn’t get tired. It doesn’t get distracted. It doesn’t have fatigue. It runs every time you hit make test and tells you if something smells off.
This is the same principle I follow in The 5 Defenses Against Hallucinations in Code: the verification system must be external to the generator. If the AI writes the code, verification has to come from somewhere else. In this case, from a clock measuring how long something takes.