Last week I told how my AI invented a complete JSON structure and wrapped it in DTOs, fixtures, and passing tests. 90 green tests. All lies.
That post was the diagnosis. This is the treatment.
After discovering the disaster, I did what any engineer with wounded pride does: obsessively research for days to make sure it never happens again. I read papers, tried tools, analyzed real data from my APIs, and built a defense system for my app.
What I found surprised me. Of the 5 reactive measures I identified, only 3 actually work. The other two are, at best, security theater with good intentions.
The Mental Model: You vs. the AI (Literally)
Before diving into the measures, you need to understand the framework. And the best analogy I found comes from deep learning.
In a GAN (Generative Adversarial Network) there are two neural networks competing:
- The generator produces content (images, text, whatever)
- The discriminator tries to detect if the content is real or fake
The system improves because both push each other. The generator learns to fool better. The discriminator learns to detect better.
When you program with an LLM, you’re in an involuntary GAN:
- The LLM is the generator. It produces code, DTOs, tests, fixtures.
- You are the discriminator. You must detect what’s real and what’s invented.
But there’s a brutal asymmetry: the generator is tireless and you get tired. The LLM can generate 50 files without breaking a sweat. You review 10, get fatigued, and file 11 passes without you looking at it.
It’s the same authorization fatigue I wrote about with 1Password asking for Touch ID 47 times a day. Security that depends on a human being permanently alert is cardboard security.
What the Discriminator Should Watch
You can’t (and shouldn’t) review every line. What you have to watch are the boundaries — where your code touches the outside world:
| Boundary | Key Question |
|---|---|
| External APIs | Do the DTO fields exist in the real API? |
| Packages | Does the dependency exist and is it named like this? |
| DB Schemas | Does the table actually have those columns? |
| URLs/endpoints | Does the endpoint exist and respond as expected? |
Rule: everything the LLM declares about the outside world is suspicious until verified. That it says it with confidence is not evidence. Anthropic acknowledges this in their own documentation:
“Claude can sometimes generate responses that contain fabricated information… presented in a confident, authoritative manner.”
An LLM that says “I’m sure” and one that says “I think” have exactly the same probability of being wrong.
Automating the Discriminator
The ultimate goal is to stop depending on your discipline and automate verification:
BEFORE:
LLM generates → You review (sometimes) → Merge
AFTER:
LLM generates → CI verifies against real data → You review discrepancies → Merge
The 5 measures that follow are ways to automate parts of that discriminator role. Some work. Others not so much.
The Hard Data (For Skeptics)
Before you think “this doesn’t happen to me,” here are numbers from real studies:
- 21.7% of packages recommended by open-source LLMs are invented. In commercial models it drops to 5.2%, which is still one package in 20.
- GPT-4o only achieves 38.58% valid invocations for infrequent APIs. Less than 40%. Flip a coin.
- The best current methods for locating code hallucinations achieve 22-33% precision. In plain English: we detect one out of four.
- A researcher uploaded an empty package with a name that LLMs frequently hallucinated. 30,000 downloads in 3 months. They call it slopsquatting.
And there’s a formal taxonomy. The CodeHalu paper (AAAI 2025) defines 4 categories of code hallucinations:
| Category | What it is | Real Example |
|---|---|---|
| Mapping | Fields mapped incorrectly | Confusing user_id with account_id |
| Naming | Invented names | response.quota.percentage when it’s response.utilization |
| Resource | Resources that don’t exist | active_flags field in an API that doesn’t have it |
| Logic | Plausible but incorrect logic | isPaid = !activeFlags.isEmpty with an always-empty field |
My case was a Resource that led to a Logic. The field didn’t exist, and the logic depending on it seemed perfect. Coherent book fiction.
Measure 1: Contract Testing Against Real APIs
The Idea
Define a “contract” of what the API returns and automatically verify that your code is compatible. If your DTO has fields that the contract doesn’t define: alarm.
How It Works
Imagine you have a DTO like this:
| |
A contract test takes the real API response, extracts the JSON keys, and compares them with your DTO’s CodingKeys. If your DTO has a field the API doesn’t return, it’s a PHANTOM — a ghost field, possibly invented.
Keys in real API: {uuid, name, capabilities, billing_type}
Keys in DTO: {uuid, name, activeFlags}
PHANTOM: activeFlags ← In DTO but NOT in API. Hallucinated?
UNCONSUMED: capabilities, billing_type ← In API but not in DTO.
Pros
- Deterministic. Doesn’t depend on another LLM or your gut feeling. If the field isn’t in the API, it fails.
- Eliminates phantom fields by construction. It’s impossible for an invented field to pass.
- Automatable in CI. You run it on every push.
Cons
- You need the API spec. If the API doesn’t have an OpenAPI spec (like Claude’s), you have to capture responses manually.
- Doesn’t detect incorrect naming. If the field exists but is named differently (
active_flagsvscapabilities), it doesn’t catch it automatically. - Requires credentials. To capture the real response you need a valid session.
Tools by Stack
| Stack | Tool | Approach |
|---|---|---|
| Python | Pydantic extra='forbid' | Rejects JSON fields not declared in model |
| TypeScript | Zod .strict() | Same concept, rejects extras |
| Swift | Custom decoder or manual key comparison | Codable ignores unknown keys by default |
| Dart | json_serializable + disallowUnrecognizedKeys | Rejects undeclared fields |
| Agnostic | oasdiff, Specmatic | Compare OpenAPI specs |
What I Implemented
In my app (Swift/SPM) there’s no OpenAPI spec for Claude’s API. So I built bidirectional validation by hand:
make capturedownloads real responses from all APIs and saves them as fixtures inFixtures/real/SchemaValidationTestscompares each DTO’sCodingKeys.allCasesagainst the real fixture’s keys- If there’s a discrepancy → PHANTOM (field in DTO but not in API) or UNCONSUMED (field in API we don’t consume)
$ make doctor
✅ OrganizationInfo: 4 common, 0 phantom, 8 unconsumed
✅ UsageResponse: 9 common, 0 phantom, 1 unconsumed
⚠️ StatsCache: PHANTOM field 'totalSpeculationTimeSaved' — not in real data
Intentionally unconsumed fields go in a documented allowlist with the reason. If a new field appears in the API tomorrow, the test fails with UNCONSUMED and I find out.
Verdict: the most important measure. If you only implement one, make it this one.
Measure 2: Fixture Validation (Real Fixtures, Not Invented Ones)
The Idea
Test fixtures should come from real captured data, not written by hand by the LLM. If the LLM generates the fixture, you’re validating fiction against fiction.
The Problem It Solves
George Tsiokos nailed it in a February 2025 post:
“Tests don’t validate that software meets business needs — they simply confirm that the code does exactly what it was written to do, including bugs.”
When the LLM generates the code AND the tests AND the fixtures:
LLM invents field → LLM writes fixture with that field → LLM writes test
→ Test passes ✅ → Nobody verified against reality ❌
The Solution: Record-Replay
Record-replay frameworks record real HTTP responses and replay them in tests. There’s no possibility of invention because the fixture comes from the API, not the model.
| Stack | Tool |
|---|---|
| Python | VCR.py, pytest-recording |
| TypeScript | Polly.js (Netflix), MSW |
| Swift | Replay (mattt) |
| Agnostic | Hoverfly |
Pros
- Impossible to invent. The fixture comes from the network, not the model.
- Includes metadata. URL, timestamp, status code. You can trace where it came from.
- Gets committed to repo. Reviewers see exactly what the API returned.
Cons
- Fixture ages. If the API changes, the captured fixture is no longer representative.
- Credentials in CI. You need to be able to call the API to record.
- Doesn’t scale to all variations. You capture one response, but the API might return many different shapes.
What I Implemented
Two fixture layers:
Tests/Fixtures/ ← Static, written by LLM
Used for decode unit tests
CAN contain errors (that's acceptable)
Tests/Fixtures/real/ ← Captured by make capture
With .meta file (capture timestamp)
Source of truth for schema validation
Static fixtures are useful for testing edge cases (truncated JSON, empty fields, weird formats). But the validation of “do these fields really exist?” is always done by the real fixture.
Each real fixture has a .meta file with the capture timestamp. If a fixture is more than 30 days old, you know it’s time to refresh.
Verdict: essential as a complement to contract testing. Alone it’s not enough (you need the comparison from Measure 1), but without real fixtures Measure 1 has nothing to compare against.
Measure 3: Smoke Tests with Real Data (make doctor)
The Idea
Before approving a change, make a real call to the API and verify that your DTOs parse the response without silent loss.
How It Works
| |
It’s make capture + make test in one step. Captures fresh production data and crosses it against the DTOs.
Pros
- The most honest defense. Real data, direct comparison, unambiguous result.
- Fast. 30 seconds locally.
- Detects drift. If the API adds or removes fields, you know immediately.
Cons
- Requires active session. You need to be logged in to capture.
- Doesn’t go in CI (in my case). Claude’s API doesn’t have service credentials, only session cookies.
- It’s manual. Depends on you remembering to run it.
What I Implemented
make doctor is my project’s most important command. I run it:
- After every DTO change
- Once a week as routine
- When something “smells weird” in the app
For APIs I can’t call in CI, the trick is to save the doctor result as a real fixture that does go to the repo. CI validates against that fixture. It’s not real-time, but it’s better than nothing.
Additionally, the system emits early signals at runtime: if the SessionFileReader reads assistant type lines without a usage field, it logs a .notice. If the SessionTokenService reads files but finds 0 new entries, it also logs. The idea is for the app to warn if the format changed, even if it doesn’t crash (because graceful degradation can hide the problem).
Verdict: the most practical measure. Low cost, high value. If you have 30 seconds, you have make doctor.
Measure 4: Parse Anomaly Detection (Always-Null Fields)
The Idea
Monitor at runtime which fields in your models get populated with real data and which are always nil. A field that’s been nil for 50 consecutive parses is suspicious of being invented.
The Mental Model
GraphQL has this solved. Tools like Apollo GraphOS report usage by field: how many times it was requested, how many times it returned data, first and last time it was used. Fields with 0% usage get marked for removal.
For REST, there’s no equivalent. You have to build it yourself.
Pros
- Detects in production. You don’t need to capture manually; the app’s own usage generates the data.
- Complements other measures. A field that passes contract testing (exists in API) but is always null in practice is still suspicious.
Cons
- You need volume. With 5 calls you can’t conclude anything. You need hundreds.
- False positives. A field can legitimately be null 95% of the time (e.g.
seven_day_opus: nullin my API is normal if you didn’t use Opus that week). - Manual implementation. There’s no tool you can plug in. You have to write the monitor.
- In client apps, no APM. In a backend with Datadog or Sentry, you emit custom metrics. In a macOS menu bar app, you’re on your own.
What I Implemented
Partially. I don’t have a formal always-nil field monitor, but I do have early warning signals in the logs:
| |
It’s the low-tech version of anomaly detection. It doesn’t count by field, but it detects the big case: “I’m reading data but nothing useful is coming out.”
Verdict: useful as an alert signal, but not as primary defense. It’s a canary in the mine, not a wall.
Measure 5: Post-Generation Semantic Diff (LLM-as-Judge)
The Idea
Use a second LLM (or the same one with a different prompt) to audit generated code, looking for fields or structures it can’t verify against known documentation.
State of the Art
There are serious tools working on this:
| Tool | What it does |
|---|---|
| VERDICT (Haize Labs) | Modular pipeline: verification + debate + aggregation |
| DeepEval | pytest-like framework with HallucinationMetric |
| Patronus Lynx | SOTA hallucination detection model, open-source |
| Vectara HHEM | Model + API, reduces hallucinations to ~0.9% in enterprise |
And the homemade option: ask GPT-4o to generate DTOs for the same API without seeing your code, and compare:
Claude says: activeFlags: [String]
GPT-4o says: capabilities: [String]
→ DISCREPANCY: at least one is hallucinating. Verify against real API.
Pros
- Scales without manual effort. You put it in CI and it runs automatically.
- Detects subtle patterns. A second model might notice things you don’t.
Cons
And this is where things get ugly.
- The judge can hallucinate too. If the second LLM doesn’t know the API, it can “confirm” invented fields.
- Systematic hallucinations. If both models were trained on similar data, they can share the same invention. SelfCheckGPT (Cambridge, EMNLP 2023) showed that multi-sample consistency doesn’t detect systematic hallucinations.
- Deplorable precision. Collu-Bench: the best methods achieve 22-33% precision localizing code hallucinations. You detect one out of four. That’s not a defense, it’s a lottery.
- Cost. Each layer multiplies LLM calls. You’re paying for a detector that’s right a third of the time.
- Position bias. LLM judges prefer longer responses and those that appear first. They don’t judge; they have aesthetic preferences.
Evidently AI summarized it with a devastating question:
“How do you monitor a system that occasionally hallucinates with another system that occasionally hallucinates?”
What I Implemented
Nothing. Zero.
And it’s a conscious decision. The deterministic measures (1, 2, and 3) give me reliable, reproducible detection with no false positives or per-call costs. Putting an LLM to watch another LLM is like putting an intern to supervise another intern. Better to put up a camera.
Verdict: interesting research, premature production. When precision goes from 33% to 90%, we’ll talk. Today, it’s theater with an R&D budget.
The Final Score
| Measure | Reliability | Cost | Implemented? | Why? |
|---|---|---|---|---|
| 1. Contract testing | High | Medium | Yes | Mechanically detects phantom fields |
| 2. Fixture validation | High | Low | Yes | Real fixtures eliminate fiction-validates-fiction |
3. Smoke tests (make doctor) | High | Low | Yes | 30 seconds, maximum value |
| 4. Anomaly detection | Medium | Low | Partial | Signals in logs, no formal monitor |
| 5. LLM-as-Judge | Low | High | No | 22-33% precision = lottery |
Measures 1, 2, and 3 form a tripod. Each covers a different angle:
- Contract testing answers: “Do these fields exist?”
- Fixture validation answers: “Is this data real?”
- Smoke tests answer: “Does this work right now?”
Together, they make an invented field have to survive three independent filters. It’s not impossible, but it’s much harder than fooling a unit test with an invented fixture.
The Golden Rule
I want to end with the most important rule I took from all this:
The verification system must be external to the generator.
If the LLM generates:
- The code → OK, that’s its job
- The logic tests → OK, they verify behavior
- The fixtures → NO, they must come from real data
- The schemas → NO, they must come from the API spec
- The validation that the data is correct → NO, a deterministic system does that
It’s separation of powers applied to development. Whoever writes the law can’t be the one who judges it. Whoever generates the code can’t be the one who verifies it’s correct.
You can have 200 green tests and be living in the Matrix. Or you can have a make doctor that in 30 seconds tells you if your data is real or fiction.
I prefer the red pill.
Complete series: This post is the fourth chapter of an involuntary series about AI failures in production. First came the 44 invented emails (the AI that acts without permission). Then MEMORY.md (the AI that forgets). After that the silent failure (the AI that invents and passes tests). And now, the defenses. Each failure different, one common denominator: we need mechanical systems, not promises of good behavior.