Last week I told how my AI invented a complete JSON structure and wrapped it in DTOs, fixtures, and passing tests. 90 green tests. All lies.

That post was the diagnosis. This is the treatment.

After discovering the disaster, I did what any engineer with wounded pride does: obsessively research for days to make sure it never happens again. I read papers, tried tools, analyzed real data from my APIs, and built a defense system for my app.

What I found surprised me. Of the 5 reactive measures I identified, only 3 actually work. The other two are, at best, security theater with good intentions.

The Mental Model: You vs. the AI (Literally)

Before diving into the measures, you need to understand the framework. And the best analogy I found comes from deep learning.

In a GAN (Generative Adversarial Network) there are two neural networks competing:

  • The generator produces content (images, text, whatever)
  • The discriminator tries to detect if the content is real or fake

The system improves because both push each other. The generator learns to fool better. The discriminator learns to detect better.

When you program with an LLM, you’re in an involuntary GAN:

  • The LLM is the generator. It produces code, DTOs, tests, fixtures.
  • You are the discriminator. You must detect what’s real and what’s invented.

But there’s a brutal asymmetry: the generator is tireless and you get tired. The LLM can generate 50 files without breaking a sweat. You review 10, get fatigued, and file 11 passes without you looking at it.

It’s the same authorization fatigue I wrote about with 1Password asking for Touch ID 47 times a day. Security that depends on a human being permanently alert is cardboard security.

What the Discriminator Should Watch

You can’t (and shouldn’t) review every line. What you have to watch are the boundaries — where your code touches the outside world:

BoundaryKey Question
External APIsDo the DTO fields exist in the real API?
PackagesDoes the dependency exist and is it named like this?
DB SchemasDoes the table actually have those columns?
URLs/endpointsDoes the endpoint exist and respond as expected?

Rule: everything the LLM declares about the outside world is suspicious until verified. That it says it with confidence is not evidence. Anthropic acknowledges this in their own documentation:

“Claude can sometimes generate responses that contain fabricated information… presented in a confident, authoritative manner.”

An LLM that says “I’m sure” and one that says “I think” have exactly the same probability of being wrong.

Automating the Discriminator

The ultimate goal is to stop depending on your discipline and automate verification:

BEFORE:
  LLM generates → You review (sometimes) → Merge

AFTER:
  LLM generates → CI verifies against real data → You review discrepancies → Merge

The 5 measures that follow are ways to automate parts of that discriminator role. Some work. Others not so much.

The Hard Data (For Skeptics)

Before you think “this doesn’t happen to me,” here are numbers from real studies:

  • 21.7% of packages recommended by open-source LLMs are invented. In commercial models it drops to 5.2%, which is still one package in 20.
  • GPT-4o only achieves 38.58% valid invocations for infrequent APIs. Less than 40%. Flip a coin.
  • The best current methods for locating code hallucinations achieve 22-33% precision. In plain English: we detect one out of four.
  • A researcher uploaded an empty package with a name that LLMs frequently hallucinated. 30,000 downloads in 3 months. They call it slopsquatting.

And there’s a formal taxonomy. The CodeHalu paper (AAAI 2025) defines 4 categories of code hallucinations:

CategoryWhat it isReal Example
MappingFields mapped incorrectlyConfusing user_id with account_id
NamingInvented namesresponse.quota.percentage when it’s response.utilization
ResourceResources that don’t existactive_flags field in an API that doesn’t have it
LogicPlausible but incorrect logicisPaid = !activeFlags.isEmpty with an always-empty field

My case was a Resource that led to a Logic. The field didn’t exist, and the logic depending on it seemed perfect. Coherent book fiction.

Measure 1: Contract Testing Against Real APIs

The Idea

Define a “contract” of what the API returns and automatically verify that your code is compatible. If your DTO has fields that the contract doesn’t define: alarm.

How It Works

Imagine you have a DTO like this:

1
2
3
4
5
struct OrganizationInfo: Decodable {
    let uuid: String
    let name: String
    let activeFlags: [String]  // ← Does this really exist?
}

A contract test takes the real API response, extracts the JSON keys, and compares them with your DTO’s CodingKeys. If your DTO has a field the API doesn’t return, it’s a PHANTOM — a ghost field, possibly invented.

Keys in real API:  {uuid, name, capabilities, billing_type}
Keys in DTO:       {uuid, name, activeFlags}

PHANTOM: activeFlags  ← In DTO but NOT in API. Hallucinated?
UNCONSUMED: capabilities, billing_type ← In API but not in DTO.

Pros

  • Deterministic. Doesn’t depend on another LLM or your gut feeling. If the field isn’t in the API, it fails.
  • Eliminates phantom fields by construction. It’s impossible for an invented field to pass.
  • Automatable in CI. You run it on every push.

Cons

  • You need the API spec. If the API doesn’t have an OpenAPI spec (like Claude’s), you have to capture responses manually.
  • Doesn’t detect incorrect naming. If the field exists but is named differently (active_flags vs capabilities), it doesn’t catch it automatically.
  • Requires credentials. To capture the real response you need a valid session.

Tools by Stack

StackToolApproach
PythonPydantic extra='forbid'Rejects JSON fields not declared in model
TypeScriptZod .strict()Same concept, rejects extras
SwiftCustom decoder or manual key comparisonCodable ignores unknown keys by default
Dartjson_serializable + disallowUnrecognizedKeysRejects undeclared fields
Agnosticoasdiff, SpecmaticCompare OpenAPI specs

What I Implemented

In my app (Swift/SPM) there’s no OpenAPI spec for Claude’s API. So I built bidirectional validation by hand:

  1. make capture downloads real responses from all APIs and saves them as fixtures in Fixtures/real/
  2. SchemaValidationTests compares each DTO’s CodingKeys.allCases against the real fixture’s keys
  3. If there’s a discrepancy → PHANTOM (field in DTO but not in API) or UNCONSUMED (field in API we don’t consume)
$ make doctor
✅ OrganizationInfo: 4 common, 0 phantom, 8 unconsumed
✅ UsageResponse: 9 common, 0 phantom, 1 unconsumed
⚠️  StatsCache: PHANTOM field 'totalSpeculationTimeSaved' — not in real data

Intentionally unconsumed fields go in a documented allowlist with the reason. If a new field appears in the API tomorrow, the test fails with UNCONSUMED and I find out.

Verdict: the most important measure. If you only implement one, make it this one.

Measure 2: Fixture Validation (Real Fixtures, Not Invented Ones)

The Idea

Test fixtures should come from real captured data, not written by hand by the LLM. If the LLM generates the fixture, you’re validating fiction against fiction.

The Problem It Solves

George Tsiokos nailed it in a February 2025 post:

“Tests don’t validate that software meets business needs — they simply confirm that the code does exactly what it was written to do, including bugs.”

When the LLM generates the code AND the tests AND the fixtures:

LLM invents field → LLM writes fixture with that field → LLM writes test
→ Test passes ✅ → Nobody verified against reality ❌

The Solution: Record-Replay

Record-replay frameworks record real HTTP responses and replay them in tests. There’s no possibility of invention because the fixture comes from the API, not the model.

StackTool
PythonVCR.py, pytest-recording
TypeScriptPolly.js (Netflix), MSW
SwiftReplay (mattt)
AgnosticHoverfly

Pros

  • Impossible to invent. The fixture comes from the network, not the model.
  • Includes metadata. URL, timestamp, status code. You can trace where it came from.
  • Gets committed to repo. Reviewers see exactly what the API returned.

Cons

  • Fixture ages. If the API changes, the captured fixture is no longer representative.
  • Credentials in CI. You need to be able to call the API to record.
  • Doesn’t scale to all variations. You capture one response, but the API might return many different shapes.

What I Implemented

Two fixture layers:

Tests/Fixtures/          ← Static, written by LLM
                           Used for decode unit tests
                           CAN contain errors (that's acceptable)

Tests/Fixtures/real/     ← Captured by make capture
                           With .meta file (capture timestamp)
                           Source of truth for schema validation

Static fixtures are useful for testing edge cases (truncated JSON, empty fields, weird formats). But the validation of “do these fields really exist?” is always done by the real fixture.

Each real fixture has a .meta file with the capture timestamp. If a fixture is more than 30 days old, you know it’s time to refresh.

Verdict: essential as a complement to contract testing. Alone it’s not enough (you need the comparison from Measure 1), but without real fixtures Measure 1 has nothing to compare against.

Measure 3: Smoke Tests with Real Data (make doctor)

The Idea

Before approving a change, make a real call to the API and verify that your DTOs parse the response without silent loss.

How It Works

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ make doctor
Capturing /api/organizations... OK (2 orgs)
Capturing /api/organizations/{id}/usage... OK (9 windows)
Capturing ~/.claude/stats-cache.json... OK (115 sessions)
Capturing session JSONL... OK (847 entries)

Validating schemas...
✅ OrganizationInfo: OK
✅ UsageResponse: OK
✅ StatsCache: OK
✅ SessionEntry: OK

0 phantom fields, 0 new unconsumed fields

It’s make capture + make test in one step. Captures fresh production data and crosses it against the DTOs.

Pros

  • The most honest defense. Real data, direct comparison, unambiguous result.
  • Fast. 30 seconds locally.
  • Detects drift. If the API adds or removes fields, you know immediately.

Cons

  • Requires active session. You need to be logged in to capture.
  • Doesn’t go in CI (in my case). Claude’s API doesn’t have service credentials, only session cookies.
  • It’s manual. Depends on you remembering to run it.

What I Implemented

make doctor is my project’s most important command. I run it:

  • After every DTO change
  • Once a week as routine
  • When something “smells weird” in the app

For APIs I can’t call in CI, the trick is to save the doctor result as a real fixture that does go to the repo. CI validates against that fixture. It’s not real-time, but it’s better than nothing.

Additionally, the system emits early signals at runtime: if the SessionFileReader reads assistant type lines without a usage field, it logs a .notice. If the SessionTokenService reads files but finds 0 new entries, it also logs. The idea is for the app to warn if the format changed, even if it doesn’t crash (because graceful degradation can hide the problem).

Verdict: the most practical measure. Low cost, high value. If you have 30 seconds, you have make doctor.

Measure 4: Parse Anomaly Detection (Always-Null Fields)

The Idea

Monitor at runtime which fields in your models get populated with real data and which are always nil. A field that’s been nil for 50 consecutive parses is suspicious of being invented.

The Mental Model

GraphQL has this solved. Tools like Apollo GraphOS report usage by field: how many times it was requested, how many times it returned data, first and last time it was used. Fields with 0% usage get marked for removal.

For REST, there’s no equivalent. You have to build it yourself.

Pros

  • Detects in production. You don’t need to capture manually; the app’s own usage generates the data.
  • Complements other measures. A field that passes contract testing (exists in API) but is always null in practice is still suspicious.

Cons

  • You need volume. With 5 calls you can’t conclude anything. You need hundreds.
  • False positives. A field can legitimately be null 95% of the time (e.g. seven_day_opus: null in my API is normal if you didn’t use Opus that week).
  • Manual implementation. There’s no tool you can plug in. You have to write the monitor.
  • In client apps, no APM. In a backend with Datadog or Sentry, you emit custom metrics. In a macOS menu bar app, you’re on your own.

What I Implemented

Partially. I don’t have a formal always-nil field monitor, but I do have early warning signals in the logs:

1
2
3
4
// SessionTokenService.swift
if totalFilesRead > 0 && totalNewEntries == 0 {
    logger.notice("read \(totalFilesRead) files but 0 new entries — possible format change")
}

It’s the low-tech version of anomaly detection. It doesn’t count by field, but it detects the big case: “I’m reading data but nothing useful is coming out.”

Verdict: useful as an alert signal, but not as primary defense. It’s a canary in the mine, not a wall.

Measure 5: Post-Generation Semantic Diff (LLM-as-Judge)

The Idea

Use a second LLM (or the same one with a different prompt) to audit generated code, looking for fields or structures it can’t verify against known documentation.

State of the Art

There are serious tools working on this:

ToolWhat it does
VERDICT (Haize Labs)Modular pipeline: verification + debate + aggregation
DeepEvalpytest-like framework with HallucinationMetric
Patronus LynxSOTA hallucination detection model, open-source
Vectara HHEMModel + API, reduces hallucinations to ~0.9% in enterprise

And the homemade option: ask GPT-4o to generate DTOs for the same API without seeing your code, and compare:

Claude says:  activeFlags: [String]
GPT-4o says:  capabilities: [String]
→ DISCREPANCY: at least one is hallucinating. Verify against real API.

Pros

  • Scales without manual effort. You put it in CI and it runs automatically.
  • Detects subtle patterns. A second model might notice things you don’t.

Cons

And this is where things get ugly.

  • The judge can hallucinate too. If the second LLM doesn’t know the API, it can “confirm” invented fields.
  • Systematic hallucinations. If both models were trained on similar data, they can share the same invention. SelfCheckGPT (Cambridge, EMNLP 2023) showed that multi-sample consistency doesn’t detect systematic hallucinations.
  • Deplorable precision. Collu-Bench: the best methods achieve 22-33% precision localizing code hallucinations. You detect one out of four. That’s not a defense, it’s a lottery.
  • Cost. Each layer multiplies LLM calls. You’re paying for a detector that’s right a third of the time.
  • Position bias. LLM judges prefer longer responses and those that appear first. They don’t judge; they have aesthetic preferences.

Evidently AI summarized it with a devastating question:

“How do you monitor a system that occasionally hallucinates with another system that occasionally hallucinates?”

What I Implemented

Nothing. Zero.

And it’s a conscious decision. The deterministic measures (1, 2, and 3) give me reliable, reproducible detection with no false positives or per-call costs. Putting an LLM to watch another LLM is like putting an intern to supervise another intern. Better to put up a camera.

Verdict: interesting research, premature production. When precision goes from 33% to 90%, we’ll talk. Today, it’s theater with an R&D budget.

The Final Score

MeasureReliabilityCostImplemented?Why?
1. Contract testingHighMediumYesMechanically detects phantom fields
2. Fixture validationHighLowYesReal fixtures eliminate fiction-validates-fiction
3. Smoke tests (make doctor)HighLowYes30 seconds, maximum value
4. Anomaly detectionMediumLowPartialSignals in logs, no formal monitor
5. LLM-as-JudgeLowHighNo22-33% precision = lottery

Measures 1, 2, and 3 form a tripod. Each covers a different angle:

  • Contract testing answers: “Do these fields exist?”
  • Fixture validation answers: “Is this data real?”
  • Smoke tests answer: “Does this work right now?”

Together, they make an invented field have to survive three independent filters. It’s not impossible, but it’s much harder than fooling a unit test with an invented fixture.

The Golden Rule

I want to end with the most important rule I took from all this:

The verification system must be external to the generator.

If the LLM generates:

  • The code → OK, that’s its job
  • The logic tests → OK, they verify behavior
  • The fixtures → NO, they must come from real data
  • The schemas → NO, they must come from the API spec
  • The validation that the data is correct → NO, a deterministic system does that

It’s separation of powers applied to development. Whoever writes the law can’t be the one who judges it. Whoever generates the code can’t be the one who verifies it’s correct.

You can have 200 green tests and be living in the Matrix. Or you can have a make doctor that in 30 seconds tells you if your data is real or fiction.

I prefer the red pill.


Complete series: This post is the fourth chapter of an involuntary series about AI failures in production. First came the 44 invented emails (the AI that acts without permission). Then MEMORY.md (the AI that forgets). After that the silent failure (the AI that invents and passes tests). And now, the defenses. Each failure different, one common denominator: we need mechanical systems, not promises of good behavior.