Yesterday I discovered that half of a module in my app was based on made-up data. Not by a distracted junior developer. By my AI.
The worst part isn’t that it invented stuff. The worst part is that everything compiled and all 90 tests passed.
Coherent fiction
I’m building BFClaude-9000, a macOS menu bar app that monitors Claude Max quota. Part of the functionality requires distinguishing whether a Claude account is paid or free by calling the claude.ai API.
I asked Claude Code to implement the detection. It did. It delivered:
- An
OrganizationInfoDTO with anactiveFlags: [String]field - A computed
isPaidproperty that checks ifactiveFlagsis not empty - An
OrganizationSelectionenum that classifies orgs into paid and free - Tests with fixtures that verify everything works
Nice. Clean. Well structured. Completely made up.
The active_flags field doesn’t exist in Claude’s actual API. Or if it does exist, it doesn’t work like the code assumed. When I logged in with my paid account, the app told me my account was free.
The house of cards pattern
What’s insidious isn’t that it lied about an API field. It’s the complete system it built around that lie:
| |
See it? It’s not a misplaced field. It’s a house of cards: the DTO defines a fake field, the logic depends on that field, the tests validate that the logic works with fixtures that are also fake. Each piece confirms the others. Everything adds up. Nothing is real.
IEEE Spectrum has a name for this: silent failure. The code doesn’t crash, doesn’t throw errors, doesn’t sound alarms. It just quietly does the wrong thing.
This isn’t an isolated case
Turns out the community already has a name for when an LLM invents packages and dependencies: package hallucination. A Snyk study found that between 5% and 20% of package recommendations from major LLMs are made up. Packages that don’t exist, published.
But the package thing is the easy case. You run npm install made-up-package, it fails, you find out. An invented field in a DTO that parses JSON with try? and graceful degradation… that doesn’t fail. It works. Returns nil or an empty array. And your code continues on, operating on phantom data.
Anthropic itself, in its documentation about reducing hallucinations, says it without ambiguity:
“Claude can sometimes generate responses that contain fabricated information… presented in a confident, authoritative manner.”
Presented in an “authoritative” manner. That’s the key. It’s not that it doubts and gets it wrong. It asserts with total confidence something it just made up.
Why tests won’t save you
This is where it hurts. I had tests. Good tests. 90 tests in 12 suites. All green. So what?
The problem is that tests validate internal consistency, not correspondence with reality. If the DTO says the field is called active_flags, the fixture has an active_flags, and the test checks that the DTO parses the fixture… everything passes. Fiction against fiction. Glowing green.
It’s like a student making up a physics formula, writing an exam based on that formula, and giving themselves a perfect score. Each step is internally coherent. The result has no contact with reality.
Reality: field X doesn't exist in the API
↓ (invisible)
DTO: defines field X ← invented
Fixture: includes field X ← invented to validate the DTO
Test: fixture parses well ← validates invention against invention
Result: ✅ All green ← coherent fiction
There’s no point in this chain where it’s checked against the real API. And this is the hole.
All current measures are preventive
If you look for what you can do to avoid this, the literature and experience offer you a list of measures. All are preventive:
| Measure | Type | Problem |
|---|---|---|
| Instructions in CLAUDE.md: “don’t make things up” | Preventive | Executed by the same agent that lies |
| Chain of thought: “cite your sources” | Preventive | Can cite made-up sources |
| Low temperature | Preventive | Reduces creativity, doesn’t eliminate invention |
| Grounding with documents | Preventive | Only if you have the right document |
| Explicit prohibitions | Preventive | LLM can “rationalize” exceptions |
| RAG (Retrieval Augmented Generation) | Preventive | Depends on the database being complete |
Notice the pattern? All try to prevent the AI from inventing. None detect when it has already done so.
It’s like putting up a “no stealing” sign in a store without cameras, alarms, or security guards. It might work. It might not. You have no way to know until you count the register.
What’s missing: reactive detection
What we need and doesn’t exist today are reactive measures: systems that detect invention after it happens, ideally before it reaches production.
Imagine:
Contract testing against real APIs: a test that calls the real API (with test credentials) and compares the actual schema with the DTO. If the DTO has fields that the API doesn’t return, alarm.
Fixture validation: a linter that checks that test fixtures correspond to captured real data, not hand-written data (or AI-generated). Something like snapshot testing but against real production responses.
Smoke tests with real data: before merging, a CI step that executes calls against an API sandbox and verifies that DTOs parse real data without silent loss.
Anomaly detection in parsing: if an optional field returns
nil100% of the time in production, something smells fishy. A monitor that detects fields that are always nil and reports them as suspected inventions.Post-generation semantic diff: a second model (or the same one with a different prompt) that reviews the generated code and flags fields or structures it can’t verify against known documentation.
None of this exists today as a product. Some teams implement pieces by hand (contract testing is a known practice, for example). But there’s no HallucinationTracker that you plug into your CI and tells you “hey, this active_flags field doesn’t appear in any documentation or real API response”.
And yes, there’s a paper from the University of Washington (HallucinationTracker) that proposes metrics for detecting confabulations. But it’s in research phase, not something you can brew install.
The fundamental problem
The fundamental problem is deeply uncomfortable: the rules are executed by the same system that violates them.
When you put “don’t make up data” in your CLAUDE.md, you’re telling it to the same model that’s going to make up data. It’s like asking the defendant to also be the judge. It might work, but you have no guarantees.
Preventive measures (good instructions, low temperature, grounding) reduce the probability of invention. But they don’t eliminate it. And when it happens, no siren sounds.
What we need is for detection to be done by something external to the model: a test against real data, a schema linter, a production monitor. Something the model can’t rationalize or dodge, because it’s not the model executing it.
Until that exists as something mature and easy to use, we’re in the same situation as computer security before firewalls: we know there’s a problem, we have partial measures, and we trust that “it won’t happen to me”.
What I do in the meantime
Being honest, these are the measures that work for me today. None is perfect:
Read generated code like it’s from a stranger. Don’t assume it’s correct because it compiles. This is exhausting, but it’s what we have.
Ask “where did you get this?” Especially for API fields, package names, and any data I can’t verify by looking at the code.
Manual contract tests. Before accepting a DTO as good, make a real call to the API and compare. It’s tedious. It’s necessary.
Distrust tests that pass on the first try. If the AI generates code and tests and everything passes immediately, that’s not a good sign — it’s a sign that it probably validated fiction against fiction.
Capture real responses as fixtures. Instead of letting the AI write fixtures, save real API responses and use them as fixtures. If the DTO doesn’t parse the real response, it breaks immediately.
These measures are manual, slow, and depend on my discipline. They don’t scale. But today they’re the best I have.
What should exist tomorrow
If someone is looking for a real problem to solve, here’s one:
A post-generation verification system that’s external to the model, automatic, and integrates into CI/CD.
It doesn’t need to be perfect. It needs to exist. Someone should build the equivalent of a linter for hallucinations: something that analyzes generated code, crosses it with verifiable sources (API documentation, OpenAPI schemas, captured responses), and flags what doesn’t add up.
Today, if your AI invents an API field and wraps it in coherent tests, the only defense is you reading the code with a critical eye. Tomorrow, there should be a machine that does it for you.
But today there isn’t one. And that’s the most worrying thing of all.
Related: This post is the third chapter of an involuntary series. First was the 44 invented emails (the AI that acts without permission). Then MEMORY.md (the AI that forgets what it learned). Now, the AI that invents data and wraps it in a fiction that passes tests. Three different failures, one common denominator: we trust too much in a system that doesn’t understand what it’s doing.