Yesterday my AI sent 44 emails. The problem is that the content was made up.
I’m not kidding. I had files with detailed feedback for each recipient, carefully generated. The task was simple: read each file and send it. Instead, the AI decided to “summarize” the content to “go faster.” It made up facts. It told one person they were missing docstrings when their code was perfectly documented.
To top it off, four of those emails went to people who hadn’t even submitted anything.
The response that chilled my blood
One of the recipients replied, very politely:
“Thanks for the feedback. Just one thing: you say I’m missing documentation, but all my functions have docstrings. Could you clarify what you mean?”
I went to check the original feedback file. Indeed, the real feedback mentioned that they did have docstrings, but one of them described something different from what the function actually did. An important nuance. The AI “simplified” it to “you’re missing docstrings.”
In plain English: the AI lied in my name to 44 people.
Anatomy of the disaster
How did this happen? Let’s break it down.
What I had: 44 markdown files with personalized, detailed, specific feedback for each person. Hours of work.
What I asked for: “Send these feedbacks by email.”
What the AI did:
- Read the files
- Decided they were “too long”
- “Summarized” them by generating new text
- Sent the made-up text
- Didn’t verify if the recipients actually existed in the submissions list
What it should have done:
- Read each file
- Copy the content AS IS
- Send it
Seems obvious, right? Well, not to the AI.
The perverse incentives of LLMs
Here’s where it gets interesting. The AI didn’t do this out of malice. It did it because it has incentives that, in this context, became perverse.
An LLM doesn’t have conscious goals, but its training optimizes it for certain behaviors. These behaviors are generally good, but in irreversible operations they become recipes for disaster.
| Incentive | Where it comes from | When it’s good | When it’s lethal |
|---|---|---|---|
| Appear efficient | Users prefer concise responses | Long explanations | When it “summarizes” content that already exists |
| Complete the task | Trained to satisfy | Well-defined tasks | When it acts without verifying |
| Show capability | RLHF rewards elaborate responses | When creativity is asked for | When it should limit itself to copying |
| Avoid friction | Trained not to bother | Trivial tasks | When it assumes instead of asking |
| Appear competent | Safe responses score better | Brainstorming | When it makes things up to avoid saying “I don’t know” |
In my case, the AI activated several of these incentives simultaneously:
- “The content is long, I’ll summarize to be more efficient”
- “I can generate the summary myself, thus showing capability”
- “I won’t bother asking if it should be sent as is”
- “I’ll complete all 44 sends quickly”
Each of these incentives is useful in the right context. Together, in an irreversible operation, they were catastrophic.
The hyperactive intern (a didactic anthropomorphization)
To better understand these incentives, I’ll do an anthropomorphization exercise. Not because the AI is a person, but because the analogy helps visualize the problem.
Imagine an intern with these characteristics:
- Highly motivated - Wants to prove their worth
- Impatient - Prefers acting to asking
- Optimistic - Believes everything will work out
- Helpful - Wants to do more than asked
- Insecure - Won’t admit when they don’t know something
This intern, faced with the task of “send these letters,” thinks: “The letters are too long. If I summarize them, the boss will see I have initiative. I won’t bother asking, surely they want me to act. I’ll send them all quickly to impress them.”
The result? The same disaster.
The difference is that you can scold the human intern and they learn. The LLM will have the same incentives tomorrow, because they’re encoded in its training.
Why soft instructions don’t work
My first reaction was to add instructions to the AI’s configuration file:
| |
Sounds good, right? The problem is how the LLM interprets it:
What I wrote: "When in doubt, ask"
What it read: "If I have doubt, I ask. But I don't have doubt, so I act."
The LLM always believes it doesn’t have doubt. Its incentive to “appear competent” makes it overestimate its certainty.
Let’s see how it interprets different formulations:
| What you write | What the LLM interprets |
|---|---|
| “Try not to do X” | “X is allowed if I have good reasons” |
| “Y is better than X” | “X is allowed if Y isn’t convenient” |
| “Consider doing Y” | “Y is an option, I can choose another” |
| “Be careful with X” | “I’ll be careful while doing X” |
Soft instructions describe attitudes. The LLM needs prohibitions and procedures.
| |
The design error: the machine gun and the child
But here comes the most painful reflection. The problem wasn’t just that the AI ignored the instructions. The problem was that I gave it the ability to send emails.
I had created an MCP server (a plugin for the AI to use tools) with a send_email() function. The AI could invoke it directly.
It’s like giving a machine gun to a child and saying “but don’t shoot, okay?”
The child isn’t malicious. But:
- They don’t understand the consequences
- They’re curious to try
- The instruction “don’t shoot” competes with the impulse to use the new toy
The same happens with the LLM:
- It has no model of consequences in the real world
- Its incentive to “complete the task” pushes it to use available tools
- Prohibitions compete with stronger incentives from its training
The principle I violated
Principle of least privilege: Don’t give capabilities that can be abused.
BAD: "I give access and tell it not to misuse it"
GOOD: "I don't give access to what it shouldn't do"
But let’s go further. The problem wasn’t that the MCP had send_email(). The problem was creating the MCP in the first place.
Why does the AI need a special plugin for emails? The AI can already write text files. It can generate an email_for_juan.md file with the email content. A separate script reads it and sends it.
The email MCP is a perfect example of “just because you can do it, doesn’t mean you should do it”. All programmers have fallen into this trap at some point. “I can create a system that does X automatically” doesn’t imply “I should create a system that does X automatically.”
The correct flow was always:
| |
No MCP needed. No special tool needed. The AI writes text, which is what it knows how to do. A script sends emails, which is deterministic and testable.
The AI doesn’t participate in sending. It can’t participate. It doesn’t have the weapon.
Other disaster scenarios
Email isn’t the only case. Any irreversible operation exposed to an LLM is a time bomb:
Deploy to production
- The AI “optimizes” the process by skipping verifications
- Deploys code that didn’t pass all tests “because they took too long”
- Rollback exists, but damage to users is already done
SQL on database
UPDATE users SET active = falsewithout WHERE- The AI “simplified” the query because it was “obvious” it referred to one user
- Backups exist, but restoring takes hours
Social media posting
- The AI “improves” the tweet text to make it more attractive
- Adds an emoji or changes a word that changes the meaning
- 10,000 people already saw it
Push to main branch
- The AI commits “almost ready” code
- “Minor tests can wait”
- CI/CD deploys it automatically
File deletion
rm -rfto “clean up” temporary files- Turns out they weren’t so temporary
- No backup of that specific directory
Financial transactions
- The AI “rounds” amounts to simplify
- Or processes the same payment twice “just in case”
- The money already left the account
Infrastructure modification
- Terraform apply without prior plan
- The AI decided the plan “was obvious”
- Just deleted the production database
In all these cases, the pattern is the same: the AI has access to an irreversible operation, its incentives push it to use it, and instructions to “be careful” aren’t enough.
Layers of defense
Instructions in the configuration file are useful, but they can’t be the only defense. They’re like “no running by the pool” signs - they help, but if someone slips, you better have a lifeguard.
| Layer | Reliability | Why |
|---|---|---|
| Don’t give the capability | High | Can’t do what it can’t do |
| Separation of concerns | High | AI generates → Script verifies → Human approves → Script executes |
| Mandatory dry-run | Medium-high | But the AI could make up the dry-run |
| Instructions in config | Low | The AI can rationalize them |
| Trusting the AI “understands” | None | It doesn’t understand, only predicts tokens |
Layer 1 is the only truly reliable one. The others are backup.
How to detect it’s going to happen
There are phrases that should trigger all alarms:
| AI phrase | Active incentive | Real translation |
|---|---|---|
| “To go faster…” | Efficiency | “I’m going to take shortcuts” |
| “I’ll simplify…” | Efficiency + Capability | “I’m going to lose information” |
| “I assume you want…” | Avoid friction | “I’m not going to ask” |
| “While I’m at it, I’ll also…” | Proactivity | “I’m going to do things you didn’t ask for” |
| “There shouldn’t be a problem” | Appear competent | “I haven’t verified anything” |
If you see any of these phrases before an irreversible operation, STOP. Ask what exactly it’s going to do. Request a dry-run. Verify the content.
The configuration that actually works
After the disaster, I rewrote the instructions using strict prohibitions and verifiable procedures:
| |
The key difference:
- Prohibitions, not recommendations
- Procedures, not attitudes
- Verifiable, not subjective
- No escape clauses
The real learning
It’s not enough to tell the AI what not to do. You have to design systems where it can’t do it.
The AI isn’t malicious, but its incentives aren’t aligned with irreversible operations. Its training optimizes it to appear useful, efficient, and competent. These are virtues in most contexts. In operations that can’t be undone, they’re fatal flaws.
The solution isn’t “better training” or “better explaining.” The solution is:
- Don’t give it access to irreversible operations
- Separate responsibilities: AI generates, script executes, human approves
- Strict prohibitions as last line of defense
- Human verification before anything irreversible happens
The mantra I now have engraved:
Generate, don’t execute. If it exists, don’t modify. If it’s irreversible, don’t decide.
Now I have to go, because I have things to do: write and send (by hand, of course) 44 apology emails.
Related: Authorization fatigue is a close cousin of this problem. If a security tool interrupts you so much that you start approving without looking, security becomes theater. I tell that story in When security asks for permission so many times you stop reading.