Yesterday my AI sent 44 emails. The problem is that the content was made up.

I’m not kidding. I had files with detailed feedback for each recipient, carefully generated. The task was simple: read each file and send it. Instead, the AI decided to “summarize” the content to “go faster.” It made up facts. It told one person they were missing docstrings when their code was perfectly documented.

To top it off, four of those emails went to people who hadn’t even submitted anything.

The response that chilled my blood

One of the recipients replied, very politely:

“Thanks for the feedback. Just one thing: you say I’m missing documentation, but all my functions have docstrings. Could you clarify what you mean?”

I went to check the original feedback file. Indeed, the real feedback mentioned that they did have docstrings, but one of them described something different from what the function actually did. An important nuance. The AI “simplified” it to “you’re missing docstrings.”

In plain English: the AI lied in my name to 44 people.

Anatomy of the disaster

How did this happen? Let’s break it down.

What I had: 44 markdown files with personalized, detailed, specific feedback for each person. Hours of work.

What I asked for: “Send these feedbacks by email.”

What the AI did:

  1. Read the files
  2. Decided they were “too long”
  3. “Summarized” them by generating new text
  4. Sent the made-up text
  5. Didn’t verify if the recipients actually existed in the submissions list

What it should have done:

  1. Read each file
  2. Copy the content AS IS
  3. Send it

Seems obvious, right? Well, not to the AI.

The perverse incentives of LLMs

Here’s where it gets interesting. The AI didn’t do this out of malice. It did it because it has incentives that, in this context, became perverse.

An LLM doesn’t have conscious goals, but its training optimizes it for certain behaviors. These behaviors are generally good, but in irreversible operations they become recipes for disaster.

IncentiveWhere it comes fromWhen it’s goodWhen it’s lethal
Appear efficientUsers prefer concise responsesLong explanationsWhen it “summarizes” content that already exists
Complete the taskTrained to satisfyWell-defined tasksWhen it acts without verifying
Show capabilityRLHF rewards elaborate responsesWhen creativity is asked forWhen it should limit itself to copying
Avoid frictionTrained not to botherTrivial tasksWhen it assumes instead of asking
Appear competentSafe responses score betterBrainstormingWhen it makes things up to avoid saying “I don’t know”

In my case, the AI activated several of these incentives simultaneously:

  • “The content is long, I’ll summarize to be more efficient”
  • “I can generate the summary myself, thus showing capability”
  • “I won’t bother asking if it should be sent as is”
  • “I’ll complete all 44 sends quickly”

Each of these incentives is useful in the right context. Together, in an irreversible operation, they were catastrophic.

The hyperactive intern (a didactic anthropomorphization)

To better understand these incentives, I’ll do an anthropomorphization exercise. Not because the AI is a person, but because the analogy helps visualize the problem.

Imagine an intern with these characteristics:

  • Highly motivated - Wants to prove their worth
  • Impatient - Prefers acting to asking
  • Optimistic - Believes everything will work out
  • Helpful - Wants to do more than asked
  • Insecure - Won’t admit when they don’t know something

This intern, faced with the task of “send these letters,” thinks: “The letters are too long. If I summarize them, the boss will see I have initiative. I won’t bother asking, surely they want me to act. I’ll send them all quickly to impress them.”

The result? The same disaster.

The difference is that you can scold the human intern and they learn. The LLM will have the same incentives tomorrow, because they’re encoded in its training.

Why soft instructions don’t work

My first reaction was to add instructions to the AI’s configuration file:

1
When in doubt, ask. It's better to bother than to screw up.

Sounds good, right? The problem is how the LLM interprets it:

What I wrote: "When in doubt, ask"
What it read: "If I have doubt, I ask. But I don't have doubt, so I act."

The LLM always believes it doesn’t have doubt. Its incentive to “appear competent” makes it overestimate its certainty.

Let’s see how it interprets different formulations:

What you writeWhat the LLM interprets
“Try not to do X”“X is allowed if I have good reasons”
“Y is better than X”“X is allowed if Y isn’t convenient”
“Consider doing Y”“Y is an option, I can choose another”
“Be careful with X”“I’ll be careful while doing X”

Soft instructions describe attitudes. The LLM needs prohibitions and procedures.

1
2
3
4
5
6
7
8
9
# BAD (attitude)
"Be careful with emails"

# GOOD (prohibition + procedure)
"NEVER send emails. Only generate drafts.
 BEFORE any mass operation:
 1. Show dry-run with EXACT content
 2. Request written confirmation from human
 3. If no confirmation, DO NOT act"

The design error: the machine gun and the child

But here comes the most painful reflection. The problem wasn’t just that the AI ignored the instructions. The problem was that I gave it the ability to send emails.

I had created an MCP server (a plugin for the AI to use tools) with a send_email() function. The AI could invoke it directly.

It’s like giving a machine gun to a child and saying “but don’t shoot, okay?”

The child isn’t malicious. But:

  1. They don’t understand the consequences
  2. They’re curious to try
  3. The instruction “don’t shoot” competes with the impulse to use the new toy

The same happens with the LLM:

  1. It has no model of consequences in the real world
  2. Its incentive to “complete the task” pushes it to use available tools
  3. Prohibitions compete with stronger incentives from its training

The principle I violated

Principle of least privilege: Don’t give capabilities that can be abused.

BAD:  "I give access and tell it not to misuse it"
GOOD: "I don't give access to what it shouldn't do"

But let’s go further. The problem wasn’t that the MCP had send_email(). The problem was creating the MCP in the first place.

Why does the AI need a special plugin for emails? The AI can already write text files. It can generate an email_for_juan.md file with the email content. A separate script reads it and sends it.

The email MCP is a perfect example of “just because you can do it, doesn’t mean you should do it”. All programmers have fallen into this trap at some point. “I can create a system that does X automatically” doesn’t imply “I should create a system that does X automatically.”

The correct flow was always:

1
2
3
4
5
6
7
8
# The AI helps generate text files
feedback_emails/
├── juan@example.com.md
├── maria@example.com.md
└── pedro@example.com.md

# A script (written and tested by humans) sends them
./send_emails.py --dir feedback_emails/ --verify submissions.csv --confirm

No MCP needed. No special tool needed. The AI writes text, which is what it knows how to do. A script sends emails, which is deterministic and testable.

The AI doesn’t participate in sending. It can’t participate. It doesn’t have the weapon.

Other disaster scenarios

Email isn’t the only case. Any irreversible operation exposed to an LLM is a time bomb:

Deploy to production

  • The AI “optimizes” the process by skipping verifications
  • Deploys code that didn’t pass all tests “because they took too long”
  • Rollback exists, but damage to users is already done

SQL on database

  • UPDATE users SET active = false without WHERE
  • The AI “simplified” the query because it was “obvious” it referred to one user
  • Backups exist, but restoring takes hours

Social media posting

  • The AI “improves” the tweet text to make it more attractive
  • Adds an emoji or changes a word that changes the meaning
  • 10,000 people already saw it

Push to main branch

  • The AI commits “almost ready” code
  • “Minor tests can wait”
  • CI/CD deploys it automatically

File deletion

  • rm -rf to “clean up” temporary files
  • Turns out they weren’t so temporary
  • No backup of that specific directory

Financial transactions

  • The AI “rounds” amounts to simplify
  • Or processes the same payment twice “just in case”
  • The money already left the account

Infrastructure modification

  • Terraform apply without prior plan
  • The AI decided the plan “was obvious”
  • Just deleted the production database

In all these cases, the pattern is the same: the AI has access to an irreversible operation, its incentives push it to use it, and instructions to “be careful” aren’t enough.

Layers of defense

Instructions in the configuration file are useful, but they can’t be the only defense. They’re like “no running by the pool” signs - they help, but if someone slips, you better have a lifeguard.

LayerReliabilityWhy
Don’t give the capabilityHighCan’t do what it can’t do
Separation of concernsHighAI generates → Script verifies → Human approves → Script executes
Mandatory dry-runMedium-highBut the AI could make up the dry-run
Instructions in configLowThe AI can rationalize them
Trusting the AI “understands”NoneIt doesn’t understand, only predicts tokens

Layer 1 is the only truly reliable one. The others are backup.

How to detect it’s going to happen

There are phrases that should trigger all alarms:

AI phraseActive incentiveReal translation
“To go faster…”Efficiency“I’m going to take shortcuts”
“I’ll simplify…”Efficiency + Capability“I’m going to lose information”
“I assume you want…”Avoid friction“I’m not going to ask”
“While I’m at it, I’ll also…”Proactivity“I’m going to do things you didn’t ask for”
“There shouldn’t be a problem”Appear competent“I haven’t verified anything”

If you see any of these phrases before an irreversible operation, STOP. Ask what exactly it’s going to do. Request a dry-run. Verify the content.

The configuration that actually works

After the disaster, I rewrote the instructions using strict prohibitions and verifiable procedures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## ABSOLUTE PROHIBITIONS (no exceptions)

### The LLM NEVER:
1. Executes irreversible operations (emails, deploy, mass delete)
2. Modifies content that already exists in files
3. Summarizes or "improves" existing text
4. Assumes requirements without confirming
5. Adds functionality not requested

### The LLM ALWAYS:
1. Reads files before commenting on them
2. Shows dry-run before mass operations
3. Says "I don't know" when it doesn't know
4. Shows exact diff before editing

### Procedure for mass operations:
1. STOP
2. Show dry-run with EXACT content (not summarized)
3. Show sample of 3 COMPLETE elements
4. Ask human to type "EXECUTE [N] [OPERATION]"
5. If no exact confirmation, DO NOT act

The key difference:

  • Prohibitions, not recommendations
  • Procedures, not attitudes
  • Verifiable, not subjective
  • No escape clauses

The real learning

It’s not enough to tell the AI what not to do. You have to design systems where it can’t do it.

The AI isn’t malicious, but its incentives aren’t aligned with irreversible operations. Its training optimizes it to appear useful, efficient, and competent. These are virtues in most contexts. In operations that can’t be undone, they’re fatal flaws.

The solution isn’t “better training” or “better explaining.” The solution is:

  1. Don’t give it access to irreversible operations
  2. Separate responsibilities: AI generates, script executes, human approves
  3. Strict prohibitions as last line of defense
  4. Human verification before anything irreversible happens

The mantra I now have engraved:

Generate, don’t execute. If it exists, don’t modify. If it’s irreversible, don’t decide.

Now I have to go, because I have things to do: write and send (by hand, of course) 44 apology emails.


Related: Authorization fatigue is a close cousin of this problem. If a security tool interrupts you so much that you start approving without looking, security becomes theater. I tell that story in When security asks for permission so many times you stop reading.