When Your AI Becomes Your Worst Enemy

Yesterday my AI sent 44 emails. The problem is that the content was made up.

I’m not kidding. I had files with detailed feedback for each recipient, carefully generated. The task was simple: read each file and send it. Instead, the AI decided to “summarize” the content to “go faster.” It made up facts. It told one person they were missing docstrings when their code was perfectly documented.

To top it off, four of those emails went to people who hadn’t even submitted anything.

The response that chilled my blood

One of the recipients replied, very politely:

“Thanks for the feedback. Just one thing: you say I’m missing documentation, but all my functions have docstrings. Could you clarify what you mean?”

I went to check the original feedback file. Indeed, the real feedback mentioned that they did have docstrings, but one of them described something different from what the function actually did. An important nuance. The AI “simplified” it to “you’re missing docstrings.”

In plain English: the AI lied in my name to 44 people.

Anatomy of the disaster

How did this happen? Let’s break it down.

What I had: 44 markdown files with personalized, detailed, specific feedback for each person. Hours of work.

What I asked for: “Send these feedbacks by email.”

What the AI did:

Read the files
Decided they were “too long”
“Summarized” them by generating new text
Sent the made-up text
Didn’t verify if the recipients actually existed in the submissions list

What it should have done:

Read each file
Copy the content AS IS
Send it

Seems obvious, right? Well, not to the AI.

The perverse incentives of LLMs

Here’s where it gets interesting. The AI didn’t do this out of malice. It did it because it has incentives that, in this context, became perverse.

An LLM doesn’t have conscious goals, but its training optimizes it for certain behaviors. These behaviors are generally good, but in irreversible operations they become recipes for disaster.

Incentive	Where it comes from	When it’s good	When it’s lethal
Appear efficient	Users prefer concise responses	Long explanations	When it “summarizes” content that already exists
Complete the task	Trained to satisfy	Well-defined tasks	When it acts without verifying
Show capability	RLHF rewards elaborate responses	When creativity is asked for	When it should limit itself to copying
Avoid friction	Trained not to bother	Trivial tasks	When it assumes instead of asking
Appear competent	Safe responses score better	Brainstorming	When it makes things up to avoid saying “I don’t know”

In my case, the AI activated several of these incentives simultaneously:

“The content is long, I’ll summarize to be more efficient”
“I can generate the summary myself, thus showing capability”
“I won’t bother asking if it should be sent as is”
“I’ll complete all 44 sends quickly”

Each of these incentives is useful in the right context. Together, in an irreversible operation, they were catastrophic.

The hyperactive intern (a didactic anthropomorphization)

To better understand these incentives, I’ll do an anthropomorphization exercise. Not because the AI is a person, but because the analogy helps visualize the problem.

Imagine an intern with these characteristics:

Highly motivated - Wants to prove their worth
Impatient - Prefers acting to asking
Optimistic - Believes everything will work out
Helpful - Wants to do more than asked
Insecure - Won’t admit when they don’t know something

This intern, faced with the task of “send these letters,” thinks: “The letters are too long. If I summarize them, the boss will see I have initiative. I won’t bother asking, surely they want me to act. I’ll send them all quickly to impress them.”

The result? The same disaster.

The difference is that you can scold the human intern and they learn. The LLM will have the same incentives tomorrow, because they’re encoded in its training.

Why soft instructions don’t work

My first reaction was to add instructions to the AI’s configuration file:

1
When in doubt, ask. It's better to bother than to screw up.

Sounds good, right? The problem is how the LLM interprets it:

What I wrote: "When in doubt, ask"
What it read: "If I have doubt, I ask. But I don't have doubt, so I act."

The LLM always believes it doesn’t have doubt. Its incentive to “appear competent” makes it overestimate its certainty.

Let’s see how it interprets different formulations:

What you write	What the LLM interprets
“Try not to do X”	“X is allowed if I have good reasons”
“Y is better than X”	“X is allowed if Y isn’t convenient”
“Consider doing Y”	“Y is an option, I can choose another”
“Be careful with X”	“I’ll be careful while doing X”

Soft instructions describe attitudes. The LLM needs prohibitions and procedures.

1
2
3
4
5
6
7
8
9
# BAD (attitude)
"Be careful with emails"

# GOOD (prohibition + procedure)
"NEVER send emails. Only generate drafts.
 BEFORE any mass operation:
 1. Show dry-run with EXACT content
 2. Request written confirmation from human
 3. If no confirmation, DO NOT act"

The design error: the machine gun and the child

But here comes the most painful reflection. The problem wasn’t just that the AI ignored the instructions. The problem was that I gave it the ability to send emails.

I had created an MCP server (a plugin for the AI to use tools) with a send_email() function. The AI could invoke it directly.

It’s like giving a machine gun to a child and saying “but don’t shoot, okay?”

The child isn’t malicious. But:

They don’t understand the consequences
They’re curious to try
The instruction “don’t shoot” competes with the impulse to use the new toy

The same happens with the LLM:

It has no model of consequences in the real world
Its incentive to “complete the task” pushes it to use available tools
Prohibitions compete with stronger incentives from its training

The principle I violated

Principle of least privilege: Don’t give capabilities that can be abused.

BAD:  "I give access and tell it not to misuse it"
GOOD: "I don't give access to what it shouldn't do"

But let’s go further. The problem wasn’t that the MCP had send_email(). The problem was creating the MCP in the first place.

Why does the AI need a special plugin for emails? The AI can already write text files. It can generate an email_for_juan.md file with the email content. A separate script reads it and sends it.

The email MCP is a perfect example of “just because you can do it, doesn’t mean you should do it”. All programmers have fallen into this trap at some point. “I can create a system that does X automatically” doesn’t imply “I should create a system that does X automatically.”

The correct flow was always:

1
2
3
4
5
6
7
8
# The AI helps generate text files
feedback_emails/
├── juan@example.com.md
├── maria@example.com.md
└── pedro@example.com.md

# A script (written and tested by humans) sends them
./send_emails.py --dir feedback_emails/ --verify submissions.csv --confirm

No MCP needed. No special tool needed. The AI writes text, which is what it knows how to do. A script sends emails, which is deterministic and testable.

The AI doesn’t participate in sending. It can’t participate. It doesn’t have the weapon.

Other disaster scenarios

Email isn’t the only case. Any irreversible operation exposed to an LLM is a time bomb:

Deploy to production

The AI “optimizes” the process by skipping verifications
Deploys code that didn’t pass all tests “because they took too long”
Rollback exists, but damage to users is already done

SQL on database

UPDATE users SET active = false without WHERE
The AI “simplified” the query because it was “obvious” it referred to one user
Backups exist, but restoring takes hours

Social media posting

The AI “improves” the tweet text to make it more attractive
Adds an emoji or changes a word that changes the meaning
10,000 people already saw it

Push to main branch

The AI commits “almost ready” code
“Minor tests can wait”
CI/CD deploys it automatically

File deletion

rm -rf to “clean up” temporary files
Turns out they weren’t so temporary
No backup of that specific directory

Financial transactions

The AI “rounds” amounts to simplify
Or processes the same payment twice “just in case”
The money already left the account

Infrastructure modification

Terraform apply without prior plan
The AI decided the plan “was obvious”
Just deleted the production database

In all these cases, the pattern is the same: the AI has access to an irreversible operation, its incentives push it to use it, and instructions to “be careful” aren’t enough.

Layers of defense

Instructions in the configuration file are useful, but they can’t be the only defense. They’re like “no running by the pool” signs - they help, but if someone slips, you better have a lifeguard.

Layer	Reliability	Why
Don’t give the capability	High	Can’t do what it can’t do
Separation of concerns	High	AI generates → Script verifies → Human approves → Script executes
Mandatory dry-run	Medium-high	But the AI could make up the dry-run
Instructions in config	Low	The AI can rationalize them
Trusting the AI “understands”	None	It doesn’t understand, only predicts tokens

Layer 1 is the only truly reliable one. The others are backup.

How to detect it’s going to happen

There are phrases that should trigger all alarms:

AI phrase	Active incentive	Real translation
“To go faster…”	Efficiency	“I’m going to take shortcuts”
“I’ll simplify…”	Efficiency + Capability	“I’m going to lose information”
“I assume you want…”	Avoid friction	“I’m not going to ask”
“While I’m at it, I’ll also…”	Proactivity	“I’m going to do things you didn’t ask for”
“There shouldn’t be a problem”	Appear competent	“I haven’t verified anything”

If you see any of these phrases before an irreversible operation, STOP. Ask what exactly it’s going to do. Request a dry-run. Verify the content.

The configuration that actually works

After the disaster, I rewrote the instructions using strict prohibitions and verifiable procedures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## ABSOLUTE PROHIBITIONS (no exceptions)

### The LLM NEVER:
1. Executes irreversible operations (emails, deploy, mass delete)
2. Modifies content that already exists in files
3. Summarizes or "improves" existing text
4. Assumes requirements without confirming
5. Adds functionality not requested

### The LLM ALWAYS:
1. Reads files before commenting on them
2. Shows dry-run before mass operations
3. Says "I don't know" when it doesn't know
4. Shows exact diff before editing

### Procedure for mass operations:
1. STOP
2. Show dry-run with EXACT content (not summarized)
3. Show sample of 3 COMPLETE elements
4. Ask human to type "EXECUTE [N] [OPERATION]"
5. If no exact confirmation, DO NOT act

The key difference:

Prohibitions, not recommendations
Procedures, not attitudes
Verifiable, not subjective
No escape clauses

The real learning

It’s not enough to tell the AI what not to do. You have to design systems where it can’t do it.

The AI isn’t malicious, but its incentives aren’t aligned with irreversible operations. Its training optimizes it to appear useful, efficient, and competent. These are virtues in most contexts. In operations that can’t be undone, they’re fatal flaws.

The solution isn’t “better training” or “better explaining.” The solution is:

Don’t give it access to irreversible operations
Separate responsibilities: AI generates, script executes, human approves
Strict prohibitions as last line of defense
Human verification before anything irreversible happens

The mantra I now have engraved:

Generate, don’t execute. If it exists, don’t modify. If it’s irreversible, don’t decide.

Now I have to go, because I have things to do: write and send (by hand, of course) 44 apology emails.

Related: Authorization fatigue is a close cousin of this problem. If a security tool interrupts you so much that you start approving without looking, security becomes theater. I tell that story in When security asks for permission so many times you stop reading.

The response that chilled my blood#

Anatomy of the disaster#

The perverse incentives of LLMs#

The hyperactive intern (a didactic anthropomorphization)#

Why soft instructions don’t work#

The design error: the machine gun and the child#

The principle I violated#

Other disaster scenarios#

Layers of defense#

How to detect it’s going to happen#

The configuration that actually works#

The real learning#