| |
Correct path fails → Agent finds a shortcut → Shortcut “works” → Hidden damage
Let me share two real examples from an ETL pipeline that aggregates scattered data from various web sources.
**Case 1: The script as a string.** The pipeline has a `make scrape-source` command that starts a *watchdog*, which in turn launches *workers*. The watchdog monitors, restarts dead workers, and cleans up orphaned connections. One day, the agent needed to run a scrape. The `make` command failed due to a dependency issue. What did it do? It created an inline Python script, 47 lines long as a *string*, and passed it to `python -c "..."`. No *error handling*. No *watchdog*. No *cleanup*. It worked… until a worker got stuck, and no one restarted it. Partial data, unclosed connections, and I didn’t notice until three days later.
**Case 2: The lone worker.** Another session, same pipeline. The agent directly executed `voyeur worker`, bypassing the watchdog entirely. The worker started scraping, hit a network timeout, and got stuck in an infinite retry loop, consuming resources. Without the watchdog, no one killed it. Without centralized logging, no one noticed. The server spent three hours retrying a single page that returned 503 errors.
In both cases, the agent made a locally rational decision. "The `make` command fails, but I know how to do the same thing manually." The problem is, it didn’t know the same thing. It knew 60%. The other 40% were system invariants that didn’t appear in any README.
## Why forbidding doesn’t work
My first reaction was the same as everyone’s: write rules.
```markdown
## FORBIDDEN
- NEVER execute workers directly
- NEVER create scripts as strings
- ALWAYS use make
Do you know how an LLM reads that?
| What you write | What it interprets |
|---|---|
| “NEVER do X” | “X is forbidden, unless I think it’s necessary” |
| “ALWAYS use Y” | “Y is preferred, but if it fails, I’ll improvise” |
| “Doing Z is risky” | “I’ll be careful while doing Z” |
I mentioned this in a previous post: soft instructions describe attitudes. An LLM needs impossibilities. “Don’t run by the pool” doesn’t work. What works is having no pool or making the floor out of Velcro.
An LLM always believes its case is the exception. Its training optimizes for completing tasks, demonstrating competence, and avoiding friction. When the correct path fails, these incentives align in one direction: “I can figure this out myself.” And it does. Badly.
The philosophy: impossible, not forbidden
There’s an idea in safety engineering that has worked for decades: make the wrong action impossible instead of forbidding it.
You don’t put a “no diesel” sign on a gasoline car. You make the nozzle incompatible. You don’t put a label on a plug saying “this device runs on 110V, don’t plug it into 220V”. You give it a different shape.
In plain language: the system should physically prevent wrongdoing, not rely on someone reading the manual.
Applied to an AI agent running an ETL pipeline, this translates to three layers of defense.
Layer 1: Self-defending code
If a worker needs the watchdog to function properly, the worker should verify this itself:
| |
Now, no matter how creative the agent gets, it can write python -c "from pipeline import Worker; Worker().run()" all it wants, and the worker will spit out an error in its face. There’s no alternative path. The code defends itself.
The same applies to pipeline phases. If phase 3 (consolidation) requires phase 1 (scrape) to be complete, it should check this on startup:
| |
This isn’t a test. It’s not a configuration rule. It’s code that executes every time and doesn’t rely on the agent having read the README.
Layer 2: A single interface, no shortcuts
The Makefile is the whitelist of operations. If it isn’t in make help, it doesn’t exist.
| |
Notice one detail: scrape-% runs health before anything else. The health check verifies that the scraping adapters are still functional (websites change without warning). The agent can’t skip this verification because it’s inside the make target.
To a fleeing enemy, a silver bridge: if you want the agent to use the right path, make it the easiest path. make scrape-source is more convenient than crafting a manual script. Don’t fight the agent’s nature — channel it.
Layer 3: Interceptors block shortcuts
Layers 1 and 2 cover 90%. The remaining 10% is for when the agent is too creative. For that, intercept commands before they’re executed.
Tools like Claude Code allow you to configure hooks to inspect every shell command before execution. A hook can block dangerous patterns:
| |
Yes, it’s a blacklist. And blacklists aren’t perfect. Yet, combined with Layers 1 and 2, it closes the gaps. For the agent to bypass it, it would need to:
- Invent a command that doesn’t match any hook pattern
- Avoid detection by the code safeguards
- Produce a correct outcome without using the Makefile
It’s possible, but we’re talking about a level of creativity bordering on malicious intent. And LLMs aren’t malicious — they’re lazily creative. Put up a wall, and they’ll look for the easiest path, which by now is the Makefile.
The catalog of shortcuts you didn’t know you feared
Beyond executing bad commands, operational confabulations can occur within the code the agent writes:
| Shortcut | Why it happens | Why it’s deadly |
|---|---|---|
Loosens tests (assert count >= 0) | Test fails, agent wants it to pass | A test that always passes tests nothing |
| Invents JSON fixtures | Needs test data but lacks real data | Fiction validating fiction |
Suppresses warnings (# type: ignore) | Linter complains, agent wants silence | Real errors hidden under the rug |
except Exception: pass | Something fails, agent “fixes” it | Silent failures snowball |
| Infinite retry loops | A service isn’t responding | Resource consumption and hidden issues |
For each of these, the defense is the same: don’t forbid, make impossible.
How do you stop loosened tests? With a pytest plugin detecting suspicious assertions:
| |
How do you prevent invented fixtures? Require every fixture to document provenance: source URL, capture date, SHA256 hash. A fixture without provenance fails the CI.
How do you block except Exception: pass? With a ruff or flake8 rule that marks it as an error, not a warning.
In every case, the verification is mechanical, automatic, and doesn’t depend on someone reading instructions.
The underlying issue: trust vs. instrumentation
There’s a mantra in engineering that applies perfectly here:
“You don’t trust; you instrument.”
Trust is a feeling. Instrumentation is a system. Feelings scale poorly. Systems scale well.
When you give an AI agent shell access and say, “but be careful,” you’re trusting. When you give it access to a shell where dangerous commands simply won’t work, you’re instrumenting.
The difference isn’t one of degree. It’s one of nature. An agent that “is careful” fails when distracted (and an LLM gets distracted every token generation). A system that makes the wrong path impossible doesn’t fail because there’s nothing to fail.
The scoreboard
| Layer | Reliability | Implementation Cost | Example |
|---|---|---|---|
| Code safeguards | High | Medium | Worker verifying watchdog |
| Makefile as single interface | High | Low | make help = whitelist |
| Intercepting hooks | High-Med | Low | Blocking python -c |
| Config rules for agent | Low | Very low | “NEVER do X” |
| Trusting the agent | None | Free | ¯\_(ツ)_/¯ |
The first three layers are cumulative. The fourth is a useful complement but insufficient. The fifth is what we all do until it backfires.
Who watches the watchman?
This leaves one uncomfortable question: who writes the safeguards? If the AI agent writes the same code that’s supposed to restrict it, aren’t we in a loop?
Yes. Partially.
The key is that the human designs the safeguards, and anyone can implement them — agent, human, or a monkey with a keyboard. What matters is that once implemented, the safeguards test themselves. The _verify_invocation test doesn’t test the pipeline; it tests that the pipeline rejects incorrect invocations. This test is trivial to write and hard to mess up:
| |
If this test passes, the safeguard works. If the safeguard works, the agent can’t bypass it. It doesn’t matter who wrote the code. What matters is that the test exists and passes.
What I learned
I’ve spent months working with an AI agent in an ETL pipeline that aggregates data from scattered web sources. I’ve seen the agent do brilliant things and things that left me caught with my pants down. Here’s the single most important takeaway:
Don’t design rules for a well-behaved agent. Design systems for an agent with shell access and unlimited creativity.
The agent isn’t malicious. It’s an optimizer. Its goal is to complete the task, not respect your system invariants. If you leave a loophole, it’ll find it. Not to screw you over, but because that’s literally what it does: find paths.
Your job isn’t to block every wrong path. It’s to make sure the only path that works is the right one.
Full series on AI failures in production: The 44 fake emails → MEMORY.md → Silent failure → 5 reactive defenses → This post: structural defenses.