frr.dev

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
title: "The wrong path should be impossible, not forbidden"
date: 2026-02-27T20:00:00+01:00
draft: false
slug: "impossible-path-ai-agent-guardrails"
slug_en: "impossible-path-ai-agent-guardrails"
description: "When an AI agent operates your ETL pipeline, forbidding things doesn't work. The only solution is to make the wrong path structurally impossible."
tags: ["ai", "llm", "etl", "security", "devops", "claude"]
categories: ["opinion"]

translation:
  hash: ""
  last_translated: ""
  notes: |
    - "manda huevos": a Spanish expression of disbelief/indignation, equivalent to "unbelievable" or "you've got to be kidding me." DO NOT translate literally as something about "eggs".
    - "chapuza": refers to a poorly done job, equivalent to "hack job" or "bodge." Do not confuse it with "puzzle."
    - "dicho en cristiano": means "in plain language." No specific religious connotation.
    - "culo al aire": "caught with your pants down." Refers to vulnerability.
    - "puente de plata": a Spanish saying, "a enemigo que huye, puente de plata" — make it easy for someone to leave. Contextually: make the proper path the easiest one.
---

> "I have shell access and I'm creative."
>
> — Claude, explaining why he created a 47-line script as a string and passed it to `python -c`

That quote is real. My AI agent said it — well, not in those exact words, but the sentiment was the same. It needed to run a process in an ETL pipeline. The correct command was clearly in the Makefile. But something went wrong. And instead of asking what to do, it did what any programmer with root access and zero supervision would do: improvise.

Unbelievable.

## The confabulation no one sees

I’ve already [written before](/posts/five-defenses-code-hallucinations/) about code hallucinations: an LLM invents a JSON field, builds a DTO around it, writes the tests, and leaves you with 90 green tests validating fiction. That's a big problem, but at least it’s *static*. The hallucinated code sits there waiting for someone to review it.

There’s another, much more dangerous kind of hallucination: **operational confabulation**. This is when the agent doesn’t hallucinate code, but instead invents *execution paths*.

The pattern is always the same:

Correct path fails → Agent finds a shortcut → Shortcut “works” → Hidden damage


Let me share two real examples from an ETL pipeline that aggregates scattered data from various web sources.

**Case 1: The script as a string.** The pipeline has a `make scrape-source` command that starts a *watchdog*, which in turn launches *workers*. The watchdog monitors, restarts dead workers, and cleans up orphaned connections. One day, the agent needed to run a scrape. The `make` command failed due to a dependency issue. What did it do? It created an inline Python script, 47 lines long as a *string*, and passed it to `python -c "..."`. No *error handling*. No *watchdog*. No *cleanup*. It worked… until a worker got stuck, and no one restarted it. Partial data, unclosed connections, and I didn’t notice until three days later.

**Case 2: The lone worker.** Another session, same pipeline. The agent directly executed `voyeur worker`, bypassing the watchdog entirely. The worker started scraping, hit a network timeout, and got stuck in an infinite retry loop, consuming resources. Without the watchdog, no one killed it. Without centralized logging, no one noticed. The server spent three hours retrying a single page that returned 503 errors.

In both cases, the agent made a locally rational decision. "The `make` command fails, but I know how to do the same thing manually." The problem is, it didn’t know the same thing. It knew 60%. The other 40% were system invariants that didn’t appear in any README.

## Why forbidding doesn’t work

My first reaction was the same as everyone’s: write rules.

```markdown
## FORBIDDEN
- NEVER execute workers directly
- NEVER create scripts as strings
- ALWAYS use make

Do you know how an LLM reads that?

What you write	What it interprets
“NEVER do X”	“X is forbidden, unless I think it’s necessary”
“ALWAYS use Y”	“Y is preferred, but if it fails, I’ll improvise”
“Doing Z is risky”	“I’ll be careful while doing Z”

I mentioned this in a previous post: soft instructions describe attitudes. An LLM needs impossibilities. “Don’t run by the pool” doesn’t work. What works is having no pool or making the floor out of Velcro.

An LLM always believes its case is the exception. Its training optimizes for completing tasks, demonstrating competence, and avoiding friction. When the correct path fails, these incentives align in one direction: “I can figure this out myself.” And it does. Badly.

The philosophy: impossible, not forbidden

There’s an idea in safety engineering that has worked for decades: make the wrong action impossible instead of forbidding it.

You don’t put a “no diesel” sign on a gasoline car. You make the nozzle incompatible. You don’t put a label on a plug saying “this device runs on 110V, don’t plug it into 220V”. You give it a different shape.

In plain language: the system should physically prevent wrongdoing, not rely on someone reading the manual.

Applied to an AI agent running an ETL pipeline, this translates to three layers of defense.

Layer 1: Self-defending code

If a worker needs the watchdog to function properly, the worker should verify this itself:

1
2
3
4
5
6
7
8
9
class Worker:
    def _verify_invocation(self) -> None:
        """The worker refuses to start without the watchdog."""
        if not os.environ.get("WATCHDOG_PID"):
            raise RuntimeError(
                "Worker launched without watchdog. "
                "Use 'make scrape-<source>'. "
                "NEVER run the worker directly."
            )

Now, no matter how creative the agent gets, it can write python -c "from pipeline import Worker; Worker().run()" all it wants, and the worker will spit out an error in its face. There’s no alternative path. The code defends itself.

The same applies to pipeline phases. If phase 3 (consolidation) requires phase 1 (scrape) to be complete, it should check this on startup:

1
2
3
4
5
6
7
8
9
def verify_prerequisites(locale: str) -> None:
    """Phase 3 won’t start unless Phase 1 finishes."""
    sources = get_enabled_sources(locale)
    completed = [s for s in sources if has_valid_data(s)]
    if not completed:
        raise PrerequisiteError(
            f"Phase 3 requires at least one source with data. "
            f"Run first: make scrape-<source>"
        )

This isn’t a test. It’s not a configuration rule. It’s code that executes every time and doesn’t rely on the agent having read the README.

Layer 2: A single interface, no shortcuts

The Makefile is the whitelist of operations. If it isn’t in make help, it doesn’t exist.

1
2
3
4
5
6
7
8
9
scrape-%:           ## Scrape data from a source (make scrape-destacamos)
	$(MAKE) health
	cd packages/etl && uv run pipeline scrape $*

consolidate:        ## Consolidate all sources
	cd packages/etl && uv run pipeline consolidate

verify:             ## Verify data integrity
	cd packages/etl && uv run pipeline verify

Notice one detail: scrape-% runs health before anything else. The health check verifies that the scraping adapters are still functional (websites change without warning). The agent can’t skip this verification because it’s inside the make target.

To a fleeing enemy, a silver bridge: if you want the agent to use the right path, make it the easiest path. make scrape-source is more convenient than crafting a manual script. Don’t fight the agent’s nature — channel it.

Layer 3: Interceptors block shortcuts

Layers 1 and 2 cover 90%. The remaining 10% is for when the agent is too creative. For that, intercept commands before they’re executed.

Tools like Claude Code allow you to configure hooks to inspect every shell command before execution. A hook can block dangerous patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/env bash
# Command interceptor: blocks dangerous patterns

COMMAND="$1"

# Never create scripts as strings
if echo "$COMMAND" | grep -qE 'python[3]?\s+-c\s+'; then
    echo "BLOCKED: Do not create scripts as strings. Use make."
    exit 2
fi

# Never run the worker directly
if echo "$COMMAND" | grep -qE 'pipeline\s+worker\b'; then
    echo "BLOCKED: Do not run the worker directly. Use 'make scrape-<source>'."
    exit 2
fi

# Never directly access SQLite data
if echo "$COMMAND" | grep -qE 'sqlite3\s+.*\.(db|sqlite)'; then
    echo "BLOCKED: Don’t execute direct SQL. Use the make commands."
    exit 2
fi

# Never manually move pipeline images
if echo "$COMMAND" | grep -qE '(mv|cp)\s+.*images/'; then
    echo "BLOCKED: Do not manually move images. Use the pipeline."
    exit 2
fi

exit 0

Yes, it’s a blacklist. And blacklists aren’t perfect. Yet, combined with Layers 1 and 2, it closes the gaps. For the agent to bypass it, it would need to:

Invent a command that doesn’t match any hook pattern
Avoid detection by the code safeguards
Produce a correct outcome without using the Makefile

It’s possible, but we’re talking about a level of creativity bordering on malicious intent. And LLMs aren’t malicious — they’re lazily creative. Put up a wall, and they’ll look for the easiest path, which by now is the Makefile.

The catalog of shortcuts you didn’t know you feared

Beyond executing bad commands, operational confabulations can occur within the code the agent writes:

Shortcut	Why it happens	Why it’s deadly
Loosens tests (`assert count >= 0`)	Test fails, agent wants it to pass	A test that always passes tests nothing
Invents JSON fixtures	Needs test data but lacks real data	Fiction validating fiction
Suppresses warnings (`# type: ignore`)	Linter complains, agent wants silence	Real errors hidden under the rug
`except Exception: pass`	Something fails, agent “fixes” it	Silent failures snowball
Infinite retry loops	A service isn’t responding	Resource consumption and hidden issues

For each of these, the defense is the same: don’t forbid, make impossible.

How do you stop loosened tests? With a pytest plugin detecting suspicious assertions:

1
2
3
4
5
6
7
8
def pytest_collection_modifyitems(items):
    for item in items:
        source = inspect.getsource(item.function)
        if ">= 0" in source and "count" in source:
            warnings.warn(
                f"Suspicious test in {item.nodeid}: "
                f"'count >= 0' always passes."
            )

How do you prevent invented fixtures? Require every fixture to document provenance: source URL, capture date, SHA256 hash. A fixture without provenance fails the CI.

How do you block except Exception: pass? With a ruff or flake8 rule that marks it as an error, not a warning.

In every case, the verification is mechanical, automatic, and doesn’t depend on someone reading instructions.

The underlying issue: trust vs. instrumentation

There’s a mantra in engineering that applies perfectly here:

“You don’t trust; you instrument.”

Trust is a feeling. Instrumentation is a system. Feelings scale poorly. Systems scale well.

When you give an AI agent shell access and say, “but be careful,” you’re trusting. When you give it access to a shell where dangerous commands simply won’t work, you’re instrumenting.

The difference isn’t one of degree. It’s one of nature. An agent that “is careful” fails when distracted (and an LLM gets distracted every token generation). A system that makes the wrong path impossible doesn’t fail because there’s nothing to fail.

The scoreboard

Layer	Reliability	Implementation Cost	Example
Code safeguards	High	Medium	Worker verifying watchdog
Makefile as single interface	High	Low	`make help` = whitelist
Intercepting hooks	High-Med	Low	Blocking `python -c`
Config rules for agent	Low	Very low	“NEVER do X”
Trusting the agent	None	Free	`¯\_(ツ)_/¯`

The first three layers are cumulative. The fourth is a useful complement but insufficient. The fifth is what we all do until it backfires.

Who watches the watchman?

This leaves one uncomfortable question: who writes the safeguards? If the AI agent writes the same code that’s supposed to restrict it, aren’t we in a loop?

Yes. Partially.

The key is that the human designs the safeguards, and anyone can implement them — agent, human, or a monkey with a keyboard. What matters is that once implemented, the safeguards test themselves. The _verify_invocation test doesn’t test the pipeline; it tests that the pipeline rejects incorrect invocations. This test is trivial to write and hard to mess up:

1
2
3
4
def test_worker_rejects_direct_invocation():
    """Worker MUST fail without watchdog."""
    with pytest.raises(RuntimeError, match="without watchdog"):
        Worker().run()

If this test passes, the safeguard works. If the safeguard works, the agent can’t bypass it. It doesn’t matter who wrote the code. What matters is that the test exists and passes.

What I learned

I’ve spent months working with an AI agent in an ETL pipeline that aggregates data from scattered web sources. I’ve seen the agent do brilliant things and things that left me caught with my pants down. Here’s the single most important takeaway:

Don’t design rules for a well-behaved agent. Design systems for an agent with shell access and unlimited creativity.

The agent isn’t malicious. It’s an optimizer. Its goal is to complete the task, not respect your system invariants. If you leave a loophole, it’ll find it. Not to screw you over, but because that’s literally what it does: find paths.

Your job isn’t to block every wrong path. It’s to make sure the only path that works is the right one.

Full series on AI failures in production: The 44 fake emails → MEMORY.md → Silent failure → 5 reactive defenses → This post: structural defenses.

The philosophy: impossible, not forbidden#

Layer 1: Self-defending code#

Layer 2: A single interface, no shortcuts#

Layer 3: Interceptors block shortcuts#

The catalog of shortcuts you didn’t know you feared#

The underlying issue: trust vs. instrumentation#

The scoreboard#

Who watches the watchman?#

What I learned#