Claude Code includes a slash command called /simplify that automatically reviews your code. I ran it on a hefty diff — about 500 lines across 8 files — and the results were… interesting. It found things I wouldn’t have noticed, but it also wasted my time pointing out stuff that didn’t matter.

So, I took it apart and rebuilt it piece by piece.

What Does /simplify Do?

It’s a skill that comes bundled with Claude Code (you don’t install it). It launches three agents in parallel, each looking at the same diff from a different angle:

  1. Code Reuse — Are there existing utilities that could replace newly added code?
  2. Code Quality — Redundant state, copy-paste, leaky abstractions, stringly-typed code.
  3. Efficiency — Unnecessary I/O, missed concurrency opportunities, memory leaks.

The three produce findings, and then the system tries to fix the issues directly.

What It Does Well

The reuse agent caught a helper that was duplicated verbatim in two test suites. Same name, same lines, two different files. I moved it to a shared module. Nice and clean.

The efficiency agent spotted a double trip to disk inside a processing loop: load state, modify, save, read data, re-load, re-save. Two writes when one would suffice. I wouldn’t have noticed that myself.

It also flagged a memory buffer that wasn’t cleaned up in the error path. If something failed between allocation and release, leak. The main path was fine. Classic copy-paste swallowing the detail.

So far, so good. Three legitimate, actionable findings. But the problem with /simplify isn’t what it catches — it’s everything else it reports.

Where It Falls Short

Too much noise in low-severity issues. It suggested removing a field from a struct because it was “redundant” with a computed property. We’re talking 8 bytes. That field is used in more than 10 places in the code and the tests. The churn of removing it far outweighs the benefit of saving a single integer.

No understanding of project context. It flagged a concurrency pattern as HIGH risk, which is fair — that’s correct in the abstract. But it had already been documented in the project’s CLAUDE.md, had a dedicated linter, was allowlisted, and had an open issue. The agent didn’t know any of this because it works only with the diff, in complete isolation.

Doesn’t distinguish “incorrect” from “improvable.” The double disk trip was inefficient but correct. The concurrency pattern was a latent bomb. Both came back as MEDIUM priority. The prioritization is flat.

Suggests enums for external data. It claimed that some fields in a DTO should be enums instead of strings. But those fields come from an external API. They’re only read and displayed. Turning them into enums requires custom decoding and adds nothing — if the API sends a new value, your enum blows up instead of gracefully degrading.

These are mistakes a developer with project context would filter out in two seconds. But /simplify has no context. It has a diff and good intentions.

The Three Fixes I Made

After reviewing the outputs, I identified three structural problems with /simplify and fixed them in a custom skill I called /improve.

1. Inject Project Context

Each agent receives the CLAUDE.md, open issues from the tracker, and linter results before generating findings. If something is already managed, it mentions it but doesn’t report it as new.

This eliminates the most irritating category of false positives: the ones you already know about and have under control.

2. Cost/Benefit Filtering

Before reporting, each agent estimates how many files the fix would touch. If the effort-to-improvement ratio is negative — like renaming a field in 10+ spots for minor readability gains — it filters it out.

This seems obvious, but /simplify doesn’t do it. It treats a one-line change and a 15-file refactor with the same priority.

3. Separate “Auto-Fix” from “Backlog Issue”

Findings are split into two types:

  • auto-fix: Mechanical, ≤3 files, low risk. Applied directly.
  • issue: Requires design, touches >3 files, or changes an interface. Created as a tracker issue.

This prevents the review from attempting fixes that need more thought.

What I Didn’t Do (And Why)

A second LLM as a reviewer. Sexy idea — cross-model validation, more eyes, additional training. In practice, the bottleneck isn’t the number of eyes but the quality of the context. A second model without access to the CLAUDE.md or tracker spits out the same thing: generic “best practices” advice you can find in any book.

Categorization into 4 severity levels. I started with CRITICAL/HIGH/MEDIUM/LOW, but with cost/benefit filtering active, almost everything that passes the filter is MEDIUM or HIGH. The other two categories are empty. More taxonomy doesn’t mean better prioritization.

The Jedi Council

And here’s the idea that changed the game.

A few weeks ago, I wrote about invoking experts as mentors — asking an LLM to adopt the perspective of Tufte, Munger, or whoever fits your needs. It worked brilliantly in design.

What if, instead of three generic agents (reuse, quality, efficiency), I used three agents with names, philosophies, and specific decision rules?

The idea has a name every Star Wars fan will recognize: the Jedi Council. Three masters with different perspectives evaluating the same case. But be careful — this isn’t about the LLM doing surface-level impersonations by quoting famous lines. It’s about each “wise master” applying specific filtering rules that a generic reviewer wouldn’t.

The Three Masters (and Why These)

Kent Beck — Simplicity. “Make it work, make it right, make it fast — in that order.” He’s the guy who tells you “those two identical blocks of code are fine, don’t extract a helper just yet.” His key rule: The Rule of Three. DO NOT report duplication unless the same block appears three times. Twice is coincidence. Three times is a pattern. And if the fix touches more files than the code it’d clean up, it’s probably not worth it.

But Beck isn’t just about simplicity. He also catches correctness bugs: cases where the obvious choice has semantics different from the correct one. That async keyword that seems harmless but inherits a context you don’t want. The default that works in tests but blows up in production.

Martin Fowler — Design. Code smells are symptoms, not diseases. Refactoring is a discipline, not a hobby. His key rule: Only suggest refactoring if there’s a concrete change it would benefit. “Refactoring without direction is codebase tourism.” If a string comes from an external API and is only read, don’t suggest converting it to an enum. If one field is always synchronized with another by design, the redundancy is intentional.

Mike Acton — Performance. “The purpose of all programs is to transform data from one form to another.” If you haven’t measured, you don’t have a performance problem — you have an opinion. His key rule: I/O is what matters in 99% of apps. CPU rarely bottlenecks. Disk and network do.

Acton Doesn’t Guess — He Measures

Here’s where it gets interesting. Mike Acton doesn’t stop at static analysis. He does two things before rendering a verdict:

Static I/O Counting: Scans the diff for read/write operations to disk, network, or databases. Maps each operation to its context: Is it in a loop? A hot path? Generates a frequency table before opining.

Real Profiling: If the diff touches hot path code and the project can compile, runs a profiler and condenses the results. If a hotspot aligns with code from the diff, it reports it with numbers, not opinions.

The I/O table includes rough time estimates: SSD read ~0.5ms, write ~1ms, flush ~2-5ms, network ~100-500ms. It’s not precise — it’s for spotting operations that, in aggregate, cross a threshold.

The Risk (And How to Avoid It)

Before you start building a Jedi Council for every pull request, here’s the elephant in the room: The LLM can do surface-level impersonation. It might output “as Kent Beck would say…” and just spout the same generic advice under his name.

To avoid this, the instructions don’t say “adopt Kent Beck’s perspective.” They say: “Apply the Rule of Three: if a fix touches more files than it cleans up, discard it.” Specific rules, not vibes.

Also, each master must finish with a “Discarded” section — findings they considered but rejected, with the rule applied. This makes it clear that the master actively filtered, not just reported less.

And if two masters disagree on the same code — Beck says “don’t touch” while Fowler says “refactor” — a moderator agent evaluates the specific case until consensus is reached. If no consensus → discard. Better to do nothing than do the wrong thing.

Is it the same as having Kent Beck in the room? Obviously not. But it’s infinitely better than three generic agents reporting everything without judgment.

The Test: Same Diff, Two Reviews

I ran the same diff through /simplify and /improve. Same changes, same project, same session:

/simplify/improve
Reported Findings85
False Positives3-40
New Findings01 (concurrency bug, HIGH)
“Discarded” SectionNoYes, with applied rule
Project ContextNoCLAUDE.md + tracker + linters

The new finding /improve caught that /simplify didn’t: a concurrency bug where an apparently correct pattern inherited a faulty execution context, causing the UI to freeze. In plain language: the code looked fine, compiled cleanly, but it blocked the main thread. /simplify missed it because its generic agents don’t look for bugs where the “obvious” choice is wrong. Kent Beck did, because that’s exactly his mandate.

The false positives from /simplify — the “redundant” 8-byte field, the already managed concurrency pattern, the enums for external JSON — didn’t show up in /improve. Cost/benefit filtering caught the first. Project context filtered out the second. Fowler’s rule (“stringly-typed is fine unless it hurts”) discarded the third.

What sold me the most: the “Discarded” section. Seeing what each master considered and why they rejected it inspires far more trust than just seeing what they reported. You know they looked at more than they said.

How to Install It

The skill is called /improve and lives in ~/.claude/skills/improve/SKILL.md. It’s a global skill for Claude Code — works in any project.

1
2
3
4
# Create the directory
mkdir -p ~/.claude/skills/improve

# Copy the SKILL.md (or write your own following Claude Code’s skill structure)

Usage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Review only (no code changes)
/improve

# Review + apply mechanical fixes
/improve --fix

# Review + draft report for senior dev
/improve --report

# Review a specific commit range
/improve --diff HEAD~5..HEAD

Summary

/simplify is a solid starting point. Three generic agents detect duplication, inefficiencies, and code smells. But without project context, it creates noise; without cost/benefit filtering, it suggests changes that aren’t worth it; and without specific criteria, it treats all findings equally.

/improve is the next step: three masters with targeted philosophies, project context, cost/benefit analysis, and separation between auto-fixes and backlog issues. Beck tells you when NOT to extract a helper. Fowler tells you when a smell is purely cosmetic. Acton tells you when a “performance problem” is just an unmeasured opinion.

Fewer findings, zero false positives, and one real bug the other missed. Sometimes improvement isn’t more eyes — it’s better eyes.


Related: Invoking the Experts — the original technique with Tufte and Munger. /loop vs claude-cron — another Claude Code skill I analyzed.