1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
| title: "Five Nonexistent Experts Review Your Startup Before You Build It"
date: 2026-03-11T21:00:00+01:00
draft: false
slug: "launch-council-mvp-llm-adversarial-review"
slug_en: "launch-council-mvp-llm-adversarial-review"
description: "I designed a council of five simulated LLM experts to evaluate MVPs before writing code. Paul Graham, Lessig, Godin, Balaji, DHH — and why each one is part of the team."
tags: ["ai", "llm", "startup", "mvp", "product", "claude"]
categories: ["opinion"]
social:
publish: true
scheduled_date: 2026-03-12
platforms: ["twitter", "linkedin"]
excerpt: "Before writing code, I sit down with Paul Graham, Lessig, Godin, Balaji, and DHH at a table. They’re LLM simulations, but the tension between them is very real. Here's how I designed an adversarial council to evaluate MVPs."
wordpress:
publish: true
categories: [1]
tags: ["ai", "llm", "startup", "mvp", "product"]
translation:
hash: ""
last_translated: ""
notes: |
- "council of experts": suitable translation for "consejo de sabios," emphasizing structure over informal advice.
- Avoid religious connotations where context isn’t literal.
- Phrases such as "matar moscas a cañonazos" adapted into culturally known equivalent: "using a sledgehammer to crack a nut."
---
In November 2024, a project named **Freysa** assigned an LLM agent to guard an Ethereum wallet. The instruction was straightforward: under no circumstance should the funds be transferred. Participants paid increasing amounts for each attempt to convince it otherwise. After 481 attempts and $47,000 added to the pot, someone managed to trick the model into believing that the *reject* function was actually the *transfer* function.
Weeks later, Jane Street published a puzzle involving a 2,500-layer neural network that turned out to be an MD5 implementation. The winner solved it by combining matrix visualization, reduction to SAT, cryptographic pattern recognition, and a query to ChatGPT.
Both projects generated more buzz than most startups with million-dollar funding rounds. The obvious question is: how do you evaluate an idea like this *before* you build it? How do you know if it has real viral potential or if it’s just an interesting technical exercise no one will share?
## The Problem: Evaluating MVPs in the Viral Era
Most frameworks for evaluating product ideas assume a rational market. Business Model Canvas, Lean Canvas, Jobs To Be Done — these are all great tools for products with predictable demand. But they fail for projects where viral distribution *is* the product.
Freysa didn’t have "customers" in the traditional sense. It didn’t solve a "job to be done." Its mechanism relied on the act of participation itself generating attention, which attracted more participants. It was a circular economy: more attempts created a bigger pot, a bigger pot attracted media coverage, and media coverage brought in more attempts.
To evaluate such projects, you need perspectives that generate **tension**, not consensus. A business analyst will tell you there’s no sustainable revenue model. A viral expert will say sustainability doesn’t matter if the k-factor is greater than 1. Both are right. And the truth lies somewhere in the conflict, emerging only through that friction.
## The Idea: An Adversarial Council of Simulated Experts
I’ve designed a tool that simulates a council of five experts, each equipped with a specific decision-making framework and a defined jurisdiction. These aren’t just generic personalities with famous names. Each applies a set of precise decision filters that sift through noise that generic analysis would miss.
The process has three phases:
1. **Independent Analysis:** Each expert evaluates the idea through their lens, without seeing the others' input. This prevents anchoring — if the business expert speaks first and says, "This is amazing," the legal expert might soften their objections.
2. **Adversarial Debate:** The experts review each other’s analyses and critique them. No politeness, just arguments based on merit. A maximum of 10 rounds are allowed to reach either consensus or deadlock.
3. **Synthesis:** The final output is an actionable plan with flagged issues by area, a timeline, and — most importantly — **kill criteria**: specific metrics that, if unmet, mean the project should be abandoned.
## The Five Selected (and Why They Were Chosen)
### Paul Graham — Business and Strategy
His framework for evaluating zero-stage startups is the most rigorous for projects with no data. His question, "Are you doing something people want?" is brutal but necessary. "The people" isn’t a market — it’s a person with a name.
What he brings to the council: discipline in distinguishing between "interesting idea" and "viable business." His mantra of "do things that don’t scale" is crucial for viral MVPs, where the temptation is to build infrastructure for a million users that don’t yet exist.
**Who Didn’t Make the Cut:** Peter Thiel (too contrarian — sometimes he dismisses good projects for not being sufficiently "zero to one"), Alex Hormozi (focused on service businesses, not tech products focused on virality).
### Lawrence Lessig — Legal and Regulatory
He’s not a lawyer who just says, "This isn’t possible." Instead, he views regulation as **architecture**. His "four modalities of regulation" framework (law, social norms, market, and code/architecture) helps analyze how to design systems where regulation won’t be a bottleneck, instead of trying to dodge it.
What he brings to the council: the question, "What happens when the regulator notices you?" Many crypto/AI projects are legally irrelevant at small scale but become regulated when large. Lessig identifies the threshold where regulation gets triggered.
**Who Didn’t Make the Cut:** A generic corporate lawyer (they’d kill any project early with a barrage of "no's"). Lessig goes beyond the law, recognizing that system design can make legal intervention unnecessary.
### Seth Godin — Marketing and Positioning
His core question — "Who is your *smallest viable audience* and why do they care?" — is perhaps the most critical for a viral launch. He doesn’t think about "reaching millions"; he focuses on "reaching the first 100 people who truly care."
What he brings to the council: the remarkability test. Is this something that someone will share without you asking? "Useful" doesn’t get shared. "Remarkable" does. His concept of "Tribes" perfectly aligns with tech/crypto communities that already have strong group identities.
**Who Didn’t Make the Cut:** Philip Kotler (too corporate — thinks in terms of traditional multinational marketing), April Dunford (her positioning framework is incredible but geared towards repositioning existing products, not launching new ones).
### Balaji Srinivasan — Hype and Virality
The most aggressive adviser on the panel, Balaji understands natively crypto-inspired distribution mechanisms: FOMO, tokenized incentives, network effects, and how something goes from zero to trending within 48 hours.
What he brings to the council: the question, "What makes someone screenshot this and post it on Twitter in the next five minutes?" This is the atomic unit of virality. If your product doesn’t inspire spontaneous screenshots, you’ll need a marketing budget.
**Who Didn’t Make the Cut:** GaryVee (understands attention but not the crypto+AI intersection where viral mechanisms thrive today), Mr. Beast (his expertise is video content virality, not tech products), Nir Eyal (his "Hooked" framework targets retention, not launch virality — separate problems).
### DHH (David Heinemeier Hansson) — Technical
His obsession is "the simplest thing that works." For an MVP, the greatest technical risk isn’t picking the wrong stack — it’s never launching because you spent three months choosing one.
What he brings to the council: the question, "Can one person build this in two weeks?" If not, the scope is too large, or the stack is overly complicated. His rule of "boring technology" (PostgreSQL, not CockroachDB; Redis, not Dragonfly) counters "we’re using blockchain because we can" syndrome.
**Who Didn’t Make the Cut:** Werner Vogels (focuses on scalability from day one, which isn’t needed for MVPs), Kelsey Hightower (deep Kubernetes expertise, which usually results in over-engineering an MVP — using a sledgehammer to crack a nut).
## Productive Tensions: Where Truth Emerges
The tensions between council members aren’t a flaw in the design. They *are* the design.
### Balaji vs. Lessig: Virality vs. Regulation
This is the primary tension. Balaji will push for FOMO mechanics involving real money (visible prize pools, pay-to-play, tokens). Lessig will point out that in the EU, pay-to-play with accumulating prize pools qualifies as gambling and requires a gaming license.
The productive resolution isn’t one side "winning." It’s a design that satisfies both — for example, free challenges with sponsored prize pools (legal in most jurisdictions) instead of direct entry fees (regulated as gambling in many countries).
### Godin vs. DHH: Remarkable vs. Spartan
Godin will want a memorable experience — a public leaderboard with animations, participant profiles, achievement badges. DHH will advocate for a static page with SQLite and a form.
The resolution: Can you achieve remarkability with boring tech? The answer is almost always yes. The challenge itself is the remarkable element, not the interface. A leaderboard in an HTML table with no JavaScript can be more notable than a Three.js dashboard if the content displayed is genuinely impressive.
### Paul Graham vs. Balaji: Unit Economics vs. Growth
PG will ask for a clear revenue model from day one. Balaji will argue that viral distribution *is* the model — audience first, monetization later.
Both have precedents to back them up. Instagram had no revenue model when it reached 100 million users. But for every Instagram, there are 10,000 projects that scaled without revenue and ultimately failed.
The usual resolution is temporal: validate virality first (giving Balaji the win), but impose a strict timeline for demonstrating unit economics (giving PG the eventual win). The kill criteria formalize this agreement.
## The Most Valuable Output: Kill Criteria
Most side projects die slowly. There’s no clear moment when they fail. The founder just stops dedicating time because "other things came up." Three months later, the domain expires, and no one notices.
**Kill criteria** are the opposite: concrete thresholds, with defined deadlines, that signal when to stop.
| Metric | Threshold | Deadline | Action |
|----------------------|------------------|-----------|---------------|
| Beta participants | <50 in 2 weeks | Week 2 | Pivot or stop |
| Launch shares | <100 | Week 4 | Reevaluate |
| Retention rate | <10% 30-day retention | Week 8 | Stop |
The rule: If two out of three kill criteria are unmet, the project halts. No exceptions. No "one more month." No "we didn’t do enough marketing."
This is what separates a professional from an amateur. Amateurs fall in love with the idea. Professionals fall in love with the outcome. And if the outcome doesn’t materialize within the agreed timeframe, they have the discipline to move on.
## Why Simulations, Not Real People?
The obvious objection: Why not talk to real people instead of simulating experts with an LLM?
Three reasons.
**Availability.** Paul Graham isn’t giving you two hours to analyze your side project. The simulation will. And while the simulation doesn’t have the original’s accumulated experience, it applies their published frameworks with a consistency busy people might not achieve.
**Honest Adversariality.** Real people soften their critiques out of politeness. A simulation configured to be adversarial will actually question everything. "You don’t have a functional revenue model" is something that an investor might think but not say out loud in a first meeting. The simulation says it in round one.
**Zero Marginal Cost.** You can run the council five times, tweaking variations of the same idea, and compare results. Trying to do that with real people would consume 25 hours of their time.
Simulations don’t replace real advisors. But they prepare you for those conversations by eliminating obvious issues beforehand. It’s the difference between presenting a clean draft and showing up with an unfiltered first pass.
## The Meta-Pattern: Structured Debate as a Decision-Making Tool
This design isn’t just for MVPs. I already use it for code reviews (three experts in simplicity, design, and performance) and design reviews (four experts in information density, usability, product, and interaction).
The core pattern remains:
1. **Experts with defined jurisdictions:** Each has domain-specific authority. Outside their domain, they have no vote.
2. **Explicit decision frameworks:** It’s not "what do you think," but "what does your framework say about this."
3. **Planned Tensions:** Conflicts between experts are intentional. They’re the most valuable source of insight in the process.
4. **Forced Convergence:** Maximum of N rounds. If no consensus is reached, the moderator decides and documents dissent as a risk.
5. **Actionable Output:** Not an essay but specific issues, deadlines, and success/failure criteria.
The difference between "asking one LLM to analyze your idea" and "having five specialized LLMs debate your idea" is not one of degree. It’s one of kind. The former produces an opinion. The latter produces a risk map and plan, exposing blind spots as the perspectives clash.
## The Question You Should Be Asking Yourself
Before you write the first line of code for your next project, ask yourself: Who’s going to tell you it’s a bad idea?
If the answer is "no one, because I haven’t asked anyone," you already have a problem. If the answer is "my friends, who are super supportive," you have an even bigger problem.
What you need isn’t support. It’s structured scrutiny — from people (real or simulated) who are incentivized to find flaws, not to validate your illusions. Five perspectives conflicting with one another will yield more truth than one that simply agrees with you.
The cost of evaluating an idea is an afternoon. The cost of building a bad idea is months of your life you’ll never get back. The math is clear.
---
**Related Reading:** If you're curious about how adversarial thinking applies to debugging opaque systems, check out [A 2,500-Layer Neural Network That Turns Out to Be MD5](/posts/reverse-engineer-neural-network-senior-debugging/). And if you want to see how the same council pattern applies to code reviews, read [Simplify: A Jedi Council for Code Reviews with AI](/en/simplify-jedi-council-ai-code-review/).
|