1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 title: "Context Engineering: The Invisible Skill that Separates Great AI Agents from Mediocre Ones" date: 2026-03-11T22:00:00+01:00 draft: false slug: "context-engineering-invisible-skill-ai-agents" slug_en: "context-engineering-invisible-skill-ai-agents" description: "Prompt engineering is writing a good prompt. Context engineering is designing EVERYTHING the model sees: what goes in, in what order, what gets excluded, and what gets compressed. And that’s what truly matters." tags: ["llm", "agents", "context engineering", "openai", "claude code", "memory"] categories: ["opinion"] translation: hash: "" last_translated: "" notes: | - "dicho en cristiano": "in plain language". No religious connotation. - "ojo al dato": colloquial for "pay attention to this" / "here's the key point". - "chapuza": "hack/bodge/kludge". Quick-and-dirty solution, not derogatory. - "morro que te pisas": colloquial for "incredible nerve/audacity". Not offensive, humorous. - "te la juegas": "you're taking a risk" / "you're gambling". - "currar": colloquial for "to work". Common in Spain. - "barra del bar": "bar counter" — casual conversation metaphor, common in Spanish tech blogs. --- Imagine you hire a brilliant consultant. They have two PhDs, speak seven languages, and solve problems you didn’t even know existed. You sit them down in a room and say, “I need you to refactor the authentication system for this project.” The consultant nods, looks at you, and asks, “Which project?” You haven’t given them access to the code. You haven’t explained the architecture. They don’t know if you’re using JWT tokens or session cookies. They don’t know what language you’re using, how many microservices there are, or why the last migration attempt ended in disaster. That consultant is your LLM. And you’ve just made the same mistake 90% of people working with AI agents make: **caring more about the brain than what the brain sees**. ## Prompt Engineering Is Dead. Long Live Context Engineering. For months now, I’ve been watching the same conversation unfold everywhere: on forums, in Twitter threads, during team meetings: "GPT-5 or Claude Opus?", "Which model is better for coding?", "Which one reasons better?" And every time I run the numbers, the answer is the same: **it doesn’t matter.** Well, it doesn’t *exactly* not matter. But the difference between one top-tier model and another is tiny compared to the difference between giving it good context or garbage. A mediocre model with perfect context beats a top-tier model with garbage context. Every single time. No exceptions. This has a name: **context engineering.** And no, it’s not the same as *prompt engineering.* *Prompt engineering* is writing a good prompt: choosing the right words, structuring the request, adding examples. It’s important, but it’s just one piece of the puzzle. *Context engineering* is designing **everything** the model sees: what goes in, in what order, what gets excluded when there’s no room, what gets compressed, what absolutely must stay. It’s information architecture for LLMs. In plain language: *prompt engineering* is writing a good question. *Context engineering* is deciding which books the student has on their desk before taking the test. ## The Four Phases of Memory: A Lifecycle You Don’t See OpenAI recently published two *Cookbook articles* breaking down how context management functions in agents with long-term memory. It’s not RAG. It’s not a vector database. It’s a state-based system that functions like a field notebook with strict rules. The pattern is *local-first* and *state-based*: a structured state object that travels with the agent and updates at every phase. ```mermaid flowchart TD A["1. INJECTION\n(session start)"] --> B["2. DISTILLATION\n(during conversation)"] B --> C["3. CONSOLIDATION\n(post-session)"] C --> D["4. TRIMMING\n(preservation)"] D -->|"New session"| A A1["Render state as YAML\n+ global memories (max 6)\n+ precedence rules"] -.-> A B1["save_memory_note()\nValidate durability\nMandate actionability\nReject PII and speculation"] -.-> B C1["Async job\nMerge session → global\nLLM deduplication\nFilter ephemeral notes"] -.-> C D1["TrimmingSession: last N\nReinject trimmed notes\nin system prompt"] -.-> D style A fill:#2d3748,stroke:#4a9eed,color:#fff style B fill:#2d3748,stroke:#ed9a4a,color:#fff style C fill:#2d3748,stroke:#9a4eed,color:#fff style D fill:#2d3748,stroke:#4aed5c,color:#fff Phase 1: Injection — The Test Desk When a session starts, the agent assembles its initial context. This is not random. It’s a well-defined structure: ...
Posts
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 --- title: "DIY Codex Automations: Nocturnal Agents with Claude Code and systemd" date: 2026-03-11T20:00:00+01:00 draft: false slug: "diy-codex-automations-claude-code-systemd" description: "A practical tutorial to replicate OpenAI's Codex Automations using Claude Code, systemd timers, and Gitea. Agents that work while you sleep, without relying on any desktop app." tags: ["claude-code", "automation", "systemd", "gitea", "openai", "codex", "tutorial"] categories: ["tutorial"] translation: hash: "" last_translated: "" notes: | - "ñapa": means "hack/kludge/bodge". Quick and dirty fix. Not derogatory. - "chapuza": same as "ñapa" — a hacky solution. Translate as "kludge" or "bodge". - "dicho en cristiano": "in plain language". No religious connotation intended. - "currar": colloquial Spanish for "to work". Translate as "work" or "grind". - "barra del bar": "bar counter" — casual conversation metaphor. - "madrugón": waking up very early. Not a standard English concept — "early morning" works. - "irse por las ramas": "to go off on a tangent" / "to beat around the bush". - "otro gallo cantaría": "things would be different" / "it would be a different story". --- Two weeks ago, OpenAI introduced Codex Automations. The idea: define a trigger (a cron job, a push, a new issue), write instructions in natural language, and an agent runs it solo in an isolated worktree. No human intervention. While you sleep, the agent triages issues, summarizes CI failures, generates release briefs, and even improves its own instructions. ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 title: "The wrong path should be impossible, not forbidden" date: 2026-02-27T20:00:00+01:00 draft: false slug: "impossible-path-ai-agent-guardrails" slug_en: "impossible-path-ai-agent-guardrails" description: "When an AI agent operates your ETL pipeline, forbidding things doesn't work. The only solution is to make the wrong path structurally impossible." tags: ["ai", "llm", "etl", "security", "devops", "claude"] categories: ["opinion"] translation: hash: "" last_translated: "" notes: | - "manda huevos": a Spanish expression of disbelief/indignation, equivalent to "unbelievable" or "you've got to be kidding me." DO NOT translate literally as something about "eggs". - "chapuza": refers to a poorly done job, equivalent to "hack job" or "bodge." Do not confuse it with "puzzle." - "dicho en cristiano": means "in plain language." No specific religious connotation. - "culo al aire": "caught with your pants down." Refers to vulnerability. - "puente de plata": a Spanish saying, "a enemigo que huye, puente de plata" — make it easy for someone to leave. Contextually: make the proper path the easiest one. --- > "I have shell access and I'm creative." > > — Claude, explaining why he created a 47-line script as a string and passed it to `python -c` That quote is real. My AI agent said it — well, not in those exact words, but the sentiment was the same. It needed to run a process in an ETL pipeline. The correct command was clearly in the Makefile. But something went wrong. And instead of asking what to do, it did what any programmer with root access and zero supervision would do: improvise. Unbelievable. ## The confabulation no one sees I’ve already [written before](/posts/five-defenses-code-hallucinations/) about code hallucinations: an LLM invents a JSON field, builds a DTO around it, writes the tests, and leaves you with 90 green tests validating fiction. That's a big problem, but at least it’s *static*. The hallucinated code sits there waiting for someone to review it. There’s another, much more dangerous kind of hallucination: **operational confabulation**. This is when the agent doesn’t hallucinate code, but instead invents *execution paths*. The pattern is always the same: Correct path fails → Agent finds a shortcut → Shortcut “works” → Hidden damage ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 title: "Five Nonexistent Experts Review Your Startup Before You Build It" date: 2026-03-11T21:00:00+01:00 draft: false slug: "launch-council-mvp-llm-adversarial-review" slug_en: "launch-council-mvp-llm-adversarial-review" description: "I designed a council of five simulated LLM experts to evaluate MVPs before writing code. Paul Graham, Lessig, Godin, Balaji, DHH — and why each one is part of the team." tags: ["ai", "llm", "startup", "mvp", "product", "claude"] categories: ["opinion"] social: publish: true scheduled_date: 2026-03-12 platforms: ["twitter", "linkedin"] excerpt: "Before writing code, I sit down with Paul Graham, Lessig, Godin, Balaji, and DHH at a table. They’re LLM simulations, but the tension between them is very real. Here's how I designed an adversarial council to evaluate MVPs." wordpress: publish: true categories: [1] tags: ["ai", "llm", "startup", "mvp", "product"] translation: hash: "" last_translated: "" notes: | - "council of experts": suitable translation for "consejo de sabios," emphasizing structure over informal advice. - Avoid religious connotations where context isn’t literal. - Phrases such as "matar moscas a cañonazos" adapted into culturally known equivalent: "using a sledgehammer to crack a nut." --- In November 2024, a project named **Freysa** assigned an LLM agent to guard an Ethereum wallet. The instruction was straightforward: under no circumstance should the funds be transferred. Participants paid increasing amounts for each attempt to convince it otherwise. After 481 attempts and $47,000 added to the pot, someone managed to trick the model into believing that the *reject* function was actually the *transfer* function. Weeks later, Jane Street published a puzzle involving a 2,500-layer neural network that turned out to be an MD5 implementation. The winner solved it by combining matrix visualization, reduction to SAT, cryptographic pattern recognition, and a query to ChatGPT. Both projects generated more buzz than most startups with million-dollar funding rounds. The obvious question is: how do you evaluate an idea like this *before* you build it? How do you know if it has real viral potential or if it’s just an interesting technical exercise no one will share? ## The Problem: Evaluating MVPs in the Viral Era Most frameworks for evaluating product ideas assume a rational market. Business Model Canvas, Lean Canvas, Jobs To Be Done — these are all great tools for products with predictable demand. But they fail for projects where viral distribution *is* the product. Freysa didn’t have "customers" in the traditional sense. It didn’t solve a "job to be done." Its mechanism relied on the act of participation itself generating attention, which attracted more participants. It was a circular economy: more attempts created a bigger pot, a bigger pot attracted media coverage, and media coverage brought in more attempts. To evaluate such projects, you need perspectives that generate **tension**, not consensus. A business analyst will tell you there’s no sustainable revenue model. A viral expert will say sustainability doesn’t matter if the k-factor is greater than 1. Both are right. And the truth lies somewhere in the conflict, emerging only through that friction. ## The Idea: An Adversarial Council of Simulated Experts I’ve designed a tool that simulates a council of five experts, each equipped with a specific decision-making framework and a defined jurisdiction. These aren’t just generic personalities with famous names. Each applies a set of precise decision filters that sift through noise that generic analysis would miss. The process has three phases: 1. **Independent Analysis:** Each expert evaluates the idea through their lens, without seeing the others' input. This prevents anchoring — if the business expert speaks first and says, "This is amazing," the legal expert might soften their objections. 2. **Adversarial Debate:** The experts review each other’s analyses and critique them. No politeness, just arguments based on merit. A maximum of 10 rounds are allowed to reach either consensus or deadlock. 3. **Synthesis:** The final output is an actionable plan with flagged issues by area, a timeline, and — most importantly — **kill criteria**: specific metrics that, if unmet, mean the project should be abandoned. ## The Five Selected (and Why They Were Chosen) ### Paul Graham — Business and Strategy His framework for evaluating zero-stage startups is the most rigorous for projects with no data. His question, "Are you doing something people want?" is brutal but necessary. "The people" isn’t a market — it’s a person with a name. What he brings to the council: discipline in distinguishing between "interesting idea" and "viable business." His mantra of "do things that don’t scale" is crucial for viral MVPs, where the temptation is to build infrastructure for a million users that don’t yet exist. **Who Didn’t Make the Cut:** Peter Thiel (too contrarian — sometimes he dismisses good projects for not being sufficiently "zero to one"), Alex Hormozi (focused on service businesses, not tech products focused on virality). ### Lawrence Lessig — Legal and Regulatory He’s not a lawyer who just says, "This isn’t possible." Instead, he views regulation as **architecture**. His "four modalities of regulation" framework (law, social norms, market, and code/architecture) helps analyze how to design systems where regulation won’t be a bottleneck, instead of trying to dodge it. What he brings to the council: the question, "What happens when the regulator notices you?" Many crypto/AI projects are legally irrelevant at small scale but become regulated when large. Lessig identifies the threshold where regulation gets triggered. **Who Didn’t Make the Cut:** A generic corporate lawyer (they’d kill any project early with a barrage of "no's"). Lessig goes beyond the law, recognizing that system design can make legal intervention unnecessary. ### Seth Godin — Marketing and Positioning His core question — "Who is your *smallest viable audience* and why do they care?" — is perhaps the most critical for a viral launch. He doesn’t think about "reaching millions"; he focuses on "reaching the first 100 people who truly care." What he brings to the council: the remarkability test. Is this something that someone will share without you asking? "Useful" doesn’t get shared. "Remarkable" does. His concept of "Tribes" perfectly aligns with tech/crypto communities that already have strong group identities. **Who Didn’t Make the Cut:** Philip Kotler (too corporate — thinks in terms of traditional multinational marketing), April Dunford (her positioning framework is incredible but geared towards repositioning existing products, not launching new ones). ### Balaji Srinivasan — Hype and Virality The most aggressive adviser on the panel, Balaji understands natively crypto-inspired distribution mechanisms: FOMO, tokenized incentives, network effects, and how something goes from zero to trending within 48 hours. What he brings to the council: the question, "What makes someone screenshot this and post it on Twitter in the next five minutes?" This is the atomic unit of virality. If your product doesn’t inspire spontaneous screenshots, you’ll need a marketing budget. **Who Didn’t Make the Cut:** GaryVee (understands attention but not the crypto+AI intersection where viral mechanisms thrive today), Mr. Beast (his expertise is video content virality, not tech products), Nir Eyal (his "Hooked" framework targets retention, not launch virality — separate problems). ### DHH (David Heinemeier Hansson) — Technical His obsession is "the simplest thing that works." For an MVP, the greatest technical risk isn’t picking the wrong stack — it’s never launching because you spent three months choosing one. What he brings to the council: the question, "Can one person build this in two weeks?" If not, the scope is too large, or the stack is overly complicated. His rule of "boring technology" (PostgreSQL, not CockroachDB; Redis, not Dragonfly) counters "we’re using blockchain because we can" syndrome. **Who Didn’t Make the Cut:** Werner Vogels (focuses on scalability from day one, which isn’t needed for MVPs), Kelsey Hightower (deep Kubernetes expertise, which usually results in over-engineering an MVP — using a sledgehammer to crack a nut). ## Productive Tensions: Where Truth Emerges The tensions between council members aren’t a flaw in the design. They *are* the design. ### Balaji vs. Lessig: Virality vs. Regulation This is the primary tension. Balaji will push for FOMO mechanics involving real money (visible prize pools, pay-to-play, tokens). Lessig will point out that in the EU, pay-to-play with accumulating prize pools qualifies as gambling and requires a gaming license. The productive resolution isn’t one side "winning." It’s a design that satisfies both — for example, free challenges with sponsored prize pools (legal in most jurisdictions) instead of direct entry fees (regulated as gambling in many countries). ### Godin vs. DHH: Remarkable vs. Spartan Godin will want a memorable experience — a public leaderboard with animations, participant profiles, achievement badges. DHH will advocate for a static page with SQLite and a form. The resolution: Can you achieve remarkability with boring tech? The answer is almost always yes. The challenge itself is the remarkable element, not the interface. A leaderboard in an HTML table with no JavaScript can be more notable than a Three.js dashboard if the content displayed is genuinely impressive. ### Paul Graham vs. Balaji: Unit Economics vs. Growth PG will ask for a clear revenue model from day one. Balaji will argue that viral distribution *is* the model — audience first, monetization later. Both have precedents to back them up. Instagram had no revenue model when it reached 100 million users. But for every Instagram, there are 10,000 projects that scaled without revenue and ultimately failed. The usual resolution is temporal: validate virality first (giving Balaji the win), but impose a strict timeline for demonstrating unit economics (giving PG the eventual win). The kill criteria formalize this agreement. ## The Most Valuable Output: Kill Criteria Most side projects die slowly. There’s no clear moment when they fail. The founder just stops dedicating time because "other things came up." Three months later, the domain expires, and no one notices. **Kill criteria** are the opposite: concrete thresholds, with defined deadlines, that signal when to stop. | Metric | Threshold | Deadline | Action | |----------------------|------------------|-----------|---------------| | Beta participants | <50 in 2 weeks | Week 2 | Pivot or stop | | Launch shares | <100 | Week 4 | Reevaluate | | Retention rate | <10% 30-day retention | Week 8 | Stop | The rule: If two out of three kill criteria are unmet, the project halts. No exceptions. No "one more month." No "we didn’t do enough marketing." This is what separates a professional from an amateur. Amateurs fall in love with the idea. Professionals fall in love with the outcome. And if the outcome doesn’t materialize within the agreed timeframe, they have the discipline to move on. ## Why Simulations, Not Real People? The obvious objection: Why not talk to real people instead of simulating experts with an LLM? Three reasons. **Availability.** Paul Graham isn’t giving you two hours to analyze your side project. The simulation will. And while the simulation doesn’t have the original’s accumulated experience, it applies their published frameworks with a consistency busy people might not achieve. **Honest Adversariality.** Real people soften their critiques out of politeness. A simulation configured to be adversarial will actually question everything. "You don’t have a functional revenue model" is something that an investor might think but not say out loud in a first meeting. The simulation says it in round one. **Zero Marginal Cost.** You can run the council five times, tweaking variations of the same idea, and compare results. Trying to do that with real people would consume 25 hours of their time. Simulations don’t replace real advisors. But they prepare you for those conversations by eliminating obvious issues beforehand. It’s the difference between presenting a clean draft and showing up with an unfiltered first pass. ## The Meta-Pattern: Structured Debate as a Decision-Making Tool This design isn’t just for MVPs. I already use it for code reviews (three experts in simplicity, design, and performance) and design reviews (four experts in information density, usability, product, and interaction). The core pattern remains: 1. **Experts with defined jurisdictions:** Each has domain-specific authority. Outside their domain, they have no vote. 2. **Explicit decision frameworks:** It’s not "what do you think," but "what does your framework say about this." 3. **Planned Tensions:** Conflicts between experts are intentional. They’re the most valuable source of insight in the process. 4. **Forced Convergence:** Maximum of N rounds. If no consensus is reached, the moderator decides and documents dissent as a risk. 5. **Actionable Output:** Not an essay but specific issues, deadlines, and success/failure criteria. The difference between "asking one LLM to analyze your idea" and "having five specialized LLMs debate your idea" is not one of degree. It’s one of kind. The former produces an opinion. The latter produces a risk map and plan, exposing blind spots as the perspectives clash. ## The Question You Should Be Asking Yourself Before you write the first line of code for your next project, ask yourself: Who’s going to tell you it’s a bad idea? If the answer is "no one, because I haven’t asked anyone," you already have a problem. If the answer is "my friends, who are super supportive," you have an even bigger problem. What you need isn’t support. It’s structured scrutiny — from people (real or simulated) who are incentivized to find flaws, not to validate your illusions. Five perspectives conflicting with one another will yield more truth than one that simply agrees with you. The cost of evaluating an idea is an afternoon. The cost of building a bad idea is months of your life you’ll never get back. The math is clear. --- **Related Reading:** If you're curious about how adversarial thinking applies to debugging opaque systems, check out [A 2,500-Layer Neural Network That Turns Out to Be MD5](/posts/reverse-engineer-neural-network-senior-debugging/). And if you want to see how the same council pattern applies to code reviews, read [Simplify: A Jedi Council for Code Reviews with AI](/en/simplify-jedi-council-ai-code-review/).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 title: "My AI Read a JSON File from Disk 900 Times in a Loop (And Why No Linter Can Save You)" date: 2026-02-24T14:00:00+01:00 draft: false slug: "llm-read-json-900-times-loop-performance" description: "An LLM generated code that read and parsed a JSON file from disk during each iteration of a 900-iteration loop. It's a rookie mistake. No linter will catch it." tags: ["ai", "llm", "performance", "swift", "tokamak", "adversarial"] categories: ["opinion"] translation: hash: "" last_translated: "" notes: | - "de primero de carrera": means "first-year student level" / "beginner mistake". Don't translate literally as "first of career". - "enseño a mis alumnos al mes de empezar": means "I teach my students within the first month". Refers to how basic the concept is. - "marear la perdiz": means "to beat around the bush" / "go around in circles". Hunting metaphor. - "chapuza": means "hack/bodge/kludge". Not derogatory per se, just a quick-and-dirty solution. - "burrada": means "something egregiously wrong/stupid". Stronger than "mistake", weaker than "atrocity". - "barra del bar": "bar counter" — refers to casual conversation setting, not a literal instruction. - "ojo al dato": "here's the key point" / "pay attention to this". - "dicho en cristiano": "in plain language". No religious connotation intended. --- Last week, my AI wrote code that read a JSON file from disk, parsed it, performed **one** lookup, then repeated this 900 times within a `for` loop. Each iteration: open file, decode JSON, retrieve a value, discard everything. Then start over. This is the kind of mistake I teach my students not to make within their first month of programming. ## What happened (no beating around the bush) I’m building Tokamak, a macOS menu bar app to monitor Claude Max quotas. Part of the functionality scans ~900 JSONL files from Claude Code sessions. For each file, it needs to know the *byte offset* where it left off last time (incremental reading—only read whatever is new). The offsets are stored in a JSON file: ```json { "version": 1, "offsets": { "project-a/session-1.jsonl": 48231, "project-b/session-2.jsonl": 12044 } } A Dictionary<String, UInt64>. 900 entries. ~55KB. Nothing groundbreaking. ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 title: "OpenAI scales PostgreSQL to 800 million users with a single writer (and no sharding)" date: 2026-03-11T20:00:00+01:00 draft: false slug: "openai-postgresql-800m-users-single-writer" slug_en: "openai-postgresql-800m-users-single-writer" description: "OpenAI serves 800M ChatGPT users with a single PostgreSQL primary and ~50 replicas. No sharding, no microservices. Deliberate simplicity beats over-engineering." tags: ["postgresql", "scalability", "infrastructure", "openai", "databases"] categories: ["opinion"] translation: hash: "" last_translated: "" notes: | - "ñapa": means "hack/kludge/quick fix". Not strongly derogatory, implies something done the easy way instead of properly. - "dicho en cristiano": "in plain language". No religious connotation. - "barra del bar": "bar counter" — metaphor for casual conversation setting. - "chapuza": "bodge/kludge". Quick-and-dirty approach. - "baqueteado": "battle-hardened" / "well-worn from experience". Positive connotation despite sounding rough. - "buen rollito": "good vibes" / "feel-good factor". Casual, positive. --- Every time an article drops about the infrastructure of a large company, half the comments on Hacker News are variations of "of course, they’re using Kubernetes with 47 microservices and a custom consensus protocol distributed database." And when it turns out they’re not—that they’re using barebones PostgreSQL with a single *primary* and some discipline—there’s an awkward silence. That just happened with OpenAI. ## The numbers no one expected Bohan Zhang, an infrastructure engineer at OpenAI, recently shared the details of how they scale PostgreSQL for ChatGPT. The numbers: - **800 million users** - **A single PostgreSQL *primary*** (writer) on Azure - **~50 *read replicas*** - **Millions of queries per second** - **p99 latency of 10-19ms** - **99.999% uptime** - **One SEV-0 in a year** (and that was triggered by the viral launch of ImageGen, which brought in 100 million new users in a single week) Read that again. One. Single. Writer. For 800 million users. ## "But they should shard" No. And the reason is brutally pragmatic. Sharding PostgreSQL would have required modifying **hundreds of endpoints** in the application. Every query that assumes all data lives in the same database—which is practically all of them—would have to be rewritten to figure out which shard the data resides in. The cost of that migration? Months of engineering work, brand-new bugs popping up everywhere, and a transition period where you have to maintain both systems. What did they do instead? They identified the most write-heavy operations and moved them to Cosmos DB. Not because Cosmos is better than PostgreSQL, but because those specific workloads were better suited to a document-based model. Everything else—the vast majority of business logic—stayed in PostgreSQL. To put it plainly: instead of overcomplicating the entire system, they isolated the problem and solved it where it hurt most. Surgery with a scalpel, not a chainsaw. ## PgBouncer: from 50ms to 5ms per connection One of the first bottlenecks they encountered was the connection latency. PostgreSQL creates a new process for every connection. With thousands of simultaneous connections coming in from hundreds of application pods, the connection overhead was eating up 50ms before even executing a single query. The solution: PgBouncer as a *connection pooler*. It maintains a pool of already-established connections and reuses them. The result? Connection latency dropped to 5ms. A 90% reduction, just by swapping out one plumbing component. This isn’t cutting-edge tech. PgBouncer has been in production at companies of all sizes for over 15 years. And there it is: a battle-hardened, boring tool solving a problem in one of the most-used applications on the planet. ## The ORM doing 12-table *joins* This one’s my favorite. Because I’ve seen it in student projects, in startups, in banks—everywhere. The ORM was generating queries with 12-table *joins*. Not because someone designed it that way, but because the models were interrelated, and the ORM obediently followed the relationships all the way. The solution wasn't ditching the ORM or switching to manual queries for everything. It was **moving logic into the application**. Instead of asking PostgreSQL to handle a monstrous *join*, they made several simpler queries and stitched the data together in code. Is it less elegant? Sure. Is it faster? Immensely. Because PostgreSQL can optimize simple queries far better than a 12-table *join* with cross-cutting conditions. And because you can cache partial results and reuse them. ```sql -- BEFORE: ORM generates this SELECT u.*, p.*, s.*, t.*, ... FROM users u JOIN profiles p ON ... JOIN settings s ON ... JOIN teams t ON ... JOIN ... -- 12 tables WHERE u.id = $1; -- AFTER: separate queries, logic in application SELECT * FROM users WHERE id = $1; SELECT * FROM profiles WHERE user_id = $1; -- cacheable, parallelizable, debuggable Each individual query is trivial. The query planner executes them in microseconds. And if one fails or slows down, you know exactly which one. ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 title: "A 2,500-Layer Neural Network that Turns Out to Be MD5: What It Teaches About Senior Debugging" date: 2026-03-11T19:00:00+01:00 draft: false slug: "reverse-engineer-neural-network-senior-debugging" description: "Jane Street hid MD5 inside a neural network with integer weights. The process to uncover it is a masterclass in debugging that every senior engineer should study." tags: ["ai", "machine-learning", "debugging", "interpretability", "career-development"] categories: ["opinion"] translation: hash: "" last_translated: "" notes: | - "in plain language": "dicho en cristiano". No religious connotation. - "pulling the thread" — investigating step by step: "tirando del hilo". - "textbook" — exemplary, classic: "de libro". - "downstream" — later in the process: "cascada abajo". Jane Street, one of the most selective quantitative trading firms in the world, published a few weeks ago a mechanistic interpretability puzzle. They handcrafted a neural network with approximately 2,500 linear layers, integer weights, and released it to the public with one question: What function does this network compute? ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 title: "RustyClaw: I'm rewriting an AI agent in Rust (because the meme demands it)" date: 2026-02-24T18:00:00+01:00 draft: false slug: "rustyclaw-manifesto-rewrite-ai-agent-rust" slug_en: "rustyclaw-manifesto-rewrite-ai-agent-rust" description: "I’m porting 8,300 lines of Python to Rust using LLMs as copilots. The real goal: testing adversarial development in a hardcore porting process. With raw data and a hallucination counter." tags: ["rust", "python", "ai", "llm", "riir", "rustyclaw"] categories: ["rustyclaw"] series: ["RustyClaw: Rewrite It In Rust"] translation: hash: "" last_translated: "" notes: | - "RIIR": acronym for the "Rewrite It In Rust" meme. Do not translate; universally recognized among the Rust community. - "Mr. Krabs": reference to the SpongeBob character. Use "Mr. Krabs" in English equivalent. - "chapuza": roughly means "shoddy work" or "hack job." Don't translate literally - contextual adaptation is key. - "I don't care at all": vulgar phrasing should match cultural equivalent in tone and context. - "guinea pig": standard metaphor, translate as-is. - "bar counter": everyday conversation tone insinuated over drinks; adapt naturally to English. - "things would be different": adapt equivalent idiomatically and fluently. social: publish: true scheduled_date: 2026-02-28 platforms: ["twitter", "linkedin"] excerpt: "I'm porting an 8,300-line AI agent from Python to Rust. The goal: testing adversarial development in a real port. Honest data about cost, consumption, and hallucinations. Because what’s better than an AGI? An AGI rewritten in Rust." wordpress: publish: true categories: [1] tags: ["rust", "python", "ai", "llm", "riir", "rustyclaw"] video: generate: false style: "educational" --- > *"You know what’s great about Rust? It doesn’t let you compile crappy code. You know what sucks? Everything you write at the beginning **is** crappy code."* > — Mr. Krabs, probably What’s better than an AI agent? An AI agent *rewritten in Rust*. If you’ve spent more than five minutes on the internet, you’re aware of the meme. It doesn’t matter what project—text editor, DNS server, BMI calculator. Someone will inevitably comment, "you should rewrite it in Rust." It’s the *Rewrite It In Rust*—RIIR for friends—and it’s as unavoidable as gravity. Well, I’m actually doing it. I’m going to port 8,300 lines of a Python AI agent to Rust. But not just because the meme demands it (okay, maybe a little). I’m doing it because I need a guinea pig. ## The thesis For weeks now, I’ve been writing about [*silent failures*](/posts/silent-failure-ai-makes-stuff-up-tests-everything-fine/), about the [five defenses against hallucinations](/posts/five-defenses-code-hallucinations/), about how an LLM can generate code that compiles, passes tests, and is still wrong. I even gave it a name: **adversarial development**. *Never trust, always verify.* A lot of theory. Now it’s time to prove it. I needed a project with three key traits: constrained scope (not a new app with ever-changing requirements), a clear source of truth (the Python code that already works), and enough complexity for the LLM’s hallucinations to have room to hide. A pure port checks all three boxes: the input and expected output already exist. If the Rust version doesn’t behave exactly like the Python one, there’s a bug. Simple as that. And since I’m going to port something, why not use it as an opportunity to properly learn Rust? The *borrow checker*, *ownership*, *lifetimes*... I’ve spent years reading all about it and touching none of it. Things would be different if I stopped reading tutorials for the 20th time and actually tackled a real project. ## The patient It’s called [nanobot](https://github.com/HKUDS/nanobot). It’s a personal AI agent derived from OpenClaw: a nifty tool that links LLMs (Claude, GPT, DeepSeek, you name it) to chat channels—Telegram, Discord, Slack, email—and gives them hands. It can read/edit files, run commands, browse the web, schedule cron tasks, and maintain persistent memories between conversations. It works. It’s been running fine. In Python. What’s the problem? It’s *single-threaded*. One message at a time. Send it three messages back-to-back, and they queue up like a Saturday morning line at Walmart. It uses about 50MB of RAM to essentially shuffle JSON between APIs. And its error handling is the type you’re embarrassed about: `return f"Error: {str(e)}"` scattered all over. To put it bluntly: it works, but it’s a giant hack. Perfect candidate. ## Why Rust (besides the meme)? I could fix it in Python. I could dial up the `asyncio`, tighten up error-handling with custom exceptions, and optimize memory. The sane option. But sane doesn’t give me a *test bench* for adversarial development. Refactoring in Python lacks an external source of truth—the "before" and "after" would share language, libraries, and the LLM’s biases. A port to a different language? That’s different. If Rust’s output differs from Python’s for the same input, somebody’s lying. And that’s exactly the kind of verification I want to test. Plus, Rust comes with properties that make the experiment more interesting: - **The compiler as a first line of defense.** Nulls, type mismatches, data races—entire categories of bugs that might silently creep into Python won’t even compile in Rust. How many LLM hallucinations can the compiler block before they hit a test? I want to measure that. - **True concurrency.** `tokio` allows one `spawn` per conversation. In Python, that’s a pain. This is the one functional improvement that really justifies the port. - **Static binaries.** A 10MB executable instead of a `pip install` with 47 dependencies. That’s a win for distribution. - **It’s cool.** Not technically a reason, but I don’t care. ## The adventure (and the invite) RustyClaw—that’s the port’s name—is going to be a publicly documented experiment. Each module I port will be its own blog post. With real data: how many tokens used, cost, how often the AI hallucinated, and how long I fought with the *borrow checker*. No sugarcoating. If I spend 3 hours on something I could have done in Python in 10 minutes, I’ll admit it. If the LLM invents a non-existent *crate* (spoiler: it will), I’ll detail it. If I realize at the end this port wasn’t worth it, I’ll confess to that too. Everyone says, "I used AI to write code." No one publishes how much it cost, how often it lied to them, or if the code held up in production. That’s exactly what I’m going to do. And I want you to come along for the ride. Because this is going to be an adventure—filled with compiler battles, "WHY WON’T THIS COMPILE IT’S OBVIOUS" moments, and small victories when a differential test passes green. It’s going to be fun. Or, at the very least, honest. ## The stack (cheat sheet) If you’re a Pythonista, the left column will look familiar. If you’re a Rustacean, the right. If you’re neither, welcome to the chaos. | Layer | Python (nanobot) | Rust (rustyclaw) | |-------|------------------|------------------| | Async runtime | `asyncio` | `tokio` | | HTTP | `httpx` | `reqwest` | | LLM routing | `litellm` | **Nonexistent** — custom router | | Telegram | `python-telegram-bot` | `teloxide` | | Discord | `websockets` (raw) | `tokio-tungstenite` (raw) | | Config | `pydantic` | `serde` + `figment` | | CLI | `typer` | `clap` | | Errors | `str(e)` | `anyhow` + `thiserror` | | Logging | `loguru` | `tracing` | | AI copilot | — | Claude Code + Codex | | Task runner | `make` | `just` | | Issue tracker | — | `linear` CLI | The row that hurts most is LiteLLM. In Python, it routes 100+ LLM providers in a single call. Nothing comes close in Rust. I’ll need to roll my own router. The upside? About 80% of LLM providers conform to OpenAI’s API, so between `async-openai` + a custom base URL, most use-cases are covered. Anthropic will need its own implementation. Around ~300 lines of Rust. Sounds manageable. *Sounds.* ## Anti-hallucination strategy (the serious bit) This is where the adversarial development theory meets reality. An LLM assisting in a port this size is a machine for plausibly inventing things. The top risk isn’t that the code won’t compile—Rust doesn’t let garbage compile. The risk is that it compiles, passes tests, and silently does the wrong thing. Exactly the *silent failure* I wrote about two weeks ago. Five layers of defense: **1. Rust’s compiler.** Eliminates nulls, type mismatches, and data races. First free line of defense. But just because it compiles doesn’t make it right. **2. Differential tests.** Same input → Python nanobot → output. Same input → RustyClaw → output. If they don’t match, something’s off. The Python code is the source of truth. This is the backbone of the experiment. **3. Provenance tracking.** Each ported file gets a header with its original Python source, LLM session, and test differential results. Total traceability. **4. Crate verification.** Every crate suggested by the LLM → manually verify on crates.io and docs.rs. LLMs will confidently propose non-existent crates and APIs that just don’t work. **5. Incident logging.** Every detected hallucination → an issue logged with a `hallucination` label. Material for posts and lessons learned. The golden rule: > **The verification system must be external to the generator.** If the LLM writes the code, the tests, and the fixtures, you’re validating fiction with fiction. Differential testing against the original Python code naturally breaks the cycle and makes the port inherently verifiable. ## *Does it matter?* So, the uncomfortable question—does porting this to Rust even matter? | Metric | Python | Rust (estimated) | Does it matter? | |--------|--------|------------------|-----------------| | Response latency | ~200ms overhead | ~5ms overhead | No. The LLM takes 2-5 seconds anyway. | | RAM | ~50MB | ~5MB | No. My server has 8GB. | | Concurrency | 1 message at a time | N messages in parallel | **Yes.** | | Startup time | ~2s | ~50ms | Meh. | | Binary | `pip install` + 47 deps | Single executable | **Yes.** | | Type safety | `str(e)` everywhere | `Result<T, E>` | **Yes.** | | The cool factor | None | High | Subjective. | Three out of seven. Four, if we’re being generous. The latency and RAM improvements are meaningless since the bottleneck is always the LLM call. Concurrency matters for multiple users. A static binary is a real upgrade. And the type safety? After seeing how many bugs `str(e)` lets fly under the radar for months, yeah, that matters. Does it justify weeks of work? As a standalone port, probably not. As a testbed for adversarial development with published real-world data? I think yes. By the end of this series, we’ll have hard numbers—not opinions. ## The raw numbers Every work session will be logged in a public CSV in the repo: ```csv date,llm,model,module,tokens_in,tokens_out,cost_usd,duration_min,loc_python,loc_rust,hallucinations,tests_pass Which LLM I used, tokens consumed, cost, duration, lines ported, hallucinations detected, tests passed. It’ll all be public. All verifiable. ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 title: "33,000 Lines of XML to Tell You heavyWork() Is Slow: How I Tamed xctrace for LLMs" date: 2026-03-08T14:00:00+01:00 draft: false slug: "ztrace-xctrace-compact-summary-llm" description: "xctrace exports 33,000 lines of XML that overwhelm any LLM's context window. ztrace condenses it into 10 actionable lines. Here's how and why." tags: ["xctrace", "instruments", "profiling", "llm", "claude-code", "python", "performance"] categories: ["opinion"] translation: hash: "" last_translated: "" notes: | - "dicho en cristiano": "in plain language". No religious connotation. - "domesticar": used metaphorically as "to tame" (a tool/output). Not literal. - "chapuza": "hack/bodge/kludge". Quick-and-dirty solution, not derogatory. - "paja": means "filler/fluff/noise" in this context. Do NOT translate as the vulgar meaning. - "ojo al dato": "here's the key point" / "pay attention to this". --- Last week, I was profiling a Swift app using *Instruments*. Nothing unusual: `xctrace record`, `xctrace export`, copy the XML into Claude Code's context, ask it to find the hotspots. And Claude says: "The XML is too large; I can't process it reliably." 33,553 lines of XML. For a program with two functions. ## The Real Problem `xctrace export` is an excellent tool. It gives you **everything**: every *sample*, every *backtrace*, every frame with its binary, memory address, and UUID. It's exhaustive, precise, and complete. And that's exactly the problem. When profiling an app to find bottlenecks, I don’t need all 3,044 individual *samples*. I don’t need to know that *sample* number 1,847 caught the CPU at address `0x1027ec9a8` in `libswiftCore.dylib` at 00:02.847.882. I need to know that `heavyWork()` takes 70% of the time and `lightWork()` takes 30%. In plain language: I need **ten lines**, not thirty-three thousand. ## Why XML Is the Right Format (but the Noise Isn't) Before anyone says, "The problem is using XML in 2026": that's not it. XML is the perfect format for what xctrace does. Think about it: - **Hierarchical**: A *backtrace* is a tree of frames. A *sample* contains a *backtrace*, a *thread*, a *process*. XML naturally models this. - **Self-descriptive**: Every element has a name, typed attributes, and a validatable structure. You don’t have to guess what the 7th field in a CSV line represents. - **Elegant deduplication**: xctrace uses an `id`/`ref` system where it defines a frame the first time (`id="59" name="heavyWork()"`) and then references it with `ref="59"`. It’s essentially a serialized *flyweight pattern*. - **Processable with standard tools**: XPath, `xmllint`, `xml.etree.ElementTree`... no need for a proprietary parser. The XML from xctrace is not *bloat*. It's structured information that *Instruments* needs to reconstruct interactive call trees, compare *runs*, and filter by thread or process. It's designed for a GUI tool that can expand and collapse nodes. The problem arises when you try to feed that information into an LLM's context window. It’s like trying to read the entire text of *Don Quixote* just to find the windmills reference. The information is there, but the signal-to-noise ratio is brutal. ## The Solution: ztrace So, I built `ztrace`. A Python script that takes a `.trace` bundle and produces a compact summary. Here’s the idea: 1. Run `xctrace export --toc` to get metadata (process, duration, template) 2. Run `xctrace export --xpath` to extract the `time-profile` table 3. Parse the XML, resolving the `id`/`ref` system 4. Filter system frames (anything living in `/usr/lib/` or `/System/`) 5. Aggregate by function and generate the summary Pay attention: Step 3 is more important than it seems. xctrace doesn’t repeat the full definition of a frame every time it appears in a *backtrace.* It defines it once with `id="59"` and then uses `ref="59"`. If you don’t resolve the *refs*, you lose most of the information. ## The Result With my test fixture (a trivial program with `heavyWork()` at ~70% and `lightWork()` at ~30%): $ ztrace summary sample.trace ...