I Let 3 AI Agents Debate My Code — Here's What They Found

I’ve been building a multi-agent AI orchestration system. It lets me run tasks through multiple AI models — Claude, GPT-4, Gemini — using different collaboration patterns. Routing, parallel, sequential, debate, consensus, hierarchical. Each pattern solves a different problem.

I decided to point it at something I know intimately: this website.

The Setup

The orchestration system supports several patterns. For a code review, debate is the natural choice. It works like this:

Agent A makes a proposal (initial review)
Agent B critiques it (pushes back, finds gaps)
Agent C synthesizes (resolves disagreements, ranks findings)

The idea is that a single reviewer misses things. Two reviewers catch more. Three reviewers who are actively disagreeing with each other catch the most.

I assigned each agent a persona:

Codex — pragmatic, implementation-focused. Bugs, performance, DRY violations, code smells.
Gemini — broad-thinking, architecture-focused. Accessibility, SEO, design system consistency.
Claude — security-minded, deep-analysis. XSS vectors, edge cases, missing production hardening.

All three got the same codebase context. Same key files. Same instructions: be specific, cite line numbers, don’t hold back.

What Came Back

The debate ran for a few minutes. Each agent produced a detailed review, then they critiqued each other’s findings, and the system synthesized the results into a ranked list. The most interesting part was seeing where agents converged independently — the highest-confidence findings are the ones all three flagged without prompting from each other:

All three agents flagged:

My OG image is an SVG. Most social platforms silently ignore SVGs for link previews. Every share to Twitter, LinkedIn, or Discord has been showing a blank card. I had no idea.
136 hardcoded hex color values across 23 files, despite having a full set of CSS custom properties defined in my design system. The tokens exist — nothing uses them.
The same .section-label CSS block copy-pasted identically across four pages.
A variable named process that shadows the Node.js global.

Two of three agents flagged:

preconnect hints to Google Fonts when the site uses self-hosted @fontsource packages. Dead DNS lookups on every page load.
The GlassPanel component’s ::before pseudo-element defined twice with conflicting approaches — once in the component, once in global CSS.
Three.js (~600KB) hydrating eagerly via client:load when client:idle would be fine.

The token discipline finding was the most damning. I built the design system. I defined --color-cyan-glow, --font-mono, --color-text-muted. And then I hardcoded #00ffd5, 'JetBrains Mono', monospace, and #777 everywhere anyway. Classic case of building infrastructure you don’t use.

The Security Angle

Agent Claude found things the other two didn’t look for:

My newsletter subscribe endpoint has zero rate limiting. An attacker could burn my Buttondown API quota in seconds.
No CSRF protection — HTML forms can POST cross-origin without triggering CORS preflight.
No Content-Security-Policy headers at all. No vercel.json with security headers.
The email validation on my subscribe form is email.includes('@'). That accepts @ as a valid email.
set:html with JSON.stringify for JSON-LD structured data — if a blog post title ever contains </script>, it breaks out of the tag.

None of these are exotic attacks. They’re baseline production hardening that I skipped because I was focused on the fun parts (shaders, animations, design).

The Accessibility Angle

Agent Gemini caught several things I would have missed:

#777 text on #000 background has a contrast ratio of ~4.0:1. WCAG AA requires 4.5:1. My muted text color — used everywhere — fails accessibility standards.
The mobile navigation has active styling but no aria-current="page" attribute. The desktop nav has it. I just forgot the mobile version.
Emoji icons on my services page (⚡, 🔧, 🧭) lack role="img" and aria-label. Screen readers announce the Unicode names — “high voltage sign” instead of something useful.
If JavaScript fails to load, my homepage hero is a solid black rectangle with zero content. No <noscript> fallback.

What I Learned About Multi-Agent Reviews

Convergence is signal. When three agents independently flag the same issue, it’s almost certainly real and important. The OG image problem and the token discipline problem were both things I’d been vaguely aware of but hadn’t prioritized. Seeing three reviewers converge on them made it impossible to ignore.

Specialization surfaces different things. A generalist reviewer would probably catch the DRY violations and maybe the contrast issue. But the security findings (rate limiting, CSRF, CSP) and the deep accessibility findings (aria-current gap, emoji semantics) only came out because I gave each agent a specific lens. The agents weren’t actually smarter — they were looking at different things.

The debate pattern is ideal for reviews. Code reviews are inherently adversarial in a healthy way. You want someone to push back. The debate pattern formalizes this: propose, critique, synthesize. The back-and-forth between agents surfaced issues that a single pass never would have — each agent’s critique forced the others to dig deeper.

Agents find what you normalize. I’d been looking at #00ffd5 hardcoded in my CSS for months without registering it as a problem. I built the tokens and then didn’t use them. The agents don’t have that blind spot because they don’t have familiarity. Everything is equally fresh.

The Fix List

In order of impact:

Generate a PNG og-image (biggest visibility fix — every social share has been broken)
Create vercel.json with security headers + rate limit the subscribe endpoint
Replace 136 hardcoded hex values with CSS custom properties
Extract shared section styles into global utilities
Fix WCAG contrast on muted text colors

Not bad for a few minutes of compute time. I’m going to keep running these periodically — the codebase will keep evolving, and the agents will keep finding things I’ve normalized.

Next in this series: I turned the review system on itself · How the orchestration system actually works

Want me to point this system at your codebase? Let’s talk.

The Setup

What Came Back

The Security Angle

The Accessibility Angle

What I Learned About Multi-Agent Reviews

The Fix List

Keep Reading

I Turned My AI Review System on Itself — It Found 16 Issues

The Model Isn't the Agent: The Harness Is

Stay in the loop