I Let 3 AI Agents Debate My Code — Here's What They Found
I pointed my multi-agent orchestration system at my own website and ran a debate-pattern code review. The results were genuinely useful — and a little humbling.
I’ve been building a multi-agent AI orchestration system. It lets me run tasks through multiple AI models — Claude, GPT-4, Gemini — using different collaboration patterns. Routing, parallel, sequential, debate, consensus, hierarchical. Each pattern solves a different problem.
I decided to point it at something I know intimately: this website.
The Setup
The orchestration system supports several patterns. For a code review, debate is the natural choice. It works like this:
- Agent A makes a proposal (initial review)
- Agent B critiques it (pushes back, finds gaps)
- Agent C synthesizes (resolves disagreements, ranks findings)
The idea is that a single reviewer misses things. Two reviewers catch more. Three reviewers who are actively disagreeing with each other catch the most.
I assigned each agent a persona:
- Codex — pragmatic, implementation-focused. Bugs, performance, DRY violations, code smells.
- Gemini — broad-thinking, architecture-focused. Accessibility, SEO, design system consistency.
- Claude — security-minded, deep-analysis. XSS vectors, edge cases, missing production hardening.
All three got the same codebase context. Same key files. Same instructions: be specific, cite line numbers, don’t hold back.
What Came Back
The debate ran for a few minutes. Each agent produced a detailed review, then they critiqued each other’s findings, and the system synthesized the results into a ranked list. The most interesting part was seeing where agents converged independently — the highest-confidence findings are the ones all three flagged without prompting from each other:
All three agents flagged:
- My OG image is an SVG. Most social platforms silently ignore SVGs for link previews. Every share to Twitter, LinkedIn, or Discord has been showing a blank card. I had no idea.
- 136 hardcoded hex color values across 23 files, despite having a full set of CSS custom properties defined in my design system. The tokens exist — nothing uses them.
- The same
.section-labelCSS block copy-pasted identically across four pages. - A variable named
processthat shadows the Node.js global.
Two of three agents flagged:
preconnecthints to Google Fonts when the site uses self-hosted @fontsource packages. Dead DNS lookups on every page load.- The GlassPanel component’s
::beforepseudo-element defined twice with conflicting approaches — once in the component, once in global CSS. - Three.js (~600KB) hydrating eagerly via
client:loadwhenclient:idlewould be fine.
The token discipline finding was the most damning. I built the design system. I defined --color-cyan-glow, --font-mono, --color-text-muted. And then I hardcoded #00ffd5, 'JetBrains Mono', monospace, and #777 everywhere anyway. Classic case of building infrastructure you don’t use.
The Security Angle
Agent Claude found things the other two didn’t look for:
- My newsletter subscribe endpoint has zero rate limiting. An attacker could burn my Buttondown API quota in seconds.
- No CSRF protection — HTML forms can POST cross-origin without triggering CORS preflight.
- No Content-Security-Policy headers at all. No
vercel.jsonwith security headers. - The email validation on my subscribe form is
email.includes('@'). That accepts@as a valid email. set:htmlwithJSON.stringifyfor JSON-LD structured data — if a blog post title ever contains</script>, it breaks out of the tag.
None of these are exotic attacks. They’re baseline production hardening that I skipped because I was focused on the fun parts (shaders, animations, design).
The Accessibility Angle
Agent Gemini caught several things I would have missed:
#777text on#000background has a contrast ratio of ~4.0:1. WCAG AA requires 4.5:1. My muted text color — used everywhere — fails accessibility standards.- The mobile navigation has active styling but no
aria-current="page"attribute. The desktop nav has it. I just forgot the mobile version. - Emoji icons on my services page (
⚡,🔧,🧭) lackrole="img"andaria-label. Screen readers announce the Unicode names — “high voltage sign” instead of something useful. - If JavaScript fails to load, my homepage hero is a solid black rectangle with zero content. No
<noscript>fallback.
What I Learned About Multi-Agent Reviews
Convergence is signal. When three agents independently flag the same issue, it’s almost certainly real and important. The OG image problem and the token discipline problem were both things I’d been vaguely aware of but hadn’t prioritized. Seeing three reviewers converge on them made it impossible to ignore.
Specialization surfaces different things. A generalist reviewer would probably catch the DRY violations and maybe the contrast issue. But the security findings (rate limiting, CSRF, CSP) and the deep accessibility findings (aria-current gap, emoji semantics) only came out because I gave each agent a specific lens. The agents weren’t actually smarter — they were looking at different things.
The debate pattern is ideal for reviews. Code reviews are inherently adversarial in a healthy way. You want someone to push back. The debate pattern formalizes this: propose, critique, synthesize. The back-and-forth between agents surfaced issues that a single pass never would have — each agent’s critique forced the others to dig deeper.
Agents find what you normalize. I’d been looking at #00ffd5 hardcoded in my CSS for months without registering it as a problem. I built the tokens and then didn’t use them. The agents don’t have that blind spot because they don’t have familiarity. Everything is equally fresh.
The Fix List
In order of impact:
- Generate a PNG og-image (biggest visibility fix — every social share has been broken)
- Create
vercel.jsonwith security headers + rate limit the subscribe endpoint - Replace 136 hardcoded hex values with CSS custom properties
- Extract shared section styles into global utilities
- Fix WCAG contrast on muted text colors
Not bad for a few minutes of compute time. I’m going to keep running these periodically — the codebase will keep evolving, and the agents will keep finding things I’ve normalized.
Next in this series: I turned the review system on itself · How the orchestration system actually works
Want me to point this system at your codebase? Let’s talk.