The Model Isn't the Agent: The Harness Is

The model is the least interesting part of this system.

That sounds backwards if you’ve spent the last year watching benchmark screenshots and people arguing about which model is 3% better. But after building this thing, I care a lot less about the raw model and a lot more about the harness wrapped around it.

The first two posts in this series were about what the system found when I pointed it at my website and then at itself. This one is about the machinery that made those reviews possible.

The Harness Is the Agent

A model by itself is not an agent.

It’s a text generator with no tools, no memory, no boundaries, no cleanup, and no idea what to do when something goes sideways. The harness is what turns it into something useful. Anthropic wrote a whole article about this — the harness manages the loop, tool execution, context engineering, memory, and error handling. Two products using the same underlying model can produce wildly different results depending entirely on the quality of their harness.

My orchestration system is a harness that wraps multiple harnesses. Each agent adapter — Codex, Gemini, Claude Code — is itself a harness around a model. My system orchestrates the orchestrators. It’s harnesses all the way down.

In practice that means the loop around the model. How it gets context. Which tools it can call. How subprocesses are launched. How timeouts work. What happens after a 429. How outputs are truncated before they poison the next step. Whether errors are logged or silently buried.

Same model, different harness, wildly different result.

The design is simple on paper. Each agent sits behind an adapter with the same interface: execute() and health_check(). Codex runs through a subprocess. Gemini uses the async SDK. Claude can run as API or Claude Code. Then the patterns compose those agents in different ways.

That boring layer is where almost all the real work is.

Six Patterns, Six Failure Modes

I ended up with six orchestration patterns because different tasks break in different ways.

Routing picks the agent based on the task. Fast, cheap, and honestly the default most of the time.
Parallel asks multiple agents at once. Good when I want different angles quickly, or when one agent is fast and another is thorough.
Sequential creates a pipeline. Agent A writes, Agent B critiques, Agent C refines. Useful, but dangerous — bad output gets promoted downstream unless you put guardrails around it.
Debate is my favorite for reviews. One agent proposes, another attacks it, a third synthesizes. Healthy disagreement is the whole point.
Consensus asks agents to solve independently, then hands the answers to a judge. More expensive than debate, but great for trade-off heavy decisions.
Hierarchical breaks a big task into subtasks, fans them out, and recombines the results. Also the easiest way to burn a pile of tokens doing the wrong thing with confidence.

This is the part people over-romanticize.

Multi-agent workflows do not make sense for most development work. Most tasks want routing. Some want parallel. A few want debate. Hierarchical decomposition sounds cool right up until a planner invents five mediocre subtasks and all your workers sprint in the wrong direction.

If your single-agent task is vague, your multi-agent version is just vague six times.

MCP Was the Right Start, Then the Wrong Runtime

I started this project as an MCP server.

That was the obvious move. MCP solves a real problem: I don’t want custom glue for every tool and every client. One protocol is cleaner than a mess of one-off integrations. So I built the orchestration engine as an MCP tool that Claude Code could call directly.

For light patterns, it worked well. Routing was fine. Small parallel jobs were fine. Sequential was usually fine.

Then I tried the heavier stuff.

Debate, consensus, and big parallel runs regularly took 2 to 5 minutes. MCP did not love that. In practice, the orchestration server would time out somewhere around the 60 to 120 second mark. Which means the exact patterns that benefit most from multiple agents were the ones least suited to the protocol I started with.

So I split the system in two.

MCP stayed as the thin access layer for lightweight tasks. Then I built a proper CLI for the long-running jobs. Same core engine, same patterns, same agents, but without trying to squeeze a multi-step debate through a tool call that wants a quick answer.

That fixed the timeout problem. It also made the system more annoying to use.

Typing long CLI commands inside Claude Code gets old fast. Pattern, agents, model overrides, context files, project dir, format flags. All technically fine. All friction.

So the third version was a Claude Code skill. Now I type /orch @debate codex gemini: review this diff instead of a full Python incantation. The skill handles the ugly part: which jobs can stay native, which need the CLI, which agent combination makes sense, when to avoid MCP entirely.

I did not plan that evolution upfront.

It went MCP server, then CLI, then skill because each layer exposed the weakness of the previous one. That felt a lot more honest than pretending I had some elegant roadmap from day one.

The Bugs Were Embarrassingly Real

The most useful bugs in this system were not abstract “AI alignment” bugs. They were normal software bugs. Boring bugs. The kind you get when you’re tired and a happy-path test passes.

Zombie subprocesses. Codex and Claude Code both run as subprocesses. On timeout, I caught TimeoutError, returned a failure, and moved on. I forgot to kill the child process. So timed-out jobs kept running in the background. Quietly. For weeks. I had basically built a tiny graveyard of stranded PIDs because I handled the exception and forgot the cleanup.

Throttle bypass in hierarchical. I built a ThrottleManager because rate limits are real and AI APIs love throwing 429s at the worst possible moment. Every pattern used it except hierarchical. Of course it was hierarchical — the one pattern that fans out the most subtasks, the one most likely to hammer an agent, was the only pattern bypassing the throttle layer completely.

Silent Gemini failures. The Gemini adapter had a catch-all exception handler that returned an empty string with success=False and logged nothing. Not the exception class. Not the message. Nothing. For a while I thought Gemini was just being inconsistent. Turns out it was failing correctly into a black hole. Silent failure is worse than a loud crash because it makes you doubt your inputs instead of your code.

Cross-agent prompt injection. In sequential and debate flows, Agent A’s output becomes Agent B’s input with zero sanitization. Zero boundary. That’s less “collaboration” and more “prompt injection relay race.”

Again — harness problems. Not model problems.

Agents Need Shift Notes

One thing becomes obvious the second you actually use multiple agents in a real codebase: they do not remember anything unless you make them.

Every run starts cold.

It’s like staffing a software project with engineers working in shifts, except each new engineer wakes up with amnesia and full confidence. If you don’t leave notes, they repeat work, step on old decisions, and reopen bugs you already accepted as trade-offs.

So I built a handoff protocol. It lives in a boring little .claude/ directory at the workspace root, and it does more real work than half the fancy orchestration code.

handoff.md is the active queue and the “do not be clever here” document.
changelog.md is session history. One entry per session, not a stream of tiny diary entries.
known-issues.md is exactly what it sounds like. Open bugs and accepted caveats only.
codex-prompt-template.md is the task scaffold so Codex gets the same structure every time.

This is not glamorous. It is also the difference between “multi-agent system” and “three models randomly touching the same repo.”

I trust file-based handoffs more than I trust giant conversational context windows. Files are inspectable. Files can be pruned. Files do not hallucinate what was decided last Tuesday.

What Actually Matters

The flashy part of this project is the debate pattern. Or the consensus pattern. Or watching three different models argue about my code.

The real part is the harness. It’s timeout handling. Cleanup. Context selection. Rate limiting. Output bounds. Handoff notes. Knowing when not to use multi-agent orchestration at all.

That last part matters most.

If you have not already learned how to give one agent a tight, well-scoped task, adding five more agents will not save you. It will just waste more compute with better formatting.

The first two posts in this series were about the findings. This post is about the plumbing. The plumbing is less fun to demo, but it’s the whole reason the interesting part works at all.

The model is not the agent. The harness is.

Previous posts: AI agents reviewed my website · The system reviewed itself · See the project page

Want to put multi-agent systems to work — code, design, or product? Let’s talk.

The Harness Is the Agent

Six Patterns, Six Failure Modes

MCP Was the Right Start, Then the Wrong Runtime

The Bugs Were Embarrassingly Real

Agents Need Shift Notes

What Actually Matters

Keep Reading

I Turned My AI Review System on Itself — It Found 16 Issues

I Let 3 AI Agents Debate My Code — Here's What They Found

Stay in the loop