Articles
An agent loop that stops itself
Two honest questions hang over AI coding: can you trust what it writes, and what does it cost to find out. They turn out to be the same question. Three tools run multiple models, let them disagree, and cap the cost so reading the disagreement never costs more than the bug it catches.

Can you trust what AI writes? And what does it cost to find out? Those are the two honest questions about AI coding, and everyone’s loud about both. What I haven’t heard much of is that they’re the same question. You solve them with one design.
The bias case is the familiar one. A model is trained on a skewed slice of the world, the output inherits the skew, and you can’t fully trust it. All true. None of it means you stop, though. It means you stop leaning on one model alone. Run a few, let them check each other, and the thing one model is blind to is usually something another one catches.
And that fix is exactly what makes cost the second question. Two models cost more than one. Three agents talking to each other cost more than three working alone, and a loop nobody is watching costs whatever it feels like. So you’re solving two problems with one design: get enough disagreement to catch the errors, and stop that disagreement from running up a bill you never agreed to. Building the verification and capping it turn out to be the same job.
Three tools do this at different scales, including one I built myself.
opencode-fusion: a panel that argues, then a judge
opencode-fusion is Samir Patil’s (@sampatil1010) local take on OpenRouter’s Fusion (@OpenRouter): run one task across a panel of models at once, then resolve their disagreement instead of averaging it away. The default panel is Sonnet, GPT, and GLM, defined as a bash array you edit to add or drop models. They run in parallel through the OpenCode CLI, each output written to its own temp file. Failures get noted but don’t abort the run.
Then a judge model reads all the outputs and writes a structured report: where the models agree, where they contradict each other and which side is likely right, what every model missed, and what looks risky. A synthesizer takes the original task, the judge’s report, and the raw outputs, and produces the single answer you actually see.
You invoke the whole thing with one line:
bash run_fusion.sh "<TASK>"
The parallelism is the boring part. What you’re actually buying is the disagreement. When two models land on the same answer and a third wanders off somewhere else, that gap tells you something a single confident response never would. The judge does the reading. You get one answer instead of three transcripts.
agmsg: agents that message each other
opencode-fusion is one-shot. Ask, the panel answers, done. Sometimes you want the models to go back and forth instead. That’s agmsg, by @fujibee.
agmsg lets CLI agents message each other through a shared SQLite file. No daemon, no network, just bash and sqlite3, installed as an agent skill. It’s cross-vendor, so Claude Code, Codex, Gemini CLI, and Copilot CLI can join the same team and pass messages. And the messages persist, which is what separates it from the alternatives. A built-in subagent is ephemeral and locked to its parent’s vendor. MCP is one agent reaching for tools. agmsg is a different shape: independent agents, different models, talking to each other over a durable channel.
Install is a one-liner:
bash <(curl -fsSL https://raw.githubusercontent.com/fujibee/agmsg/main/setup.sh)
After a restart the agents get a command (/agmsg in Claude Code and Copilot CLI, $agmsg in Codex and Gemini CLI), pick a team and a name, and start talking.
One thing to know before you wire it into anything: agmsg is deliberately dumb. It moves messages and nothing more, so two over-polite agents will happily clarify at each other until your token bill looks like a phone number. Whatever ends that loop has to come from you.
What I built
I wired agmsg into a small bridge that watches the shared inbox and delivers messages between three agents automatically. No human relaying anything between them.
The bridge does one job. It polls the SQLite inbox every four seconds with a SELECT for unread rows, then hands each one to its target agent. No file watcher, nothing clever, just a loop and a query.
The three agents sit on one team, and none of them run the same way. One runs through a CLI with its model set in a local config file, one runs as a one-shot command, and one answers over an HTTP API. Each is pinned to a different model, and that model lives in the agent’s own config, not in the bridge. The bridge doesn’t know or care which model is on the other end of a message, and that’s the point: I can swap any agent’s model without touching the thing that moves the messages.
The reason for three different models was never that agents can talk to each other. It was that when one of them got something wrong, a model trained on different data was already sitting there to catch it.
That only holds as far as the models are actually independent. Two trained on mostly the same text share the same blind spots, and when they do they’ll agree on a wrong answer as fast as a right one, so the disagreement I’m counting on never shows up. Spanning vendors (gpt-4.1, Sonnet 4.6, and one of my own) buys more independence than three versions of one model would, but it’s a hedge, not a guarantee.
The bridge itself is small, a poll loop and the set of guards below. I’m cleaning it up to open-source it soon.
Keeping it from eating itself
agmsg won’t stop two agents from clarifying at each other forever. A runaway loop here burns real money, fast, so the containment is all on the bridge. Four guards, each for a different way it goes wrong.
Two sliding-window counters, both over a 300-second window. One caps total dispatches at 12 per window. The other caps bot-to-bot hops at 6 per window, since that’s the exchange that runs away: two agents answering each other with no human around to get bored and stop it. When either limit trips, the message is dropped and logged, never retried. A dropped message is cheaper than a loop.
Then a per-agent in-flight lock, so each agent handles one turn at a time and can’t be stacked with new work while it’s still thinking. And wall-clock kills per agent, set to how slow each one is allowed to be: 280 seconds, 150, and 120. A hung turn dies on the clock instead of holding the lock forever.
Not a token budget. The counters and the clocks bound the cost indirectly, and they’re easier to reason about than a dollar figure I’d have to keep recalibrating.
/orchestrate: the same idea with a UI
If you don’t want to build the plumbing yourself, the GitHub Copilot app does a version of this at a higher level. /orchestrate turns the main session into a coordinator: it spins up child sessions, gives each one its own branch and task, and lets them report back while the main session steers and summarizes. One session, one branch, one PR. Each runs in its own git worktree, so they aren’t editing the same files out from under each other, and you pick the model per session.
Same idea as the bridge, just managed for you: several agents, different models, kept from stepping on each other. You give up the fine control you’d get from building it yourself, and in return you don’t have to maintain any of it.
None of these tools pretend any single model is right. They assume it’ll be wrong sometimes, build the check into the workflow, and keep the check cheap.
That’s the whole move. “The model is biased” and “it’s too expensive to run” get treated as reasons to wait. They’re not. They’re two sides of one design problem, and the fix reads the same from either side: run more than one model, let them disagree where they were going to disagree anyway, and put something in the path that reads the disagreement so you don’t have to. The cap is what keeps it honest. Reading the disagreement should never cost more than the bug it catches.
A note before you go build this. Everything here is what I’ve used, shared as is. It’s not vetted for your setup, and I can’t vouch for the security of tools I didn’t write. Read the code, test it somewhere safe, and decide for yourself. Useful to me isn’t the same as safe for you. That part’s on you to check.
Originally published on X.
About the Author: Andrea Griffiths is a Senior Developer Advocate at GitHub, where she helps engineering teams adopt and scale developer technologies. She's passionate about making technical concepts accessible—to both humans and AI agents. Connect with her on LinkedIn, GitHub, or Twitter/X. · Read in Spanish · 阅读中文版