EssayJuly 2026·9 min read

They built the wrong half.

The whole market converged on multi-agent AI in eight days. All of it blends models to raise a score. Science needs the opposite primitive.

Houston Golden · Hubify · 9 min read

Eight days in June

Between June 22 and June 30, three of the most credible teams in AI shipped multi-agent orchestration:

June 22 — Sakana Fugu

A trained conductor model — Thinker, Worker, Verifier — that coordinates frontier models and packages the whole ensemble as a single model. Cooperative synthesis: plan, execute, check, blend. Closed conductor, benchmark-parity claims against the very best.

June 26 — Hermes Mixture of Agents 2.0

Combine any provider's models into a preset that behaves like one virtual model. Run them in parallel on the same query, synthesize through an aggregator. Their own benchmark jumped from 0.76 (best single model) to 0.82 (the mixture). Genuinely vendor-agnostic — and genuinely cooperative.

June 30 — Anthropic Claude Science

A coordinating agent that spawns specialists, 60+ scientific database connectors, SSH/HPC execution, auditable artifact provenance. The best research substrate anyone has shipped. Multi-agent to its core — and single-vendor by definition.

I watched this happen with a strange double feeling. Validation, first: I’ve been running multi-agent, multi-model research loops by hand for most of a year, and the three most serious teams in the industry just declared that’s the future. And then something closer to disbelief — because all three of them built the same half of the idea. The half that was never the point.

Cooperative aggregation raises scores

Fugu, Hermes MoA, and Claude Science are architecturally different and philosophically identical: run several models, then blend— synthesize, aggregate, reconcile — into one better answer. Call it cooperative aggregation. It works. The benchmark numbers are real. If your problem is “the answer isn’t good enough,” a mixture of agents is a real solution.

But that was never my problem. My problem — everyone’s problem, the moment you use these models for actual research — is that the answer is confidently, plausibly, catastrophically wrongsome fraction of the time, and you can’t tell which fraction. In software a false positive is a failing test. In science it’s a retraction, a wasted year, wrong physics propagating through citations. The catastrophic failure isn’t a lost benchmark point. It’s the false positive that ships.

The other primitive

So for the past year, on a six-paper cosmology program called BigBounce, I’ve been running the mirror image. Not blend-and-agree — refute. Every substantive claim gets reviewed by five models from five different labs — Anthropic, OpenAI, Google, xAI, Perplexity — each with a distinct role and a prompt that says, in effect: assume this is wrong and prove it. They never see each other’s verdicts. A separate auditor, never told the emerging conclusion, checks the review itself for steering and self-favoring. Verdicts are structured, source-cited, and provenance-linked. Disagreement becomes a tracked fix, not a footnote.

Objective

A mixture optimizes the score. An adversarial loop optimizes catching the false positive before it ships — the hallucinated derivation, the fabricated pass, the conveniently headlined number.

Relationship between models

A mixture blends toward agreement. An adversarial loop rewards a reviewer for breaking the claim. Disagreement isn't noise to average away; it's the signal you paid for.

Why cross-vendor is mandatory

Two models from the same lab share training priors — they hallucinate in the same direction, so their agreement is cheap. Reviewers from different labs decorrelate exactly the errors that get papers retracted. Same-vendor multi-agent cannot buy this at any price.

The bias guard

The loop reviews itself: a separate auditor, never told the emerging verdict, checks whether the review held a consistent bar or was steered. Ours has returned 'engineered' before. That sting is the feature.

This isn’t theoretical. The loop has caught things I would have shipped. A dark-energy significance that was quietly inflated by double-counting supernovae shared between two catalogs. A “catalog-grade” label on an anomaly tier that had failed its own injection-recovery test. A derivation that broke rotational invariance in a way every cooperative pass had glossed over. Each one was found because a reviewer from a different lab, with different priors, was paid to disagree — and once, the integrity audit caught my own review prompts leaning favorable, which is the most uncomfortable and most valuable thing the system has ever done.

Falsification is the scientific method. A cross-vendor refutation loop is just the scientific method, rendered computational. Cooperative aggregation is the opposite instinct — it manufactures consensus.

Why no lab can build this

Here’s the structural part, and it’s the part I’d want to know if I were betting on any of this: no frontier lab can offer vendor-agnostic adversarial review without cannibalizing itself.Anthropic will never route your claim to Gemini and GPT to referee its own model’s work — that’s not a missing feature, it’s a conflict of interest baked into the business model. Sakana’s conductor exists to route arounda lab, not to referee one. Hermes is the closest in spirit — genuinely any-provider — but it’s a developer tool doing cooperative aggregation, with no verdicts, no integrity gate, no science discipline.

Only an independent layer can be the referee. That’s not a slogan; it’s an org-chart fact.

Substrate, not competitor

Which is why I’m genuinely glad Claude Science exists. It’s the best execution substrate anyone has shipped for this work — the connectors, the HPC execution, the artifact provenance are all things I used to build by hand. We’re moving BigBounce’s compute onto it where it fits. And the review layer stays outside it, on top of it, exactly where it has to live: independent of any single vendor, because that independence is the entire point.

That’s what Hubify is. Bring any agent — Claude Science, Claude Code, Codex, your own — and Hubify decides whether the result is true: cross-vendor, refute-first, integrity-audited, with the verdicts and their provenance public on your lab’s site. The market spent June proving that multi-agent orchestration works. The other half — the half that makes it trustworthy — is the half we’ve been building all along.