- OpenRouter’s new Fusion API runs a prompt across a panel of models in parallel, then has a judge model synthesize their outputs into a single answer
- On Perplexity’s DRACO deep research benchmark, a budget panel run through Fusion scored 64.7%, beating solo GPT-5.5 (60.0%) and solo Claude Opus 4.8 (58.8%) at roughly half the cost of the top configuration
- Fusing Claude Opus 4.8 with itself still improved its score from 58.8% to 65.5%, showing that synthesis itself – not just model diversity – drives a meaningful part of the gain
OpenRouter released the OpenRouter Fusion API on June 12, 2026.
It’s a new way to call multiple AI models in a single request and get back one answer built from all of them. Instead of picking one model and hoping it fits the task, Fusion sends your prompt to a panel of models at the same time.
Each model in the panel gets web search and web fetch access. A judge model then reads every response and flags where the models agree, where they contradict each other, and what any single model missed.
The result: a panel of budget models, routed through Fusion, can match or beat individual frontier models on complex research tasks. Often at a fraction of the cost.
Why the OpenRouter Fusion API Matters for LLM Builders
Most teams building on large language models pick one model and live with its blind spots.
A model that’s strong at coding might be weak at multi-step research. A fast, cheap model might miss a source a slower model would catch. Fusion treats this as a solvable problem instead of a tradeoff you accept by default.
This matters most where being wrong is expensive:
- Financial research and due diligence
- Technical or legal summarization
- Medical information synthesis
- Agentic workflows where one missed source breaks the next step downstream
The logic echoes ensemble methods in traditional machine learning, where several weaker models combined often outperform one strong model running alone. We covered a related idea in our breakdown of agentic loop patterns, from ReAct to loop engineering: structured, repeated passes over a problem tend to beat a single shot at it, even using the same underlying model.
How the OpenRouter Fusion API Actually Works
The pipeline behind Fusion breaks into three steps.
Step 1: Parallel dispatch. Your prompt goes out to a panel of models at the same time, each with web search and web fetch tools enabled.
Step 2: Judged synthesis. A judge model reads every panel response and produces structured analysis: consensus points, contradictions, partial coverage, unique insights, and blind spots.
Step 3: Grounded final answer. The calling model writes the final response, grounded in that analysis rather than in a single model’s raw output.
The whole process runs server-side. From the developer’s side, calling Fusion looks like calling one model:
You can also customize which models sit on the panel and which one acts as judge:
That flexibility matters for teams running their own evals or agent pipelines, where the right panel composition depends heavily on the task. Anyone building systems that route between models will recognize the underlying shape of it – it’s the same orchestration logic we walked through when comparing Claude Code’s /goal command against Codex: decision-making sitting above individual model calls, deciding which model handles which part of the job.
The Benchmark: DRACO and Why OpenRouter Chose It
OpenRouter tested Fusion against DRACO, a benchmark built by Perplexity AI.
DRACO is designed to test deep research capability specifically – not factual recall, not reasoning puzzles. It covers 100 tasks across 10 domains:
- Academic research
- Finance
- Law
- Medicine
- Technology
- UX design
- General knowledge
- Needle-in-a-haystack retrieval
- Personalized assistance
- Product comparison
Each task is graded against roughly 39 weighted criteria, split into four categories: factual accuracy, breadth and depth of synthesis, presentation quality, and citation quality.
Some criteria carry negative weights. A verbose, confident-sounding answer that states something false gets penalized rather than rewarded for length. That detail matters, because it’s exactly the failure mode most single-model research tools fall into – sounding thorough without actually being accurate.
The Numbers Behind the OpenRouter Fusion API Results
Here’s where the benchmark results get specific.
Fable 5 fused with GPT-5.5 scored 69.0%, ahead of every individual model tested, including Fable 5 running solo at 65.3%.
A budget panel – Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro – scored 64.7% through the same pipeline. That’s within one percentage point of Fable 5 solo, at roughly half the cost.
Solo model scores ranged widely:
| Type | Configuration | Score |
|---|---|---|
| Fusion | Fable 5 + GPT-5.5 | 69.0% |
| Fusion | Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro | 68.3% |
| Fusion | Opus 4.8 + GPT-5.5 | 67.6% |
| Fusion | Opus 4.8 + Opus 4.8 | 65.5% |
| Solo | Claude Fable 5 | 65.3% |
| Fusion | Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro | 64.7% |
| Solo | DeepSeek V4 Pro | 60.3% |
| Solo | GPT-5.5 | 60.0% |
| Solo | Claude Opus 4.8 | 58.8% |
| Solo | Kimi K2.6 | 53.7% |
| Solo | Gemini 3.1 Pro | 45.4% |
| Solo | Gemini 3 Flash | 43.1% |
The most interesting result isn’t even about combining different models.
OpenRouter ran Claude Opus 4.8 paired with itself as a two-model panel, with Opus 4.8 also serving as judge. That configuration scored 65.5% – a 6.7-point jump over solo Opus 4.8.
Running the same prompt twice produces different reasoning paths, different tool calls, and different source selections. Which means a meaningful chunk of Fusion’s lift comes from the synthesis step itself, not purely from model diversity.
This kind of comparative testing across model families is the same approach we used when testing Kimi K2.6 against Claude Sonnet 4.6 on real developer tasks. Benchmark scores only tell part of the story until you see how models perform on work that resembles what you’ll actually ask of them.
It’s also worth reading alongside our coverage of Claude Fable 5’s own benchmarks and system card findings, since Fable 5 is the strongest solo model in OpenRouter’s own results table.
A Real Contamination Problem OpenRouter Had to Solve
One detail in OpenRouter’s writeup is worth flagging for anyone running their own evals.
When panel models were given web search, they started finding the DRACO grading rubric online during testing. Not through intentional gaming – search terms happened to surface pages discussing the benchmark itself.
OpenRouter fixed this by excluding the locations hosting the benchmark results from web search and web fetch. The same mechanism is available to anyone running evals through Fusion or any other tool-enabled pipeline:
- Pass excluded_domains to web_search
- Pass blocked_domains to web_fetch
Both keep a panel from finding pages related to your own test rubric.
This is a good reminder that contamination risk doesn’t only come from training data. A model with live web access can stumble into the same problem at inference time – a risk worth keeping in mind for any team building retrieval-heavy agents, something we got into in our breakdown of agent skills versus tools.
What This Means for Practitioners
If your stack depends on research quality over raw latency, Fusion is worth testing against whatever single-model setup you’re currently running.
A few practical starting points:
- Test it on your own task distribution first. DRACO is a strong proxy for deep research, but it evaluates text-only, English-only interactions, and your use case may differ.
- Try fusing a model with itself before paying for a multi-model panel. Since a chunk of the lift comes from synthesis rather than diversity, this is the cheapest way to see if Fusion helps your specific workload.
- Budget panels are worth a serious look if cost is a constraint. Landing within 1% of a frontier model’s score at half the cost changes the economics for high-volume research or support tooling.
- Apply domain exclusion if you’re running your own evals with web-enabled models. Contamination through live search is a real risk, not a theoretical one.
Teams already running multi-agent systems may find Fusion slots in naturally alongside existing orchestration work.
What to Watch Next
OpenRouter’s benchmark numbers depend partly on which model acts as judge.
The company used Gemini 3.1 Pro Preview rather than the original DRACO paper’s choice of Gemini 3 Pro, and noted that absolute scores can shift 10 to 25 points depending on judge choice – even though relative rankings hold steady.
Expect more scrutiny over judge model selection as fusion-style approaches become common across providers, along with more third-party benchmarking now that the API is publicly available.
Frequently Asked Questions
What is the OpenRouter Fusion API? The OpenRouter Fusion API sends a single prompt to multiple AI models in parallel, then uses a judge model to synthesize their responses into one final answer, within a single API call.
How do I call the OpenRouter Fusion API? Send a standard request with “model”: “openrouter/fusion”. To customize the panel of models and which model acts as judge, add a fusion plugin block specifying analysis_models.
Does Fusion cost more than calling a single model? It depends on panel size and model choice. OpenRouter’s testing found that a budget panel of three smaller models can match near-frontier performance at roughly half the cost of a frontier-model fusion configuration.
What benchmark did OpenRouter use to test Fusion? OpenRouter used DRACO, a 100-task deep research benchmark built by Perplexity AI that grades responses on factual accuracy, synthesis depth, presentation quality, and citation quality.
Can fusing a model with itself improve results? Yes. OpenRouter found that pairing Claude Opus 4.8 with itself as a two-model panel raised its score from 58.8% to 65.5% – evidence that the synthesis step itself contributes to the improvement, separate from model diversity.
Is Fusion available now? Yes. It can be called directly via the API with the openrouter/fusion model slug, or tested interactively in OpenRouter’s chatroom at openrouter.ai/fusion.













nibalism.

