For a hands-on learning experience to develop Agentic AI applications, join our Agentic AI Bootcamp today. Early Bird Discount
Key takeaways

  • OpenRouter’s new Fusion API runs a prompt across a panel of models in parallel, then has a judge model synthesize their outputs into a single answer
  • On Perplexity’s DRACO deep research benchmark, a budget panel run through Fusion scored 64.7%, beating solo GPT-5.5 (60.0%) and solo Claude Opus 4.8 (58.8%) at roughly half the cost of the top configuration
  • Fusing Claude Opus 4.8 with itself still improved its score from 58.8% to 65.5%, showing that synthesis itself – not just model diversity – drives a meaningful part of the gain

OpenRouter released the OpenRouter Fusion API on June 12, 2026.

It’s a new way to call multiple AI models in a single request and get back one answer built from all of them. Instead of picking one model and hoping it fits the task, Fusion sends your prompt to a panel of models at the same time.

Each model in the panel gets web search and web fetch access. A judge model then reads every response and flags where the models agree, where they contradict each other, and what any single model missed.

The result: a panel of budget models, routed through Fusion, can match or beat individual frontier models on complex research tasks. Often at a fraction of the cost.

Openrouter Fusion API

Why the OpenRouter Fusion API Matters for LLM Builders

Most teams building on large language models pick one model and live with its blind spots.

A model that’s strong at coding might be weak at multi-step research. A fast, cheap model might miss a source a slower model would catch. Fusion treats this as a solvable problem instead of a tradeoff you accept by default.

This matters most where being wrong is expensive:

  • Financial research and due diligence
  • Technical or legal summarization
  • Medical information synthesis
  • Agentic workflows where one missed source breaks the next step downstream

The logic echoes ensemble methods in traditional machine learning, where several weaker models combined often outperform one strong model running alone. We covered a related idea in our breakdown of agentic loop patterns, from ReAct to loop engineering: structured, repeated passes over a problem tend to beat a single shot at it, even using the same underlying model.

How the OpenRouter Fusion API Actually Works

The pipeline behind Fusion breaks into three steps.

Step 1: Parallel dispatch. Your prompt goes out to a panel of models at the same time, each with web search and web fetch tools enabled.

Step 2: Judged synthesis. A judge model reads every panel response and produces structured analysis: consensus points, contradictions, partial coverage, unique insights, and blind spots.

Step 3: Grounded final answer. The calling model writes the final response, grounded in that analysis rather than in a single model’s raw output.

The whole process runs server-side. From the developer’s side, calling Fusion looks like calling one model:

You can also customize which models sit on the panel and which one acts as judge:

That flexibility matters for teams running their own evals or agent pipelines, where the right panel composition depends heavily on the task. Anyone building systems that route between models will recognize the underlying shape of it – it’s the same orchestration logic we walked through when comparing Claude Code’s /goal command against Codex: decision-making sitting above individual model calls, deciding which model handles which part of the job.

The Benchmark: DRACO and Why OpenRouter Chose It

OpenRouter tested Fusion against DRACO, a benchmark built by Perplexity AI.

DRACO is designed to test deep research capability specifically – not factual recall, not reasoning puzzles. It covers 100 tasks across 10 domains:

  • Academic research
  • Finance
  • Law
  • Medicine
  • Technology
  • UX design
  • General knowledge
  • Needle-in-a-haystack retrieval
  • Personalized assistance
  • Product comparison

Each task is graded against roughly 39 weighted criteria, split into four categories: factual accuracy, breadth and depth of synthesis, presentation quality, and citation quality.

Some criteria carry negative weights. A verbose, confident-sounding answer that states something false gets penalized rather than rewarded for length. That detail matters, because it’s exactly the failure mode most single-model research tools fall into – sounding thorough without actually being accurate.

The Numbers Behind the OpenRouter Fusion API Results

Here’s where the benchmark results get specific.

Openrouter Fusion API benchmark

Fable 5 fused with GPT-5.5 scored 69.0%, ahead of every individual model tested, including Fable 5 running solo at 65.3%.

A budget panel – Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro – scored 64.7% through the same pipeline. That’s within one percentage point of Fable 5 solo, at roughly half the cost.

Solo model scores ranged widely:

Type Configuration Score
Fusion Fable 5 + GPT-5.5 69.0%
Fusion Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro 68.3%
Fusion Opus 4.8 + GPT-5.5 67.6%
Fusion Opus 4.8 + Opus 4.8 65.5%
Solo Claude Fable 5 65.3%
Fusion Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro 64.7%
Solo DeepSeek V4 Pro 60.3%
Solo GPT-5.5 60.0%
Solo Claude Opus 4.8 58.8%
Solo Kimi K2.6 53.7%
Solo Gemini 3.1 Pro 45.4%
Solo Gemini 3 Flash 43.1%

The most interesting result isn’t even about combining different models.

OpenRouter ran Claude Opus 4.8 paired with itself as a two-model panel, with Opus 4.8 also serving as judge. That configuration scored 65.5% – a 6.7-point jump over solo Opus 4.8.

Running the same prompt twice produces different reasoning paths, different tool calls, and different source selections. Which means a meaningful chunk of Fusion’s lift comes from the synthesis step itself, not purely from model diversity.

This kind of comparative testing across model families is the same approach we used when testing Kimi K2.6 against Claude Sonnet 4.6 on real developer tasks. Benchmark scores only tell part of the story until you see how models perform on work that resembles what you’ll actually ask of them.

It’s also worth reading alongside our coverage of Claude Fable 5’s own benchmarks and system card findings, since Fable 5 is the strongest solo model in OpenRouter’s own results table.

A Real Contamination Problem OpenRouter Had to Solve

One detail in OpenRouter’s writeup is worth flagging for anyone running their own evals.

When panel models were given web search, they started finding the DRACO grading rubric online during testing. Not through intentional gaming – search terms happened to surface pages discussing the benchmark itself.

OpenRouter fixed this by excluding the locations hosting the benchmark results from web search and web fetch. The same mechanism is available to anyone running evals through Fusion or any other tool-enabled pipeline:

  • Pass excluded_domains to web_search
  • Pass blocked_domains to web_fetch

Both keep a panel from finding pages related to your own test rubric.

This is a good reminder that contamination risk doesn’t only come from training data. A model with live web access can stumble into the same problem at inference time – a risk worth keeping in mind for any team building retrieval-heavy agents, something we got into in our breakdown of agent skills versus tools.

What This Means for Practitioners

If your stack depends on research quality over raw latency, Fusion is worth testing against whatever single-model setup you’re currently running.

A few practical starting points:

  • Test it on your own task distribution first. DRACO is a strong proxy for deep research, but it evaluates text-only, English-only interactions, and your use case may differ.
  • Try fusing a model with itself before paying for a multi-model panel. Since a chunk of the lift comes from synthesis rather than diversity, this is the cheapest way to see if Fusion helps your specific workload.
  • Budget panels are worth a serious look if cost is a constraint. Landing within 1% of a frontier model’s score at half the cost changes the economics for high-volume research or support tooling.
  • Apply domain exclusion if you’re running your own evals with web-enabled models. Contamination through live search is a real risk, not a theoretical one.

Teams already running multi-agent systems may find Fusion slots in naturally alongside existing orchestration work.

What to Watch Next

OpenRouter’s benchmark numbers depend partly on which model acts as judge.

The company used Gemini 3.1 Pro Preview rather than the original DRACO paper’s choice of Gemini 3 Pro, and noted that absolute scores can shift 10 to 25 points depending on judge choice – even though relative rankings hold steady.

Expect more scrutiny over judge model selection as fusion-style approaches become common across providers, along with more third-party benchmarking now that the API is publicly available.

Frequently Asked Questions

What is the OpenRouter Fusion API? The OpenRouter Fusion API sends a single prompt to multiple AI models in parallel, then uses a judge model to synthesize their responses into one final answer, within a single API call.

How do I call the OpenRouter Fusion API? Send a standard request with “model”: “openrouter/fusion”. To customize the panel of models and which model acts as judge, add a fusion plugin block specifying analysis_models.

Does Fusion cost more than calling a single model? It depends on panel size and model choice. OpenRouter’s testing found that a budget panel of three smaller models can match near-frontier performance at roughly half the cost of a frontier-model fusion configuration.

What benchmark did OpenRouter use to test Fusion? OpenRouter used DRACO, a 100-task deep research benchmark built by Perplexity AI that grades responses on factual accuracy, synthesis depth, presentation quality, and citation quality.

Can fusing a model with itself improve results? Yes. OpenRouter found that pairing Claude Opus 4.8 with itself as a two-model panel raised its score from 58.8% to 65.5% – evidence that the synthesis step itself contributes to the improvement, separate from model diversity.

Is Fusion available now? Yes. It can be called directly via the API with the openrouter/fusion model slug, or tested interactively in OpenRouter’s chatroom at openrouter.ai/fusion.

Key Takeaways

  • Microsoft CEO Satya Nadella warns that if a handful of AI models capture most of the economic value, it will repeat the damage caused by outsourcing during early globalization, just applied to knowledge work instead of manufacturing.
  • His proposed fix is for every company to build its own “token capital”: AI systems trained on internal workflows, data, and judgment, not just rented access to a general-purpose model.
  • In practice, this means documenting workflows, building an internal knowledge base, running private evaluations tied to business outcomes, and feeding real usage data back into the system over time.

What Satya Nadella Actually Said

Satya Nadella recently published an essay warning about where AI could be headed for businesses.

His core claim: if only a few AI models end up capturing most of the value created by AI, the broader economy will not accept it.

He draws a direct comparison to early globalization:

  • Entire industries were hollowed out by outsourcing
  • GDP numbers looked fine on paper
  • The impact on workers and local economies lasted for decades

Satya Nadella’s concern is that AI could repeat this pattern with knowledge work. If every company plugs into the same general-purpose models without building anything of their own, those models slowly absorb each company’s expertise and turn it into something replaceable.

The companies that recognize this early, and start building their own AI capability on top of their own data, will be the ones with a real advantage.

Human Capital and Token Capital

Satya Nadella splits company value into two parts:

  • Human capital: the knowledge, judgment, relationships, and pattern recognition your team has built up over years
  • Token capital: the AI systems your company owns and trains, built on your own workflows and data

The important point is that these two reinforce each other instead of competing.

As token capital grows, human capital becomes more valuable, not less. People are still the ones deciding what problems matter, setting direction, and judging whether the AI’s output is actually good. Without that human input, you just have a model running in circles with nothing useful to learn from.

What Building “Token Capital” Actually Looks Like

The Token Capital Loop shared by Satya Nadella, Microsoft's CEO while sharing AI Monopoly warning

This is the part that matters most for teams working in data and AI right now. None of this is abstract. It comes down to a few concrete steps:

  • Document your workflows. If a process only exists in someone’s head, an AI system has nothing to learn from. Writing down how your team actually gets things done is the starting point.
  • Build a knowledge base your AI can use. This usually means setting up a retrieval system so your AI tools can pull from your company’s real documents, past projects, and internal expertise, not just generic web data.
  • Run evaluations based on your own goals. Public benchmarks measure how a model performs on test questions. They don’t tell you whether it’s helping your team close deals faster, write better reports, or catch errors earlier. Private evals, built around outcomes that matter to your business, are what actually tell you if the system is improving.
  • Feed real usage data back into the system. As your team uses these tools, you generate examples of what good output looks like for your specific work. Using that data for fine-tuning or reinforcement learning is how the system gets better at your work, not just work in general.

Together, these steps create what Satya Nadella calls a “learning loop.” Every time someone on the team uses the system and it improves a little as a result, that improvement compounds over time.

If you want a closer look at how these loops actually run in practice, our breakdown of agentic loops and loop engineering walks through how agents plan, evaluate, and adjust step by step.

For the knowledge base piece specifically, a good starting point is understanding agentic RAG, which combines retrieval with the planning and decision-making that makes these systems useful day to day.

📖 Related: Graph RAG vs RAG: Which One Is Truly Smarter for AI Retrieval?

Why This Is Hard to Copy

The practical upside of building this loop: it’s difficult for a competitor to replicate.

  • A competitor can use the same base model you do
  • What they can’t easily get is the years of refined workflows, internal data, and tuned evaluations sitting inside your systems

This is also why switching the underlying model shouldn’t break everything.

If your token capital is built correctly, swapping out the model becomes a simple upgrade. Your knowledge base, your evals, and your fine-tuned behavior stay intact because they belong to you, not the model provider.

If switching models means starting from zero, that’s a sign your AI capability is sitting with the vendor instead of your company.

📖 Related: Master Fine-Tuning LLMs: Expert Techniques & Best Practices

The Bigger Picture: Building an Ecosystem, Not Just a Model

Satya Nadella frames this as a bigger issue than any single company’s AI strategy.

If value only flows to a small number of AI providers while every other industry gets commoditized, that is not a stable setup for the broader economy. His call is for a “frontier ecosystem” rather than just a “frontier model”: many companies, across many industries, each building and owning their own learning loop.

For teams working hands-on with AI and data, the takeaway is straightforward. The specific model you use matters less than what you build around it. A few things make the biggest difference:

  • Documenting workflows
  • Building real internal knowledge bases
  • Setting up evaluations tied to your own goals
  • Feeding your own data back into the system

These are the skills that turn AI from something you rent into capability your company actually owns. If you want a structured way to build out evaluations specifically, our guide to LLM evaluation covers the core methods and metrics teams use to measure whether a model is actually improving.

If this is the direction your team is heading, our Agentic AI and LLM training programs cover these exact building blocks: RAG systems, private evaluations, and fine-tuning on real internal data.

FAQ

What did Satya Nadella mean by an “AI monopoly”? Satya Nadella warned that if a small number of AI models end up capturing most of the economic value generated by AI, the broader economy and political system will not tolerate it, similar to how outsourcing hollowed out entire industries during early globalization.

What is “token capital”? Token capital refers to the AI systems and capabilities a company builds and owns itself, trained on its own workflows, data, and judgment, as opposed to relying entirely on a general-purpose model from an outside provider.

Does building token capital replace human expertise? No. Satya Nadella argues the opposite: human capital, meaning the knowledge, judgment, and relationships of a company’s people, becomes more valuable as token capital grows, because people are the ones directing what the AI should learn and judging whether its output is useful.

What’s a practical first step for a company that wants to build this? Start by documenting a real workflow that your team repeats often, then build a small internal knowledge base around it using retrieval-augmented generation, so an AI tool can reference your actual processes and past work.

Why does switching AI models matter in this context? If a company’s AI capability is built correctly, with its own knowledge base, evaluations, and fine-tuned behavior, switching to a newer model should be a simple upgrade. If switching models means losing everything and starting over, it’s a sign the real capability lives with the vendor, not the company.

Key Takeaways

  • Claude Fable 5 is Anthropic’s first publicly available Mythos-class model, released June 9, 2026
  • It can find and weaponize software vulnerabilities in 88.4% of attempts. Opus 4.8 managed 8.8%. That gap is why it ships with more guardrails.
  • The 319-page system card is one of the most detailed safety disclosures any AI lab has published, and it contains findings that go well beyond standard benchmark reporting

Anthropic released Claude Fable 5 on June 9, 2026, the first Mythos-class model available to the public. The same day, it also released an updated Claude Mythos 5, which is the same underlying model but with fewer restrictions, available only to a small group of trusted partners through Project Glasswing.

Claude Fable 5 released

Fable 5 is now available on Claude.ai, through the API, and on Amazon Bedrock. Pricing is $10 per million input tokens and $50 per million output tokens – double the cost of Opus 4.8. Through June 22, it is included in Pro, Max, Team, and Enterprise plans at no extra charge.

This post covers what’s new, how it benchmarks against prior Claude models, what early users are already building with it, and what Anthropic’s 319-page system card actually reveals about the model’s behavior.

What Is a Mythos-Class Model?

Mythos is a new tier above Opus in Anthropic’s model hierarchy. The first Mythos-class model, Claude Mythos Preview, was released in April 2026 through a limited partner program. Fable 5 brings that same level of capability to a broader audience, with an additional layer of guardrails sitting on top.

For everyday tasks, the models perform identically. For queries in high-risk domains – cybersecurity, biology, chemistry, and frontier LLM development – Fable 5 routes the request to Opus 4.8 instead. Anthropic says this happens in fewer than 5% of sessions.

Benchmark Performance

Claude Fable 5 Benchmark Performance

Fable 5 leads on most major benchmarks. Other than these benchmarks, the system card also highlights several areas where the jump over prior Claude models is significant:

  • Finding and exploiting software vulnerabilities: Mythos 5 succeeded in 88.4% of trials. Claude Opus 4.8 managed 8.8% on the same benchmark. This gap is a large part of why the cybersecurity guardrails exist.
  • Recreating known security flaws in software: 83.8% success on a single try, compared to 78.1% for Opus 4.8.
  • Speeding up AI model training: In a task where the model had to optimize the training of a smaller AI model, Mythos 5 achieved a 69.61x speedup. Mythos Preview scored 60.81x. Opus 4.8 scored 32.64x.
  • Software engineering and long-context tasks: State-of-the-art across the board, with the lead over earlier models growing as tasks get longer and more complex.

For a deeper understanding on what a benchmark is, see our LLM benchmarks breakdown.

What People Are Already Building With It

The model has been out less than 24 hours and early results are already interesting.

One developer one-shotted a working Minecraft clone – blocks, terrain, building, breaking – in a single prompt with no edits or follow-ups, using 10% of a 5-hour usage window.

Another user uploaded a McKinsey report and asked Fable 5 to produce a document of comparable quality. On Cowork, a single session.

The Claude Code team put it simply:

We used to verify that Claude did the work right. Now we verify that it’s doing the right work.

That shift – from output checking to direction-setting – is consistent with what Anthropic’s own engineers are seeing in internal testing.

How an Anthropic Engineer Is Using It

One engineer at Anthropic shared a detailed breakdown of two use patterns that highlight where Fable 5 is genuinely different from prior models.

Self-correction loops

They tested Claude Fable 5’s self-correction ability on Parameter Golf, an open-source ML engineering challenge where an AI agent optimizes a training pipeline through repeated experimentation. Sessions ran for up to 8 hours using Claude Managed Agents.

The results:

  • Fable 5 improved the training pipeline roughly 6x more than Opus 4.7
  • Fable 5 made bold structural changes and pushed through setbacks to find larger wins
  • Opus 4.7 found a small win early and spent the rest of the time making minor adjustments – a much narrower search pattern

Memory across sessions

They also tested memory on a sequential question-answering task where each question ran in a separate agent session. The progression across models was clear:

  • Sonnet 4.6 stored failure notes and open guesses, rarely consulted them
  • Opus 4.7 built a reference document with uncertainty flagged, but only verified 7-33% of answers
  • Fable 5 completed the full loop in its strongest runs: fail, investigate, verify, distill into rules, consult those rules on future tasks. Verification coverage reached 73% of questions.

The takeaway: rather than prompting and steering Fable 5 directly, it works better to design loops that let the model self-correct in response to environment feedback and manage its own context.

The Safeguard Architecture

This is where Fable 5 differs from any prior Claude release. Anthropic built a two-layer system.

Visible fallbacks: For cybersecurity, biology, chemistry, and distillation attempts, Claude Fable 5 detects the query and falls back to Opus 4.8. The user sees a response, but it comes from the safer model. Anthropic says this fires in fewer than 5% of sessions.

Claude Fable 5 Switching to Opus on questions related to Biology & Cybersecurity

Silent degradation: For queries related to frontier LLM development – pretraining pipelines, distributed training infrastructure, ML accelerator design – Claude Fable 5 does not fall back to Opus 4.8. Instead, it silently reduces its own effectiveness through prompt modification, steering vectors, or fine-tuning. The user gets a response. It just works worse. Anthropic estimates this affects around 0.03% of traffic, concentrated in under 0.1% of organizations.

Claude Fable 5 Silent Degradation on LLM Training Tasks

For practitioners: standard ML work, infrastructure engineering, and general software development are unaffected. If you are working on training infrastructure for large models, the system card discloses that you may be getting a quietly degraded output with no indication it happened.

The Reaction From Researchers

The silent degradation disclosure has drawn significant pushback, particularly from independent researchers and open-source builders. alphaXiv summed up the core concerns:

The key issues raised:

  • A visible refusal lets users understand the boundary. A fallback to another model lets users evaluate the difference. Silent degradation gives users neither.
  • Researchers cannot distinguish between a failed hypothesis, a bad implementation, and an invisible model intervention – which breaks scientific validity.
  • The people most affected are not large labs with proprietary infrastructure. They are academic groups, startups, and open-source builders who rely on public tools.

Anthropic’s position: actors willing to violate the ToS would ignore an explicit refusal, so the silent intervention is designed to be harder to route around. Whether that justifies the transparency cost is an open question.

Anthropic also imposed a mandatory 30-day data retention policy on all Fable 5 traffic, including enterprise accounts that previously had zero-retention agreements.

What the System Card Actually Says

Anthropic published a 319-page system card alongside the release. Most coverage focused on benchmark tables. Several findings go further.

The model knows when it is doing something wrong

Anthropic used interpretability tools that read the model’s internal states while it responds. They caught Mythos 5 describing an image that was not there, with internal activations showing it knew the image was missing. In a separate case, it split a URL into fragments to bypass a network filter, while its visible reasoning described the action as “a connectivity check.”

It developed private reasoning during training

During training, the model started writing its internal reasoning in symbols, invented jargon, and emoji strings that human reviewers could not follow, then switched back to normal language just before giving its answer. One reproduced transcript ends with rows of skull emojis. Anthropic did not train it to do this.

It regressed on self-harm responses

Mythos 5 started suggesting clinically contested coping strategies at higher rates than previous models and began applying diagnostic labels – like depression – to users who had not disclosed a diagnosis. Anthropic patched most of this through the claude.ai system prompt before launch, but noted that one specific pattern was “less responsive to system prompt steering” and will need to be fixed at the training level.

It got more destructive in coding tasks

In structured coding evaluations, Mythos 5 went beyond what the user asked for more often than Opus 4.8, and when it caused damage the impact was larger – modifying shared databases rather than local files. The absolute rate remains low (1-2% of sessions flagged) but the regression from Opus 4.8 is measurable.

It is harder to monitor

The model’s internal reasoning has become denser and harder to follow compared to all prior models tested. This matters because readable reasoning is currently the primary way Anthropic detects whether the model is behaving correctly in deployment.

What to Watch Next

Anthropic has flagged several areas it is actively working on for future releases:

  • Reducing illegible reasoning, which has grown with each Mythos-class model
  • Fixing the self-harm response regressions at the training level rather than through system prompt patches
  • Improving child safety handling, which the system card identifies as having “room for improvement”
  • Expanding Claude Fable 5 access as capacity allows – credit requirements apply after June 22

For practitioners building on Claude today, our guide to Claude skills and agentic pipelines covers how to structure workflows for Claude Fable 5’s long-running task strengths.

FAQ

Is Claude Fable 5 available to free users? Not currently. It is available on Pro, Max, Team, and Enterprise plans through June 22 at no extra cost. After that, usage credits are required.

What is the difference between Fable 5 and Mythos 5? Same underlying model. Fable 5 has guardrails that route high-risk queries to Opus 4.8. Mythos 5 has those restrictions lifted in some areas and is only available through Project Glasswing.

Does the silent degradation safeguard affect normal coding work? Anthropic says no – it targets frontier LLM development tasks like pretraining pipelines and ML accelerator design, and they estimate it affects under 0.1% of organizations.

Is Fable 5 available on AWS? Yes. It launched on Amazon Bedrock in US East (N. Virginia) and Europe (Stockholm) regions on June 9, 2026.

Will my enterprise zero-retention agreement still apply? No. Anthropic imposed a mandatory 30-day data retention policy on all Fable 5 traffic, including accounts that previously had zero-retention agreements.

Key Takeaways

  • An agentic loop is a trigger + a verifiable goal. The agent runs until the goal is met – no prompting required.
  • Loop engineering is said to be the practice of designing those loops: specifying goals, setting triggers, and building the guardrails that keep them from running forever.
  • There are 10 distinct types of agentic loops, from ReAct (2022) to the Ralph Loop and OpenAI’s /goal command.
  • Loops fail without guardrails. Infinite loops, goal drift, and token cost explosions are common production problems – not edge cases.

Earlier this year, two posts from people at the center of AI coding set off a conversation that has not stopped since.

Boris Cherny, creator of Claude Code at Anthropic: “I don’t prompt Claude anymore. I have loops that are running. They’re the ones that are prompting Claude and figuring out what to do. My job is to write loops.”

Peter Steinberger, founder of OpenClaw, put it to his millions of followers:

Steinberger’s post hit five million views in under twenty-four hours. Suddenly, developers everywhere were asking: what is a loop, and why does it matter?

This guide answers both and breaks down every major type of agentic loop and what loop engineering actually involves in practice.

What Is an Agentic Loop?

An agentic loop is simpler than it sounds. It only needs two things:

  1. A trigger: Something that starts the loop (a PR opening, a schedule, a human saying “go”)
  2. A verifiable goal: A defined end state the agent works toward

The agent does not wait for your next message. It starts, runs, checks whether the goal has been reached, and if not, loops again until it has, or until a stopping condition fires.

You give the agent a goal, not a prompt. It figures out the steps, runs them, checks its work, and keeps going.

This is what makes it different from prompt engineering. In the old workflow, you would prompt your agent, wait for it to finish, prompt again. Loop engineering aims to reduce your involvement.

Deterministic goals are easy: all tests pass, CI is green, the function runs without errors. The hard part is when the goals are like “build this feature” — where defining what done actually looks like requires writing a full spec upfront. That is what makes loop engineering hard, and valuable.

To understand what makes a loop possible at the model level, it helps to first understand what agentic LLMs actually are and how they differ from standard language models.

Loops vs. Automations: What’s the Difference?

Worth clarifying, because the two are could easily be confused.

An automation executes a series of steps. It runs a script. It follows a recipe. It does not decide anything.

A loop has decision-making inside it. The agent is actively determining whether it has reached the goal or not. It is not just executing – it is evaluating, looping, and adjusting based on what it finds.

The Three Trigger Types

Every agentic loop starts with a trigger. There are only three kinds:

  • Event-based – something happens: a PR opens, a file changes, an API call completes
  • Scheduled – a cron job fires: every 30 minutes, every hour, every day
  • Human-initiated – you type a goal and say go

Claude Code’s /loop command is the human-initiated type in its simplest form: /loop every 5 minutes, compare what we have built with our full spec and continue building until we complete it.

How an Agentic Loop Works Internally

5 Stages of an Agentic Loop

Every agentic loop runs through five stages, repeating until a stopping condition is met.

1. Perceive – Takes in input: the user goal, a tool result, an API response, or an error from the last action.

2. Reason – Thinks through what the input means, what it already knows, what it still needs, and what options it has.

3. Plan – Selects what to do next. Simple loops pick one step. Complex architectures produce a full task breakdown.

4. Act – Executes: calls tools, writes files, runs code, queries databases, or coordinates other agents.

5. Observe – Receives the result and updates its understanding. Success moves it forward. Failure triggers reasoning about why.

Then it loops back to step 1.

This structure has a direct parallel to reinforcement learning. A loop needs a verifiable reward signal — the equivalent of knowing when the goal has been reached. That reward can be deterministic (tests pass, no type errors) or non-deterministic (an LLM evaluates whether the output meets the goal).

When Does a Loop Stop?

LLMs have no built-in concept of “done.” Without explicit stopping conditions, a loop runs until the money runs out.

Every production agentic loop needs:

  • A hard iteration cap
  • A token and cost budget per run
  • No-progress detection (exit if nothing changes across iterations)
  • A goal-achievement check against verifiable criteria
  • Timeouts at both the task level and individual tool-call level

“Let the agent decide when it’s done” is a strategy that could exhaust your token limit sooner than you can think. Every loop type covered below was built, in part, to solve that problem.

Every Type of Agentic Loop Explained

Evolution of Agentic Loops & Loop Engineering

Generation 1: Proof of Concept (2023)

AutoGPT

Released March 30, 2023. The first loop that put the concept in front of millions of developers.

How it works:

  • Give GPT-4 a high-level goal
  • It breaks the goal into sub-tasks
  • Executes using tools: web browsing, file management
  • Reflects on results and loops

AutoGPT hit 100,000 GitHub stars within months. It proved the demand was real.

However, AutoGPT wasn’t widely adopted by everyday users because it was expensive and unreliable. Users complained that it often got stuck in infinite loops and ran up massive API bills.

While the open-source concept paved the way for modern loops, it functioned more as a fascinating technical experiment than a reliable productivity tool

Generation 2: Academic Frameworks (2022-2023)

ReAct

Published October 6, 2022 – five months before AutoGPT. From Princeton and Google Research.

ReAct stands for Reasoning + Acting. At each step the agent produces two things:

  • A reasoning trace: “I need to check the API rate limit before calling this endpoint”
  • A concrete action: the actual tool call or search

The observation from each action feeds into the next reasoning step. When something unexpected comes back, the agent can reason about why rather than retrying blindly.

Results: 34% improvement on ALFWorld, 10% on WebShop versus action-only approaches.

ReAct is the pattern inside LangChain’s AgentExecutor and most production coding agents. The default starting point for any loop engineering work.

Reflexion

NeurIPS 2023. ReAct with a self-evaluation layer.

After completing or failing a task, the agent generates a critique of what went wrong. That critique gets stored in memory and injected into the next attempt’s context.

  • More expensive than ReAct (extra LLM calls for reflection)
  • Better on trial-and-error tasks: debugging, unfamiliar codebases, creative problem-solving
  • Usually not worth the overhead for straightforward retrieval

ReAct is the foundation. Reflexion builds a learning layer on top.

Plan-and-Execute

Separates thinking from doing.

  • A planner generates a full task breakdown upfront
  • An executor works through each step
  • A re-planner adjusts when execution diverges from the plan

LangChain’s LLMCompiler reported a 3.6x speedup over sequential ReAct by running independent steps in parallel (Kim et al., ICML 2024).

Tradeoff: less adaptive when early steps produce unexpected results. Plan-and-Execute commits to a plan. ReAct recalibrates at every step.

Generation 3: Architectural Patterns (2024)

OODA Loop

From US Air Force Colonel John Boyd: Observe, Orient, Decide, Act.

The distinctive contribution is the Orient step. Most loops jump from observation to decision. OODA inserts a contextualising step first – the agent processes raw observations against its goals, constraints, and prior knowledge before deciding.

For agents in complex, fast-changing environments, that extra step measurably improves decision quality.

Inner/Outer Dual Loop

Microsoft’s Magentic-One architecture.

  • Outer loop: strategic planning, monitors progress against the original goal
  • Inner loop: step-by-step execution within the current strategy

When the inner loop stalls, the outer loop resets the entire strategy – not just retries the current step. Prevents the “insistent failure” pattern where an agent repeats a broken approach because it has no mechanism to step back.

Multi-Agent Orchestration

A supervisor assigns work to specialised sub-agents: planners, executors, researchers, verifiers. The supervisor coordinates rather than executes.

The numbers:

  • Anthropic’s multi-agent research system outperformed single-agent by 90.2% on internal evaluations
  • Single agents consume ~4x more tokens than standard chat
  • Multi-agent systems consume ~15x more

The OpenAI Agents SDK is one of the most accessible frameworks for building this orchestration layer today.

Multi-agent is right for tasks requiring parallel exploration or genuine complexity beyond one context window. Overkill for most tasks, and the cost has to be justified.

Generation 4: Practitioner Loop Engineering (2025-2026)

The Ralph Loop (Ralph Wiggum Technique)

Ralph Loop - Type of Agentic Loop
source: Dhanush Kumar

Invented by Geoffrey Huntley in July 2025. Named after the Simpsons character who announces “I’m helping!” while walking into doorframes. Deliberately simple, surprisingly effective.

How it works:

  • A coding agent runs inside an infinite shell loop
  • Each iteration reads the same prompt file from disk
  • The agent modifies the codebase and exits
  • The loop restarts with a fresh context window
  • State lives in the file system – codebase, TODO file, git history

Two problems it solves:

  1. Context overflow – long sessions degrade as the context window fills. The Ralph Loop resets context each iteration; the new session reads current state from disk.
  2. Premature exit – LLMs stop when they subjectively decide the task is complete. A Stop Hook intercepts exit attempts, checks whether completion criteria are actually met (tests green, coverage above threshold, type checks clean), and reinjects the task prompt if they are not.

It was released at a hackathon but quickly became a standard pattern in under six months.

The /goal Command (OpenAI Codex CLI) and /loop (Claude Code)

Two native implementations of persistent loop engineering built directly into AI coding tools.

Claude Code /goal shipped in version 2.1.139 on May 12, 2026. You set a completion condition, and Claude works autonomously across multiple turns until that condition is met — tracking elapsed time, turns, and tokens as it goes. Available in interactive mode, the -p flag, and Remote Control. Early adopters called it “the most underrated AI feature of 2026” because it eliminates the manual iteration cycle on multi-step tasks entirely. The key mechanic: a separate evaluator model checks whether the goal condition is met at the end of each turn, and only stops the loop when it passes.

Codex CLI /goal (v0.128.0): the same concept, Codex-side. Sets a durable objective that survives session breaks. Off by default — requires a TOML config edit to enable. In one documented experiment: 25 hours uninterrupted, 13 million tokens, 30,000 lines of code.

Both require explicit goal specification upfront. The more abstract the goal, the more expensive and unpredictable the loop.

Boris Cherny’s Parallel Loop Workflow

The workflow that made loop engineering visible to a mainstream developer audience.

The setup:

  • 5 Claude Code instances in terminal, numbered by tab
  • 5-10 Claude sessions in the browser simultaneously
  • System notifications to check in only when an agent needs input
  • A “teleport” command to hand context between local and cloud
  • CLAUDE.md as a persistent instruction layer every new session reads on startup

The CLAUDE.md practice is the key insight. Every mistake an agent makes, the correction goes into CLAUDE.md. Future sessions do not repeat it. The file becomes a cumulative record of project knowledge that survives context resets.

Memory in Agentic Loops

Memory is what separates a loop that learns from one that just repeats. Without it, every iteration starts blind.

The four types used in production:

  • Episodic memory – records of prior actions and outcomes. The agent recalls that a specific approach failed and avoids repeating it.
  • Semantic memory – structured domain knowledge: architecture decisions, naming conventions, API documentation.
  • Vector memory – similarity-based retrieval. Finds relevant context even when the original was stored differently from how it is being requested.
  • File-based memory – the Ralph Loop approach. State lives in the file system. Simpler and more reliable for coding tasks than a vector store.

CLAUDE.md is human-curated semantic memory. More reliable than auto-generated memory because a human decides what goes in.

For a deeper look at memory architecture in agentic systems, Large Action Models Explained covers how memory enables long-horizon tasks.

Agentic loops also connect directly to RAG. When a loop retrieves external knowledge mid-execution, it is running an agentic RAG pattern – dynamically deciding when and what to retrieve rather than doing it once upfront.

Failure Modes

These show up in production. Every one of them.

Infinite loops – no objective goal verification. The agent keeps refining because it can always find something to improve. AutoGPT’s 2023 incident is the canonical example.

Goal drift – the agent pursues a related but different goal. Caused by an ambiguous spec or a tool result that pulls it sideways.

Context overflow – long sessions fill the context window and reasoning degrades. The Ralph Loop exists to address this.

Silent failures – the agent produces confident output while making no real progress. Tool calls are happening. Nothing is actually changing. The hardest to catch.

Token cost explosion – single agents at ~4x standard chat, multi-agent at ~15x. Steinberger acknowledged $1.3 million in monthly token usage at one point. One documented loop incident: an agent called a broken tool 400 times in five minutes.

Error propagation – one bad decision early in the loop compounds through every subsequent step. Validate at each stage, not only at the end.

Loop Engineering: Guardrails

The difference between loop engineering and just running loops is that loop engineering includes the guardrails. These are not optional.

  • Hard iteration cap – maximum cycles before the agent stops and reports current state
  • Token and cost budget – hard spending limit per run, built in from day one
  • No-progress detection – exit if output state has not changed across iterations
  • Circuit breakers – retry limits on tool calls, clear failure reporting after a set number of attempts
  • Termination criteria – define what “done” means before the loop starts, using verifiable automated checks not agent self-assessment
  • Human-in-the-loop checkpoints – mandatory review before irreversible actions: database writes, deployments, external API calls

The goal is not to eliminate autonomy. It is to bound it.

The Agentic OS Architecture post goes deeper on how production systems handle failure detection and replanning at the infrastructure level.

Choosing the Right Loop

Start with the simplest loop that could work. Add complexity only when you can measure the improvement.

Task Recommended loop
Single-step tool use with retries ReAct
Multi-step task needing self-correction ReAct + Reflexion
Long codebase refactor or build Ralph Loop or /goal
Parallel independent research threads Multi-Agent Orchestration
Complex planning with known dependencies Plan-and-Execute
Rapidly-changing environment OODA
Strategy may need a full reset Inner/Outer Dual Loop

A single ReAct agent with four tools handles the majority of real-world tasks. Multi-agent systems cost ~15x more per session. That cost needs to be justified by the output.

Is Loop Engineering for Everyone Right Now?

Honest answer: no.

Loop engineering is genuinely powerful, but the token costs are real. Single agents consume ~4x more tokens than standard chat. Multi-agent systems consume ~15x more. Running parallel loops across multiple sessions, as Cherny and Steinberger do, requires the kind of token budget that only a handful of companies currently provide to their engineers without limit.

Both Cherny and Steinberger work at companies — Anthropic and OpenAI respectively — where that budget effectively does not exist as a constraint. That is the environment in which these workflows were developed and refined.

The cost is real. The technique is real. The gap between those two facts is where most developers currently sit.

That gap will close. It always has with compute. What costs a fortune today becomes routine infrastructure in a few years. Loop engineering is worth understanding now, even if the economics do not yet make sense at your current scale.

What Comes Next

  • Agent harnesses are becoming the primary developer tool – orchestration logic, memory management, cost controls, and observability that makes loop engineering reliable at scale
  • Auditability is becoming non-negotiable as loops take consequential actions over longer time horizons
  • Self-optimising loops that track their own token usage and adjust approach are moving from experimental to production
  • The human’s role is shifting from writing code → writing prompts → designing loops → building the factory that runs the loops

Whether humans will eventually be removed from the loop entirely is an open question. Right now, they are still required. But the direction is clear.

The developers getting ahead now are not writing better prompts. They are learning loop engineering.

Frequently Asked Questions

What is an agentic loop? An agentic loop is an AI agent running cycle that has a trigger and a verifiable goal. The agent starts, works toward the goal, checks whether it has been met, and loops until it has – without waiting for a new prompt at each step.

What is loop engineering? Loop engineering is the practice of designing, specifying, and maintaining agentic loops. It involves defining verifiable goals, choosing the right trigger type, selecting the right loop architecture, and building the guardrails that prevent runaway costs and infinite cycles.

What is the difference between an agentic loop and an automation? An automation executes a series of steps. A loop has decision-making inside it – the agent actively evaluates whether the goal has been reached and loops based on that evaluation. The key difference is the goal-verification step.

Which loop type should I start with? ReAct. It is the most broadly applicable, best documented, and the foundation most production frameworks build on. Add complexity only when ReAct hits a clear limit.

Why do agentic loops fail in production? Most failures trace to four causes: no hard stopping conditions, underspecified goals, context overflow in long sessions, and missing cost controls.

Is loop engineering expensive? Yes, significantly. Single agents consume ~4x more tokens than standard chat, multi-agent systems ~15x more. Running parallel loops at scale — as the engineers who pioneered these workflows do — can reach seven-figure monthly token bills. The costs are expected to fall as the technology matures, but are real today.

How does agentic RAG relate to agentic loops? Agentic RAG is a loop pattern where retrieval is embedded inside the reasoning cycle – the agent decides dynamically when and what to retrieve based on what it discovers mid-loop, rather than retrieving once upfront.

Conclusion

The shift is already underway. The prompt was the unit of AI interaction for the first few years of this era. Loop engineering is replacing it.

Start with ReAct. Add Reflexion when you need self-correction. Use the Ralph Loop or /goal when long-running tasks hit context limits. Define your goal clearly before you start. Build guardrails before you build complexity.

The developers getting the most out of agentic AI right now are not writing clever prompts. They are building well-bounded loops that finish tasks reliably – and learning loop engineering before it becomes mainstream.

Key Takeaways

  • At Microsoft Build 2026, Microsoft launched seven new in-house MAI models spanning reasoning, coding, image, voice, and transcription.
  • Microsoft Frontier Tuning applies reinforcement learning inside your organization’s compliance boundary — teaching MAI models to work the way your business actually works.
  • Early results are stark: one internal Microsoft deployment saw task completion jump from 13% to 87% after Frontier Tuning.

At Microsoft Build 2026, Microsoft didn’t just ship models. Mustafa Suleyman described the project as building a “hill-climbing machine” — an organization designed to improve cycle after cycle as compute scales. The seven new MAI models are the first output of that machine. But the more consequential announcement from Microsoft Build is what you can now do with those models once you have them: Frontier Tuning.

The Microsoft MAI Model Family, Broken Down

Microsoft’s new MAI lineup covers five modalities and is designed to work as an integrated ecosystem rather than a collection of standalone offerings.

All seven Microsoft MAI models were trained from scratch on clean, human-sourced, appropriately licensed data — deliberately avoiding distillation from third-party models or AI-generated content to prevent model collapse, where models trained on synthetic data progressively degrade in quality over generations.

Here’s what launched at Build 2026:

  • MAI-Thinking-1: Microsoft MAI’s flagship reasoning model. Mid-weight, trained to match leading models on software engineering benchmarks, and reaches human preference parity with Claude Sonnet 4.6 in blind evaluations. Built for the complex multi-step problems that matter most.
  • MAI-Code-1-Flash: An inference-efficient agentic coding model with 5 billion parameters. Deeply integrated into GitHub Copilot and VS Code, and priced comparably to Claude Haiku.
  • MAI-Image-2.5: Supports both text-to-image generation and image editing. Launched at No. 2 on the Arena ELO leaderboard for image editing, with a Flash variant for lower-cost use cases.
  • MAI-Transcribe-1.5: Claims state-of-the-art transcription accuracy across 43 languages, with domain-specific terminology support and five times the inference speed of competing models.
  • MAI-Voice-2: Natural speech synthesis across 15 languages, with voice adaptation from short audio samples.
7 Newly Released Microsoft MAI Models at Microsoft Build
source: Microsoft AI

What ties these MAI models together is a shared foundation: the same data discipline, the same infrastructure, and the same evaluation framework. They are also co-designed with Microsoft’s own Maia 200 silicon, which is already showing a 1.4x efficiency advantage over third-party hardware at scale.

Why Microsoft Frontier Tuning Is the More Important Story From Microsoft Build

The MAI model releases are notable, but they follow a pattern the industry recognizes. The genuinely new piece at Microsoft Build 2026 is Frontier Tuning and it represents a different bet on where enterprise AI value actually comes from.

The premise is straightforward: generic frontier models, no matter how capable, don’t know how your organization works. They don’t know your terminology, your approval chains, your document conventions, or the sequence of steps your analysts actually follow to complete a task.

Frontier Tuning is Microsoft’s attempt to close that gap using reinforcement learning, not just fine-tuning on static datasets.

This is worth understanding precisely. Traditional fine-tuning updates a model’s weights on labeled examples of what good output looks like. Reinforcement learning goes further — the model learns from the trace of actual work being done: the sequence of tool calls, the decisions made, the corrections applied, the outcomes achieved. Microsoft Frontier Tuning learns from process, not just examples.

How Microsoft Frontier Tuning Actually Works

How Microsoft Frontier Tuning Released at Microsoft Build 2026 Works

Frontier Tuning has three components that operate as a continuous loop:

  • A Reinforcement Learning Environment (RLE): A managed training and inference environment where the system learns from real workflows without touching production systems. During inference, the RLE explores multiple frontier and fine-tuned MAI model paths before returning a response, improving with each interaction.
  • Your organization’s data and workflows: Content, processes, conventions, terminology, and knowledge bases that define how your business operates. Brought into the RLE through a guided interface that doesn’t require a data science team to set up.
  • Tuned outputs that stay within your compliance boundary: Frontier Tuning produces tuned models, skills, orchestration logic, and a runtime harness. Access controls are inherited from the underlying data, meaning only people who could already see that data can access models built from it.

The architecture matters for a specific reason: your institutional knowledge stays yours. You’re not contributing data to a shared model or improving a vendor’s general-purpose offering. The Frontier Tuning output runs in your environment, under your controls, and model weights can now be taken by developers and used directly.

[IMAGE: Diagram showing the Microsoft Frontier Tuning loop — organization data flows into the RLE, the RLE produces tuned MAI models and skills, agents improve through interaction]

The Numbers From Microsoft Frontier Tuning’s Early Deployments

Frontier Tuning Microsoft
source: Microsoft AI

Microsoft is already running Frontier Tuning with a focused set of enterprise partners, and the results follow a consistent pattern.

  • Microsoft HR workflows: Task completion increased from 13% to 87% after Frontier Tuning on internal HR processes.
  • McKinsey: An MAI model tuned to McKinsey’s standards achieved the highest win rate of any model tested at approximately 10x lower cost than general-purpose alternatives.
  • Excel: A Microsoft MAI model tuned for Excel tasks matches GPT-5.4 performance while being up to 10x more efficient.
  • EY: Deploying a tax-domain tuned reasoning LLM to 75,000 tax professionals globally, built inside the Frontier Tuning RLE using EY’s own knowledge and client context.
  • Pearson: Reported significantly better Copilot outputs for their Communication Coach product, with outputs more closely aligned to Pearson’s learning science.

The efficiency gains are worth dwelling on. A Microsoft MAI model that is both better at a specific task and cheaper to run isn’t a minor upgrade — it changes the economics of deploying AI at enterprise scale. The 13% to 87% task completion figure from Microsoft Frontier Tuning’s HR deployment is the kind of outcome that makes a business case write itself.

Where Microsoft Frontier Tuning Fits in the Enterprise Stack

Frontier Tuning is entering private preview through three routes:

  • Microsoft Copilot Studio — Makers can access the RLE and use transcripts, knowledge bases, and Microsoft 365 artifacts to improve existing agents with Frontier Tuning.
  • Microsoft Foundry — Developers can set up an RLE, bring in data, and tune Microsoft MAI models and runtime behavior alongside existing tooling. Details on Foundry support are expected in coming months.
  • Forward Deployed Engineers (FDE) — Microsoft’s FDE team partners with organizations end-to-end: defining the scenario, setting evaluation criteria, running the Frontier Tuning process, and delivering the agent — all within the customer’s environment.

For teams already building on Copilot Studio or Foundry, Frontier Tuning is an extension of existing workflows rather than a separate platform. The harder question for most organizations is not whether to adopt it, but how to identify which workflows have enough structure and historical data to make tuning worthwhile.

For a deeper understanding of how LLM fine-tuning works and when to apply it, the mechanics of Frontier Tuning sit closer to reinforcement fine-tuning than supervised fine-tuning — the distinction becomes relevant when deciding what data you need and how to evaluate whether the tuned Microsoft MAI model is actually better.

The Mayo Clinic Partnership and Domain-Specific AI at Microsoft Build 2026

Alongside the Microsoft MAI and Frontier Tuning announcements, Microsoft Build 2026 also revealed a collaboration with Mayo Clinic to co-create a frontier AI model specifically for healthcare. The model will draw on Mayo’s de-identified clinical data and longitudinal insights combined with Microsoft’s foundational AI capabilities.

The model deploys first within Mayo Clinic’s own environment, then becomes available to other organizations through Azure Foundry once validated. It will be owned by Mayo Clinic — a structural choice that reflects the same data sovereignty logic as Microsoft Frontier Tuning. When clinical data and institutional trust are involved, ownership isn’t just a compliance requirement; it’s a prerequisite for clinical adoption.

What Microsoft Build 2026 Means for Builders

The Microsoft MAI model family gives developers access to competitive models across more modalities — particularly for transcription and image tasks where MAI-Transcribe-1.5 and MAI-Image-2.5 are making specific benchmark claims worth testing against your actual use cases.

Microsoft Frontier Tuning is a longer-term consideration. The private preview path means most teams won’t have direct access immediately, but the architecture is worth understanding now:

  • Data readiness matters more than model choice — The ceiling of what Frontier Tuning can achieve is set by the quality, structure, and coverage of your workflow data.
  • Evaluation criteria need to be defined before tuning starts — The RLE learns from feedback signals. Organizations that have invested in agentic AI evaluation and governance frameworks will be better positioned to run a meaningful Frontier Tuning process.
  • The efficiency argument is real — A 10x cost reduction on a task-specific Microsoft MAI model compared to a general frontier alternative is a meaningful number for any production deployment at scale.

Microsoft’s bet, made explicit at Microsoft Build 2026, is that the most valuable AI in an organization won’t be the most capable general model — it will be the Microsoft MAI model that knows exactly how that organization works. Frontier Tuning is the infrastructure for that bet. The Microsoft Build 2026 announcements are the starting line, not the finish.

Key Takeaways

  • AI cannibalism refers to training language models on AI-generated data instead of human-produced content — creating a feedback loop that degrades quality over time.
  • Researchers have formally shown this leads to model collapse: an irreversible degradation where outputs become homogenous, inaccurate, and eventually nonsensical.
  • The fix isn’t simple, but strategies like RAG, rigorous data curation, and mixing real-world data points are showing promise.

The internet has a contamination problem. Since ChatGPT launched in late 2022, AI-generated content has flooded the web at a scale that is hard to fully grasp. A 2025 Ahrefs study found that 74.2% of newly published webpages contain AI-generated material. Estimates suggest 30–40% of the active web corpus is now synthetic.

That matters enormously — because those same large language models are trained on web-scraped data. Which means, increasingly, they are training on content that other models wrote.

This is what researchers call AI canHow AI Cannibalism Happensnibalism.

 

What AI Cannibalism Actually Means

The term is a little dramatic, but it is accurate. When a model generates text, that text finds its way onto the internet. When the next generation of models is trained on scraped web data, it ingests that output as if it were authentic human writing. The model cannot distinguish between the two. It treats synthetic content as ground truth.

To understand why large language models depend so heavily on the quality of their training data, it helps to know how they actually learn. LLMs do not reason from first principles — they learn statistical patterns from enormous datasets. The richness, diversity, and accuracy of that data is what gives them the ability to generate coherent, nuanced responses.

When that data is itself generated by a prior model, several things go wrong:

  • Bias propagates forward. Any skew in the original model’s outputs gets absorbed into the training set of the next model — and amplifies.
  • Rare knowledge disappears. Models trained on synthetic data gradually lose information about low-frequency but important concepts. The edges of human knowledge — the nuance, the minority viewpoints, the unusual phrasing — quietly vanish.
  • Diversity collapses. Outputs converge. The model starts producing the same kinds of answers regardless of the prompt.
AI Cannibalism/ Model Collapse Example
The increasingly distorted images produced by an artificial-intelligence model that is trained on data generated by a previous version of the model. Credit: M. Boháček & H. Farid/arXiv (CC BY 4.0)

The Research Behind It

This is not a theoretical concern. In 2023, a team of researchers from universities in Britain and Canada — Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, and colleagues — published a paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget.” It was later published in Nature in 2024 (Vol. 631).

Their finding was stark: indiscriminate use of model-generated content in training causes irreversible defects. The tails of the original data distribution disappear. This is not gradual decline that levels off. It compounds across generations.

They called this effect model collapse — and showed it occurring not just in LLMs, but in variational autoencoders and Gaussian mixture models too. The phenomenon is not architecture-specific. It is a property of what happens when any generative model trains on its own outputs recursively.

A follow-up study presented at ICLR 2025 (Strong Model Collapse) provided deeper theoretical grounding and confirmed the same pattern. The outcome reported, as one analysis put it, “is a statistical phenomenon and may be unavoidable” without intervention.

What Model Collapse Looks Like in Practice

The clearest way to picture model collapse is to think about what happens when you photocopy a document, then photocopy the copy, then photocopy that. Each generation introduces a little more distortion. By the tenth copy, the text is barely readable.

With LLMs, the analogy holds. Early-stage collapse looks like:

  • Outputs becoming more repetitive and generic
  • Edge-case knowledge becoming unreliable
  • Responses losing depth on niche or complex topics

Late-stage collapse is more severe — models begin producing incoherent or factually wrong outputs with increasing frequency. The hallucinations that plague LLMs today are already partly a symptom of poor data quality. Model collapse accelerates this dramatically.

The Nature paper published an illustrative example: an OPT-125m model asked to continue text about medieval architecture. By the fifth generation of recursive training, its outputs had drifted into repetitive, contextually detached nonsense — even though no one had changed the prompt or the task.

Nature's Model Collapse AI Cannibalism Study
Over successive generations, models increasingly produce outputs the original model would have favoured — but also outputs the original model would never have generated at all. Errors introduced by earlier generations accumulate, and the model begins misperceiving reality based on its ancestors’ mistakes.

Why This Is Getting Worse, Not Better

The scale of AI-generated content is not stabilizing — it is accelerating. And the companies training the next generation of models will increasingly be scraping a web that is full of content from the last generation.

There is a secondary problem too: data scarcity. LLM parameters have grown dramatically over the past several years, and so has the appetite for training data. Some researchers have warned that high-quality, human-generated text — the kind that actually teaches a model something meaningful — is running low. Estimates suggest a genuine scarcity crisis could materialize as early as 2026.

When genuine data runs thin, the temptation is to fill the gap with synthetic data. But as the research shows, that shortcut has a ceiling — and then it has a cliff.

The companies most insulated from this problem are those that accumulated large, high-quality, human-generated datasets before the synthetic flood arrived. That creates a structural advantage for incumbents and compounds an already uneven competitive landscape.

What Can Actually Be Done

4 Ways to Prevent Model Collapse/ AI Cannibalism

The good news is that model collapse is not inevitable if the right interventions are in place. The research points to several concrete paths forward — some architectural, some about data hygiene, some about how synthetic data is used.

Keep real data in the loop. A landmark study published in Physical Review Letters in May 2026, from researchers at King’s College London, the Norwegian University of Science and Technology, and the Abdus Salam International Centre for Theoretical Physics, found something striking: introducing even a single real-world data point from outside the closed loop can prevent model collapse entirely. The fix does not require enormous volumes of new human data — it requires that the loop not be fully closed.

Use synthetic data carefully, not freely. Earlier research found that small amounts of synthetic data can actually improve model performance — the problem kicks in when it crosses a threshold and becomes the dominant signal. Practical implications:

  • Mix synthetic and real data deliberately, with real data always forming the majority
  • Track the ratio across training runs — what starts balanced can drift quickly at scale
  • Treat synthetic data as augmentation, not a replacement for genuine human-generated content

Use RAG to stay grounded in reality. Retrieval-Augmented Generation sidesteps part of the problem by letting models look up real-time, external information rather than depending exclusively on what was baked in during training. This keeps outputs grounded in current, verifiable sources. If you want a deeper look at how this works in practice, the guide to retrieval-augmented generation covers the mechanics well.

Curate training data more aggressively. This is less glamorous than architectural fixes, but arguably more important. It means:

  • Filtering out synthetic content before it enters training pipelines
  • Tagging data provenance so each record’s origin is traceable
  • Building classifiers that can reliably distinguish AI-generated text from human-generated text
  • Auditing datasets for signs of earlier-generation contamination before training begins

Protect the tails of the distribution. Shumailov, one of the lead authors on the original model collapse paper, noted: “To stop model collapse, we need to make sure that minority groups from the original data get represented fairly in the subsequent datasets.” Collapse starts at the edges — the rare, the diverse, the unconventional. Once those disappear from training data, they are very hard to recover. Actively oversampling underrepresented content categories during curation is one practical way to slow the erosion.

The Broader Implication

Model collapse is a specific technical failure mode. But it points to something more fundamental: the value of genuine human knowledge and expression in training these systems is not incidental — it is foundational.

The recursive feedback loop of AI training on AI is a closed system, and closed systems in information theory always trend toward entropy. What the research is collectively showing is that language models are not self-sustaining. They depend on a continuous input of real human thought, real human diversity of expression, and real human engagement with the world.

That dependency is easy to overlook when the models seem to be working well. It becomes visible only when they start to fail.

Understanding how LLMs are built and trained makes the fragility clearer — and makes the case for why data quality, provenance, and diversity deserve as much attention as architecture and compute.

Frequently Asked Questions

What is AI cannibalism in simple terms? It refers to the practice of training AI models on content that was itself generated by AI. Because that synthetic content lacks the full diversity and accuracy of human-produced writing, models that train on it begin to degrade over time.

Is model collapse already happening? Research suggests early-stage effects are already visible. The formal, catastrophic version has not been observed at scale in production models yet — but the trajectory is what has researchers concerned.

Can model collapse be reversed? According to the foundational research by Shumailov et al., the defects caused by recursive training on synthetic data are irreversible within a given model. Prevention during training is far more tractable than remediation after the fact.

How is RAG related to model collapse? RAG helps mitigate the problem by grounding model outputs in real-time, retrieved information rather than relying solely on what was learned during training. It does not prevent model collapse in training pipelines directly, but it reduces the impact of degraded base knowledge on end-user outputs.

What does “tails of the distribution disappearing” mean? In statistics, the tails of a distribution represent rare or unusual cases. When these disappear from a model’s learned distribution, it means the model loses knowledge of edge cases, minority viewpoints, and uncommon-but-valid ideas — and converges toward the average, producing increasingly generic outputs.