AI cannibalism refers to training language models on AI-generated data instead of human-produced content — creating a feedback loop that degrades quality over time.
Researchers have formally shown this leads to model collapse: an irreversible degradation where outputs become homogenous, inaccurate, and eventually nonsensical.
The fix isn’t simple, but strategies like RAG, rigorous data curation, and mixing real-world data points are showing promise.
The internet has a contamination problem. Since ChatGPT launched in late 2022, AI-generated content has flooded the web at a scale that is hard to fully grasp. A 2025 Ahrefs study found that 74.2% of newly published webpages contain AI-generated material. Estimates suggest 30–40% of the active web corpus is now synthetic.
That matters enormously — because those same large language models are trained on web-scraped data. Which means, increasingly, they are training on content that other models wrote.
This is what researchers call AI cannibalism.
What AI Cannibalism Actually Means
The term is a little dramatic, but it is accurate. When a model generates text, that text finds its way onto the internet. When the next generation of models is trained on scraped web data, it ingests that output as if it were authentic human writing. The model cannot distinguish between the two. It treats synthetic content as ground truth.
To understand why large language models depend so heavily on the quality of their training data, it helps to know how they actually learn. LLMs do not reason from first principles — they learn statistical patterns from enormous datasets. The richness, diversity, and accuracy of that data is what gives them the ability to generate coherent, nuanced responses.
When that data is itself generated by a prior model, several things go wrong:
Bias propagates forward. Any skew in the original model’s outputs gets absorbed into the training set of the next model — and amplifies.
Rare knowledge disappears. Models trained on synthetic data gradually lose information about low-frequency but important concepts. The edges of human knowledge — the nuance, the minority viewpoints, the unusual phrasing — quietly vanish.
Diversity collapses. Outputs converge. The model starts producing the same kinds of answers regardless of the prompt.
The increasingly distorted images produced by an artificial-intelligence model that is trained on data generated by a previous version of the model. Credit: M. Boháček & H. Farid/arXiv (CC BY 4.0)
The Research Behind It
This is not a theoretical concern. In 2023, a team of researchers from universities in Britain and Canada — Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, and colleagues — published a paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget.” It was later published in Nature in 2024 (Vol. 631).
Their finding was stark: indiscriminate use of model-generated content in training causes irreversible defects. The tails of the original data distribution disappear. This is not gradual decline that levels off. It compounds across generations.
They called this effect model collapse — and showed it occurring not just in LLMs, but in variational autoencoders and Gaussian mixture models too. The phenomenon is not architecture-specific. It is a property of what happens when any generative model trains on its own outputs recursively.
A follow-up study presented at ICLR 2025 (Strong Model Collapse) provided deeper theoretical grounding and confirmed the same pattern. The outcome reported, as one analysis put it, “is a statistical phenomenon and may be unavoidable” without intervention.
What Model Collapse Looks Like in Practice
The clearest way to picture model collapse is to think about what happens when you photocopy a document, then photocopy the copy, then photocopy that. Each generation introduces a little more distortion. By the tenth copy, the text is barely readable.
With LLMs, the analogy holds. Early-stage collapse looks like:
Outputs becoming more repetitive and generic
Edge-case knowledge becoming unreliable
Responses losing depth on niche or complex topics
Late-stage collapse is more severe — models begin producing incoherent or factually wrong outputs with increasing frequency. The hallucinations that plague LLMs today are already partly a symptom of poor data quality. Model collapse accelerates this dramatically.
The Naturepaper published an illustrative example: an OPT-125m model asked to continue text about medieval architecture. By the fifth generation of recursive training, its outputs had drifted into repetitive, contextually detached nonsense — even though no one had changed the prompt or the task.
Over successive generations, models increasingly produce outputs the original model would have favoured — but also outputs the original model would never have generated at all. Errors introduced by earlier generations accumulate, and the model begins misperceiving reality based on its ancestors’ mistakes.
Why This Is Getting Worse, Not Better
The scale of AI-generated content is not stabilizing — it is accelerating. And the companies training the next generation of models will increasingly be scraping a web that is full of content from the last generation.
There is a secondary problem too: data scarcity. LLM parameters have grown dramatically over the past several years, and so has the appetite for training data. Some researchers have warned that high-quality, human-generated text — the kind that actually teaches a model something meaningful — is running low. Estimates suggest a genuine scarcity crisis could materialize as early as 2026.
When genuine data runs thin, the temptation is to fill the gap with synthetic data. But as the research shows, that shortcut has a ceiling — and then it has a cliff.
The companies most insulated from this problem are those that accumulated large, high-quality, human-generated datasets before the synthetic flood arrived. That creates a structural advantage for incumbents and compounds an already uneven competitive landscape.
What Can Actually Be Done
The good news is that model collapse is not inevitable if the right interventions are in place. The research points to several concrete paths forward — some architectural, some about data hygiene, some about how synthetic data is used.
Keep real data in the loop. A landmark study published in Physical Review Letters in May 2026, from researchers at King’s College London, the Norwegian University of Science and Technology, and the Abdus Salam International Centre for Theoretical Physics, found something striking: introducing even a single real-world data point from outside the closed loop can prevent model collapse entirely. The fix does not require enormous volumes of new human data — it requires that the loop not be fully closed.
Use synthetic data carefully, not freely. Earlier research found that small amounts of synthetic data can actually improve model performance — the problem kicks in when it crosses a threshold and becomes the dominant signal. Practical implications:
Mix synthetic and real data deliberately, with real data always forming the majority
Track the ratio across training runs — what starts balanced can drift quickly at scale
Treat synthetic data as augmentation, not a replacement for genuine human-generated content
Use RAG to stay grounded in reality. Retrieval-Augmented Generation sidesteps part of the problem by letting models look up real-time, external information rather than depending exclusively on what was baked in during training. This keeps outputs grounded in current, verifiable sources. If you want a deeper look at how this works in practice, the guide to retrieval-augmented generation covers the mechanics well.
Curate training data more aggressively. This is less glamorous than architectural fixes, but arguably more important. It means:
Filtering out synthetic content before it enters training pipelines
Tagging data provenance so each record’s origin is traceable
Building classifiers that can reliably distinguish AI-generated text from human-generated text
Auditing datasets for signs of earlier-generation contamination before training begins
Protect the tails of the distribution. Shumailov, one of the lead authors on the original model collapse paper, noted: “To stop model collapse, we need to make sure that minority groups from the original data get represented fairly in the subsequent datasets.” Collapse starts at the edges — the rare, the diverse, the unconventional. Once those disappear from training data, they are very hard to recover. Actively oversampling underrepresented content categories during curation is one practical way to slow the erosion.
The Broader Implication
Model collapse is a specific technical failure mode. But it points to something more fundamental: the value of genuine human knowledge and expression in training these systems is not incidental — it is foundational.
The recursive feedback loop of AI training on AI is a closed system, and closed systems in information theory always trend toward entropy. What the research is collectively showing is that language models are not self-sustaining. They depend on a continuous input of real human thought, real human diversity of expression, and real human engagement with the world.
That dependency is easy to overlook when the models seem to be working well. It becomes visible only when they start to fail.
Understanding how LLMs are built and trained makes the fragility clearer — and makes the case for why data quality, provenance, and diversity deserve as much attention as architecture and compute.
Frequently Asked Questions
What is AI cannibalism in simple terms? It refers to the practice of training AI models on content that was itself generated by AI. Because that synthetic content lacks the full diversity and accuracy of human-produced writing, models that train on it begin to degrade over time.
Is model collapse already happening? Research suggests early-stage effects are already visible. The formal, catastrophic version has not been observed at scale in production models yet — but the trajectory is what has researchers concerned.
Can model collapse be reversed? According to the foundational research by Shumailov et al., the defects caused by recursive training on synthetic data are irreversible within a given model. Prevention during training is far more tractable than remediation after the fact.
How is RAG related to model collapse? RAG helps mitigate the problem by grounding model outputs in real-time, retrieved information rather than relying solely on what was learned during training. It does not prevent model collapse in training pipelines directly, but it reduces the impact of degraded base knowledge on end-user outputs.
What does “tails of the distribution disappearing” mean? In statistics, the tails of a distribution represent rare or unusual cases. When these disappear from a model’s learned distribution, it means the model loses knowledge of edge cases, minority viewpoints, and uncommon-but-valid ideas — and converges toward the average, producing increasingly generic outputs.
Anthropic shipped /goal in Claude Code v2.1.139 on May 12, 2026 — set a completion condition once, and the agent keeps working across turns until it’s met
OpenAI’s Codex CLI shipped a comparable /goal feature weeks earlier in April 2026, with persistent state that survives process restarts
The real story isn’t who got there first — it’s that both frontier labs converged on the same interaction model independently, signaling a structural shift in how AI coding tools are built
Two of the most widely used AI coding tools shipped the same feature within weeks of each other.
Anthropic added /goal to Claude Code on May 12 with version 2.1.139. OpenAI shipped a comparable feature to Codex CLI in April. Neither team was copying the other — they arrived at the same design because the problem they were solving is identical.
AI coding assistants have been optimized for a one-prompt-one-response rhythm. That rhythm breaks down the moment a task requires more than a few turns to complete. The broader shift toward agentic AI — systems that pursue goals rather than respond to prompts — has been building for years, and /goal is the first widely-deployed mechanism to bring that model directly into a developer’s terminal.
/goal is the fix for that.
You define a completion condition — something like “all tests in test/auth pass and the lint step is clean” — and the agent keeps working until a small, fast evaluator model confirms the condition has been satisfied. No manual prompting to continue. No babysitting.
How do you keep Claude working until the job is done? Claude Code helps with this in a few ways, including one we shipped recently: /goal. pic.twitter.com/QtVPmwoKct
Run /goal followed by the condition you want satisfied. After each turn, a lightweight evaluator model checks whether the condition holds. If it doesn’t, Claude starts another turn automatically instead of returning control to you. The goal clears once the condition is met.
Key things to know about the session behavior:
One goal per session — a new /goal command replaces the active one
Status indicator — a ◎ /goal active badge shows elapsed time and tokens spent while a goal is running
Evaluator transparency — after each turn, the evaluator returns a short reason explaining why the condition is or isn’t met yet, visible in both the status view and the transcript
Manual override — run /goal clear to cancel anytime, or /goal with no argument to check progress
What matters about the design is how Anthropic framed what /goal is actually for.
The official docs position it for “substantial work with a verifiable end state” — not vague tasks, not exploration. Work that already has a clear finish line.
Use cases Anthropic explicitly calls out:
Migrating a module until every call site compiles and tests pass
Implementing a design doc until all acceptance criteria hold
Splitting a large file into focused modules until each is under a size budget
Running through a labeled issue backlog until the queue is empty
That framing defines the right mental model: /goal is a control surface for work that can be verified, not a shortcut for tasks you haven’t fully defined.
Writing Conditions That Actually Work
This is where most people will get tripped up early on.
A condition that holds across many turns needs three things:
One measurable end state — a test result, a build exit code, a file count, an empty queue
A stated check — how Claude should prove it (“npm test exits 0”, “git status is clean”)
Constraints that matter — anything that must not change along the way (“no other test file is modified”)
The condition can be up to 4,000 characters. You can also include a turn or time clause to bound how long a goal runs — “or stop after 20 turns” is a simple guardrail worth building into most conditions by default.
Writing effective /goal conditions is an extension of good prompt engineering. The same principles that make a standard prompt precise — specificity, clear success criteria, explicit constraints — apply here, but the stakes are higher because the agent will keep acting on a vague condition until it runs out of turns. If you’re newer to crafting structured instructions for LLMs, this primer on prompt engineering strategies covers the foundations well.
A few examples from the cheatsheets circulating on X that illustrate the pattern well:
/goal Refactor this repo to TypeScript strict mode. Success: zero ‘any’ types, all tests pass, no functional regressions, build clean, summary of changes.
/goal Make every test in this repo pass. Success: npm test exits 0, no skipped tests, root-cause notes for each fix, no test-mocking shortcuts.
/goal Migrate this app from Supabase to Postgres + Drizzle. Success: schema parity, all queries working, seed data preserved, tests pass, migration guide written.
Each of those conditions has a clear binary outcome. The agent either hits it or it doesn’t — and the evaluator can tell the difference.
The Trust and Safety Model
/goal is deliberately gated.
The feature only runs in workspaces where the trust dialog has been accepted, because the evaluator is part of the hooks system. It’s also unavailable when disableAllHooks is set at any settings level, or when allowManagedHooksOnly is set in managed settings.
This isn’t a footnote — it tells you something about how Anthropic is thinking about autonomous workflows. The trust dialog is the boundary. Teams deploying Claude Code in managed environments need to account for this before building /goal into any pipeline.
Security becomes a first-order concern as agents run longer and touch more of your codebase unsupervised. The trust model here is also relevant for teams using Claude Code Remote Control, where the agent is running locally but being accessed from another device — a long /goal run in that context means your machine is executing code autonomously while you’re away from it.
For individual developers, the practical implication is simple: if /goal silently does nothing when you run it, check the trust settings first.
How Codex’s /goal Is Different
Codex shipped its version roughly a month earlier, and the key architectural difference is persistence.
Where Claude Code’s goal lives within an active session, Codex’s implementation is built on app-server APIs and runtime continuation. The agent can survive process restarts, reboots, and terminal crashes. You can pick up where you left off even if your session died mid-task.
Other meaningful differences:
Checkpoint model — Codex defaults to “plan-mode nudges,” pausing at key decision points to confirm direction rather than running fully unattended. Full-auto mode is available via codex –approval-mode full-auto but isn’t the default.
Setup — Claude Code: launch CLI, type /goal. Codex Desktop: Settings → Configuration → goals = true. Different surfaces, different onboarding friction.
Multi-agent scope — Codex’s May 2026 release expanded MultiAgentV2 support, so multiple goals can be active across different environments, each tied to its own thread.
The philosophical difference between the two implementations is real.
Codex leans toward inline confirmation at decision points — the agent checks in before making consequential moves. Claude Code leans toward a blanket trust model — grant trust at the workspace level, then let it run.
Neither is wrong. They reflect different assumptions about who is using the tool and how much they want to stay in the loop during a long-running task.
The Formula Both Tools Share
Despite the architectural differences, the prompt structure that works is essentially the same across both tools.
The three-element formula:
/goal [do the work] until [measurable end state] without [constraints]
For more complex tasks, both tools benefit from an extended structure:
Tips that apply regardless of which tool you’re using:
One goal at a time — scope it tightly. A goal that tries to do too many things at once is harder for the evaluator to verify.
Let the model write its own /goal — describe the task in plain language and ask Claude or Codex to generate the condition. The model often writes a tighter condition than a human would.
Pair with /plan — run /goal → /plan → /goal clear for complex tasks where you want the agent to map the work before executing it.
Attach a .md checklist — the agent can use it as a running log, which makes the evaluator’s job easier and gives you a readable audit trail.
Add turn limits — “or stop after 20 turns” is a cheap safeguard against runaway sessions.
The Token Cost Risk Is Real
This is the part that doesn’t show up in the launch posts.
Neither Codex nor Claude Code currently has a native “set budget cap per goal” feature. A poorly scoped condition running across 50 turns with Sonnet as the evaluator model can cost significantly more than expected.
Part of what makes this worth understanding is the underlying model architecture. The /goal evaluator is itself a language model — a small one, but it’s running on every turn. If you’re using a larger model as the evaluator, costs compound fast. The shift toward using SLMs for evaluator-style tasks in agentic systems is exactly why tools like these tend to route lightweight verification work to smaller, cheaper models rather than the primary reasoning model.
Practical mitigations:
Hardcode a turn limit directly into the condition — the single most effective safeguard
Use Haiku as the evaluator model — evaluation speed and costs stay predictable; Sonnet as the evaluator spikes overhead fast
Set platform-level budget alerts before kicking off any long-running goal
Start with a dry run — test the condition on a small scope before pointing /goal at your entire codebase
The community is calling out token consumption as the main friction point right now. One widely shared take on X: “Already active in Claude Code and Codex — you need to use it now.” The enthusiasm is warranted. The cost awareness isn’t always there alongside it.
Comparing the Two Side by Side
Claude Code
Codex CLI
Shipped
May 12, 2026 (v2.1.139)
April 2026
Persistence
Session-scoped
Survives restarts/crashes
Default approval mode
Trust dialog (workspace-level)
Plan-mode nudges (inline)
Full-auto mode
Auto mode (approve tool calls)
codex --approval-mode full-auto
Turn tracking
◎ /goal active + evaluator reason
Terminal title indicator
Multi-agent
One goal per session
Multiple goals across environments
Mobile
Yes (Claude Code Mobile)
Desktop CLI focus
Remote Control
Yes
N/A
Works with
Claude Code CLI, Remote Control, -p flag
Codex CLI, Codex Desktop
The Actual Story: A Pattern Becoming Infrastructure
The more significant thing happening here is not the feature — it’s the convergence.
When two competing labs ship the same interaction primitive within the same month without coordinating, that’s independent validation. /goal is becoming the default way to express “keep working on this until it’s done” across agentic coding tools. The fact that it’s also appeared in Hermes reinforces that this is a cross-platform pattern, not a product feature.
This is a natural extension of how agentic LLMs have been evolving — from models that respond to prompts, to models that reason across steps, to models that now pursue defined objectives autonomously across an unbounded number of turns. /goal is essentially the user-facing surface of that architectural shift. That has real implications for how developers should think about workflows going forward:
Tasks that previously required babysitting — multi-file refactors, migration jobs, test cleanup backlogs — are now first-class use cases with native tooling
The “keep going” prompt is effectively deprecated. You define the condition once and hand it off.
The session model of AI coding tools is shifting from discrete exchanges to durable objectives
Anthropic doubled Claude Code’s five-hour rate limits for paid plans in early May — a timing that makes more sense nowthat /goal is live and encouraging longer unsupervised runs. If those limits extend further, it signals Anthropic is prepared to bet on multi-hour autonomous workflows as a core product pattern.
The underlying reason both labs arrived here simultaneously is that the Model Context Protocol and the broader agentic tooling ecosystem have matured enough to make persistent, verifiable agent loops tractable. A year ago, the infrastructure to reliably evaluate conditions across many turns didn’t exist in a form that shipped cleanly to developers. It does now.
What Practitioners Should Do Right Now
If you’re on Claude Code:
Update to v2.1.139 if you haven’t already
Pick one task you currently babysit — anything where you keep prompting “continue” — and reframe it as a /goal condition
Start with test-driven refactoring — passing tests make a natural, verifiable end state
Add “or stop after 20 turns” to every condition until you’ve calibrated what your typical goals cost
If you’re on Codex:
Enable goals in Settings → Configuration → goals = true
Use the persistence layer for anything long enough that your terminal might close mid-task
Keep plan-mode on by default unless you’re confident in the condition — it’s a useful safety net for new task types
If you’re evaluating both:
Choose Codex if persistence across restarts matters for your workflow
Choose Claude Code if you want cleaner Remote Control integration or mobile access
Both work. The formula is the same. Start with whichever tool you’re already using.
What to Watch Next
A few signals worth tracking over the coming months:
Rate limit expansion — Anthropic’s May rate limit doubling looks like preparation for longer /goal runs. Further increases would confirm autonomous workflows as a priority.
Native budget caps — neither tool has this yet. The first to ship a “max spend per goal” control wins the trust of teams running this in production.
Evaluator model choice — both tools currently handle evaluator model selection implicitly. Explicit developer control over which model evaluates the condition would meaningfully change the cost calculus.
Cross-vendor standardization — if Hermes, Cursor, and other tools adopt the same /goal primitive, it may evolve into a shared spec rather than competing implementations.
The pattern is validated. The tooling will keep improving around it.
FAQ
What is the /goal command in Claude Code?
/goal is a command introduced in Claude Code v2.1.139 that lets you define a completion condition for an agent. After each turn, a lightweight evaluator model checks whether the condition is met. If not, Claude continues working automatically — no prompting required. The goal clears once the condition is satisfied.
How is Claude Code’s /goal different from Codex’s /goal?
The biggest difference is persistence. Codex’s implementation survives process restarts and terminal crashes using app-server APIs. Claude Code’s goal is session-scoped. Codex also defaults to inline confirmation checkpoints; Claude Code uses a workspace trust dialog as the access control layer.
What kinds of tasks is /goal designed for?
Tasks with a verifiable end state — migrating a module until every call site compiles, running tests until a suite passes, cleaning a backlog until it’s empty. It’s not well-suited for open-ended tasks without a clearly defined finish line.
Is /goal available in Claude Code Remote Control and mobile?
Yes. As of v2.1.139, /goal works in interactive mode, the -p flag, Remote Control, and Claude Code Mobile.
What’s the biggest risk with /goal?
Token cost. Neither Claude Code nor Codex has a native per-goal budget cap. A long-running goal with a large model as the evaluator can consume significantly more tokens than expected. Always include a turn limit in your condition and set platform-level budget alerts before running anything substantial.
Does /goal work the same way in both Claude Code and Codex?
The underlying pattern is the same — define a condition, let the agent work until it’s met — but the implementations differ in persistence, approval model, and setup. The three-element formula (/goal [task] until [end state] without [constraints]) works in both.
“Agentic OS” is not a product you install — it’s an architectural pattern that adds a management layer on top of AI agents so they can coordinate, share memory, and improve over time.
Without this layer, multi-agent systems break in predictable ways: agents contradict each other, forget context, and fail silently.
The pattern borrows directly from how operating systems manage processes — and that analogy turns out to be more useful than it sounds.
The Honest Answer Up Front
“Agentic OS” has become one of those terms that means everything and nothing at the same time.
Ask five engineers what it means and you’ll get five different answers. Ask a vendor and they’ll tell you their product is the Agentic OS. Ask Reddit and you’ll mostly get skepticism.
Here’s the fair take: the term is overused, but the underlying pattern is real and worth understanding.
This guide explains what an Agentic OS actually is, why the pattern exists, what its core components look like in practice, and where current implementations still fall short.
What Problem Does Agentic OS Actually Solve?
Before getting into what it is, it helps to understand why it exists.
Most people building with LLMs start with a single agent. It works well for simple tasks. Then requirements grow — the agent needs to search the web, write code, query a database, summarize documents, and make decisions across all of it. So you add tools. Then memory. Then you realize one agent doing everything is fragile, slow, and hard to debug.
The natural next step is splitting the work across multiple specialized agents. But now you have a different problem: who coordinates them?
Without a coordination layer:
Agents don’t know what other agents have done, so they repeat work or contradict each other
There’s no shared memory, so every agent starts from scratch on every run
When one agent fails, nothing knows how to recover — the whole pipeline stalls
Context bleeds between agents in unintended ways, producing inconsistent outputs
This is exactly the problem an Agentic OS is designed to solve. It’s the layer that sits above your agents and manages how they work together.
An Agentic OS is a software layer that manages multiple AI agents — coordinating how they plan, act, share memory, and learn — without requiring a human to intervene at every step.
The OS analogy holds up better than most tech analogies. A traditional operating system doesn’t do your work. It manages the resources — memory, CPU, I/O — that make work possible. It decides which process runs when, what memory each process can access, and how they communicate with each other.
An Agentic OS does the same thing, but for agents:
It allocates context and decides what each agent knows before it runs, so agents get exactly the information they need and nothing they don’t
It routes tasks and determines which agent is responsible for which part of a goal, based on capability and availability
It manages memory and maintains a shared knowledge layer that agents can read from and write to across sessions
It handles failures and detects when an agent produces a bad output or gets stuck, and triggers replanning instead of halting
Without this layer, you have a collection of agents. With it, you have a system.
The agents doing the actual work inside this system are LLM-based — models that can reason, use tools, and act across multiple steps. For a detailed look at how those models work and what makes them genuinely agentic, Agentic LLMs in 2025: How AI Is Becoming Self-Directed, Tool-Using & Autonomous covers the landscape well.
What Makes This Different From a Regular Multi-Agent Pipeline
This is the question the definition doesn’t answer on its own — and it’s worth being direct about.
A standard multi-agent pipeline is static. You define the flow upfront: agent A runs first, passes output to agent B, agent B passes to agent C. The coordination logic is hardcoded into the pipeline itself. It works well when inputs are predictable and nothing breaks. But change the input shape, add a new requirement, or have one agent fail — and the whole thing needs to be manually updated or it stops.
An Agentic OS moves coordination out of the pipeline and into a runtime layer. Instead of following a fixed script, the orchestrator decides at runtime how to break down a goal, which agents to involve, and in what order — based on the actual task in front of it. If a sub-task fails, it doesn’t halt. It replans. If a different approach is needed for a specific input, it routes differently. The pipeline adapts to the work, rather than forcing the work to fit the pipeline.
The simplest way to put it: a multi-agent pipeline follows a script. An Agentic OS writes the script on the fly and rewrites it when something goes wrong.
The Five Core Components
Every serious implementation of this pattern, whether you’re building it yourself or using a framework, needs these five components working together.
1. The Orchestrator
The orchestrator is the entry point for every goal that enters the system. It receives a high-level task, figures out what needs to happen, and coordinates the agents that execute it.
Think of it as the kernel of your Agentic OS — the component everything else reports to.
What a well-built orchestrator does:
Decomposes goals into sub-tasks that are specific enough for a specialist agent to execute without ambiguity
Routes each sub-task to the right agent based on what that agent is designed to do, not just what’s available
Tracks completion across all running agents and knows when to wait, when to proceed, and when to replan
Handles failures without halting — if a sub-task fails, the orchestrator tries an alternative path rather than crashing the whole pipeline
The key quality that separates a good orchestrator from a fragile one is replanning. Anyone can build an orchestrator that works when everything goes right. A reliable one keeps moving when things go wrong.
2. Memory Architecture
This is where most early multi-agent systems break. If agents have no persistent memory, every run starts from scratch. Your agentic sytem would just be a collection of stateless API calls dressed up as agents.
A proper Agentic OS maintains three distinct memory layers:
Memory Type
What It Stores
Lifespan
Working Memory
The current task, intermediate results, and agent outputs mid-run
Lives for the duration of one task
Episodic Memory
Records of past interactions, decisions, and outcomes
Before an agent runs, the system queries the relevant memory stores and injects only the entries that matter for that specific task into the agent’s context. The agent doesn’t get a dump of everything the system knows — it gets a targeted slice. This retrieval step is essentially RAG applied to agent memory, which is covered in depth in Agentic RAG: A Powerful Leap Forward in Context-Aware AI.
Writing to memory is just as important as reading from it. Not every agent should have write access to long-term memory. Entries follow a defined schema, and in most production systems, new entries are reviewed before becoming permanent. This keeps the knowledge base from silently accumulating garbage that degrades agent behavior over time.
3. Context Management
source: Philschmid
Context windows have hard limits. What you put in them determines the quality of every output.
“Fresh context” means each agent gets a purpose-built context window assembled specifically for its task — not a copy-paste of everything the system has seen so far.
A well-assembled context includes:
A scoped system prompt that defines the agent’s role and constraints for this specific task — not a generic “you are a helpful assistant” prompt
Retrieved memory entries pulled from the relevant memory layers, filtered to the top results most relevant to the current task
Tool definitions for only the tools the agent actually needs to complete its job
Handoff data from the previous agent in the pipeline, structured and clean
What gets deliberately excluded:
Conversation history from other agents’ runs, which introduces noise and causes unexpected behavior
Memory entries from unrelated tasks or past sessions that don’t apply here
Tool definitions for tools the agent won’t use — these take up context space and can confuse the model into attempting actions it shouldn’t
Clean context boundaries make the system predictable and debuggable. When something goes wrong, you know exactly what the agent saw when it made a bad decision — because you controlled what went in.
Instead of one large agent trying to handle everything, an Agentic OS runs a network of agents where each one is purpose-built for a specific type of task.
This is the part that makes the system genuinely scalable. A specialist agent has a tightly scoped system prompt, access to only the tools it needs, and a well-defined output format. It’s easier to build, easier to test, and much easier to fix when it breaks.
Common specialist roles in production systems:
Research agent — queries the web or internal knowledge bases to gather raw information, then structures it into a clean format that downstream agents can actually use
Writer agent — takes a brief and structured inputs and produces a draft, operating within brand or tone guidelines stored in semantic memory
Code agent — writes, reviews, or executes code against a defined spec, and returns structured results including errors and test outputs
QA agent — evaluates another agent’s output against a rubric before it moves to the next step, acting as a quality gate in the pipeline
Tool agent — handles direct integrations like API calls, database queries, and file operations — the parts of the workflow that touch external systems
Memory agent — decides what gets written to long-term memory after a task completes, applying the schema and governance rules that keep the knowledge base clean
Agents communicate through structured interfaces — defined input/output schemas, not free-form conversation. The orchestrator calls a specialist with a structured payload, the specialist returns a structured result, and the orchestrator uses that result to decide what happens next.
For agents to communicate reliably at scale, they need standardized protocols. Agentic AI Communication Protocols: MCP, A2A, and ACP explains how these standards work and why MCP in particular has become the default way agents connect to external tools and services.
This is what makes the whole system composable. You can swap out one specialist, improve another, or add a new one without touching the rest of the pipeline.
5. Feedback Loops and Self-Learning
A static multi-agent pipeline executes the same way every time regardless of whether its outputs are good or bad. A self-learning one gets better.
This doesn’t require retraining the underlying model. Most useful self-improvement happens at the workflow level through feedback loops that are built into the system.
Two types of feedback worth capturing:
Explicit feedback — A human reviews an output and signals whether it was good or bad. This could be a rating, a correction, or an approval/rejection in a review step. Good signals reinforce the current approach. Bad signals trigger a review of the relevant memory entries or system prompts that fed into that output.
Implicit feedback — Behavioral signals the system can observe without anyone rating anything. If a user consistently rewrites the opening of every email the writer agent drafts, that pattern is feedback. If outputs from a particular agent keep getting flagged in the QA step, that’s feedback too. The system captures these signals and surfaces them for review.
The goal is to build feedback collection into the workflow as a first-class feature — not bolt it on later.
How the Components Work Together: A Real Example
Here’s a concrete walkthrough. Say you ask an Agentic OS: “Research our three main competitors and draft a summary report.”
Step 1 — Orchestrator receives the goal and decomposes it: research competitor A, research competitor B, research competitor C, then synthesize everything into a report. It identifies the agents needed and sequences the work.
Step 2 — Context Manager builds a fresh context for each research task. It queries semantic memory for any prior research on these competitors, scopes the system prompt to research-only, and passes only the web search tool to each agent.
Step 3 — Research Agents run in parallel, one per competitor. Each searches, retrieves, and structures its findings into a clean output format that the next stage can consume.
Step 4 — QA Agent reviews each research output against a completeness rubric before anything moves forward. If one output is thin or off-target, it flags it and the orchestrator either retries or routes around it.
Step 5 — Writer Agent receives the validated research from all three agents and drafts the report. It pulls tone and formatting guidelines from semantic memory and structures the output to spec.
Step 6 — Memory Agent stores the final report and key findings in episodic memory so future runs can reference them without starting from scratch.
Step 7 — Feedback Loop kicks in when you read the report. If you edit sections, those changes are logged as implicit feedback on the writer agent’s prompt. If you approve it without changes, that’s a positive signal.
No human stepped in during steps 2–6. The system handled decomposition, coordination, quality checking, and memory management on its own. That’s the pattern in action.
Where Current Implementations Still Break
The Agentic OS pattern is sound. Most real-world implementations are still far from fully realizing it. Here’s where they actually fall apart:
Reliability Agents hallucinate actions, not just text. An agent told to call an API might call the wrong endpoint or construct a malformed request — and do it confidently. According to Gartner, over 40% of ambitious agentic AI pilots are projected to be cancelled by 2027, with reliability failures as the primary cause.
Memory drift Without strict governance on what gets written to shared memory, the knowledge base silently accumulates bad entries. Agents start behaving inconsistently in ways that are hard to trace because the root cause is buried in stale or incorrect memory.
Context bleed When agents share context carelessly — or when the context manager isn’t properly isolating each agent’s input — outputs from one task contaminate another. A support agent that carries over context from a code review run produces outputs that are confused and off-brand in ways that are hard to reproduce and harder to fix.
Infinite loops Agents without well-defined exit conditions can get stuck. The orchestrator keeps replanning, the agent keeps retrying the same failing tool call, and the system burns tokens and time without making progress.
Cost at scale Running multiple specialist agents per task, each making its own LLM call with a carefully assembled context, adds up fast. One way teams address this is by replacing large models with smaller, task-specific ones for routine agent roles — a shift covered in detail in From LLMs to SLMs: Redefining Intelligence in Agentic AI Systems. Production systems also need aggressive context pruning and result caching to stay economically viable at scale.
The Buzzword Test: Is What You’re Looking At Actually an Agentic OS?
The term is being applied to things that don’t deserve it. Before you buy into a platform’s claim or evaluate your own system, ask three questions:
1. Does it have persistent, structured memory across sessions? If the system starts from scratch every time a new session begins, it’s not an Agentic OS. It’s a stateless pipeline with an LLM at the front.
2. Do specialized agents delegate work to each other through defined interfaces? If there’s one model handling every type of task with a single long prompt, that’s not an OS architecture — that’s just a capable model. The multi-agent structure with defined roles and clean handoffs is what makes the pattern work.
3. Does it replan when something fails? If the system halts, throws an error, or requires a human to restart whenever an agent produces a bad output, it’s a workflow tool. An Agentic OS handles failures as a normal operating condition, not an exception.
Build vs. Buy
If you’re deciding whether to build this pattern from scratch or use an existing framework, the tradeoff is straightforward.
Build from scratch if:
Your workflows are specific enough that no framework covers them without significant workarounds
Your security or data requirements mean you can’t route data through external APIs
You have the engineering capacity to maintain a custom orchestration layer long-term
You need to move quickly and don’t want to build memory management and agent routing from scratch
Your use case fits within what existing frameworks support — which covers most common patterns
You want built-in observability and debugging tools without building your own
What no platform decides for you:
How your memory layers are structured and who has write access
What your agent roles are and how they hand off to each other
How feedback signals get captured and acted on
What your failure and replanning logic looks like
The framework handles the plumbing. The architecture — the decisions that actually determine whether your system works — is still yours to design.
FAQ
What’s the difference between an AI agent and an Agentic OS? An agent is a single unit: it receives input, reasons, and produces an output or takes an action. An Agentic OS is the layer above that — it manages multiple agents, decides what each one knows, routes tasks between them, and handles what happens when things go wrong. The agent is the process; the Agentic OS is what runs and coordinates the processes.
Is Agentic OS the same as AGI? No. An Agentic OS is an architectural pattern for organizing AI agents. The agents inside it are still LLM calls with defined roles and scoped context — not general intelligence. The architecture makes them more capable as a system, but each individual agent is still narrow.
What is MCP and why does it matter here? Model Context Protocol (MCP) is an open standard that gives agents a consistent way to connect to external tools and services. Before MCP, every tool integration was custom-built — a different connector for every API. MCP acts like a universal adapter, so agents can call tools without the orchestration layer needing to know the implementation details of each one. For the full picture on MCP and other agent communication standards, see Agentic AI Communication Protocols: MCP, A2A, and ACP.
Can a small team realistically build this? Yes. Frameworks like LangGrap handle most of the infrastructure so you’re not building orchestration from scratch. A small team can get a functional multi-agent system running in weeks. The harder work is designing the memory governance, the agent interfaces, and the failure handling — those require deliberate thought, not just code.
What are the biggest risks when deploying this in production? Three things cause the most problems: agents taking unintended actions with real-world consequences (sending emails, modifying records, making API calls that can’t be undone), memory drift degrading system behavior in ways that are slow and hard to diagnose, and runaway costs from uncontrolled LLM calls across many agents. All three are manageable — but only if you design for them upfront, not after you’re already in production.
The Bottom Line
Agentic OS is a real architectural pattern — not a product, not a marketing term, and not just AI hype.
The core idea is simple: multi-agent systems need a management layer the same way computers need an operating system. Without it, agents are powerful but ungovernable. With it, they become a system you can actually build on, debug, and improve over time.
Most of what’s being sold as “Agentic OS” today doesn’t fully deliver on the pattern yet. The implementations are catching up to the architecture. But the pattern itself — orchestration, structured memory, clean context, specialist agents, feedback loops — is the right foundation for any multi-agent system that needs to work reliably at scale.
If your current agent setup keeps hitting walls, this is the architecture that fixes it.
We ran Kimi K2.6 and Claude Sonnet 4.6 through four real developer tasks: code generation, debugging, code review, and security architecture reasoning.
Kimi K2.6 has three modes, Agent, Thinking, and Agent Swarm and they behave meaningfully differently, not just faster or slower.
Claude Sonnet 4.6 was more consistent across tasks and leaned toward production-ready thinking; Kimi K2.6 went deeper on completeness when it ran at full capacity.
Mid-test, Kimi K2.6 dropped from Thinking to Instant mode due to high demand. That’s worth factoring in before you build workflows around it.
The timing of this comparison wasn’t random. The week we ran these tests, a lot of developers were already eyeing Kimi as a Claude alternative — not because of benchmarks, but because Anthropic spooked them on pricing.
On April 21, 2026, Anthropic’s pricing page briefly showed Claude Code removed from the $20/month Pro plan. No email, no changelog entry, just an “X” where the checkmark used to be. Reddit and Hacker News moved fast. Within hours there were hundreds of comments, and the alternatives people were naming most often were Kimi, Minimax, and Qwen. By end of day, Anthropic’s Head of Growth had clarified it was an A/B test on roughly 2% of new signups, and the page was restored the next morning. But the comment he left behind stuck: “Usage has changed a lot and our current plans weren’t built for this.”
The change was reversed but the anxiety wasn’t and the timing happened to coincide almost exactly with the release of Kimi K2.6 on April 20. So we decided to actually test it.
We paired Kimi K2.6 against Claude Sonnet 4.6, Anthropic’s mid-tier model, rather than Opus, because that’s the fair fight. Both sit in the everyday-use tier in their respective families. Both are what most developers have running in production right now. Comparing it to Opus would skew the results in ways that don’t reflect how people actually choose between models.
Before we get into the tasks, it’s worth understanding how Kimi K2.6 is structured, because it’s genuinely different from how Claude works.
Kimi K2.6 Agent operates as a single autonomous agent with tool access. It takes actions rather than just responding, closer to a coding assistant that can actually do things.
Kimi K2.6 Thinking is the deliberative mode. It takes longer, reasons through more steps before committing, and tends to surface tradeoffs. For review and architecture tasks, this is the right mode to use.
Agent Swarm is Kimi K2.6’s most distinctive offering. Up to 300 parallel sub-agents coordinating across thousands of steps. There’s nothing quite like it in Claude’s current interface. We had planned to test it on an agentic planning task, but Agent Swarm and Agent modes currently require priority access. We couldn’t complete that test, so this comparison covers four tasks instead of five. That access gap is worth noting if you’re evaluating it for production.
For Claude Sonnet 4.6, we used standard mode across all tasks.
Kimi K2.6 vs Claude Sonnet 4.6: Feature Comparison
Before the task results, here’s the side-by-side on specs, pricing, and capabilities so you have the full picture in one place.
Kimi K2.6
Claude Sonnet 4.6
API pricing
$0.95 input / $4.00 output per 1M tokens
$3.00 input / $15.00 output per 1M tokens
Context window
256K tokens
1M tokens (200K standard; 1M in beta)
Input modalities
Text, image, video
Text, image
Agentic modes
Agent, Thinking, Agent Swarm (waitlisted)
Standard + Claude Code
Open source
Yes — Modified MIT, self-hostable
No
SWE-Bench Verified
80.2%
79.6%
A few things worth calling out from this table. The pricing gap is real, at $0.95/$4.00 per million tokens versus $3.00/$15.00, Kimi K2.6 is roughly 3–4x cheaper on the API. For teams running high-volume coding agents or processing long contexts regularly, that difference adds up fast. A startup consuming 100M input tokens and 10M output tokens monthly pays around $85 with Kimi K2.6 versus $450 with Claude Sonnet 4.6.
The context window comparison needs a caveat though. Kimi K2.6’s 256K is generous, but Claude Sonnet 4.6’s 1M token beta window is a meaningful advantage for full-codebase analysis and long document workflows. If you need to load an entire repository into a single prompt, Sonnet 4.6 can do it at standard pricing. And while Kimi K2.6 is open source and self-hostable (a real differentiator for teams with data residency requirements or cost constraints at scale), Agent Swarm access currently requires a priority waitlist, so the most powerful mode on paper isn’t yet available to everyone on demand.
Task 1: Code Generation — Building a FastAPI Endpoint
The prompt: build a FastAPI endpoint that takes user_id and action, validates the action against an allowed list, stores events in memory, and returns a summary for that user.
Both models returned working code and neither needed cleanup. That’s the baseline and both passed.
The interesting part was the pattern each one reached for. Kimi K2.6 used a field_validator with Pydantic v2. Totally valid. Claude used Literal[“login”, “logout”, “purchase”] as the type annotation itself, which means FastAPI rejects invalid input at the type level before the handler even runs. It’s a small difference on the surface, but it reflects how you think about where constraints should live — in a method, or in the type system. For Pydantic v2 specifically, the type-level approach is the more idiomatic pattern.
Claude also added a DELETE endpoint without being asked, flagged that the in-memory store should be replaced with Redis in multi-process deployments, and mentioned Swagger UI at /docs. It added a GET endpoint and solid curl examples. Both went beyond the prompt, just in different directions. Claude’s additions were the kind of things that come up in code review. Kimi K2.6’s additions were the kind of things that make the output immediately usable.
One more practical difference: Claude rendered the endpoint as a testable artifact you could interact with inline. With the Agent mode, you copy the code, save the files, and run it locally. For developers iterating quickly, that friction adds up.
Task 2: Debugging — A Logic Bug That Looks Fine on the Surface
The function was supposed to return unique emails from a list of user dictionaries. The bug: seen was checked on every loop but never populated, so duplicates passed through silently. The code looked syntactically correct. There was nothing to catch in a linter.
Both models found it immediately. Both fixed it and recommended a set for O(1) lookups over the original list. On the core task, they were equal.
The difference showed up in what each model offered next. Kimi K2.6 threw in a one-liner using seen.add() inside a boolean short-circuit expression. It works, and you can see why it’s tempting to include. It’s also the kind of thing that gets flagged in a code review because it trades readability for conciseness in a way that doesn’t pay off in a real codebase.
Claude’s bonus was dict.fromkeys(). It’s a standard library idiom, it preserves insertion order, and any Python developer who reads it knows exactly what it’s doing. The O(n) vs O(1) explanation was also cleaner — not just “use a set” but a brief walkthrough of why the performance difference matters as the input scales.
Both models went beyond what was asked. One went toward showing off, the other went toward teaching.
Task 3: Code Review — A Dangerous Database Function
This one had a classic SQL injection via f-string, a connection that’s never closed, SELECT * pulling every column, no error handling, no input validation, and a hardcoded database path. Six issues stacked in a short function.
Both models found all of them. Neither missed the SQL injection, and neither missed the resource leak. At the level of “does the model know what production code quality looks like,” both cleared the bar.
Where they diverged was in how they organized the findings. Claude led with severity labels, Critical, High, Medium and finished with a summary table. That structure matters in practice: it tells you what to fix before you ship and what can wait for the next sprint. It also framed SELECT * as a security issue rather than just a performance one. Most developers know that pulling all columns is wasteful; fewer think about the fact that it likely returns password hashes, tokens, and admin flags to wherever the result lands. Claude made that explicit.
K2.6 caught two issues Claude didn’t mention — missing docstring and absent type hints — and its refactored version reflected that. The rewrite came back with a full docstring including Args, Returns, and Raises sections, typed parameters using Optional[Tuple[Any, …]], and a ValueError for empty or invalid inputs. If you needed a drop-in replacement you could commit immediately, its output was closer to ready.
The practical split: Claude’s output helps you triage. K2.6’s output gives you the replacement. Depending on what stage you’re at, one of those is more useful than the other.
Task 4: Multi-Step Reasoning — Rate Limiting an Auth Flow
The task: Restructure a six-step login service to add IP-based rate limiting before any database query, identify what new components are needed, and describe what could go wrong if implemented incorrectly.
Before the results, something happened mid-test that’s worth being upfront about. Kimi K2.6 hit high demand during this task and automatically dropped from Thinking to Instant mode. It told us, and offered an upgrade path. The response we got was Instant mode output, not Thinking mode. That matters for interpreting the results below and it matters for anyone evaluating K2.6 for workflows where consistent reasoning depth is a requirement.
The response itself was still solid. K2.6 restructured the flow correctly with the rate limit as the first gate, identified Redis with atomic INCR + EXPIRE as the right approach, flagged race conditions in non-atomic read-then-write patterns, laid out the fail-open vs fail-closed tradeoff, and caught the shared-IP / NAT problem with per-IP rate limiting. It also flagged clock skew in sliding window implementations — a genuinely obscure edge case that a lot of architects wouldn’t think to include.
Claude covered the same core ground and found a few things on top of it. One was a design decision that’s easy to overlook: should the rate limiter count all login attempts from an IP, or only the failed ones? If you only count failures, an attacker who occasionally succeeds with a throwaway account can keep resetting their counter. Claude called this out explicitly and explained why it matters under adversarial conditions. It also caught a timing side-channel: if the rate limiter sits after the database query, response latency differences can reveal whether a username exists even when the request is ultimately rejected. And it added the Retry-After header — not in the prompt, not something most people think about first, but something that prevents legitimate clients from hammering the endpoint during backoff.
The gap between the outputs here reflects something real: Claude’s response read like it was written by someone thinking about what breaks in production, not just what the correct architecture looks like on a whiteboard. Whether that gap would have been smaller if Kimi K2.6 had stayed in Thinking mode, we can’t say. But the mode degradation itself is part of the result.
Kimi K2.6 is genuinely capable — and in some areas, notably code completeness and certain deep reasoning tasks, it goes further than Sonnet 4.6. Its Thinking mode produces thorough output when it runs at full capacity, and the refactored code it returns is often closer to production-ready than what Claude gives you. The three-mode interface is also a real differentiator: being able to choose between a fast agent, a deliberative reasoner, and a massively parallel swarm depending on the task is something no other model in this comparison class currently offers.
Claude Sonnet 4.6 is more consistent. Across four tasks, it ran without degradation, and its outputs reflected a stronger read on what code needs to be maintainable over time — not just correct at the moment of generation. The things it added unprompted (the Literal type, the Retry-After header, the security framing on SELECT *) were the kind of additions that save you a ticket later.
The mode reliability issue is the most honest thing we can say about the current state of using it in a real workflow. If you’re evaluating a model for something you need to depend on, “it fell back to a different mode under load” is a relevant data point — separate from how good the output is when everything runs as intended.
If you’re building agentic workflows and want to explore what an open-source model purpose-built for long-horizon execution looks like, Kimi K2.6 is worth your time once access opens more broadly. If you need a reliable, production-aware model for everyday developer work right now, Sonnet 4.6 is the more consistent choice today.
What is Kimi K2.6? It is Moonshot AI’s latest open-source model, released April 20, 2026. It runs on a Mixture-of-Experts architecture with 1 trillion total parameters (32 billion active per token), supports text, image, and video input, has a 256K context window, and offers three execution modes: Agent, Thinking, and Agent Swarm. It’s built specifically for long-horizon coding and autonomous multi-agent workflows.
What is Claude Sonnet 4.6? Claude Sonnet 4.6 is Anthropic’s mid-tier model in the Claude 4.6 family, released February 17, 2026. It’s the default model on Claude.ai’s free tier and the one most developers are using in production coding workflows today.
Why compare it to Sonnet and not Opus? Both models 4.6 are the practical everyday-use choice in their respective families. Comparing it against Opus 4.6 would tell you less about where these two actually compete — most developers choosing between them aren’t in the Opus pricing tier.
How does it benchmark against Claude on coding tasks? On SWE-Bench Pro at release, K2.6 scores 58.6 vs Claude Opus 4.6’s 53.4. On SWE-Bench Verified, K2.6 scores 80.2 and Claude Sonnet 4.6 scores 79.6 — essentially the same. The benchmarks are close enough that practical output quality, consistency, and workflow fit matter more than the numbers alone.
What is K2.6 Agent Swarm and what is it good for? Agent Swarm is K2.6’s most distinctive mode — it coordinates up to 300 parallel sub-agents across up to 4,000 steps. It’s designed for tasks that can be broken into parallel, specialized workstreams: large-scale codebase migrations, comprehensive research pipelines, multi-format content generation at scale. There’s no direct equivalent in Claude’s current product. Access currently requires a priority waitlist.
Is it free to use? Yes, it is available free at kimi.com. Paid plans unlock higher usage limits and additional features. The model weights are also open-sourced under a Modified MIT License for developers who want to self-host using vLLM or SGLang.
Claude Design launched on April 17, 2026. Anthropic’s boldest move beyond chatbots, turning Claude into a full prototyping engine that outputs live HTML, CSS, and React
Google Stitch evolved from a single-screen experiment at Google I/O 2025 into a multi-screen AI canvas with voice commands and interactive prototyping by March 2026
Figma’s stock has fallen ~35% year-to-date in 2026 — the market is already pricing in a design tools disruption that product teams need to understand now
The design tool market has a new war on its hands, and it started in earnest this April. On April 17, 2026, Anthropic launched Claude Design, a workspace that lets teams go from a text prompt to a live, interactive prototype without opening Figma. Days later, the internet had a new debate: does this kill the $3.2 billion design tools industry, or does it just reshape it?
The honest answer is more interesting than either extreme. To understand what’s really happening, you need to look at both tools in detail — what they do, how they differ, and what each one means for designers, product managers, and developers trying to move faster in 2026.
Claude Design vs. Google Stitch: Feature-by-Feature Breakdown
Feature
Claude Design
Google Stitch
Launch
April 17, 2026
May 2025 (major update March 2026)
Underlying AI
Claude Opus 4.7
Gemini 2.5 Flash / Gemini 2.5 Pro
Output type
Live HTML, CSS, React components
UI mockups + HTML/TailwindCSS
Multi-screen
Yes
Yes (up to 5 screens per generation)
Brand/design system
Auto-ingests codebase + design files on onboarding
Individual designers, fast ideation, Figma workflows
What Is Claude Design? Anthropic’s New Creative Workspace
Claude Design is a new product from Anthropic Labs that lets you collaborate with Claude to create polished visual work — designs, prototypes, slide decks, one-pagers, and more. It is powered by Claude Opus 4.7, Anthropic’s most capable vision model, and is currently available in research preview for Claude Pro, Max, Team, and Enterprise subscribers.
The key distinction worth understanding immediately: Claude Design is not an image generator. It is a prototyping engine. When you describe what you need — a landing page, a dashboard, a checkout flow — Claude builds a first version as live HTML, CSS, and React components that render in real time. You are not getting a static mockup to send to a developer. You are getting code.
This matters because it closes the gap between design and development in a way that earlier AI tools couldn’t. As we explored in our breakdown of Claude vs. ChatGPT, one of Claude’s consistent strengths has been its ability to reason about code and structure simultaneously — and Claude Design is exactly what happens when that capability gets a dedicated creative surface.
How the Workflow Works
The experience follows a natural creative loop. During onboarding, Claude reads your team’s codebase and design files to build a design system automatically. Every project that follows uses your brand’s colors, typography, and components without you having to specify them again. Teams maintaining multiple design systems — say, one for a consumer product and one for an enterprise dashboard — can manage both.
From there, you can start a project in several ways: a text prompt, an uploaded document (DOCX, PPTX, XLSX), a screenshot of your existing product, or by pointing Claude at your codebase. There is also a web capture tool that pulls visual elements directly from your website so that prototypes look like the real thing rather than a generic template.
Refinement happens through conversation. You can comment inline on specific elements, edit text directly, use adjustment knobs to tweak spacing and color in real time, and ask Claude to apply any of those changes across the entire design in one instruction. When a design is ready to hand off, Claude packages everything into a bundle that you pass to Claude Code with a single instruction — no manual spec writing, no back-and-forth briefs.
Who It’s Built For
The clearest use cases Anthropic has highlighted:
Designers who want to explore more directions quickly and turn static mockups into shareable interactive prototypes without a code review cycle
Product managers who need to sketch feature flows and hand them off directly to engineering or to designers for refinement
Founders and marketers who need a pitch deck or landing page and do not have a design background
Enterprise teams who want code-accurate, brand-consistent prototypes at scale
What Is Google Stitch? From Experiment to Figma Rival
Google Stitch launched quietly at Google I/O in May 2025 as a Google Labs experiment. The pitch was simple: describe a UI in plain English, and Stitch generates a screen for you. It was fast, impressively accurate for a first version, and clearly a test of appetite. The market responded with enthusiasm, and less than a year later, Stitch is a fundamentally different product.
The March 2026 update transformed Stitch into an AI-native software design canvas. Where the original tool generated single screens, the new version generates up to five interconnected screens simultaneously from a single natural language description. Where the original had a basic prompt input, the new version has an infinite canvas, a design agent that tracks the project’s evolution, voice commands, and an Agent manager that lets you work on multiple design directions in parallel.
Stitch’s origins trace back to Galileo AI, a startup founded in 2022 that built one of the earliest text-to-UI tools. Google acquired Galileo AI in early 2025 and rebranded it as Stitch, integrating it with the Gemini model family. This acquisition context matters: Stitch is not a side experiment Google spun up to test generative UI. It is Google’s most serious attempt to enter the professional design tools market, and it is backed by Gemini’s multimodal reasoning.
The Two Modes
Stitch runs on two versions of Gemini depending on what you need:
Standard Mode uses Gemini 2.5 Flash — fast, good for text-based prompt generation, supports Figma export, and gives you 350 generations per month
Experimental Mode uses Gemini 2.5 Pro — higher-fidelity output, accepts image inputs (sketches, screenshots, wireframes), and gives you 200 generations per month
Both modes are currently free through Google Labs, which is an important factor for individual designers and small teams evaluating the tool against paid alternatives.
What the March 2026 Canvas Introduced
The infinite canvas is the most significant structural change. Traditional design tools give you a blank page and expect you to fill it. Stitch’s canvas is intelligent — it understands the project’s entire history, can suggest next screens based on a user’s likely journey through the app, and allows you to bring in context from images, text, or code directly onto the canvas.
Voice is the other major new capability. You can speak to the canvas directly — asking for real-time design critiques, requesting layout variations, or triggering specific changes like “show me three different menu options” while a design is open. This is not a gimmick. For designers who think out loud or work with stakeholders during live reviews, voice interaction meaningfully changes how feedback loops work.
Stitch also introduced DESIGN.md — an agent-friendly markdown file that lets you export or import your design rules to and from other tools, including other Stitch projects. This addresses one of the biggest practical friction points in AI design tools: the inability to carry brand context across projects without starting from scratch.
Figma shares fell more than 4% the week of the Google Stitch March 2026 update. The stock is down approximately 35% year-to-date in 2026 — the design tools market is already repricing around AI disruption.
Where Claude Design Pulls Ahead
The strongest argument for Claude Design is the depth of its enterprise workflow integration. The design system ingestion during onboarding is not a feature you will find in Stitch — it means that from project one, every output reflects your actual brand rather than a generic interpretation of it. For teams managing complex visual identities across multiple products, this alone justifies the switch for prototyping work.
The Claude Code handoff is the other structural advantage. When a prototype is ready to build, Claude packages the entire design context into a bundle that passes directly to Claude Code. There is no specification document to write, no annotated Figma file to export, no brief to translate. The design and the implementation instructions are one artifact. Given how much time is lost in most product teams at exactly this handoff moment, this is a meaningful efficiency gain.
Product teams at companies like Datadog have reported going from rough idea to working prototype before a meeting ends, with the output staying true to brand guidelines without manual correction. Brilliant’s design team noted that pages requiring 20+ prompts in other tools only needed two prompts in Claude Design. These are not generic testimonials — they reflect a genuine reduction in friction at the most painful parts of the design cycle.
Understanding why Claude performs so well here requires some context about how the underlying model has evolved. For a deeper look at how Claude 3.5 Sonnet introduced Artifacts — the feature that laid the groundwork for Claude Design’s real-time rendering — that post explains the architectural shift that made this possible.
Where Google Stitch Pulls Ahead
Stitch’s multi-screen generation is its most practically powerful feature. Describing a full application flow and receiving five interconnected, coherent screens in one operation is something Claude Design does not currently offer at the same fidelity. A product manager who needs to communicate an entire checkout flow — cart, shipping, payment, confirmation, order tracking — can have that as a coherent design artifact in a single Stitch prompt.
The Figma integration is the other reason Stitch fits better into many existing workflows. Designers who live in Figma do not want to abandon it — they want faster ideation before they open Figma. Stitch’s paste-to-Figma function makes that transition seamless. Claude Design, by contrast, is building a parallel workflow that competes with Figma rather than plugging into it.
Stitch is also genuinely free in a way that matters for independent designers and early-stage teams. 350 standard mode generations per month is enough for rapid prototyping without any subscription cost. Claude Design requires a Pro plan at $20/month for meaningful access beyond the free tier — which is still competitive, but it is not free.
Voice-driven design critique is a genuine differentiator that is hard to overstate for teams that work collaboratively. The ability to talk through a design with an AI agent that responds in real time — making adjustments, offering critiques, suggesting alternatives — is a fundamentally different mode of working than typing prompts in a chat interface.
Pricing: What Do These Tools Actually Cost?
Plan
Claude Design
Google Stitch
Free
Available with generation limits
Free via Google Labs (350 standard/200 pro generations/month)
Paid
Claude Pro at $20/month (included in subscription)
No paid tier announced yet
Team/Enterprise
Claude Team and Enterprise plans (admin controls, org sharing)
Not yet available
Both tools undercut Figma’s team pricing significantly. For context, Figma’s professional plans run $12–15 per editor per month, with organization plans considerably higher. The AI design tools entering this space are doing so at a price point that makes evaluation essentially free, which accelerates adoption.
What This Means for Designers, PMs, and Developers in 2026
The question most teams are actually asking is not “which tool wins” — it is “which tool do I reach for and when.” The answer follows logically from what each tool prioritizes.
Reach for Claude Design when:
You need a code-accurate prototype that reflects your actual brand and design system
Your next step after prototyping is sending something to an engineering team
You are working within the Anthropic ecosystem and want Claude Code to implement the design
Your team needs org-scoped collaboration with version tracking inside a single tool
Reach for Google Stitch when:
You need to generate multiple screens of a full application flow in one operation
Your existing workflow centers on Figma and you need faster ideation before opening it
You are an independent designer or early-stage team where free access matters
You want to extract a design system from an existing URL and use it as a starting point
The deeper shift both tools represent is what the Data Science Dojo breakdown of top LLM companies describes as a transition from models as utilities to models as embedded collaborators. Both Anthropic and Google are building tools where the AI does not assist the workflow — it is the workflow. That distinction is what makes 2026 different from 2024.
For teams that want to understand how the underlying models power these capabilities, our guide to the best large language models covers the model families that both tools are built on, including Gemini’s multimodal architecture and Anthropic’s approach to instruction following and code generation.
The Bigger Picture: Who Wins the AI Design Wars?
Neither tool is a Figma killer yet. Both are genuinely missing things that production design teams depend on — precise vector editing, persistent component libraries with tokens, deep developer handoff with measurements and annotations, plugin ecosystems, and the kind of version history that large teams need to work without overwriting each other. These are not small gaps.
But the trajectory matters as much as the current state. Stitch went from a single-screen experiment to a five-screen canvas with voice and interactive prototyping in under a year. Claude Design launched with design system ingestion, Claude Code handoff, and org-level collaboration on day one. Both companies are investing heavily and iterating fast.
The financial markets have already drawn a conclusion. Figma shares fell more than 4% in the days following the March 2026 Stitch update and are down roughly 35% year-to-date. That is not just sentiment — it is institutional capital pricing in a fundamental shift in how design tools will work. This pattern mirrors what the generative AI art tools space went through between 2022 and 2024, where established creative software providers were forced to restructure their product roadmaps around AI-native competitors.
What is clear is that the “design handoff problem” — the friction-heavy translation of visual intent into buildable code — is being solved at the model level rather than the tooling level. Claude Design solves it by making the design output be the code. Stitch solves it by integrating into Figma so that the code generation happens downstream. Both approaches are valid, and both will continue to improve.
The teams that win in this environment are not the ones that pick the right tool in April 2026 — they are the ones that build the organizational habit of evaluating and integrating these tools as they evolve. For teams already building AI-powered workflows and want to understand the underlying model landscape better, the LLM guide for beginners is a practical starting point for understanding what makes these tools work the way they do.
FAQ: Claude Design and Google Stitch Explained
Is Claude Design free? Claude Design has a free tier with usage limits. Full access — including longer conversations and higher usage limits — is included in a Claude Pro subscription at $20/month. It is also available on Claude Max, Team, and Enterprise plans.
Is Google Stitch free? Yes. Google Stitch is currently free through Google Labs. Standard mode gives you 350 generations per month, and Experimental mode (higher fidelity, supports image input) gives you 200 generations per month. Google has not announced a paid tier as of April 2026.
Does Claude Design replace Figma? Not for production design work. Real-time multi-editor collaboration, persistent component libraries, precise vector editing, and developer handoff with measurements are areas where Figma still leads. Claude Design bypasses Figma for many early-stage use cases — prototyping, wireframing, pitch decks — but it is not a replacement for teams doing production-level UI work.
Can Google Stitch export to Figma? Yes. In Standard Mode, Stitch includes a paste-to-Figma function that lets you move generated designs directly into a Figma file for further editing. Experimental Mode does not currently support Figma export.
Who is Claude Design best for? Product teams, PMs, designers, and founders who want prototypes that are code-accurate, brand-consistent, and ready to hand off to engineering — particularly those already using Claude Code in their development workflow.
What language does Claude Design export code in? Claude Design generates HTML, CSS, and React components. Google Stitch exports HTML and TailwindCSS.
Can I use both tools together? Yes, and for many teams this makes sense. Stitch is stronger for rapid multi-screen ideation and Figma-compatible flows; Claude Design is stronger for code-accurate prototyping and enterprise brand consistency. Using Stitch to explore directions and Claude Design to produce the final handoff artifact is a workflow worth considering.
Conclusion: The Design Workflow Is Being Rewritten
The AI design wars of 2026 are not a zero-sum competition. Claude Design and Google Stitch are solving adjacent problems in adjacent ways, and the result is that teams have more capability than ever to close the gap between an idea and a working product.
The practical takeaway is this: if you are a product team or designer who has not yet built a prototyping workflow around AI tools, the cost of staying on the sideline is rising. Both tools are accessible right now — Claude Design through claude.ai/design, Google Stitch through stitch.withgoogle.com — and both have free or low-cost entry points that make experimentation essentially free.
The companies that figure out when to use each tool, and how to integrate both into their existing workflows, will not just move faster. They will build better products because the feedback loop between idea and prototype has been compressed from days to minutes.
For teams that want to go deeper on the models powering these tools, exploring Anthropic’s Claude 3 model family provides useful context on how Anthropic’s approach to reasoning and code generation has evolved into what powers Claude Design today.
Most AI for enterprise business cases stall because they start at the wrong ROI stage — justifying cost savings when the real value is further upstream
The 3-stage ROI maturity model (cost savings → revenue generation → new possibilities) gives decision-makers a clear benchmark for where their organization stands
The current enterprise sweet spot is 7-figure wins in the $2–3M range — but targets of $100M+ are being pursued by companies that have been building for over a year
Building a credible AI for enterprise business case has become one of the most mishandled challenges facing decision-makers today. The pressure to deploy agentic AI is real. So is the organizational skepticism that greets every new initiative. The result is a cycle of approved pilots, stalled deployments, and ROI numbers that never match what was promised.
The problem is rarely the technology. At the Future of Data and AI: Agentic AI Conference, Raja Iqbal, moderating the panel on enterprise economics, put it plainly at the outset: for many use cases, the technology works. The blockers are organizational friction, operating model, culture, and how people think about agents.
This article walks through the 3-stage agentic AI ROI maturity model introduced by Joao Moura, CEO and founder of CrewAI, during that panel. It explains what each stage looks like, what it requires, and how to build a credible AI for enterprise business case depending on where your company actually is.
Why Most AI for Enterprise Business Cases Get the ROI Framing Wrong
The most common mistake is strategic, not technical. Teams build the business case around cost reduction because it is the easiest number to put in a spreadsheet. Finance approves it, the project launches, and somewhere between the demo and production the returns shrink or disappear.
David Park, who leads the applied AI team at Landing AI, identified exactly why this happens:
“
The durable value will come from being able to restructure those workflows themselves, not just adding an agent or an LLM on top of it. Today we have augmentation without simplification.
The second failure mode is the demo-to-production gap. A polished proof of concept creates internal momentum, but production requires answering questions that demos never surface:
“
In demos the system works beautifully. But in production the critical questions are: who owns the output, how is this monitored, can it be audited and traced back to source with calibrated confidence?
— David Park, Applied AI Lead, Landing AI
Joao Moura framed the broader challenge as the “last mile” problem. Building the agent is not the hard part — the tooling is increasingly commoditized. Projects fail on data readiness, legacy integration, governance, and change management. As Joao said at the panel, that last mile turns out to be more like a thousand miles once production actually demands everything it demands.
Joao Moura introduced this model as the lens he uses to gauge how mature a customer is on their AI for enterprise journey:
“
Everyone starts on the early days talking about cost savings because that’s the horizon they can see. But then they go into how they can generate money from this. No one grows a massive business by playing defense. And the final frontier is: what can I do now that I could not even consider doing before, because it was not even feasible?
— Joao Moura, CEO & Founder, CrewAI
That progression, defense to offense to new territory, is the spine of the model.
3-stage agentic AI ROI maturity model for AI for enterprise deployments – Joao Moura
Stage 1: Cost Savings (Playing Defense)
Stage 1 is where most AI for enterprise deployments begin. Cost savings is the horizon most organizations can see at the start — it is the easiest ROI case to make internally, the easiest to measure, and the lowest-risk entry point for organizations still building confidence in the technology.
At this stage, agents automate repetitive workflows, reduce manual processing time, and cut costs in specific, bounded operations. The business case is a cost-displacement argument: here is what this process costs today, here is what it will cost with agents, here is the payback period.
The risk of staying here too long is that the organization optimizes existing processes rather than reimagining them. Companies that treat Stage 1 as a destination rather than a foundation tend to cap their returns early.
What Stage 1 requires: Defined workflows with measurable baselines. Clean enough data for agents to act on. A governance model for automated outputs. A team willing to own agent behavior in production.
Stage 2: Revenue Generation (Playing Offense)
Stage 2 is where the AI for enterprise business case shifts from defense to offense. Instead of reducing costs, the argument is about accelerating revenue: shipping faster, closing deals more efficiently, personalizing at scale, capturing revenue that was previously out of reach.
This stage requires more from the organization. Data readiness matters more because agents are now operating on revenue-critical workflows. Monitoring matters more because the cost of a failure is not just an efficiency loss — it is a customer or a deal.
The current benchmark: 7-figure wins in the $2–3M range are becoming more common. Joao shared a concrete example at the conference — a large CPG company used agents to handle stalled orders across shipping, invoice reconciliation, and routing bottlenecks. A relatively simple workflow redesign generated $2 million in value within two weeks by unblocking over 800,000 orders. As Joao noted, wins like that are no longer exceptional for well-executed Stage 2 AI for enterprise deployments.
What Stage 2 requires: A stable agent infrastructure from Stage 1. Production-grade monitoring and clear ownership of outputs. A workflow redesign mentality, not just an automation mentality. Executive sponsorship that understands the difference between the two.
Stage 3: New Possibilities (The Compounding Moat)
Stage 3 is where the AI for enterprise business case changes entirely. The question is no longer “can we do this more efficiently?” It is “can we do things that were not economically feasible before we had agents?”
At this stage, enterprises are using agentic AI to create entirely new products, serve new customer segments, or operate in markets that were previously too complex or expensive to enter. The competitive advantage does not depreciate quickly because it is built on proprietary data and workflows that cannot be replicated by deploying a third-party agent on a standard stack.
The conference benchmarks here are instructive. Joao described one customer whose goal is to save $100 million with agents in a single year:
“
They have a goal for this year that they want to save $100 million with agents. They’re shooting for the moon — but we have been working with them for over a year and now it’s getting to amazing results. It’s not a magic thing where you just snap your fingers and it works.
— Joao Moura, CEO & Founder, CrewAI
That timeline is the reality of what Stage 3 AI for enterprise requires. The $100M target is the outcome of a deliberate progression through Stages 1 and 2.
What Stage 3 requires: 12 or more months of serious investment in Stages 1 and 2. A platform team that owns identity, logging, governance, and cost metering. Leadership willing to fund a multi-year roadmap without demanding immediate returns.
Do things that were not economically feasible before agents existed
$10M–$100M+
12+ months of Stage 1 and 2 investment, dedicated platform team, multi-year roadmap
12+ months
Which Stage Is Your AI for Enterprise Program Actually At?
This is the question most teams get wrong — not because they are dishonest, but because the signals are easy to misread. A company with several active pilots and a growing AI team often assumes it is at Stage 2. Operationally, it is frequently still at Stage 1.
Use these five questions to assess your actual stage:
Do you have clean, classified data that agents can reliably act on? If not, you are at Stage 1 regardless of what your pilots are doing.
Do you have production monitoring and a defined owner for agent outputs? A working demo is not a production deployment.
Have you restructured at least one workflow around agent capabilities — not just automated it? Augmentation without simplification is Stage 1 behavior dressed as Stage 2.
Can your organization absorb a Stage 2 failure without killing the entire AI program? If not, your organizational maturity has not caught up with your ambition.
Do you have a platform team that owns agent infrastructure independently of any specific use case? If every deployment rebuilds from scratch, Stage 3 is not yet accessible.
A common pattern from the conference: companies get early success with a proprietary model, bills stack up, and they re-architect on open-source stacks without first establishing the governance layer that makes that transition safe. The stage they thought they were at and the stage they actually were at did not match.
The Hidden Blockers That Kill AI for Enterprise ROI
Even a well-constructed business case fails if the organization has not addressed the conditions that determine whether agents can deliver in production.
Data readiness is the most underestimated blocker at every stage. Unlike human workers who bring implicit background knowledge, an agent operating on an incomplete dataset will fill gaps with plausible but wrong answers. Data classification is a prerequisite to everything else.
Change management surprises teams the most. The resistance is rarely to the technology. It is to new ownership structures, new accountability models, and new ways of evaluating performance.
The demo-to-production gap is where most hidden cost lives. A proof of concept on clean, curated data will behave very differently in production. Not accounting for governance, monitoring, and change management in the business case is the single most common reason these investments underdeliver.
Frequently Asked Questions
What is the agentic AI ROI maturity model? The agentic AI ROI maturity model is a three-stage framework for how enterprise value from AI agents compounds over time. Stage 1 is cost savings, Stage 2 is revenue generation, and Stage 3 is new possibilities that were not economically feasible before agents existed. It was introduced by Joao Moura of CrewAI at the Agentic AI Conference.
How do I build a business case for agentic AI? Start by identifying which stage your organization is actually at. Stage 1 cases are operational efficiency arguments with clear baselines and payback periods. Stage 2 cases require evidence of production-grade governance and workflow redesign. Stage 3 cases are multi-year strategic pitches that require documented Stage 1 and Stage 2 outcomes.
What ROI can enterprises realistically expect from agentic AI? Current benchmarks from AI for enterprise deployments show 7-figure wins in the $2–3M range becoming common at Stage 2. Enterprises targeting $100M+ outcomes have been building for over a year and have invested heavily in data infrastructure and governance.
What is the difference between Stage 1 and Stage 2 AI ROI? Stage 1 is a cost-displacement argument: reducing headcount, automating workflows, cutting operational spend. Stage 2 is a revenue argument: shipping faster, closing more deals, capturing revenue previously out of reach. Stage 2 requires a workflow redesign mindset, not just automation.
How long does it take to see ROI from agentic AI? For most AI for enterprise programs, Stage 1 returns can appear within months of a well-scoped deployment. Stage 2 requires a Stage 1 foundation first. Stage 3 outcomes, including $100M+ targets, require 12 or more months of dedicated investment.
What are the biggest blockers to enterprise AI ROI? Data readiness, change management, and the demo-to-production gap. The technology is rarely the reason AI for enterprise projects fail.
The Stage You Start At Determines the Returns You Get
The organizations winning at AI for enterprise did not start with the most sophisticated agents or the largest budgets. They started with an honest answer to a simple question: which stage are we actually at, and what does it take to execute well here before moving to the next one?
As Joao Moura said at the conference:
“
It’s not a magic thing where you just snap your fingers and you have agents and now you’re a hundred times more productive. But if you put in the engineering work, you can achieve something remarkable.
— Joao Moura, CEO & Founder, CrewAI
The enterprises targeting $100M+ started exactly where you are. Start at the right stage, build the foundation, and the returns compound from there.
Ready to build robust and scalable LLM Applications? Explore our LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI.