For a hands-on learning experience to develop Agentic AI applications, join our Agentic AI Bootcamp today. Early Bird Discount

Key Takeaways

  • Anthropic shipped /goal in Claude Code v2.1.139 on May 12, 2026 — set a completion condition once, and the agent keeps working across turns until it’s met
  • OpenAI’s Codex CLI shipped a comparable /goal feature weeks earlier in April 2026, with persistent state that survives process restarts
  • The real story isn’t who got there first — it’s that both frontier labs converged on the same interaction model independently, signaling a structural shift in how AI coding tools are built

Two of the most widely used AI coding tools shipped the same feature within weeks of each other.

Anthropic added /goal to Claude Code on May 12 with version 2.1.139. OpenAI shipped a comparable feature to Codex CLI in April. Neither team was copying the other — they arrived at the same design because the problem they were solving is identical.

AI coding assistants have been optimized for a one-prompt-one-response rhythm. That rhythm breaks down the moment a task requires more than a few turns to complete. The broader shift toward agentic AI — systems that pursue goals rather than respond to prompts — has been building for years, and /goal is the first widely-deployed mechanism to bring that model directly into a developer’s terminal.

/goal is the fix for that.

You define a completion condition — something like “all tests in test/auth pass and the lint step is clean” — and the agent keeps working until a small, fast evaluator model confirms the condition has been satisfied. No manual prompting to continue. No babysitting.

How Claude Code’s /goal Works

The mechanics are clean and deliberate.

Run /goal followed by the condition you want satisfied. After each turn, a lightweight evaluator model checks whether the condition holds. If it doesn’t, Claude starts another turn automatically instead of returning control to you. The goal clears once the condition is met.

Key things to know about the session behavior:

  • One goal per session — a new /goal command replaces the active one
  • Status indicator — a ◎ /goal active badge shows elapsed time and tokens spent while a goal is running
  • Evaluator transparency — after each turn, the evaluator returns a short reason explaining why the condition is or isn’t met yet, visible in both the status view and the transcript
  • Manual override — run /goal clear to cancel anytime, or /goal with no argument to check progress

What matters about the design is how Anthropic framed what /goal is actually for.

The official docs position it for “substantial work with a verifiable end state” — not vague tasks, not exploration. Work that already has a clear finish line.

Use cases Anthropic explicitly calls out:

  • Migrating a module until every call site compiles and tests pass
  • Implementing a design doc until all acceptance criteria hold
  • Splitting a large file into focused modules until each is under a size budget
  • Running through a labeled issue backlog until the queue is empty

That framing defines the right mental model: /goal is a control surface for work that can be verified, not a shortcut for tasks you haven’t fully defined.

Writing Conditions That Actually Work

This is where most people will get tripped up early on.

A condition that holds across many turns needs three things:

  1. One measurable end state — a test result, a build exit code, a file count, an empty queue
  2. A stated check — how Claude should prove it (“npm test exits 0”, “git status is clean”)
  3. Constraints that matter — anything that must not change along the way (“no other test file is modified”)

The condition can be up to 4,000 characters. You can also include a turn or time clause to bound how long a goal runs — “or stop after 20 turns” is a simple guardrail worth building into most conditions by default.

Writing effective /goal conditions is an extension of good prompt engineering. The same principles that make a standard prompt precise — specificity, clear success criteria, explicit constraints — apply here, but the stakes are higher because the agent will keep acting on a vague condition until it runs out of turns. If you’re newer to crafting structured instructions for LLMs, this primer on prompt engineering strategies covers the foundations well.

A few examples from the cheatsheets circulating on X that illustrate the pattern well:

  • /goal Refactor this repo to TypeScript strict mode. Success: zero ‘any’ types, all tests pass, no functional regressions, build clean, summary of changes.
  • /goal Make every test in this repo pass. Success: npm test exits 0, no skipped tests, root-cause notes for each fix, no test-mocking shortcuts.
  • /goal Migrate this app from Supabase to Postgres + Drizzle. Success: schema parity, all queries working, seed data preserved, tests pass, migration guide written.

Each of those conditions has a clear binary outcome. The agent either hits it or it doesn’t — and the evaluator can tell the difference.

The Trust and Safety Model

/goal is deliberately gated.

The feature only runs in workspaces where the trust dialog has been accepted, because the evaluator is part of the hooks system. It’s also unavailable when disableAllHooks is set at any settings level, or when allowManagedHooksOnly is set in managed settings.

This isn’t a footnote — it tells you something about how Anthropic is thinking about autonomous workflows. The trust dialog is the boundary. Teams deploying Claude Code in managed environments need to account for this before building /goal into any pipeline.

Security becomes a first-order concern as agents run longer and touch more of your codebase unsupervised. The trust model here is also relevant for teams using Claude Code Remote Control, where the agent is running locally but being accessed from another device — a long /goal run in that context means your machine is executing code autonomously while you’re away from it.

For individual developers, the practical implication is simple: if /goal silently does nothing when you run it, check the trust settings first.

How Codex’s /goal Is Different

Codex shipped its version roughly a month earlier, and the key architectural difference is persistence.

Where Claude Code’s goal lives within an active session, Codex’s implementation is built on app-server APIs and runtime continuation. The agent can survive process restarts, reboots, and terminal crashes. You can pick up where you left off even if your session died mid-task.

Other meaningful differences:

  • Checkpoint model — Codex defaults to “plan-mode nudges,” pausing at key decision points to confirm direction rather than running fully unattended. Full-auto mode is available via codex –approval-mode full-auto but isn’t the default.
  • Setup — Claude Code: launch CLI, type /goal. Codex Desktop: Settings → Configuration → goals = true. Different surfaces, different onboarding friction.
  • Multi-agent scope — Codex’s May 2026 release expanded MultiAgentV2 support, so multiple goals can be active across different environments, each tied to its own thread.

The philosophical difference between the two implementations is real.

Codex leans toward inline confirmation at decision points — the agent checks in before making consequential moves. Claude Code leans toward a blanket trust model — grant trust at the workspace level, then let it run.

Neither is wrong. They reflect different assumptions about who is using the tool and how much they want to stay in the loop during a long-running task.

The Formula Both Tools Share

Despite the architectural differences, the prompt structure that works is essentially the same across both tools.

The three-element formula:

/goal [do the work] until [measurable end state] without [constraints]

For more complex tasks, both tools benefit from an extended structure:

/goal [primary objective]
Context: [what the project is]
Success criteria: [measurable outcome 1] [outcome 2]
Constraints: [rule 1] [rule 2]
Checklist: [attach .md file for tracking]

Tips that apply regardless of which tool you’re using:

  • One goal at a time — scope it tightly. A goal that tries to do too many things at once is harder for the evaluator to verify.
  • Let the model write its own /goal — describe the task in plain language and ask Claude or Codex to generate the condition. The model often writes a tighter condition than a human would.
  • Pair with /plan — run /goal → /plan → /goal clear for complex tasks where you want the agent to map the work before executing it.
  • Attach a .md checklist — the agent can use it as a running log, which makes the evaluator’s job easier and gives you a readable audit trail.
  • Add turn limits — “or stop after 20 turns” is a cheap safeguard against runaway sessions.

/goal Command by Claude Code & Codex

The Token Cost Risk Is Real

This is the part that doesn’t show up in the launch posts.

Neither Codex nor Claude Code currently has a native “set budget cap per goal” feature. A poorly scoped condition running across 50 turns with Sonnet as the evaluator model can cost significantly more than expected.

Part of what makes this worth understanding is the underlying model architecture. The /goal evaluator is itself a language model — a small one, but it’s running on every turn. If you’re using a larger model as the evaluator, costs compound fast. The shift toward using SLMs for evaluator-style tasks in agentic systems is exactly why tools like these tend to route lightweight verification work to smaller, cheaper models rather than the primary reasoning model.

Practical mitigations:

  • Hardcode a turn limit directly into the condition — the single most effective safeguard
  • Use Haiku as the evaluator model — evaluation speed and costs stay predictable; Sonnet as the evaluator spikes overhead fast
  • Set platform-level budget alerts before kicking off any long-running goal
  • Start with a dry run — test the condition on a small scope before pointing /goal at your entire codebase

The community is calling out token consumption as the main friction point right now. One widely shared take on X: “Already active in Claude Code and Codex — you need to use it now.” The enthusiasm is warranted. The cost awareness isn’t always there alongside it.

Comparing the Two Side by Side

Claude Code Codex CLI
Shipped May 12, 2026 (v2.1.139) April 2026
Persistence Session-scoped Survives restarts/crashes
Default approval mode Trust dialog (workspace-level) Plan-mode nudges (inline)
Full-auto mode Auto mode (approve tool calls) codex --approval-mode full-auto
Turn tracking ◎ /goal active + evaluator reason Terminal title indicator
Multi-agent One goal per session Multiple goals across environments
Mobile Yes (Claude Code Mobile) Desktop CLI focus
Remote Control Yes N/A
Works with Claude Code CLI, Remote Control, -p flag Codex CLI, Codex Desktop

The Actual Story: A Pattern Becoming Infrastructure

The more significant thing happening here is not the feature — it’s the convergence.

When two competing labs ship the same interaction primitive within the same month without coordinating, that’s independent validation. /goal is becoming the default way to express “keep working on this until it’s done” across agentic coding tools. The fact that it’s also appeared in Hermes reinforces that this is a cross-platform pattern, not a product feature.

This is a natural extension of how agentic LLMs have been evolving — from models that respond to prompts, to models that reason across steps, to models that now pursue defined objectives autonomously across an unbounded number of turns. /goal is essentially the user-facing surface of that architectural shift. That has real implications for how developers should think about workflows going forward:

  • Tasks that previously required babysitting — multi-file refactors, migration jobs, test cleanup backlogs — are now first-class use cases with native tooling
  • The “keep going” prompt is effectively deprecated. You define the condition once and hand it off.
  • The session model of AI coding tools is shifting from discrete exchanges to durable objectives

Anthropic doubled Claude Code’s five-hour rate limits for paid plans in early May — a timing that makes more sense nowthat /goal is live and encouraging longer unsupervised runs. If those limits extend further, it signals Anthropic is prepared to bet on multi-hour autonomous workflows as a core product pattern.

The underlying reason both labs arrived here simultaneously is that the Model Context Protocol and the broader agentic tooling ecosystem have matured enough to make persistent, verifiable agent loops tractable. A year ago, the infrastructure to reliably evaluate conditions across many turns didn’t exist in a form that shipped cleanly to developers. It does now.

What Practitioners Should Do Right Now

If you’re on Claude Code:

  1. Update to v2.1.139 if you haven’t already
  2. Pick one task you currently babysit — anything where you keep prompting “continue” — and reframe it as a /goal condition
  3. Start with test-driven refactoring — passing tests make a natural, verifiable end state
  4. Add “or stop after 20 turns” to every condition until you’ve calibrated what your typical goals cost

If you’re on Codex:

  1. Enable goals in Settings → Configuration → goals = true
  2. Use the persistence layer for anything long enough that your terminal might close mid-task
  3. Keep plan-mode on by default unless you’re confident in the condition — it’s a useful safety net for new task types

If you’re evaluating both:

  • Choose Codex if persistence across restarts matters for your workflow
  • Choose Claude Code if you want cleaner Remote Control integration or mobile access
  • Both work. The formula is the same. Start with whichever tool you’re already using.

What to Watch Next

A few signals worth tracking over the coming months:

  • Rate limit expansion — Anthropic’s May rate limit doubling looks like preparation for longer /goal runs. Further increases would confirm autonomous workflows as a priority.
  • Native budget caps — neither tool has this yet. The first to ship a “max spend per goal” control wins the trust of teams running this in production.
  • Evaluator model choice — both tools currently handle evaluator model selection implicitly. Explicit developer control over which model evaluates the condition would meaningfully change the cost calculus.
  • Cross-vendor standardization — if Hermes, Cursor, and other tools adopt the same /goal primitive, it may evolve into a shared spec rather than competing implementations.

The pattern is validated. The tooling will keep improving around it.

FAQ

What is the /goal command in Claude Code?

/goal is a command introduced in Claude Code v2.1.139 that lets you define a completion condition for an agent. After each turn, a lightweight evaluator model checks whether the condition is met. If not, Claude continues working automatically — no prompting required. The goal clears once the condition is satisfied.

How is Claude Code’s /goal different from Codex’s /goal?

The biggest difference is persistence. Codex’s implementation survives process restarts and terminal crashes using app-server APIs. Claude Code’s goal is session-scoped. Codex also defaults to inline confirmation checkpoints; Claude Code uses a workspace trust dialog as the access control layer.

What kinds of tasks is /goal designed for?

Tasks with a verifiable end state — migrating a module until every call site compiles, running tests until a suite passes, cleaning a backlog until it’s empty. It’s not well-suited for open-ended tasks without a clearly defined finish line.

Is /goal available in Claude Code Remote Control and mobile?

Yes. As of v2.1.139, /goal works in interactive mode, the -p flag, Remote Control, and Claude Code Mobile.

What’s the biggest risk with /goal?

Token cost. Neither Claude Code nor Codex has a native per-goal budget cap. A long-running goal with a large model as the evaluator can consume significantly more tokens than expected. Always include a turn limit in your condition and set platform-level budget alerts before running anything substantial.

Does /goal work the same way in both Claude Code and Codex?

The underlying pattern is the same — define a condition, let the agent work until it’s met — but the implementations differ in persistence, approval model, and setup. The three-element formula (/goal [task] until [end state] without [constraints]) works in both.

Key Takeaways

  • “Agentic OS” is not a product you install — it’s an architectural pattern that adds a management layer on top of AI agents so they can coordinate, share memory, and improve over time.
  • Without this layer, multi-agent systems break in predictable ways: agents contradict each other, forget context, and fail silently.
  • The pattern borrows directly from how operating systems manage processes — and that analogy turns out to be more useful than it sounds.

The Honest Answer Up Front

“Agentic OS” has become one of those terms that means everything and nothing at the same time.

Ask five engineers what it means and you’ll get five different answers. Ask a vendor and they’ll tell you their product is the Agentic OS. Ask Reddit and you’ll mostly get skepticism.

Here’s the fair take: the term is overused, but the underlying pattern is real and worth understanding.

This guide explains what an Agentic OS actually is, why the pattern exists, what its core components look like in practice, and where current implementations still fall short.

What Problem Does Agentic OS Actually Solve?

How Agentic OS brings coordination in an otherwise chaotic system

Before getting into what it is, it helps to understand why it exists.

Most people building with LLMs start with a single agent. It works well for simple tasks. Then requirements grow — the agent needs to search the web, write code, query a database, summarize documents, and make decisions across all of it. So you add tools. Then memory. Then you realize one agent doing everything is fragile, slow, and hard to debug.

The natural next step is splitting the work across multiple specialized agents. But now you have a different problem: who coordinates them?

Without a coordination layer:

  • Agents don’t know what other agents have done, so they repeat work or contradict each other
  • There’s no shared memory, so every agent starts from scratch on every run
  • When one agent fails, nothing knows how to recover — the whole pipeline stalls
  • Context bleeds between agents in unintended ways, producing inconsistent outputs

This is exactly the problem an Agentic OS is designed to solve. It’s the layer that sits above your agents and manages how they work together.

If you’re still getting familiar with what makes an AI agent tick in the first place, What Is Agentic AI? Master 6 Steps to Build Smart Agents is a good starting point before going deeper into the architecture.

What Is an Agentic OS?

The Agentic OS Architecture for Multi-agent systems

An Agentic OS is a software layer that manages multiple AI agents — coordinating how they plan, act, share memory, and learn — without requiring a human to intervene at every step.

The OS analogy holds up better than most tech analogies. A traditional operating system doesn’t do your work. It manages the resources — memory, CPU, I/O — that make work possible. It decides which process runs when, what memory each process can access, and how they communicate with each other.

An Agentic OS does the same thing, but for agents:

  • It allocates context and decides what each agent knows before it runs, so agents get exactly the information they need and nothing they don’t
  • It routes tasks and determines which agent is responsible for which part of a goal, based on capability and availability
  • It manages memory and maintains a shared knowledge layer that agents can read from and write to across sessions
  • It handles failures and detects when an agent produces a bad output or gets stuck, and triggers replanning instead of halting

Without this layer, you have a collection of agents. With it, you have a system.

The agents doing the actual work inside this system are LLM-based — models that can reason, use tools, and act across multiple steps. For a detailed look at how those models work and what makes them genuinely agentic, Agentic LLMs in 2025: How AI Is Becoming Self-Directed, Tool-Using & Autonomous covers the landscape well.

What Makes This Different From a Regular Multi-Agent Pipeline

This is the question the definition doesn’t answer on its own — and it’s worth being direct about.

A standard multi-agent pipeline is static. You define the flow upfront: agent A runs first, passes output to agent B, agent B passes to agent C. The coordination logic is hardcoded into the pipeline itself. It works well when inputs are predictable and nothing breaks. But change the input shape, add a new requirement, or have one agent fail — and the whole thing needs to be manually updated or it stops.

An Agentic OS moves coordination out of the pipeline and into a runtime layer. Instead of following a fixed script, the orchestrator decides at runtime how to break down a goal, which agents to involve, and in what order — based on the actual task in front of it. If a sub-task fails, it doesn’t halt. It replans. If a different approach is needed for a specific input, it routes differently. The pipeline adapts to the work, rather than forcing the work to fit the pipeline.

The simplest way to put it: a multi-agent pipeline follows a script. An Agentic OS writes the script on the fly and rewrites it when something goes wrong.

The Five Core Components

Every serious implementation of this pattern, whether you’re building it yourself or using a framework, needs these five components working together.

1. The Orchestrator

How the Orchestrator Works in an Agentic OS Pattern

The orchestrator is the entry point for every goal that enters the system. It receives a high-level task, figures out what needs to happen, and coordinates the agents that execute it.

Think of it as the kernel of your Agentic OS — the component everything else reports to.

What a well-built orchestrator does:

  • Decomposes goals into sub-tasks that are specific enough for a specialist agent to execute without ambiguity
  • Routes each sub-task to the right agent based on what that agent is designed to do, not just what’s available
  • Tracks completion across all running agents and knows when to wait, when to proceed, and when to replan
  • Handles failures without halting — if a sub-task fails, the orchestrator tries an alternative path rather than crashing the whole pipeline

The key quality that separates a good orchestrator from a fragile one is replanning. Anyone can build an orchestrator that works when everything goes right. A reliable one keeps moving when things go wrong.

2. Memory Architecture

3 Layers of Agentic Memory - Agentic OS Pattern

This is where most early multi-agent systems break. If agents have no persistent memory, every run starts from scratch. Your agentic sytem would just be a collection of stateless API calls dressed up as agents.

A proper Agentic OS maintains three distinct memory layers:

Memory Type What It Stores Lifespan
Working Memory The current task, intermediate results, and agent outputs mid-run Lives for the duration of one task
Episodic Memory Records of past interactions, decisions, and outcomes Persists across sessions
Semantic Memory Stable knowledge: documentation, rules, product facts, brand guidelines Long-term, updated deliberately

How memory actually works at runtime:

Before an agent runs, the system queries the relevant memory stores and injects only the entries that matter for that specific task into the agent’s context. The agent doesn’t get a dump of everything the system knows — it gets a targeted slice. This retrieval step is essentially RAG applied to agent memory, which is covered in depth in Agentic RAG: A Powerful Leap Forward in Context-Aware AI.

Writing to memory is just as important as reading from it. Not every agent should have write access to long-term memory. Entries follow a defined schema, and in most production systems, new entries are reviewed before becoming permanent. This keeps the knowledge base from silently accumulating garbage that degrades agent behavior over time.

3. Context Management

Context Engineering for Agentic OS
source: Philschmid

Context windows have hard limits. What you put in them determines the quality of every output.

“Fresh context” means each agent gets a purpose-built context window assembled specifically for its task — not a copy-paste of everything the system has seen so far.

A well-assembled context includes:

  • A scoped system prompt that defines the agent’s role and constraints for this specific task — not a generic “you are a helpful assistant” prompt
  • Retrieved memory entries pulled from the relevant memory layers, filtered to the top results most relevant to the current task
  • Tool definitions for only the tools the agent actually needs to complete its job
  • Handoff data from the previous agent in the pipeline, structured and clean

What gets deliberately excluded:

  • Conversation history from other agents’ runs, which introduces noise and causes unexpected behavior
  • Memory entries from unrelated tasks or past sessions that don’t apply here
  • Tool definitions for tools the agent won’t use — these take up context space and can confuse the model into attempting actions it shouldn’t

Clean context boundaries make the system predictable and debuggable. When something goes wrong, you know exactly what the agent saw when it made a bad decision — because you controlled what went in.

The discipline of deliberately designing what goes into an agent’s context is increasingly its own field. What Is Context Engineering? The New Foundation for Reliable AI and RAG Systems goes into the full framework if you want to go deeper on this component specifically.

4. Specialist Agents

Instead of one large agent trying to handle everything, an Agentic OS runs a network of agents where each one is purpose-built for a specific type of task.

This is the part that makes the system genuinely scalable. A specialist agent has a tightly scoped system prompt, access to only the tools it needs, and a well-defined output format. It’s easier to build, easier to test, and much easier to fix when it breaks.

Common specialist roles in production systems:

  • Research agent — queries the web or internal knowledge bases to gather raw information, then structures it into a clean format that downstream agents can actually use
  • Writer agent — takes a brief and structured inputs and produces a draft, operating within brand or tone guidelines stored in semantic memory
  • Code agent — writes, reviews, or executes code against a defined spec, and returns structured results including errors and test outputs
  • QA agent — evaluates another agent’s output against a rubric before it moves to the next step, acting as a quality gate in the pipeline
  • Tool agent — handles direct integrations like API calls, database queries, and file operations — the parts of the workflow that touch external systems
  • Memory agent — decides what gets written to long-term memory after a task completes, applying the schema and governance rules that keep the knowledge base clean

Agents communicate through structured interfaces — defined input/output schemas, not free-form conversation. The orchestrator calls a specialist with a structured payload, the specialist returns a structured result, and the orchestrator uses that result to decide what happens next.

For agents to communicate reliably at scale, they need standardized protocols. Agentic AI Communication Protocols: MCP, A2A, and ACP explains how these standards work and why MCP in particular has become the default way agents connect to external tools and services.

This is what makes the whole system composable. You can swap out one specialist, improve another, or add a new one without touching the rest of the pipeline.

5. Feedback Loops and Self-Learning

A static multi-agent pipeline executes the same way every time regardless of whether its outputs are good or bad. A self-learning one gets better.

This doesn’t require retraining the underlying model. Most useful self-improvement happens at the workflow level through feedback loops that are built into the system.

Two types of feedback worth capturing:

  • Explicit feedback — A human reviews an output and signals whether it was good or bad. This could be a rating, a correction, or an approval/rejection in a review step. Good signals reinforce the current approach. Bad signals trigger a review of the relevant memory entries or system prompts that fed into that output.
  • Implicit feedback — Behavioral signals the system can observe without anyone rating anything. If a user consistently rewrites the opening of every email the writer agent drafts, that pattern is feedback. If outputs from a particular agent keep getting flagged in the QA step, that’s feedback too. The system captures these signals and surfaces them for review.

The goal is to build feedback collection into the workflow as a first-class feature — not bolt it on later.

How the Components Work Together: A Real Example

Here’s a concrete walkthrough. Say you ask an Agentic OS: “Research our three main competitors and draft a summary report.”

Step 1 — Orchestrator receives the goal and decomposes it: research competitor A, research competitor B, research competitor C, then synthesize everything into a report. It identifies the agents needed and sequences the work.

Step 2 — Context Manager builds a fresh context for each research task. It queries semantic memory for any prior research on these competitors, scopes the system prompt to research-only, and passes only the web search tool to each agent.

Step 3 — Research Agents run in parallel, one per competitor. Each searches, retrieves, and structures its findings into a clean output format that the next stage can consume.

Step 4 — QA Agent reviews each research output against a completeness rubric before anything moves forward. If one output is thin or off-target, it flags it and the orchestrator either retries or routes around it.

Step 5 — Writer Agent receives the validated research from all three agents and drafts the report. It pulls tone and formatting guidelines from semantic memory and structures the output to spec.

Step 6 — Memory Agent stores the final report and key findings in episodic memory so future runs can reference them without starting from scratch.

Step 7 — Feedback Loop kicks in when you read the report. If you edit sections, those changes are logged as implicit feedback on the writer agent’s prompt. If you approve it without changes, that’s a positive signal.

No human stepped in during steps 2–6. The system handled decomposition, coordination, quality checking, and memory management on its own. That’s the pattern in action.

Where Current Implementations Still Break

The Agentic OS pattern is sound. Most real-world implementations are still far from fully realizing it. Here’s where they actually fall apart:

Reliability Agents hallucinate actions, not just text. An agent told to call an API might call the wrong endpoint or construct a malformed request — and do it confidently. According to Gartner, over 40% of ambitious agentic AI pilots are projected to be cancelled by 2027, with reliability failures as the primary cause.

Memory drift Without strict governance on what gets written to shared memory, the knowledge base silently accumulates bad entries. Agents start behaving inconsistently in ways that are hard to trace because the root cause is buried in stale or incorrect memory.

Context bleed When agents share context carelessly — or when the context manager isn’t properly isolating each agent’s input — outputs from one task contaminate another. A support agent that carries over context from a code review run produces outputs that are confused and off-brand in ways that are hard to reproduce and harder to fix.

Infinite loops Agents without well-defined exit conditions can get stuck. The orchestrator keeps replanning, the agent keeps retrying the same failing tool call, and the system burns tokens and time without making progress.

Cost at scale Running multiple specialist agents per task, each making its own LLM call with a carefully assembled context, adds up fast. One way teams address this is by replacing large models with smaller, task-specific ones for routine agent roles — a shift covered in detail in From LLMs to SLMs: Redefining Intelligence in Agentic AI Systems. Production systems also need aggressive context pruning and result caching to stay economically viable at scale.

The Buzzword Test: Is What You’re Looking At Actually an Agentic OS?

The term is being applied to things that don’t deserve it. Before you buy into a platform’s claim or evaluate your own system, ask three questions:

1. Does it have persistent, structured memory across sessions? If the system starts from scratch every time a new session begins, it’s not an Agentic OS. It’s a stateless pipeline with an LLM at the front.

2. Do specialized agents delegate work to each other through defined interfaces? If there’s one model handling every type of task with a single long prompt, that’s not an OS architecture — that’s just a capable model. The multi-agent structure with defined roles and clean handoffs is what makes the pattern work.

3. Does it replan when something fails? If the system halts, throws an error, or requires a human to restart whenever an agent produces a bad output, it’s a workflow tool. An Agentic OS handles failures as a normal operating condition, not an exception.

Build vs. Buy

If you’re deciding whether to build this pattern from scratch or use an existing framework, the tradeoff is straightforward.

Build from scratch if:

  • Your workflows are specific enough that no framework covers them without significant workarounds
  • Your security or data requirements mean you can’t route data through external APIs
  • You have the engineering capacity to maintain a custom orchestration layer long-term

Use a framework like LangGraph if:

  • You need to move quickly and don’t want to build memory management and agent routing from scratch
  • Your use case fits within what existing frameworks support — which covers most common patterns
  • You want built-in observability and debugging tools without building your own

What no platform decides for you:

  • How your memory layers are structured and who has write access
  • What your agent roles are and how they hand off to each other
  • How feedback signals get captured and acted on
  • What your failure and replanning logic looks like

The framework handles the plumbing. The architecture — the decisions that actually determine whether your system works — is still yours to design.

FAQ

What’s the difference between an AI agent and an Agentic OS? An agent is a single unit: it receives input, reasons, and produces an output or takes an action. An Agentic OS is the layer above that — it manages multiple agents, decides what each one knows, routes tasks between them, and handles what happens when things go wrong. The agent is the process; the Agentic OS is what runs and coordinates the processes.

Is Agentic OS the same as AGI? No. An Agentic OS is an architectural pattern for organizing AI agents. The agents inside it are still LLM calls with defined roles and scoped context — not general intelligence. The architecture makes them more capable as a system, but each individual agent is still narrow.

What is MCP and why does it matter here? Model Context Protocol (MCP) is an open standard that gives agents a consistent way to connect to external tools and services. Before MCP, every tool integration was custom-built — a different connector for every API. MCP acts like a universal adapter, so agents can call tools without the orchestration layer needing to know the implementation details of each one. For the full picture on MCP and other agent communication standards, see Agentic AI Communication Protocols: MCP, A2A, and ACP.

Can a small team realistically build this? Yes. Frameworks like LangGrap handle most of the infrastructure so you’re not building orchestration from scratch. A small team can get a functional multi-agent system running in weeks. The harder work is designing the memory governance, the agent interfaces, and the failure handling — those require deliberate thought, not just code.

What are the biggest risks when deploying this in production? Three things cause the most problems: agents taking unintended actions with real-world consequences (sending emails, modifying records, making API calls that can’t be undone), memory drift degrading system behavior in ways that are slow and hard to diagnose, and runaway costs from uncontrolled LLM calls across many agents. All three are manageable — but only if you design for them upfront, not after you’re already in production.

The Bottom Line

Agentic OS is a real architectural pattern — not a product, not a marketing term, and not just AI hype.

The core idea is simple: multi-agent systems need a management layer the same way computers need an operating system. Without it, agents are powerful but ungovernable. With it, they become a system you can actually build on, debug, and improve over time.

Most of what’s being sold as “Agentic OS” today doesn’t fully deliver on the pattern yet. The implementations are catching up to the architecture. But the pattern itself — orchestration, structured memory, clean context, specialist agents, feedback loops — is the right foundation for any multi-agent system that needs to work reliably at scale.

If your current agent setup keeps hitting walls, this is the architecture that fixes it.

Key takeaways

  • We ran Kimi K2.6 and Claude Sonnet 4.6 through four real developer tasks: code generation, debugging, code review, and security architecture reasoning.
  • Kimi K2.6 has three modes, Agent, Thinking, and Agent Swarm and they behave meaningfully differently, not just faster or slower.
  • Claude Sonnet 4.6 was more consistent across tasks and leaned toward production-ready thinking; Kimi K2.6 went deeper on completeness when it ran at full capacity.
  • Mid-test, Kimi K2.6 dropped from Thinking to Instant mode due to high demand. That’s worth factoring in before you build workflows around it.

The timing of this comparison wasn’t random. The week we ran these tests, a lot of developers were already eyeing Kimi as a Claude alternative — not because of benchmarks, but because Anthropic spooked them on pricing.

On April 21, 2026, Anthropic’s pricing page briefly showed Claude Code removed from the $20/month Pro plan. No email, no changelog entry, just an “X” where the checkmark used to be. Reddit and Hacker News moved fast. Within hours there were hundreds of comments, and the alternatives people were naming most often were Kimi, Minimax, and Qwen. By end of day, Anthropic’s Head of Growth had clarified it was an A/B test on roughly 2% of new signups, and the page was restored the next morning. But the comment he left behind stuck: “Usage has changed a lot and our current plans weren’t built for this.”

The change was reversed but the anxiety wasn’t and the timing happened to coincide almost exactly with the release of Kimi K2.6 on April 20. So we decided to actually test it.

What You’re Actually Comparing Here

We paired Kimi K2.6 against Claude Sonnet 4.6, Anthropic’s mid-tier model, rather than Opus, because that’s the fair fight. Both sit in the everyday-use tier in their respective families. Both are what most developers have running in production right now. Comparing it to Opus would skew the results in ways that don’t reflect how people actually choose between models.

Before we get into the tasks, it’s worth understanding how Kimi K2.6 is structured, because it’s genuinely different from how Claude works.

3 Modes of Kimi K2.6

Kimi K2.6 Agent operates as a single autonomous agent with tool access. It takes actions rather than just responding, closer to a coding assistant that can actually do things.

Kimi K2.6 Thinking is the deliberative mode. It takes longer, reasons through more steps before committing, and tends to surface tradeoffs. For review and architecture tasks, this is the right mode to use.

Agent Swarm is Kimi K2.6’s most distinctive offering. Up to 300 parallel sub-agents coordinating across thousands of steps. There’s nothing quite like it in Claude’s current interface. We had planned to test it on an agentic planning task, but Agent Swarm and Agent modes currently require priority access. We couldn’t complete that test, so this comparison covers four tasks instead of five. That access gap is worth noting if you’re evaluating it for production.

For Claude Sonnet 4.6, we used standard mode across all tasks.

Kimi K2.6 vs Claude Sonnet 4.6: Feature Comparison

Before the task results, here’s the side-by-side on specs, pricing, and capabilities so you have the full picture in one place.

Kimi K2.6 Claude Sonnet 4.6
API pricing $0.95 input / $4.00 output per 1M tokens $3.00 input / $15.00 output per 1M tokens
Context window 256K tokens 1M tokens (200K standard; 1M in beta)
Input modalities Text, image, video Text, image
Agentic modes Agent, Thinking, Agent Swarm (waitlisted) Standard + Claude Code
Open source Yes — Modified MIT, self-hostable No
SWE-Bench Verified 80.2% 79.6%

A few things worth calling out from this table. The pricing gap is real, at $0.95/$4.00 per million tokens versus $3.00/$15.00, Kimi K2.6 is roughly 3–4x cheaper on the API. For teams running high-volume coding agents or processing long contexts regularly, that difference adds up fast. A startup consuming 100M input tokens and 10M output tokens monthly pays around $85 with Kimi K2.6 versus $450 with Claude Sonnet 4.6.

The context window comparison needs a caveat though. Kimi K2.6’s 256K is generous, but Claude Sonnet 4.6’s 1M token beta window is a meaningful advantage for full-codebase analysis and long document workflows. If you need to load an entire repository into a single prompt, Sonnet 4.6 can do it at standard pricing. And while Kimi K2.6 is open source and self-hostable (a real differentiator for teams with data residency requirements or cost constraints at scale), Agent Swarm access currently requires a priority waitlist, so the most powerful mode on paper isn’t yet available to everyone on demand.

 How We Tested Kimi K2.6 vs Claude Sonnet 4.6

Task 1: Code Generation — Building a FastAPI Endpoint

Asking Kimi K2.6 to write a REST API

The prompt: build a FastAPI endpoint that takes user_id and action, validates the action against an allowed list, stores events in memory, and returns a summary for that user.

Both models returned working code and neither needed cleanup. That’s the baseline and both passed.

The interesting part was the pattern each one reached for. Kimi K2.6 used a field_validator with Pydantic v2. Totally valid. Claude used Literal[“login”, “logout”, “purchase”] as the type annotation itself, which means FastAPI rejects invalid input at the type level before the handler even runs. It’s a small difference on the surface, but it reflects how you think about where constraints should live — in a method, or in the type system. For Pydantic v2 specifically, the type-level approach is the more idiomatic pattern.

Claude also added a DELETE endpoint without being asked, flagged that the in-memory store should be replaced with Redis in multi-process deployments, and mentioned Swagger UI at /docs. It added a GET endpoint and solid curl examples. Both went beyond the prompt, just in different directions. Claude’s additions were the kind of things that come up in code review. Kimi K2.6’s additions were the kind of things that make the output immediately usable.

One more practical difference: Claude rendered the endpoint as a testable artifact you could interact with inline. With the Agent mode, you copy the code, save the files, and run it locally. For developers iterating quickly, that friction adds up.

Task 2: Debugging — A Logic Bug That Looks Fine on the Surface

Asking Claude Sonnet 4.6 to fix a bug in Python code

The function was supposed to return unique emails from a list of user dictionaries. The bug: seen was checked on every loop but never populated, so duplicates passed through silently. The code looked syntactically correct. There was nothing to catch in a linter.

Both models found it immediately. Both fixed it and recommended a set for O(1) lookups over the original list. On the core task, they were equal.

The difference showed up in what each model offered next. Kimi K2.6 threw in a one-liner using seen.add() inside a boolean short-circuit expression. It works, and you can see why it’s tempting to include. It’s also the kind of thing that gets flagged in a code review because it trades readability for conciseness in a way that doesn’t pay off in a real codebase.

Claude’s bonus was dict.fromkeys(). It’s a standard library idiom, it preserves insertion order, and any Python developer who reads it knows exactly what it’s doing. The O(n) vs O(1) explanation was also cleaner — not just “use a set” but a brief walkthrough of why the performance difference matters as the input scales.

Both models went beyond what was asked. One went toward showing off, the other went toward teaching.

Task 3: Code Review — A Dangerous Database Function

Asking Kimi K2.6 to review Python Code

This one had a classic SQL injection via f-string, a connection that’s never closed, SELECT * pulling every column, no error handling, no input validation, and a hardcoded database path. Six issues stacked in a short function.

Both models found all of them. Neither missed the SQL injection, and neither missed the resource leak. At the level of “does the model know what production code quality looks like,” both cleared the bar.

Where they diverged was in how they organized the findings. Claude led with severity labels, Critical, High, Medium and finished with a summary table. That structure matters in practice: it tells you what to fix before you ship and what can wait for the next sprint. It also framed SELECT * as a security issue rather than just a performance one. Most developers know that pulling all columns is wasteful; fewer think about the fact that it likely returns password hashes, tokens, and admin flags to wherever the result lands. Claude made that explicit.

K2.6 caught two issues Claude didn’t mention — missing docstring and absent type hints — and its refactored version reflected that. The rewrite came back with a full docstring including Args, Returns, and Raises sections, typed parameters using Optional[Tuple[Any, …]], and a ValueError for empty or invalid inputs. If you needed a drop-in replacement you could commit immediately, its output was closer to ready.

The practical split: Claude’s output helps you triage. K2.6’s output gives you the replacement. Depending on what stage you’re at, one of those is more useful than the other.

Task 4: Multi-Step Reasoning — Rate Limiting an Auth Flow

Asking Claude Sonnet 4.6 to perform multi step reasoning

The task: Restructure a six-step login service to add IP-based rate limiting before any database query, identify what new components are needed, and describe what could go wrong if implemented incorrectly.

Before the results, something happened mid-test that’s worth being upfront about. Kimi K2.6 hit high demand during this task and automatically dropped from Thinking to Instant mode. It told us, and offered an upgrade path. The response we got was Instant mode output, not Thinking mode. That matters for interpreting the results below and it matters for anyone evaluating K2.6 for workflows where consistent reasoning depth is a requirement.

The response itself was still solid. K2.6 restructured the flow correctly with the rate limit as the first gate, identified Redis with atomic INCR + EXPIRE as the right approach, flagged race conditions in non-atomic read-then-write patterns, laid out the fail-open vs fail-closed tradeoff, and caught the shared-IP / NAT problem with per-IP rate limiting. It also flagged clock skew in sliding window implementations — a genuinely obscure edge case that a lot of architects wouldn’t think to include.

Claude covered the same core ground and found a few things on top of it. One was a design decision that’s easy to overlook: should the rate limiter count all login attempts from an IP, or only the failed ones? If you only count failures, an attacker who occasionally succeeds with a throwaway account can keep resetting their counter. Claude called this out explicitly and explained why it matters under adversarial conditions. It also caught a timing side-channel: if the rate limiter sits after the database query, response latency differences can reveal whether a username exists even when the request is ultimately rejected. And it added the Retry-After header — not in the prompt, not something most people think about first, but something that prevents legitimate clients from hammering the endpoint during backoff.

The gap between the outputs here reflects something real: Claude’s response read like it was written by someone thinking about what breaks in production, not just what the correct architecture looks like on a whiteboard. Whether that gap would have been smaller if Kimi K2.6 had stayed in Thinking mode, we can’t say. But the mode degradation itself is part of the result.

What We Actually Took Away From This

Kimi K2.6 is genuinely capable — and in some areas, notably code completeness and certain deep reasoning tasks, it goes further than Sonnet 4.6. Its Thinking mode produces thorough output when it runs at full capacity, and the refactored code it returns is often closer to production-ready than what Claude gives you. The three-mode interface is also a real differentiator: being able to choose between a fast agent, a deliberative reasoner, and a massively parallel swarm depending on the task is something no other model in this comparison class currently offers.

Claude Sonnet 4.6 is more consistent. Across four tasks, it ran without degradation, and its outputs reflected a stronger read on what code needs to be maintainable over time — not just correct at the moment of generation. The things it added unprompted (the Literal type, the Retry-After header, the security framing on SELECT *) were the kind of additions that save you a ticket later.

The mode reliability issue is the most honest thing we can say about the current state of using it in a real workflow. If you’re evaluating a model for something you need to depend on, “it fell back to a different mode under load” is a relevant data point — separate from how good the output is when everything runs as intended.

If you’re building agentic workflows and want to explore what an open-source model purpose-built for long-horizon execution looks like, Kimi K2.6 is worth your time once access opens more broadly. If you need a reliable, production-aware model for everyday developer work right now, Sonnet 4.6 is the more consistent choice today.

FAQs

What is Kimi K2.6? It is Moonshot AI’s latest open-source model, released April 20, 2026. It runs on a Mixture-of-Experts architecture with 1 trillion total parameters (32 billion active per token), supports text, image, and video input, has a 256K context window, and offers three execution modes: Agent, Thinking, and Agent Swarm. It’s built specifically for long-horizon coding and autonomous multi-agent workflows.

What is Claude Sonnet 4.6? Claude Sonnet 4.6 is Anthropic’s mid-tier model in the Claude 4.6 family, released February 17, 2026. It’s the default model on Claude.ai’s free tier and the one most developers are using in production coding workflows today.

Why compare it to Sonnet and not Opus? Both models 4.6 are the practical everyday-use choice in their respective families. Comparing it against Opus 4.6 would tell you less about where these two actually compete — most developers choosing between them aren’t in the Opus pricing tier.

How does it benchmark against Claude on coding tasks? On SWE-Bench Pro at release, K2.6 scores 58.6 vs Claude Opus 4.6’s 53.4. On SWE-Bench Verified, K2.6 scores 80.2 and Claude Sonnet 4.6 scores 79.6 — essentially the same. The benchmarks are close enough that practical output quality, consistency, and workflow fit matter more than the numbers alone.

What is K2.6 Agent Swarm and what is it good for? Agent Swarm is K2.6’s most distinctive mode — it coordinates up to 300 parallel sub-agents across up to 4,000 steps. It’s designed for tasks that can be broken into parallel, specialized workstreams: large-scale codebase migrations, comprehensive research pipelines, multi-format content generation at scale. There’s no direct equivalent in Claude’s current product. Access currently requires a priority waitlist.

Is it free to use? Yes, it is available free at kimi.com. Paid plans unlock higher usage limits and additional features. The model weights are also open-sourced under a Modified MIT License for developers who want to self-host using vLLM or SGLang.

Key Takeaways

  • Claude Design launched on April 17, 2026. Anthropic’s boldest move beyond chatbots, turning Claude into a full prototyping engine that outputs live HTML, CSS, and React
  • Google Stitch evolved from a single-screen experiment at Google I/O 2025 into a multi-screen AI canvas with voice commands and interactive prototyping by March 2026
  • Figma’s stock has fallen ~35% year-to-date in 2026 — the market is already pricing in a design tools disruption that product teams need to understand now

The design tool market has a new war on its hands, and it started in earnest this April. On April 17, 2026, Anthropic launched Claude Design, a workspace that lets teams go from a text prompt to a live, interactive prototype without opening Figma. Days later, the internet had a new debate: does this kill the $3.2 billion design tools industry, or does it just reshape it?

The honest answer is more interesting than either extreme. To understand what’s really happening, you need to look at both tools in detail — what they do, how they differ, and what each one means for designers, product managers, and developers trying to move faster in 2026.

Claude Design vs. Google Stitch: Feature-by-Feature Breakdown

Feature Claude Design Google Stitch
Launch April 17, 2026 May 2025 (major update March 2026)
Underlying AI Claude Opus 4.7 Gemini 2.5 Flash / Gemini 2.5 Pro
Output type Live HTML, CSS, React components UI mockups + HTML/TailwindCSS
Multi-screen Yes Yes (up to 5 screens per generation)
Brand/design system Auto-ingests codebase + design files on onboarding URL extraction + DESIGN.md file
Voice input No (at launch) Yes — real-time design critique via voice
Figma export No — exports to Canva, PDF, PPTX, HTML Yes — paste directly to Figma
Developer handoff Native Claude Code handoff bundle AI Studio and Antigravity integration
Collaboration Org-scoped sharing + group conversation editing MCP server, SDK, Agent manager
Pricing Free tier with limits; Pro at $20/month Free via Google Labs
Best for Enterprise product teams, code-accurate prototypes Individual designers, fast ideation, Figma workflows

What Is Claude Design? Anthropic’s New Creative Workspace

Claude Design Interface

Claude Design is a new product from Anthropic Labs that lets you collaborate with Claude to create polished visual work — designs, prototypes, slide decks, one-pagers, and more. It is powered by Claude Opus 4.7, Anthropic’s most capable vision model, and is currently available in research preview for Claude Pro, Max, Team, and Enterprise subscribers.

The key distinction worth understanding immediately: Claude Design is not an image generator. It is a prototyping engine. When you describe what you need — a landing page, a dashboard, a checkout flow — Claude builds a first version as live HTML, CSS, and React components that render in real time. You are not getting a static mockup to send to a developer. You are getting code.

This matters because it closes the gap between design and development in a way that earlier AI tools couldn’t. As we explored in our breakdown of Claude vs. ChatGPT, one of Claude’s consistent strengths has been its ability to reason about code and structure simultaneously — and Claude Design is exactly what happens when that capability gets a dedicated creative surface.

How the Workflow Works

Claude Design - Setting Up Your Design System

The experience follows a natural creative loop. During onboarding, Claude reads your team’s codebase and design files to build a design system automatically. Every project that follows uses your brand’s colors, typography, and components without you having to specify them again. Teams maintaining multiple design systems — say, one for a consumer product and one for an enterprise dashboard — can manage both.

From there, you can start a project in several ways: a text prompt, an uploaded document (DOCX, PPTX, XLSX), a screenshot of your existing product, or by pointing Claude at your codebase. There is also a web capture tool that pulls visual elements directly from your website so that prototypes look like the real thing rather than a generic template.

Refinement happens through conversation. You can comment inline on specific elements, edit text directly, use adjustment knobs to tweak spacing and color in real time, and ask Claude to apply any of those changes across the entire design in one instruction. When a design is ready to hand off, Claude packages everything into a bundle that you pass to Claude Code with a single instruction — no manual spec writing, no back-and-forth briefs.

Who It’s Built For

The clearest use cases Anthropic has highlighted:

  • Designers who want to explore more directions quickly and turn static mockups into shareable interactive prototypes without a code review cycle
  • Product managers who need to sketch feature flows and hand them off directly to engineering or to designers for refinement
  • Founders and marketers who need a pitch deck or landing page and do not have a design background
  • Enterprise teams who want code-accurate, brand-consistent prototypes at scale

What Is Google Stitch? From Experiment to Figma Rival

Google Stitch Interface

Google Stitch launched quietly at Google I/O in May 2025 as a Google Labs experiment. The pitch was simple: describe a UI in plain English, and Stitch generates a screen for you. It was fast, impressively accurate for a first version, and clearly a test of appetite. The market responded with enthusiasm, and less than a year later, Stitch is a fundamentally different product.

The March 2026 update transformed Stitch into an AI-native software design canvas. Where the original tool generated single screens, the new version generates up to five interconnected screens simultaneously from a single natural language description. Where the original had a basic prompt input, the new version has an infinite canvas, a design agent that tracks the project’s evolution, voice commands, and an Agent manager that lets you work on multiple design directions in parallel.

Stitch’s origins trace back to Galileo AI, a startup founded in 2022 that built one of the earliest text-to-UI tools. Google acquired Galileo AI in early 2025 and rebranded it as Stitch, integrating it with the Gemini model family. This acquisition context matters: Stitch is not a side experiment Google spun up to test generative UI. It is Google’s most serious attempt to enter the professional design tools market, and it is backed by Gemini’s multimodal reasoning.

The Two Modes

Stitch runs on two versions of Gemini depending on what you need:

  • Standard Mode uses Gemini 2.5 Flash — fast, good for text-based prompt generation, supports Figma export, and gives you 350 generations per month
  • Experimental Mode uses Gemini 2.5 Pro — higher-fidelity output, accepts image inputs (sketches, screenshots, wireframes), and gives you 200 generations per month

Both modes are currently free through Google Labs, which is an important factor for individual designers and small teams evaluating the tool against paid alternatives.

What the March 2026 Canvas Introduced

The infinite canvas is the most significant structural change. Traditional design tools give you a blank page and expect you to fill it. Stitch’s canvas is intelligent — it understands the project’s entire history, can suggest next screens based on a user’s likely journey through the app, and allows you to bring in context from images, text, or code directly onto the canvas.

Voice is the other major new capability. You can speak to the canvas directly — asking for real-time design critiques, requesting layout variations, or triggering specific changes like “show me three different menu options” while a design is open. This is not a gimmick. For designers who think out loud or work with stakeholders during live reviews, voice interaction meaningfully changes how feedback loops work.

Stitch also introduced DESIGN.md — an agent-friendly markdown file that lets you export or import your design rules to and from other tools, including other Stitch projects. This addresses one of the biggest practical friction points in AI design tools: the inability to carry brand context across projects without starting from scratch.

Google Stitch - Setting Up DESIGN.md

Figma shares fell more than 4% the week of the Google Stitch March 2026 update. The stock is down approximately 35% year-to-date in 2026 — the design tools market is already repricing around AI disruption.

Where Claude Design Pulls Ahead

The strongest argument for Claude Design is the depth of its enterprise workflow integration. The design system ingestion during onboarding is not a feature you will find in Stitch — it means that from project one, every output reflects your actual brand rather than a generic interpretation of it. For teams managing complex visual identities across multiple products, this alone justifies the switch for prototyping work.

The Claude Code handoff is the other structural advantage. When a prototype is ready to build, Claude packages the entire design context into a bundle that passes directly to Claude Code. There is no specification document to write, no annotated Figma file to export, no brief to translate. The design and the implementation instructions are one artifact. Given how much time is lost in most product teams at exactly this handoff moment, this is a meaningful efficiency gain.

Product teams at companies like Datadog have reported going from rough idea to working prototype before a meeting ends, with the output staying true to brand guidelines without manual correction. Brilliant’s design team noted that pages requiring 20+ prompts in other tools only needed two prompts in Claude Design. These are not generic testimonials — they reflect a genuine reduction in friction at the most painful parts of the design cycle.

Understanding why Claude performs so well here requires some context about how the underlying model has evolved. For a deeper look at how Claude 3.5 Sonnet introduced Artifacts — the feature that laid the groundwork for Claude Design’s real-time rendering — that post explains the architectural shift that made this possible.

Where Google Stitch Pulls Ahead

Stitch’s multi-screen generation is its most practically powerful feature. Describing a full application flow and receiving five interconnected, coherent screens in one operation is something Claude Design does not currently offer at the same fidelity. A product manager who needs to communicate an entire checkout flow — cart, shipping, payment, confirmation, order tracking — can have that as a coherent design artifact in a single Stitch prompt.

The Figma integration is the other reason Stitch fits better into many existing workflows. Designers who live in Figma do not want to abandon it — they want faster ideation before they open Figma. Stitch’s paste-to-Figma function makes that transition seamless. Claude Design, by contrast, is building a parallel workflow that competes with Figma rather than plugging into it.

Stitch is also genuinely free in a way that matters for independent designers and early-stage teams. 350 standard mode generations per month is enough for rapid prototyping without any subscription cost. Claude Design requires a Pro plan at $20/month for meaningful access beyond the free tier — which is still competitive, but it is not free.

Voice-driven design critique is a genuine differentiator that is hard to overstate for teams that work collaboratively. The ability to talk through a design with an AI agent that responds in real time — making adjustments, offering critiques, suggesting alternatives — is a fundamentally different mode of working than typing prompts in a chat interface.

Pricing: What Do These Tools Actually Cost?

Plan Claude Design Google Stitch
Free Available with generation limits Free via Google Labs (350 standard/200 pro generations/month)
Paid Claude Pro at $20/month (included in subscription) No paid tier announced yet
Team/Enterprise Claude Team and Enterprise plans (admin controls, org sharing) Not yet available

Both tools undercut Figma’s team pricing significantly. For context, Figma’s professional plans run $12–15 per editor per month, with organization plans considerably higher. The AI design tools entering this space are doing so at a price point that makes evaluation essentially free, which accelerates adoption.

What This Means for Designers, PMs, and Developers in 2026

The question most teams are actually asking is not “which tool wins” — it is “which tool do I reach for and when.” The answer follows logically from what each tool prioritizes.

Reach for Claude Design when:

  • You need a code-accurate prototype that reflects your actual brand and design system
  • Your next step after prototyping is sending something to an engineering team
  • You are working within the Anthropic ecosystem and want Claude Code to implement the design
  • Your team needs org-scoped collaboration with version tracking inside a single tool

Reach for Google Stitch when:

  • You need to generate multiple screens of a full application flow in one operation
  • Your existing workflow centers on Figma and you need faster ideation before opening it
  • You are an independent designer or early-stage team where free access matters
  • You want to extract a design system from an existing URL and use it as a starting point

The deeper shift both tools represent is what the Data Science Dojo breakdown of top LLM companies describes as a transition from models as utilities to models as embedded collaborators. Both Anthropic and Google are building tools where the AI does not assist the workflow — it is the workflow. That distinction is what makes 2026 different from 2024.

For teams that want to understand how the underlying models power these capabilities, our guide to the best large language models covers the model families that both tools are built on, including Gemini’s multimodal architecture and Anthropic’s approach to instruction following and code generation.

The Bigger Picture: Who Wins the AI Design Wars?

Neither tool is a Figma killer yet. Both are genuinely missing things that production design teams depend on — precise vector editing, persistent component libraries with tokens, deep developer handoff with measurements and annotations, plugin ecosystems, and the kind of version history that large teams need to work without overwriting each other. These are not small gaps.

But the trajectory matters as much as the current state. Stitch went from a single-screen experiment to a five-screen canvas with voice and interactive prototyping in under a year. Claude Design launched with design system ingestion, Claude Code handoff, and org-level collaboration on day one. Both companies are investing heavily and iterating fast.

The financial markets have already drawn a conclusion. Figma shares fell more than 4% in the days following the March 2026 Stitch update and are down roughly 35% year-to-date. That is not just sentiment — it is institutional capital pricing in a fundamental shift in how design tools will work. This pattern mirrors what the generative AI art tools space went through between 2022 and 2024, where established creative software providers were forced to restructure their product roadmaps around AI-native competitors.

What is clear is that the “design handoff problem” — the friction-heavy translation of visual intent into buildable code — is being solved at the model level rather than the tooling level. Claude Design solves it by making the design output be the code. Stitch solves it by integrating into Figma so that the code generation happens downstream. Both approaches are valid, and both will continue to improve.

The teams that win in this environment are not the ones that pick the right tool in April 2026 — they are the ones that build the organizational habit of evaluating and integrating these tools as they evolve. For teams already building AI-powered workflows and want to understand the underlying model landscape better, the LLM guide for beginners is a practical starting point for understanding what makes these tools work the way they do.

FAQ: Claude Design and Google Stitch Explained

Is Claude Design free? Claude Design has a free tier with usage limits. Full access — including longer conversations and higher usage limits — is included in a Claude Pro subscription at $20/month. It is also available on Claude Max, Team, and Enterprise plans.

Is Google Stitch free? Yes. Google Stitch is currently free through Google Labs. Standard mode gives you 350 generations per month, and Experimental mode (higher fidelity, supports image input) gives you 200 generations per month. Google has not announced a paid tier as of April 2026.

Does Claude Design replace Figma? Not for production design work. Real-time multi-editor collaboration, persistent component libraries, precise vector editing, and developer handoff with measurements are areas where Figma still leads. Claude Design bypasses Figma for many early-stage use cases — prototyping, wireframing, pitch decks — but it is not a replacement for teams doing production-level UI work.

Can Google Stitch export to Figma? Yes. In Standard Mode, Stitch includes a paste-to-Figma function that lets you move generated designs directly into a Figma file for further editing. Experimental Mode does not currently support Figma export.

Who is Claude Design best for? Product teams, PMs, designers, and founders who want prototypes that are code-accurate, brand-consistent, and ready to hand off to engineering — particularly those already using Claude Code in their development workflow.

What language does Claude Design export code in? Claude Design generates HTML, CSS, and React components. Google Stitch exports HTML and TailwindCSS.

Can I use both tools together? Yes, and for many teams this makes sense. Stitch is stronger for rapid multi-screen ideation and Figma-compatible flows; Claude Design is stronger for code-accurate prototyping and enterprise brand consistency. Using Stitch to explore directions and Claude Design to produce the final handoff artifact is a workflow worth considering.

Conclusion: The Design Workflow Is Being Rewritten

The AI design wars of 2026 are not a zero-sum competition. Claude Design and Google Stitch are solving adjacent problems in adjacent ways, and the result is that teams have more capability than ever to close the gap between an idea and a working product.

The practical takeaway is this: if you are a product team or designer who has not yet built a prototyping workflow around AI tools, the cost of staying on the sideline is rising. Both tools are accessible right now — Claude Design through claude.ai/design, Google Stitch through stitch.withgoogle.com — and both have free or low-cost entry points that make experimentation essentially free.

The companies that figure out when to use each tool, and how to integrate both into their existing workflows, will not just move faster. They will build better products because the feedback loop between idea and prototype has been compressed from days to minutes.

For teams that want to go deeper on the models powering these tools, exploring Anthropic’s Claude 3 model family provides useful context on how Anthropic’s approach to reasoning and code generation has evolved into what powers Claude Design today.

Key Takeaways

  • Most AI for enterprise business cases stall because they start at the wrong ROI stage — justifying cost savings when the real value is further upstream
  • The 3-stage ROI maturity model (cost savings → revenue generation → new possibilities) gives decision-makers a clear benchmark for where their organization stands
  • The current enterprise sweet spot is 7-figure wins in the $2–3M range — but targets of $100M+ are being pursued by companies that have been building for over a year

Building a credible AI for enterprise business case has become one of the most mishandled challenges facing decision-makers today. The pressure to deploy agentic AI is real. So is the organizational skepticism that greets every new initiative. The result is a cycle of approved pilots, stalled deployments, and ROI numbers that never match what was promised.

The problem is rarely the technology. At the Future of Data and AI: Agentic AI Conference, Raja Iqbal, moderating the panel on enterprise economics, put it plainly at the outset: for many use cases, the technology works. The blockers are organizational friction, operating model, culture, and how people think about agents.

This article walks through the 3-stage agentic AI ROI maturity model introduced by Joao Moura, CEO and founder of CrewAI, during that panel. It explains what each stage looks like, what it requires, and how to build a credible AI for enterprise business case depending on where your company actually is.

Why Most AI for Enterprise Business Cases Get the ROI Framing Wrong

The most common mistake is strategic, not technical. Teams build the business case around cost reduction because it is the easiest number to put in a spreadsheet. Finance approves it, the project launches, and somewhere between the demo and production the returns shrink or disappear.

David Park, who leads the applied AI team at Landing AI, identified exactly why this happens:

The durable value will come from being able to restructure those workflows themselves, not just adding an agent or an LLM on top of it. Today we have augmentation without simplification.

The second failure mode is the demo-to-production gap. A polished proof of concept creates internal momentum, but production requires answering questions that demos never surface:

In demos the system works beautifully. But in production the critical questions are: who owns the output, how is this monitored, can it be audited and traced back to source with calibrated confidence?

David Park, Applied AI Lead, Landing AI

Joao Moura framed the broader challenge as the “last mile” problem. Building the agent is not the hard part — the tooling is increasingly commoditized. Projects fail on data readiness, legacy integration, governance, and change management. As Joao said at the panel, that last mile turns out to be more like a thousand miles once production actually demands everything it demands.

The 3-Stage Agentic AI ROI Maturity Model

 

Joao Moura introduced this model as the lens he uses to gauge how mature a customer is on their AI for enterprise journey:

Everyone starts on the early days talking about cost savings because that’s the horizon they can see. But then they go into how they can generate money from this. No one grows a massive business by playing defense. And the final frontier is: what can I do now that I could not even consider doing before, because it was not even feasible?

— Joao Moura, CEO & Founder, CrewAI

That progression, defense to offense to new territory, is the spine of the model.

3-stage agentic AI ROI maturity model for AI for enterprise deployments
3-stage agentic AI ROI maturity model for AI for enterprise deployments – Joao Moura

Stage 1: Cost Savings (Playing Defense)

Stage 1 is where most AI for enterprise deployments begin. Cost savings is the horizon most organizations can see at the start — it is the easiest ROI case to make internally, the easiest to measure, and the lowest-risk entry point for organizations still building confidence in the technology.

At this stage, agents automate repetitive workflows, reduce manual processing time, and cut costs in specific, bounded operations. The business case is a cost-displacement argument: here is what this process costs today, here is what it will cost with agents, here is the payback period.

The risk of staying here too long is that the organization optimizes existing processes rather than reimagining them. Companies that treat Stage 1 as a destination rather than a foundation tend to cap their returns early.

What Stage 1 requires: Defined workflows with measurable baselines. Clean enough data for agents to act on. A governance model for automated outputs. A team willing to own agent behavior in production.

Stage 2: Revenue Generation (Playing Offense)

Stage 2 is where the AI for enterprise business case shifts from defense to offense. Instead of reducing costs, the argument is about accelerating revenue: shipping faster, closing deals more efficiently, personalizing at scale, capturing revenue that was previously out of reach.

This stage requires more from the organization. Data readiness matters more because agents are now operating on revenue-critical workflows. Monitoring matters more because the cost of a failure is not just an efficiency loss — it is a customer or a deal.

The current benchmark: 7-figure wins in the $2–3M range are becoming more common. Joao shared a concrete example at the conference — a large CPG company used agents to handle stalled orders across shipping, invoice reconciliation, and routing bottlenecks. A relatively simple workflow redesign generated $2 million in value within two weeks by unblocking over 800,000 orders. As Joao noted, wins like that are no longer exceptional for well-executed Stage 2 AI for enterprise deployments.

What Stage 2 requires: A stable agent infrastructure from Stage 1. Production-grade monitoring and clear ownership of outputs. A workflow redesign mentality, not just an automation mentality. Executive sponsorship that understands the difference between the two.

Stage 3: New Possibilities (The Compounding Moat)

Stage 3 is where the AI for enterprise business case changes entirely. The question is no longer “can we do this more efficiently?” It is “can we do things that were not economically feasible before we had agents?”

At this stage, enterprises are using agentic AI to create entirely new products, serve new customer segments, or operate in markets that were previously too complex or expensive to enter. The competitive advantage does not depreciate quickly because it is built on proprietary data and workflows that cannot be replicated by deploying a third-party agent on a standard stack.

The conference benchmarks here are instructive. Joao described one customer whose goal is to save $100 million with agents in a single year:

They have a goal for this year that they want to save $100 million with agents. They’re shooting for the moon — but we have been working with them for over a year and now it’s getting to amazing results. It’s not a magic thing where you just snap your fingers and it works.

— Joao Moura, CEO & Founder, CrewAI

That timeline is the reality of what Stage 3 AI for enterprise requires. The $100M target is the outcome of a deliberate progression through Stages 1 and 2.

What Stage 3 requires: 12 or more months of serious investment in Stages 1 and 2. A platform team that owns identity, logging, governance, and cost metering. Leadership willing to fund a multi-year roadmap without demanding immediate returns.

Stage Core ROI Argument Typical Win Size Key Requirement Time Horizon
Stage 1: Cost Savings Reduce operational spend, automate repetitive workflows, displace manual effort $50K–$500K Clean data, defined workflows, governance model for agent outputs Weeks to months
Stage 2: Revenue Generation Ship faster, close more deals, capture revenue previously out of reach $1M–$3M Redesigned workflows, production-grade monitoring, cross-functional alignment 3–9 months post Stage 1
Stage 3: New Possibilities Do things that were not economically feasible before agents existed $10M–$100M+ 12+ months of Stage 1 and 2 investment, dedicated platform team, multi-year roadmap 12+ months

Which Stage Is Your AI for Enterprise Program Actually At?

This is the question most teams get wrong — not because they are dishonest, but because the signals are easy to misread. A company with several active pilots and a growing AI team often assumes it is at Stage 2. Operationally, it is frequently still at Stage 1.

Use these five questions to assess your actual stage:

  1. Do you have clean, classified data that agents can reliably act on? If not, you are at Stage 1 regardless of what your pilots are doing.
  2. Do you have production monitoring and a defined owner for agent outputs? A working demo is not a production deployment.
  3. Have you restructured at least one workflow around agent capabilities — not just automated it? Augmentation without simplification is Stage 1 behavior dressed as Stage 2.
  4. Can your organization absorb a Stage 2 failure without killing the entire AI program? If not, your organizational maturity has not caught up with your ambition.
  5. Do you have a platform team that owns agent infrastructure independently of any specific use case? If every deployment rebuilds from scratch, Stage 3 is not yet accessible.

A common pattern from the conference: companies get early success with a proprietary model, bills stack up, and they re-architect on open-source stacks without first establishing the governance layer that makes that transition safe. The stage they thought they were at and the stage they actually were at did not match.

The Hidden Blockers That Kill AI for Enterprise ROI

Even a well-constructed business case fails if the organization has not addressed the conditions that determine whether agents can deliver in production.

Data readiness is the most underestimated blocker at every stage. Unlike human workers who bring implicit background knowledge, an agent operating on an incomplete dataset will fill gaps with plausible but wrong answers. Data classification is a prerequisite to everything else.

Change management surprises teams the most. The resistance is rarely to the technology. It is to new ownership structures, new accountability models, and new ways of evaluating performance.

The demo-to-production gap is where most hidden cost lives. A proof of concept on clean, curated data will behave very differently in production. Not accounting for governance, monitoring, and change management in the business case is the single most common reason these investments underdeliver.

Frequently Asked Questions

What is the agentic AI ROI maturity model? The agentic AI ROI maturity model is a three-stage framework for how enterprise value from AI agents compounds over time. Stage 1 is cost savings, Stage 2 is revenue generation, and Stage 3 is new possibilities that were not economically feasible before agents existed. It was introduced by Joao Moura of CrewAI at the Agentic AI Conference.

How do I build a business case for agentic AI? Start by identifying which stage your organization is actually at. Stage 1 cases are operational efficiency arguments with clear baselines and payback periods. Stage 2 cases require evidence of production-grade governance and workflow redesign. Stage 3 cases are multi-year strategic pitches that require documented Stage 1 and Stage 2 outcomes.

What ROI can enterprises realistically expect from agentic AI? Current benchmarks from AI for enterprise deployments show 7-figure wins in the $2–3M range becoming common at Stage 2. Enterprises targeting $100M+ outcomes have been building for over a year and have invested heavily in data infrastructure and governance.

What is the difference between Stage 1 and Stage 2 AI ROI? Stage 1 is a cost-displacement argument: reducing headcount, automating workflows, cutting operational spend. Stage 2 is a revenue argument: shipping faster, closing more deals, capturing revenue previously out of reach. Stage 2 requires a workflow redesign mindset, not just automation.

How long does it take to see ROI from agentic AI? For most AI for enterprise programs, Stage 1 returns can appear within months of a well-scoped deployment. Stage 2 requires a Stage 1 foundation first. Stage 3 outcomes, including $100M+ targets, require 12 or more months of dedicated investment.

What are the biggest blockers to enterprise AI ROI? Data readiness, change management, and the demo-to-production gap. The technology is rarely the reason AI for enterprise projects fail.

The Stage You Start At Determines the Returns You Get

The organizations winning at AI for enterprise did not start with the most sophisticated agents or the largest budgets. They started with an honest answer to a simple question: which stage are we actually at, and what does it take to execute well here before moving to the next one?

As Joao Moura said at the conference:

It’s not a magic thing where you just snap your fingers and you have agents and now you’re a hundred times more productive. But if you put in the engineering work, you can achieve something remarkable.

— Joao Moura, CEO & Founder, CrewAI

The enterprises targeting $100M+ started exactly where you are. Start at the right stage, build the foundation, and the returns compound from there.

Explore our resources on building smarter agentic AI workflows and open-source tools for agentic AI development to take your next step.

Ready to build robust and scalable LLM Applications?
Explore our LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI.

Key Takeaways

  • An LLM wiki is a structured, AI-maintained knowledge base that grows smarter every time you add a source — unlike RAG, which rediscovers knowledge from scratch on every query.
  • The pattern was introduced by Andrej Karpathy in a GitHub Gist in April 2026 and went viral among developers within days.
  • You can build your first LLM wiki in under 30 minutes using five free research papers, a folder on your computer, and Claude Code or Claude.ai

If you have ever uploaded a PDF to ChatGPT, asked a question, and then uploaded the same PDF again the next day to ask a follow-up.. you already understand the problem an LLM wiki solves.

Most AI knowledge tools today are stateless. Every session starts from zero. Nothing you learn in one conversation carries over to the next. The model retrieves, answers, and forgets. Ask the same question tomorrow and it rebuilds the answer from scratch.

Andrej Karpathy, co-founder of OpenAI and former Director of AI at Tesla, proposed a different approach in April 2026. He called it an LLM wiki: a persistent, structured knowledge base that an AI agent actively builds and maintains, so that knowledge compounds over time instead of evaporating between sessions.

This tutorial walks you through exactly how to build one, using five foundational AI research papers as your starting material.

How does LLM Wiki by Andrej Karpathy work?
LLM Wiki By Andrej Karpathy

What Is an LLM Wiki and Why Does It Matter?

An LLM wiki is a folder of plain markdown files that an AI agent reads, writes, and maintains on your behalf. Each file is an entity page: a structured, Wikipedia-style entry for one concept, linked to related concepts using [[wiki-links]].

The key difference from every other knowledge tool is what happens when you add a new source.

In a standard RAG system (NotebookLM, ChatGPT file uploads, most enterprise tools), adding a new document means it gets indexed and sits alongside your other documents. When you ask a question, the system retrieves relevant chunks and generates an answer. The documents themselves never change. Nothing is synthesized. Nothing is connected.

In an LLM wiki, adding a new document triggers a compilation step. The agent reads the new source and the existing wiki, then:

  • Updates existing pages with new information
  • Creates new entity pages for concepts that appear for the first time
  • Adds [[wiki-links]] connecting the new concept to related ones already in the wiki
  • Flags contradictions between the new source and what was previously written

Over time, the wiki becomes a connected knowledge graph, not just a pile of documents. At 10 pages it answers basic questions. At 50 pages it starts synthesizing across ideas you never explicitly connected. At 100+ pages, it can answer questions where the answer doesn’t exist in any single source, because the answer lives in the relationships between pages.

LLM Wiki vs RAG: What’s the Real Difference?

RAG LLM Wiki
Knowledge persistence None — stateless Full — builds over time
Multi-document synthesis Per query, from scratch Pre-compiled into pages
Contradiction detection No Yes — flagged during compilation
Source traceability High Moderate (page-level)
Setup complexity Low Low–Medium
Best for Quick Q&A on documents Deep, growing research topics

The tradeoff worth knowing: RAG is better when your data changes daily or when exact source traceability matters for every claim. LLM wiki is better when you are building expertise on a topic over weeks or months, and want the model to reason across your knowledge base rather than just retrieve from it.

What You Need Before You Start

Tools:

  • A computer with a folder you can access (Mac, Windows, or Linux)
  • Claude.ai account (free tier works for the tutorial) or Claude Code if you prefer the terminal
  • Obsidian: free markdown editor (optional but recommended for the graph view)

Files:

  • 5 research papers downloaded as PDFs (links in the next section)

Knowledge assumed:

  • You know how to create a folder on your computer
  • You know how to download a file from a URL
  • No coding required for the Claude.ai version of this tutorial

Estimated time: 25–35 minutes for your first wiki

Step 1: Download Your Starting Papers

For this tutorial, we are using five foundational AI research papers. They are ideal because they build on each other sequentially — the LLM will naturally create rich connections between concepts like attention, fine-tuning, scaling, and alignment.

All five are free on arXiv. Download each as a PDF and save them somewhere easy to find.

Paper 1: Attention Is All You Need (2017) The original transformer paper. Foundation for everything modern.

Paper 2: BERT (2018) Bidirectional transformers for language understanding — builds directly on attention.

Paper 3: GPT-3 (2020) Large language models as few-shot learners — introduces emergent capabilities at scale.

Paper 4: Foundation Models (2021) A broad survey tying together transformers, scaling, and downstream applications.

Paper 5: RLHF (2022) How GPT models are aligned using human feedback — the bridge to modern assistants.

Download Research Papers for LLM Wiki Tutorial
Research Papers added to /raw Folder

After this step you should have: Five PDF files saved to your computer.

Step 2: Create Your Folder Structure

Create a new folder anywhere on your computer — your Desktop, Documents, wherever makes sense. Name it my-wiki.

Inside it, create two folders:

my-wiki/
├── raw/
└── wiki/

  • raw/ is where you drop all your source files — PDFs, articles, notes. You never edit anything in here manually.
  • wiki/ is where the compiled entity pages live. The LLM writes here.

Now move your five downloaded PDFs into the raw/ folder.

LLM wiki folder structure with raw and wiki directories
LLM wiki folder structure with raw and wiki directories

After this step you should have: A folder structure with five PDFs sitting inside raw/.

Step 3: Run the Compilation Prompt

This is the core step, where the LLM wiki pattern actually kicks in.

Option A: Using Claude.ai (no terminal needed)

Open Claude.ai and upload all five PDFs at once using the attachment button. Then send this prompt:

That is genuinely all you need. Claude will generate one markdown entity page per key concept — each with a summary, an explanation, wiki-links to related concepts, and any contradictions it finds between the papers.

Copy each page into a .md file in your wiki/ folder.

Additionally: If you want more structure as your wiki grows, you can extend the prompt to also ask Claude to create an index.md listing every entity page with a one-line description, and a log.md tracking what was compiled and when. These become useful navigational tools once you have 30+ pages, but they are not needed to get started.

Option B: Using Claude Code (terminal)

If you have Claude Code installed, open a terminal, navigate to your wiki folder, and launch it:

Then paste the same prompt above. Claude Code will read the files directly and write the pages into wiki/ for you — no copy-pasting needed.

Claude Code prompt for creating LLM wiki
Claude Code prompt for creating LLM wiki
Entity pages created for LLM Wiki by Claude Code
Entity pages created for LLM Wiki by Claude Code

After this step you should have: 10–20 markdown entity pages in your wiki/ folder.

Step 4: Open Your Wiki in Obsidian

Install Obsidian (free, no account needed). When it launches, click Open folder as vault and select your wiki/ folder.

Using Obsidian to create graphs for LLM Wiki
Using Obsidian to create graphs for LLM Wiki

Two things to look at immediately:

Graph View — press Ctrl+G (or Cmd+G on Mac). You will see your entity pages as nodes, with [[wiki-links]] rendered as edges connecting them. After just five papers, you should see a small but meaningful graph — transformer architecture linking to attention mechanism, BERT linking to fine-tuning, RLHF linking to alignment and GPT.

Obsidian graph view of an LLM wiki showing linked entity pages on transformer concepts
Obsidian graph view of an LLM wiki showing linked entity pages on transformer concepts

After this step you should have: A visual, navigable knowledge graph in Obsidian.

Step 5: Add More Sources and Watch It Compound

Drop a new paper into raw/, any paper related to transformers, language models, or AI alignment works well. Then run the compilation prompt again, this time with a small addition:

This is where the compound effect becomes visible. The new paper does not just create new pages, it enriches the pages already there. A page on “attention mechanism” that had two outgoing links might now have five. A claim that went unchallenged might now have a contradiction flagged.

Step 6: Run a Linting Pass

Every time your wiki reaches roughly 20 new pages, run this maintenance prompt:

This is the self-healing step. It is what keeps the wiki accurate as it grows, rather than slowly drifting into quiet inconsistency.

Tip: “Run linting after every 20 new pages, or any time you add a source that significantly updates a topic already in the wiki.”

After this step you should have: A clean, internally consistent wiki with no orphan pages and all flagged contradictions resolved or noted.

Common Mistakes to Avoid

Putting too much in one page. Each entity page should cover exactly one concept. If a page starts covering two ideas, split it. Dense single-concept pages create better links and better answers.

Never running linting. Small errors propagate fast in a wiki. A wrong claim on one page gets linked to by three others, and now you have organized misinformation. Run the audit pass regularly.

Adding too many unrelated topics at once. The wiki compounds best when sources are topically related. Starting with five papers on the same subject produces a richer graph than five papers on five different subjects.

Frequently Asked Questions

What is an LLM wiki? An LLM wiki is a personal knowledge base made of plain markdown files that an AI agent actively builds and maintains. Unlike RAG systems that search raw documents on every query, an LLM wiki pre-compiles knowledge into structured, interlinked entity pages — so answers compound over time instead of being rediscovered from scratch.

Who created the LLM wiki concept? Andrej Karpathy, co-founder of OpenAI and former Director of AI at Tesla, described the concept in a GitHub Gist published in April 2026. The post went viral in the developer community within days of publication.

Do I need to know how to code to build an LLM wiki? No. The Claude.ai version of this tutorial requires no coding — just uploading PDFs and pasting prompts. Claude Code makes the workflow faster and more automated, but it is not required to get started.

How is an LLM wiki different from Notion or Obsidian alone? Notion and Obsidian are tools for human-written notes — you organize and write everything yourself. An LLM wiki uses those same tools as the viewing interface, but the actual compilation, linking, and maintenance is done by the AI agent. You supply raw sources; the agent builds the structure.

How big can an LLM wiki get? Karpathy’s own wiki reached approximately 100 articles and 400,000 words before he noted that the LLM could still navigate it efficiently using the index and summaries. At that scale, the system was still faster and more accurate than a RAG pipeline for his research use case.

What file types work in the raw/ folder? PDFs work best for research papers. Markdown files work well for articles clipped from the web (the Obsidian Web Clipper browser extension converts any webpage to markdown automatically). Plain text, exported chat conversations, and .md notes all work. The LLM reads whatever you drop in.

What to Build Next

Once your first wiki is running, a few natural next steps:

  • Add the Obsidian Web Clipper browser extension. It converts any webpage to markdown and saves it directly to your raw/ folder. This makes ingesting articles as fast as bookmarking them.
  • Try topic-specific wikis. One wiki per research area tends to produce cleaner graphs than one giant wiki. Start a separate one for a new topic rather than mixing everything together.
  • Fine-tune on your wiki. At 100+ well-maintained pages, the wiki becomes a high-quality training set. You can eventually fine-tune a smaller model on it — turning your personal research into a custom private intelligence.

Ready to build robust and scalable LLM Applications?
Explore our LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI.