Beyond RAG: Karpathy's LLM Wiki Pattern Explained (2026)

3 key points

A single gist that Andrej Karpathy published in late March 2026 racked up 16 million views and became the reference point for “life after RAG.” The core idea is a file-over-app philosophy: drop the vector DB and incrementally compile knowledge into a markdown wiki.
Thanks to the 3-layer separation (Raw / Wiki / Schema), reports cite roughly 95% token savings per query. Within two weeks of the gist going public, five derivative implementations had shipped — the ecosystem moved first.
That said, scale ceilings, hallucination contamination, and skipped lints are real traps, and the “RAG with extra steps” criticism has weight. This article covers the concept, the architecture, and the limits as of April 2026.

Why did Karpathy suddenly bring up 'LLM Wikis'?
Why isn't RAG a fit for personal knowledge management?
Why isn't a long context window enough?
How does the LLM Wiki's 3-layer architecture work?
What exactly is different between RAG and the LLM Wiki?
How explosive has the LLM Wiki ecosystem been?
What are actual users saying?
Three limits of the LLM Wiki — what to watch out for
If you're reading this now, where do you start?
Frequently asked questions

Why did Karpathy suddenly bring up ‘LLM Wikis’?

In late March 2026, Andrej Karpathy posted a GitHub gist. The title was unremarkable: “How I use LLMs to manage my knowledge base.” Six weeks later it had 16 million views (source: remio.ai analysis). For a “how I use it” note from the OpenAI co-founder and former Tesla AI director, that’s an unusual reaction.

Why a simple tip post turned into a paradigm fight

The argument in the gist boils down to one sentence: “RAG is a stateless system; personal knowledge management calls for stateful compilation” (source: Karpathy gist). RAG (Retrieval-Augmented Generation) chops up source documents and pulls them out of a vector DB to interpret on each query. Karpathy’s pattern instead has the LLM synthesize (compile) the material once at ingest time and store it as human-readable markdown. After that, you just read what’s there.

VentureBeat summarized the claim as “Karpathy shares an LLM knowledge base architecture that bypasses RAG” (source: VentureBeat coverage). The key word was “bypass” — not interpret, not replace, not complement. Bypass. Meaning the premise itself gets overturned.

The shape of the debate — supporters and critics

Supporters said the pattern “finally puts into words why personal-scale RAG felt so bad.” Critics said it was “a Semantic Web rerun, a structure that only works at small scale, dressed up as a RAG alternative.” Both sides came with data. Supporters cited the 84% reduction in context tokens reported on Reddit’s r/ClaudeCode. Critics pointed to the scale ceiling (around 400,000 words), beyond which a hybrid is ultimately required.

What this article aims to do

This article lays out what the Karpathy LLM Wiki pattern (LLM-KB, file-over-app) is, what problem it tries to solve, where it’s been validated, and where it breaks — as of April 2026. It is a concept piece, not a product walkthrough. Specific implementation tools (Graphify, llm-wiki-mcp, etc.) are mentioned only in the ecosystem timeline section.

Why isn’t RAG a fit for personal knowledge management?

The claim isn’t that RAG itself is wrong. For scenarios with low update frequency and a constrained query pool — enterprise FAQ chatbots, internal document search — RAG still works well. The trouble is in the kind of environment Karpathy assumed: a “personal research archive” where sources are heterogeneous and knowledge has to accumulate.

Stateless — re-interpreting from scratch every time

RAG runs a similarity search on every query, pulls chunks, and pushes them into the LLM. Ask the same question yesterday and today, and the LLM interprets the source like it’s seeing it for the first time, every time. Yesterday’s insight doesn’t automatically fold into today’s answer. remio.ai called this the “absence of compounding” (source: remio.ai). Add as many sources as you want — the knowledge graph doesn’t get denser.

Black box — vector DBs aren’t readable

A vector embedding is a 1,536-dimensional float array. You can’t open it up and confirm “ah, this document was understood like that.” It’s hard to track whether the index is wrong, whether chunk boundaries were cut badly, or how often retrieval misses happen. techbuddies.io flagged this as “non-auditability” (source: techbuddies.io). Worse, when retrieval fails, the system doesn’t say “couldn’t find it” — it fabricates a plausible answer. Silent failure.

A ceiling on synthesis depth — chunks are just chunks

The essence of personal knowledge management is “cross-referencing what multiple sources say about the same concept.” RAG grabs top-k chunks and stitches them together one-shot. If five papers contain different claims by the same author, RAG pulls some of those chunks and tells the LLM to “figure it out.” That’s closer to sampling than synthesis. And it’s unrelated to the kind of frontier-model gains like Gemma 4’s function-calling jump from 6% to 86%. A better model can’t deepen synthesis if the input is thin.

This isn't a call to abandon RAG

Karpathy himself confirmed in a gist comment: “For scenarios like large-scale enterprise search or legal Q&A, RAG is still the right fit. I’m talking about personal-scale static archives” (source: Karpathy gist comment). The comparisons in this article are bounded to the ‘personal / small-scale knowledge management’ context.

Why isn’t a long context window enough?

“If a model has a 1M-token context, can’t I just dump everything in?” That’s the most common pushback. With Gemini 3.1 Pro’s 1M tokens, Claude Opus 4.7’s 1M context, and other frontier models rolling out serious long-context support, the objection comes up more often. But in practice you hit three walls at once.

Token cost isn’t linear

Doubling the context doubles the cost. Meanwhile, the LLM Wiki pattern reports roughly 95% token savings (source: levelup.gitconnected analysis). The reason is simple: instead of loading 400,000 source words on every query, you follow a pre-compiled summary page and 5–10 related wikilinks.

The lost-in-the-middle problem

Research has repeatedly reproduced the finding that models miss material in the middle when the context is packed. Actually using 1M tokens requires a retrieval strategy — at which point a search layer enters the picture again. Long context isn’t a one-shot fix.

Permanent storage vs. ephemeral session

Information loaded into the context window disappears when the session ends. Tomorrow’s same task means loading it again. No compounding. The wiki, on the other hand, lives on disk. The next query starts from an already-synthesized state.

Approach	Tokens per unit	Persistence	Maintenance cost
RAG retrieval	Medium (top-k chunks)	Vector DB reindexing required	Pipeline upkeep
Long-context full dump	Very high (full load every query)	Lost when session ends	Low (but cost explodes)
LLM Wiki compile	Low (summary + wikilinks)	Permanent markdown	Heavy investment in lint/ingest

How does the LLM Wiki’s 3-layer architecture work?

The substantive contribution of Karpathy’s gist is the ‘3-layer separation principle.’ Each layer has a different owner and different write permissions.

Karpathy LLM Wiki 3-layer architecture (raw / wiki / schema) diagram — The 3-layer separation principle: Raw (human-owned, read-only), Wiki (LLM-owned, compiled), Schema (co-evolving rules)

Raw layer — the source of truth, hands off

The raw/ directory holds PDFs, markdown, transcripts, and images in their original form. Use Obsidian Web Clipper to clip a web article in full, and the images come down locally too. This layer is read-only. If you decide it’s been corrupted, you wipe the wiki layer and rebuild from raw. It plays the same role as the original commits in a Git history.

Wiki layer — the LLM’s synthesized knowledge

The wiki/ directory accumulates compiled pages as markdown + YAML frontmatter + [[wikilinks]]. When a new source comes in, the LLM modifies 10–15 pages simultaneously in a single ingest pass. It writes the summary page, creates or updates entity/concept pages, plants cross-reference links, flags any contradictions it finds, and updates index.md. That’s what “compile” really means.

Schema layer — the rules humans and LLMs grow together

CLAUDE.md or AGENTS.md holds the workflow contract: ingest / query / lint order, naming rules, directory structure, conventions like “how to flag contradictions when found.” This file already exists in the dev tools ecosystem. Claude Code’s CLAUDE.md system is a prime example. The anchor-file pattern for telling an LLM about codebase conventions has been transplanted directly to knowledge work.

index.md — the search mechanism without a vector DB

The special file index.md collects category-grouped page lists with one-line summaries. When a query comes in, the LLM reads index.md first, then drills down into the relevant page. That’s the answer to “how does search work without a vector DB?” The catch: the practical limit is reportedly around 100–200 pages (source: rohitg00 extension gist). Beyond that, index.md itself bloats and has to be split.

Layer	Owner	File format	Change rule
Raw	Human	PDF / MD / images / transcripts	Read-only, original preserved
Wiki	LLM	Markdown + frontmatter + [[wikilinks]]	LLM modifies 10–15 files simultaneously on ingest
Schema	Human + LLM (co-evolved)	CLAUDE.md / AGENTS.md	Manual commit when contract changes

What exactly is different between RAG and the LLM Wiki?

Now that the architecture is clear, on to the head-to-head. Three axes: token cost, accuracy, maintenance.

Head-to-head — RAG vs LLM Wiki: 95% token savings, 71.5x efficiency — RAG (stateless rediscovery) vs LLM Wiki (stateful compilation) — 95% token savings, 71.5x efficiency

Token cost — why 95% savings is possible

Reports cite roughly 95% token savings versus RAG (source: levelup.gitconnected analysis). The reason is structural. RAG re-encodes top-k chunks (usually 5–10) on every question. The LLM Wiki only follows 1–3 already-synthesized summary pages plus relevant wikilinks, so the input size shrinks to a single digit. Among the derivative implementations, one benchmark reported 71.5x fewer tokens per query vs. loading the full raw files (source: Analytics Vidhya analysis).

Accuracy — retrieval miss disappears

When RAG can’t pull the answer in top-k, it fails silently. With the LLM Wiki, if index.md misses a relevant page, the LLM can at least confirm “that page doesn’t exist.” Within the context limit, recall is 100%. The catch: once you exceed that limit, a search layer is needed again — covered in the limits section.

Maintenance — self-healing lint

The LLM Wiki periodically runs a ‘lint mode’: merge duplicate entity pages, repair broken wikilinks, refresh stale summaries, clean up contradiction flags. The process self-heals the knowledge graph. To do the equivalent with RAG, you re-run the reindexing pipeline — and embedding-level errors still aren’t visible.

Dimension	RAG	LLM Wiki
State	Stateless (rediscovered each query)	Stateful (incrementally compiled)
Storage format	Opaque vector embeddings	Human-readable markdown
Search pipeline	Similarity search (vector DB)	index.md scan → drill-down
Infra complexity	Vector DB + embedding model + chunker	Filesystem + LLM agent
Token cost	Retrieval per query	~95% savings vs raw load
Reliability	Silent failure on retrieval miss	100% within context limit
Maintenance	Reindexing pipeline	Active LLM self-healing lint
Auditability	Hard (vector black box)	Transparent (file diff)
Synthesis depth	Shallow (one-shot top-k)	Deep (incremental compile)

The core insight

This table isn’t a ‘tech performance comparison’ — it’s a ‘data model comparison.’ RAG solves knowledge management as a search problem. The LLM Wiki solves it as a compilation problem. That framing difference determines all nine rows of the table.

How explosive has the LLM Wiki ecosystem been?

The theory was in the gist. Implementation was the issue. And yet within two weeks at least five derivative implementations had shipped — concrete evidence of “what you can do with this pattern.”

LLM Wiki ecosystem timeline (April 2026) — 5 derivative implementations — Two weeks of ecosystem explosion right after the Karpathy gist — 5 derivative implementations appeared

Watch the narrative speed

Within 48 hours of the gist going up, the first serious implementation was on GitHub (source: viral post). In days, the projects branched out in different directions: MCP-server packaging, automatic Claude Code session compilation, multi-agent extensions, agentmemory integration. The ecosystem became a speed contest over “who grabs the slot first.”

A map of the derivatives — what angles split off

Each implementation emphasizes a different facet of the same pattern. Part 1 only covers the per-category positions. Hands-on installation and use is covered in depth in Part 2 of this series.

Implementation	Axis (angle of attack)	One-line take
safishamsi/graphify	Multimodal + graph topology	Leiden community detection clusters code, PDFs, images, video; ships as a CLI skill
llm-wiki-mcp (npm)	Protocolization	MCP server for native Claude Desktop integration
Pratiyush/llm-wiki	Auto-compile session logs	Claude Code conversations get loaded directly as wiki pages
redmizt/multi-agent-wiki-toolkit	Multi-agent safety	Contamination firewall blocks cross-agent pollution
rohitg00/llm-wiki-v2	Long-term memory layer	agentmemory layer keeps past context alive

The ecosystem’s flagship — a quick look at Graphify

Of the five, the project the community piled into first is safishamsi/graphify. Three things drew attention. First, the input range is broad — not just code but PDFs, images, and video, automatically converted into a knowledge graph. Second, where other derivatives stop at a set of markdown files, Graphify builds graph topology with Leiden community detection — compressing meaning at the cluster level rather than via simple wikilinks. Third, it plugs straight into Claude Code, Cursor, and Gemini CLI as a “skill.” Layered on top of existing workflows instead of replacing them. These three traits combined to push it past 30,000 GitHub stars in two weeks.

This article is the concept piece, though. Part 2 of this series owns the actual Graphify install, folder design, first-ingest verification, Korean-source pipeline, and how to measure whether it really is a RAG alternative. For now, just remember that “Graphify is one axis of this ecosystem.”

Why so fast

The Karpathy gist itself already contained all the implementation hints. The 3-layer directory structure, index.md drill-down, lint routine, wikilink syntax — a blueprint you could just follow. No original research required, low enough complexity for an engineer to carve out over a weekend, so the bar to entry was low and the contest became a sprint.

What are actual users saying?

긍정 반응

"I compiled six months of stale research notes and a pattern showed up — I'd reached the same conclusion several times and forgotten it. A connection search would never have surfaced." — Reddit r/learnmachinelearning
"I automated it as a Claude Code plugin and context tokens dropped 84%. The cost difference is tangible." — Reddit r/ClaudeCode (103 upvotes)
"Manual note-taking has basically vanished. The conversation becomes the wiki page. A real paradigm shift." — Reddit r/ClaudeCode plugin showcase

부정 반응

"It still doesn't solve the contradiction-structure problem the Semantic Web has failed at for 30 years. Just a renamed rerun." — Hacker News comment
"There's a contradiction — it only works at small scale. At scale you end up bolting vector search onto RAG, so isn't this just RAG with extra steps?" — Reddit r/MachineLearning
"Having an LLM write your wiki for you is worthless for your own growth. The act of organizing is the learning." — Reddit comment (75 upvotes)

The shared keyword in positive reactions — ‘compounding’

The common thread in the positive reports is the felt sense that “it gets stronger over time.” Many testimonies say the first week is underwhelming, but answer quality starts shifting after 3–6 months.

The shared keyword in negative reactions — ‘self-deception’

The criticism, in contrast, converges on “is this really new?” Semantic Web rerun, RAG with extra steps, learning outsourcing. Three branches with the same root: suspicion that the packaging is overblown.

Three limits of the LLM Wiki — what to watch out for

This section isn’t neutral — it’s closer to defense. The traps you should know before adopting the pattern.

Scale ceiling — it breaks at 400,000 words

Karpathy’s own usage is around 100 documents, roughly 400,000 words (source: techbuddies.io). Reports of index.md-based search hitting its limit cluster around that figure. Beyond it, a hybrid (markdown + vector search) becomes necessary — and that’s the foundation of the “RAG with extra steps” critique.

Hallucination contamination becomes permanent — a wrong fact, once written, sticks

If the LLM records a wrong fact on a wiki page during compilation, every subsequent query references that error. RAG at least re-interprets from the source each time, so contamination is one-shot. The wiki makes it permanent. Periodic lint + cross-checking against the raw source + Silent Drift detection (freshness metadata) is essential.

Lint skipping — the first step people drop

In DIY implementations, the most commonly skipped step is lint. Ingest is fun; cleanup is boring. But a wiki without lint collapses under internal contradictions and duplication within three months. That’s why the rohitg00 extension highlights lint automation as a core feature.

Limit	Symptom	Mitigation
Scale ceiling (~400k words)	index.md bloat, rising drill-down failure rate	Sub-indexes per category, run hybrid vector search alongside
Permanent hallucination contamination	Wrong entity pages get referenced repeatedly	Periodic lint + raw-source diff + freshness metadata
Silent Drift	Stale information returned with confidence	Per-page last_verified field + expiry alerts
Lint skipping	Duplicates and contradictions degrade the wiki	Force lint into the ingest workflow
Missing evaluation metrics	Can't measure health	Awaiting community standard; for now use coverage / link density as proxies

Defend against contamination first

The scale ceiling is a problem for later. Contamination accumulates from week one. Right after ingest, on the same day, make a habit of doing a first-pass audit against the raw source. The cost of fixing it later grows exponentially.

If you’re reading this now, where do you start?

The one-line takeaway

The LLM Wiki is a ‘knowledge compilation’ paradigm, not a ‘search optimization’ one. For personal and small-scale archives, it leads RAG on token efficiency and auditability — but if you don’t dodge the three traps (the 400k-word ceiling, hallucination contamination, lint skipping), it self-destructs in three months. Starting takes 30 minutes — make the raw / wiki / schema folders.

Recommended for

Anyone with hundreds of personal research notes, papers, and bookmarks where search is increasingly failing
Developers and researchers already comfortable in Claude Code / Cursor / Obsidian environments
Anyone who built a RAG-based personal knowledge bot and felt “why are the answers so shallow?”
Conversely, not recommended for beginners, writing learners, or those who want to “internalize knowledge through the act of organizing.” The learning-outsourcing critique applies precisely to this group.

Three things you can do today

Set up an Obsidian vault + 3 folders

Create just the three: raw/, wiki/, and index.md. The Web Clipper plugin alone is enough. 30 minutes start to finish.

Pick one derivative implementation and run an ingest test

From the five in the ecosystem, pick the one that fits your editor and workflow, then ingest 10–20 of your existing notes. Always compare the first compile result against the raw source to check for hallucinations. Per-implementation install and hands-on guides are covered in Part 2 of this series.

Draft your CLAUDE.md and lock in a weekly lint

In about 100 words, write the ingest/query/lint order and naming rules. Run a lint session on the same day every week to clean up duplicates and contradictions. Without this routine, the wiki collapses in three months.

From Part 1 to Part 2 — what’s covered, what’s next

That covers “why the LLM Wiki appeared, what its structure is, and why it’s blowing up now” — the theory side. The question readers usually feel at this point is the same: “I get the concept. So how do I plug it into my own folders?” That question is the starting point of Part 2.

Series preview — from theory to practice

This is Part 1 — the concept piece — of the Karpathy LLM Wiki series. Parts 2 and 3 turn the things only mentioned here into something concrete.

Part 2 — Building Graphify in practice (read now): Take Graphify, only sketched here, and actually build a wiki with it. Folder design (raw/, wiki/, index.md), the /graphify command flow, how to verify hallucinations by comparing the first ingest against the raw source, the Korean-source (HWP, PDF, news clippings) pipeline, and integration with Claude Code’s CLAUDE.md — followed step by step at the screenshot level.
Part 3 — Six-month operational report (forthcoming): Validates the “three limits” section above with real-world data. What actually breaks at the 400k-word ceiling, how hallucination contamination gets discovered and recovered, and how long it takes for the wiki to collapse when lint is skipped.

For now, just create the three folders — raw/, wiki/, index.md — and you can pick up directly in Part 2.

Frequently asked questions

Is it mature enough for production today?

As of April 2026, it's solidly usable at personal or small-team scale (~5 people). For enterprise knowledge bases beyond 400,000 words, validation is still thin. With no evaluation-metric standard either, it's hard to report a 'healthy wiki' as a number. Start with personal experiments and decide whether to scale after three months of operation.

Can I do this without Obsidian?

Yes. The core is 'markdown files + wikilinks + LLM agent,' so VS Code, Logseq, or even a plain directory + CLI combo works. Most mainstream implementations are designed editor-agnostic. The one convenience that's hard to reproduce elsewhere is Obsidian Web Clipper saving images locally too.

Should I drop RAG entirely?

No. Karpathy himself said RAG is still the right fit for scenarios like large legal Q&A and enterprise search. The wiki only wins in domains where 'accumulation and synthesis' matter, like personal knowledge management. At scale, you eventually go hybrid (markdown + vector).

Does it handle Korean sources (papers, blogs) well?

Compilation and wikilink generation work language-independently. The wrinkle is entity-name normalization, where Korean/English homonyms get confused. For example, there are reports of the model name 'Claude' and the person 'Claude' being merged into the same page. Safer to give Korean entities their own namespace and review them manually. Korean-specific formats like HWP need markdown conversion before going into raw.

Any privacy risk?

Most implementations keep the files themselves on the local filesystem, but LLM API calls go out to external providers (Anthropic, OpenAI, etc.). For sensitive documents, consider switching to local Ollama-backed models. As of April 2026, local model quality has come up enough that something like Gemma 4 reportedly covers about 80% of compilation work.

How much does it cost?

Doing a single full compile of Karpathy-scale content (100 documents, 400k words) on Claude Sonnet runs roughly tens of dollars. That said, running weekly lint sessions on Opus has pushed monthly cost into the hundreds of dollars in some cases. The economical split is Sonnet for lint, Opus once a quarter for a deep audit.

Where should beginners start?

Read the original Karpathy gist once → follow the README of any one derivative implementation and compile 10 of your existing notes → eyeball the result against the source for hallucinations. That three-step path is a 2-hour intro course. Skip the hands-on and go theory-only and you won't get the feel for 'why this is special.' Concrete tool selection and install steps are covered at the screenshot level in Part 2 of this series.