MemPalace’s 100% memory claim gets checked

OraCore Editors

[AGENT] April 8, 20267 min readOraCore Editors

MemPalace’s 100% memory claim gets checked

MemPalace hit 11K GitHub stars fast, but its 100% LongMemEval claim fell to 84.2% under compression. The project is real; the marketing isn’t.

MCP LongMemEval Claude Code MemPalace AI memory

Share LinkedIn

MemPalace’s 100% memory claim gets checked

MemPalace pulled in more than 11,000 GitHub stars in 48 hours, which is a strong signal that people want better AI memory tools. The headline claim was even louder: a perfect 100% score on LongMemEval, a benchmark built to test long-term memory in AI systems. Independent checks later cut that number down to 84.2% once compression was actually enabled.

That gap matters because MemPalace is still interesting even after the hype gets trimmed. It is a local-first memory system with an MCP server, 19 tools, and an offline design built around a spatial “memory palace” model. The project looks useful. The scorecard around it needs a lot more honesty.

What MemPalace actually is

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

MemPalace is built around a simple idea: instead of storing chat history as one long flat log, it organizes memory into wings, halls, and rooms. That structure borrows from the ancient “method of loci,” where people remember facts by placing them in a mental building they can walk through.

The project uses ChromaDB for retrieval and PyYAML for configuration and metadata handling. It also ships with an MCP server, which means it can plug into tools that speak the Model Context Protocol, including assistants like Claude and editors like Cursor.

That offline-first angle is a big part of the appeal. A lot of AI memory products send data to a cloud service by default, which is fine for convenience and bad for anyone who wants local control. MemPalace keeps the memory store on the machine.

GitHub stars in 48 hours: 11,000+
Claimed LongMemEval score: 100% or 500/500
Verified compressed-mode score: 84.2%
Raw retrieval score cited by reviewers: 96.6% R@5
MCP tools included: 19

Why the benchmark claim broke down

LongMemEval is a real benchmark from UC San Diego that tests five long-term memory abilities across 500 questions. That gives the MemPalace story some real weight, because the benchmark is not made up and the task is hard enough to matter.

The problem is how the perfect score was presented. Independent reviewers found that the 100% result came after hand-patching the last three wrong answers and then rerunning the same dataset. That is classic overfitting to the test set. It may improve the demo, but it does not prove the system generalizes.

“The first principle is that you must not fool yourself and you are the easiest person to fool.” — Richard Feynman

That quote fits this story well. A benchmark can be real, a repo can be real, and the result can still be misleading if the evaluation is massaged after the fact. In AI tooling, the difference between “works in a demo” and “works in the wild” is often the entire story.

There is also a technical mismatch in the numbers. The 96.6% R@5 figure appears to come from ChromaDB’s default embedding retrieval, which means it measures nearest-neighbor lookup rather than the palace structure or the custom AAAK compression system. Once AAAK compression is actually used, the score drops to 84.2%, which directly contradicts the “lossless” framing.

How MemPalace compares with other systems

MemPalace is not alone in chasing long-term memory benchmarks. The interesting part is that once you line up the numbers, the project is competitive without needing a perfect score.

Here is the cleaner comparison based on published figures and reviewer notes:

Mastra: 94.87% on LongMemEval
OMEGA: 95.4%
agentmemory: 96.2%
MemPalace raw retrieval: 96.6% R@5
MemPalace with AAAK compression: 84.2%

Those numbers tell a useful story. MemPalace is in the conversation on retrieval quality, but the headline claim was doing more work than the actual system. If you strip away the perfect-score language, you are left with a solid local memory prototype that may be useful for people building agents, not a miracle benchmark winner.

The other comparison that matters is architectural. Most memory systems are just databases with some heuristics layered on top. MemPalace tries to mirror human recall with spatial organization, which is a more interesting design choice than a plain vector store. That does not make it better by default, but it does make it worth testing beyond one benchmark.

Milla Jovovich’s role and what the project proves

Milla Jovovich’s involvement is real. Her verified Instagram account and the GitHub history point to genuine participation, and her GitHub bio describes her as the “architect of the MemPalace.” That wording matters. It suggests direction and product vision, not a claim that she wrote every line herself.

Ben Sigman’s posts also hint at how the project came together. He said they created it with Claude, and his joke about “Multipass” makes it pretty clear that Claude Code did a lot of the implementation work. That is not a knock on the project. It is a sign of where AI-assisted development is now.

What MemPalace proves is narrower than the headlines suggest. A public figure with no known programming background can still help produce a functional AI tool in a few months if the workflow is good and the model does much of the coding. That is a real shift in who can ship software, and it is more interesting than the benchmark drama.

It also exposes a second lesson: if a project is genuinely useful, it does not need inflated numbers. The local-first setup, MCP support, and memory-palace interface are all strong ideas on their own. The 100% claim only made the story louder, and then it made the scrutiny harsher.

What developers should take from this

If you build AI agents, MemPalace is worth studying for the design, not the headline. The spatial memory model could be a better mental fit for some workflows than a flat recall stack, especially when users need to inspect, edit, or prune memories by topic or time.

The benchmark lesson is even more practical. If your evaluation can be nudged upward by patching a few answers and rerunning the same set, then the score is marketing, not evidence. That rule applies whether you are shipping an open-source repo, a startup demo, or an internal prototype.

My read is simple: MemPalace should be remembered as a strong prototype wrapped in weak claims. The project may keep climbing because the idea is useful and the celebrity angle gives it reach, but the real test now is whether other builders can reproduce the architecture, stress it with fresh data, and keep the score honest.

That is the question worth watching next: can MemPalace hold up when people stop talking about 100% and start asking how it performs on new memory tasks, new models, and real user data?

// Related Articles

MemPalace’s 100% memory claim gets checked

What MemPalace actually is

Get the latest AI news in your inbox

Why the benchmark claim broke down

How MemPalace compares with other systems

Milla Jovovich’s role and what the project proves

What developers should take from this

Claude给Agent加了“做梦”功能

How to Switch AI Outputs from Markdown to HTML

Anthropic’s Cat Wu on proactive AI assistants

How to Run Hermes Agent on Discord

Why RAGFlow is the right open-source RAG engine to self-host

How to Add Temporal RAG in Production