Benchmarks
We measure before we claim. And we publish our failures.
Most memory systems can store facts. The MEME benchmark (KAIST AI · Tübingen · NAVER, 2026) — the first evaluation built around evolving organizational memory — showed that storage was never the hard part. Across 100 episodes and six tasks, practical-cost systems handle exact recall and aggregation reasonably. Then the benchmark asks two questions that organizations ask every day, and the field collapses.
The two tasks that matter
Cascade — when a fact changes, does everything that depends on it change too?
A vendor's contract is renegotiated. The budget line that referenced the old terms, the project plan built on the old timeline, the commitment made downstream — do they update, or does the system keep answering from a world that no longer exists? Field average at practical cost: 3%.
Absence — does the system know what it doesn't know?
The most dangerous output of a memory system is not a wrong answer; it is a confident restatement of a stale fact. Absence measures whether a system can say "this is no longer supported" instead of fluently asserting yesterday. Field average at practical cost: 1%.
These numbers are the benchmark authors', not ours, and they are the honest state of the category. Behind the marketing, almost every memory system in production today fails the two tasks that define organizational reality: things change, and knowing that you don't know is worth more than sounding sure.
Why N71 should be tested exactly here
These two tasks are not incidental to our architecture — they are what it was built for. Cascade is a supersession problem: N71 records facts with temporal lifecycle, invalidates rather than deletes, and links dependent assertions through an explicit graph (TR-2026-01 §3.4). Absence is a refusal problem: N71's answer protocol validates every citation against anchored sources, strips what it cannot verify, and returns an explicit, machine-readable refusal when the corpus cannot support an answer — by design, it fails closed (TR-2026-01 §5).
A system that claims those properties should be willing to be measured on them. We are.
Our results
Runs of June 12–13, 2026 — the full 100-episode suite, the benchmark's published episodes and judges, through our production pipeline. Two protocol iterations, both shown.
| System | Exact Recall | Aggregation | Tracking | Deletion | Cascade | Absence | Overall |
|---|---|---|---|---|---|---|---|
| N71 — iteration 1 · June 12, 2026 | 0.99 | 0.12 | 0.50 | 0.56 | 0.55 | 0.35 | 0.512 |
| N71 — iteration 2 · June 13, 2026 (membership capture + dependency recall) | 0.98 | 0.29 | 0.54 | 0.50 | 0.48 | 0.35 | 0.522 |
| MD-flat — best published memory system | 0.94 | 0.45 | 0.77 | 0.25 | 0.06 | 0.05 | 0.42 |
| gpt-4.1-mini — full transcript in context | 1.00 | 0.27 | 0.69 | 0.45 | 0.03 | 0.04 | 0.36 |
| text-embedding-3-small — vector RAG | 0.96 | 0.33 | 0.46 | 0.17 | 0.04 | 0.00 | 0.33 |
| Sonnet 4.6 — full transcript in context | 0.50 | 0.21 | 0.58 | 0.39 | 0.05 | 0.35 | 0.32 |
| Mem0 — LLM-extracted memory | 0.67 | 0.35 | 0.43 | 0.21 | 0.03 | 0.00 | 0.28 |
| BM25 — sparse retrieval | 1.00 | 0.05 | 0.16 | 0.27 | 0.02 | 0.00 | 0.25 |
| Karpathy Wiki — file agent + compiled KB | 0.11 | 0.18 | 0.27 | 0.03 | 0.01 | 0.02 | 0.10 |
| Graphiti — temporal knowledge graph | 0.03 | 0.01 | 0.04 | 0.09 | 0.02 | 0.01 | 0.03 |
| Field average — six memory systems | 0.62 | 0.23 | 0.35 | 0.17 | 0.03 | 0.01 | 0.24 |
| MD-flat × Opus 4.7 — ≈70× cost reference | 0.60 | 0.80 | 0.20 | 0.80 | 0.32 | 0.59 | 0.55 |
Both N71 rows are the full 100-episode suite, scored by the benchmark's published judge on the strict metric the paper uses for the dependency tasks — credit only when the system knew the fact before the change and handled it correctly after (MEME §4, trivial-pass filtering). Iteration 1 is our standing result and the figures we cite; iteration 2 is the most recent protocol change, shown in full including where it regressed. Bold marks iteration 1's standing best per task. Baseline rows: MEME, Table 2 (Jung et al., arXiv:2605.12477), all run end to end on gpt-4.1-mini. The 70× reference is the study's frontier-model file agent, which its authors describe as not deployable today. Run artifacts for both iterations are published below for reproduction.
What the table says. Iteration 1 is the highest overall accuracy of any memory system in the study — above the best published system (0.42), above both no-memory full-transcript baselines, and within 0.04 of the 70× frontier reference (0.55) at roughly 1/70th the cost. The margins sit exactly where the field collapses: Cascade at 0.55 against a 3% field average and a 6% prior best — nine times the prior state of the art — Absence at 0.35 against a 1% field average, and Deletion at 0.56, more than double the previous best. Every dependency-task pass is a strict real pass: the system knew the fact before the change and handled it correctly after.
What iteration 2 says — published in full, including a regression. We said in our last run that we would fix Aggregation, and we did: 0.12 → 0.29, more than double, by capturing membership and affiliation facts onto the entity's card instead of leaving them stranded as graph edges the answer path never read. But the same release traded ground on the tasks that define this benchmark: Cascade fell 0.55 → 0.48 and Deletion 0.56 → 0.50. Overall edged up to 0.522 — a 0.010 move that is within run-to-run noise — so we do not count iteration 2 as progress. It moved a task we cared less about and went backwards on the two we care about most. We have diagnosed the cause (a field-naming drift introduced by the same change — the new value lands on a sibling field, so a cascade rule fires against a field the answer no longer reads) and the fix is in progress. Iteration 1 remains the result we cite until a run beats it on Cascade and Absence, not just on the average.
Where we lose, published on purpose. Three losses, each with its mechanism named.
Aggregation (0.12 in iteration 1) was our worst number, and we knew exactly why. Every Aggregation question asks for a complete multi-part list — hobby, sport, and club membership. Iteration 1 reliably surfaced the first two and dropped the third: membership facts were captured as relationships in the graph (User → member_of → Book Club) rather than as facts on the user's card, so the answer surface assembled two of three and never saw the rest. Retrieval wasn't failing — the fact was filed in a drawer the answer path didn't read for this question shape. The fix was specific and already designed: fold relationship-shaped personal facts (memberships, affiliations) into the card's attribute capture as accumulating set entries, and surface member_of edges into answer context. Iteration 2 shipped that fix and it worked — Aggregation 0.12 → 0.29. We are publishing it alongside iteration 1, not replacing it, because the same release regressed Cascade and Deletion; that trade, and the field-drift cause behind it, is described above and is the focus of the next iteration.
Tracking (0.50) trails the study's best file agent (0.77) — full version-chain retrieval is an iteration target.
Absence (0.35) is the best practical-cost result in the study, and it is not where we want it. A note on how it got here, because the path is a lesson in sample sizes: in six-episode iteration runs, Absence moved from 0.22 to 0.89 as we shipped conditional refusal — derive when a dependency rule determines the new value, refuse when none does. On the full 100-episode suite it lands at 0.35. That spread is why we run the full suite before claiming anything — and why any memory benchmark number published without a disclosed sample size and methodology should be read accordingly. The full analysis of the propagation–refusal trade is in TR-2026-01 §5.
Where we're headed next
We are publishing this mid-stream on purpose. Here is what we are doing next and where we expect it to land — stated before the run, so the next numbers can be checked against the claim.
Recover the regression — at no quality cost. The iteration-2 Cascade and Deletion drop is a field-naming drift, not a lost capability: the propagation machinery still fires; it fires against a field the answer path stopped reading. Fixing the field anchoring restores iteration 1's dependency-task levels while keeping the Aggregation gain. Expected overall: ≈0.56–0.57.
Then push the three tasks with the clearest headroom. Tracking (0.50–0.54) is full version-chain recall, and a chronological arrow-rendering of an entity's history should carry it toward ~0.70. Absence (0.35) is staleness surfacing — making "this is no longer supported" the default render for a fact whose basis changed. Aggregation (0.29) has room toward ~0.45–0.50 with cleaner accumulating-set capture. Stacked, with honest uncertainty on each: a credible target of ≈0.60–0.65 overall.
And we will publish the ceiling. The benchmark ships an in-context upper bound — the answer model given only the gold facts, no retrieval. We will run it on the same 100 episodes and publish it here. It tells the reader, and us, exactly how much of the remaining gap is architecture versus the answer model's own limit — and where it is no longer worth optimizing. Targets are not results; the full-suite numbers will appear on this page when each run completes, alongside these, not replacing them.
How we run it — and what we'll publish
The benchmark's own rules. The full 100-episode suite, scored with the benchmark's published judge prompts, sessions ingested in order, no task-specific tuning. The harness is an adapter over the same ingestion and answer paths our customers use — we benchmark the product, not a lab build.
Every failure, classified. For each miss we publish the cause: retrieval miss (the evidence was never surfaced), ranking loss (surfaced but out-ranked), or reasoning failure (surfaced, ranked, and wrongly synthesized). A score tells you where a system is; the failure taxonomy tells you whether its architecture can get better. Iteration 1's Cascade misses: 64 reasoning failures, 4 ranking losses, 2 retrieval misses. Iteration 2's: 77 reasoning failures, 5 ranking losses, 3 retrieval misses. In both runs retrieval is effectively solved — the evidence is almost always surfaced — and the remaining work is in the protocol that decides what to do with it. That is precisely why the iteration-2 regression is recoverable: nothing stopped being found; a rule started firing against the wrong field name.
The numbers we're proudest of and the ones we're not. A vendor that publishes only its wins should be presumed to have losses. Our results page will carry both, permanently, with the run methodology alongside.
Run artifacts
Every number above is recomputable from the raw results. No selection, no summarization — all 1,188 questions per iteration, passes and failures alike.
Iteration 1 · June 12, 2026
Complete results — Excel — every question: episode, task, the question, the gold answer, N71's answer, the judge's verdict, and the real/trivial classification. Three sheets: summary, all questions, per-episode.
Complete results — CSV · JSON — the same data for programmatic use.
Per-episode breakdown — CSV — per-task pass counts for each of the 100 episodes.
Iteration 2 · June 13, 2026 (membership capture + dependency recall)
Complete results — Excel — same schema; this is the run with Aggregation 0.29 and the Cascade/Deletion regression, every question included.
Complete results — CSV · JSON — the same data for programmatic use.
Per-episode breakdown — CSV — per-task pass counts for each of the 100 episodes.
Episodes and judge prompts are the benchmark's published versions (Jung et al., arXiv:2605.12477). Answer model: gpt-4.1-mini, the study's standard. Judge: gpt-4o. Internal identifiers and infrastructure diagnostics are scrubbed from the export; the benchmark substance — every question, answer, and verdict — is complete.
A note on reading memory benchmarks
Two things to check before believing any number in this category — including ours. First, who scored it: self-reported results on a benchmark the vendor selected deserve the same scrutiny as any other marketing claim; methodology should be published in enough detail to rerun. Second, the frontier-model baseline: on some tasks, a raw frontier model with the transcript in context already scores respectably — a memory system has to beat the baseline of simply not having a memory system, at a cost that makes sense. We hold ourselves to both checks.
Methodology questions, rerun requests, or benchmark suggestions: research@n71.ai