Benchmarks

We measure before we claim.

Capability — the frontier, from open models

Memory is one half of the thesis; capability is the other. With the right software layer, open models reach the frontier — and on several, pass it — on the public benchmarks the labs report. We name our models: our tiers pair open, openly-licensed models — Kimi K3, GLM-5.2, Qwen3-32B — with N71. Every number below is scored by the benchmark's official grader, set beside our own head-to-head runs of the frontier.

Benchmark	Open model ⊕ N71	Frontier
GPQA Diamond · accuracy · full 198	94.4 — Kimi K3 ⊕ N71	GPT-5.6 Sol — 94.9 · Claude Fable 5 — 82.8
LiveCodeBench · pass@1 · full 175	84.6 — GLM-5.2 ⊕ N71 (raw: 70.3)	GPT-5.6 Sol — 82.9 · Claude Fable 5 — 93.1
Text-to-SQL · held-out conventions	93.3	frontier, cold — 46.7

Full sets, official graders, one graded answer per problem — no best-of-K. GPQA and LiveCodeBench are the complete 198- and 175-question sets. The frontier column is our own head-to-head runs of GPT-5.6 Sol and Claude Fable 5. Methodology available under NDA: benchmarks@n71.ai

Memory at long context — N71 vs Sakana Fugu

MRCRv2 asks a memory system to find a specific occurrence buried in a long, evolving conversation — across context lengths up to a million tokens. No model's window holds that. N71's memory core doesn't try to: it retrieves the relevant occurrences (by topic — it never sees the answer), collapses a 1M-token haystack down to a few thousand, and hands a 128k-window model a bounded slice it can actually answer.

On the full official range — 2/4/8-needle across every length bin including 256k, 512k, and 1M:

Tier	MRCRv2 (full range)	Fugu Ultra	Fugu
Reason — GLM-5.2 ⊕ N71	99.1	93.6	86.6
Frontier — Kimi K3 ⊕ N71	99.8	93.6	86.6
Inquiry — Qwen3-32B ⊕ N71	91.6	93.6	86.6

The same models score 30.4 raw. The scores hold at long context — Reason is 98.9 above 512k, with no collapse past the model's native window — because N71 turns long-context recall into a bounded, retrievable question rather than cramming a million tokens into one model. That is the orchestration advantage, and it is exactly what a memory layer is for. (Frontier is 99.8 on the completed portion, finishing now; Reason and Inquiry are the full 2,400-item sweeps.)

Memory — the tasks that define organizational reality

Most memory systems can store facts. The MEME benchmark (KAIST AI · Tübingen · NAVER, 2026) — the first evaluation built around evolving organizational memory — showed that storage was never the hard part. Across 100 episodes and six tasks, practical-cost systems handle exact recall and aggregation reasonably. Then the benchmark asks two questions that organizations ask every day, and the field collapses.

The two tasks that matter

Cascade — when a fact changes, does everything that depends on it change too?
A vendor's contract is renegotiated. The budget line that referenced the old terms, the project plan built on the old timeline, the commitment made downstream — do they update, or does the system keep answering from a world that no longer exists? Field average at practical cost: 3%.

Absence — does the system know what it doesn't know?
The most dangerous output of a memory system is not a wrong answer; it is a confident restatement of a stale fact. Absence measures whether a system can say "this is no longer supported" instead of fluently asserting yesterday. Field average at practical cost: 1%.

These numbers are the benchmark authors', not ours, and they are the honest state of the category. Behind the marketing, almost every memory system in production today fails the two tasks that define organizational reality: things change, and knowing that you don't know is worth more than sounding sure.

Our results

The full 100-episode suite, the benchmark's published episodes and judges.

System	Cascade	Absence	Overall
N71	0.628	0.42	0.574
Best published memory system	0.06	0.05	0.42
Frontier model — full transcript in context	0.05	0.35	0.36
Field average — six memory systems	0.03	0.01	0.24

Bold marks the standing best per task. Baseline rows are the benchmark's published field.

On the two tasks that define organizational reality, N71 leads the field by an order of magnitude: Cascade 0.628 against a 3% field average — more than twenty times the field — and Absence 0.42 against 1%. N71's overall 0.574 is the highest of any memory system in the study, above the best published system (0.42) and both no-memory full-transcript baselines.

We publish our losses too. A vendor that publishes only its wins should be presumed to have losses. Our results carry both, permanently — the numbers we're proudest of and the ones we're not.

A note on reading memory benchmarks

Two things to check before believing any number in this category — including ours. First, who scored it: self-reported results on a benchmark the vendor selected deserve the same scrutiny as any other marketing claim. Second, the baseline: on some tasks a raw frontier model with the transcript in context already scores respectably — a memory system has to beat the baseline of simply not having a memory system, at a cost that makes sense. We hold ourselves to both checks.

Methodology, full per-model results, and validation access under NDA: sanad@n71.ai