The Evidence-Anchored Organizational Graph

N71 Technical Report TR-2026-01 · June 2026 · Draft for publication · v3
N71 Research · Abu Dhabi

Abstract

Structured organizational memory is commoditizing. Write-time knowledge graphs, provenance tracking, and temporal awareness — capabilities that differentiated systems a year ago — are converging into the baseline, and we expect that convergence to complete. This report therefore makes a narrower and harder claim than "we built a graph." It specifies the Evidence-Anchored Organizational Graph as a system defined not by its feature list but by five invariants — properties that hold for every object and every answer, without exception — and by the Grounded Answer Protocol built on them: a fail-closed answering discipline under which no claim leaves the system without surviving validation against verbatim sources, and insufficient evidence produces refusal rather than generation. A feature list tells you what a system has; an invariant tells you what you can build on. In its first published evaluation — the full 100-episode MEME benchmark, its published episodes and judges, end to end on our production pipeline — the system scores 0.512 overall, the highest of any memory system in the benchmark's own study and within 0.04 of a frontier-model reference at roughly seventy times baseline cost, with its widest margins on the two tasks the field cannot solve. We also report the regression an earlier protocol iteration produced on one of those tasks, analyze the propagation–refusal trade behind it, and specify the conditional-refusal mechanism the production protocol now runs. The substrate (five layers, four retrieval channels) is specified as the minimum machinery the invariants require. Consistent with our disclosure policy, we publish contracts and invariants; extraction internals and ranking parameters are not disclosed.

1. The Argument

The failure modes of similarity-based organizational memory are settled science: interference between semantically adjacent facts, temporal flatness, and unaccountability are properties of the embedding geometry itself [1], and impossibility results establish that no purely semantic system escapes them without external symbolic structure [2]. The industry heard the diagnosis. Knowledge graphs, write-time extraction, and temporal metadata are now appearing across the category, and within a product cycle every serious memory system will claim them.

So the discriminating question is no longer whether a system has structure. It is what the structure guarantees. A graph whose answers cannot be audited is a similarity index with extra steps; a temporal model that decorates facts with timestamps but lets a confident stale answer through has solved the easy half of the problem. Enterprises do not adopt memory systems because recall improves — they adopt them when the output can be trusted: trusted to be derived from real sources, trusted to be current, and — hardest — trusted to say "I don't know" when that is the true answer.

This report specifies trust as a set of invariants, then specifies the machinery that maintains them.

2. The Five Invariants

These hold for every object and every answer in the system. They are enforced by construction — by storage-layer constraints, typed contracts, and protocol gates — not by convention or review.

I1 — Provenance totality. Every object above the raw-capture layer has a finite derivation path terminating in immutable captured source. There are no orphan facts. The question "why does the system believe this?" always has an inspectable answer ending in a verbatim passage.

I2 — Bi-temporal completeness. Every fact carries two timestamps: when it occurred in the world and when the system observed it. Queries may bind either. "What happened Monday?" and "what did we know on Tuesday?" are different questions with different correct answers, and a system that cannot distinguish them will eventually answer one with the other.

I3 — Non-destructive evolution. No operation deletes history. Facts are superseded with pointers to what replaced them; entity merges preserve recoverable pre-merge state; corrections append, with an audited record of who changed what and why. The graph at any past moment is reconstructible.

I4 — Validated emission. No claim leaves the system that has not survived citation validation against its anchored sources. Claims that fail validation are stripped before the answer is returned, and the validation record ships with the response. An answer is not text; it is text plus the audit of itself.

I5 — Closed-world honesty. When retrieval cannot assemble sufficient support, the system refuses — explicitly, with a machine-readable reason — rather than generating a plausible response. The refusal path is a first-class output, engineered and evaluated with the same rigor as the answer path.

I1–I3 are substrate invariants; versions of them are appearing elsewhere in the category, and we regard them as the emerging price of admission. I4 and I5 are emission invariants, and to our knowledge no other production system enforces both. They are where this report's claims concentrate, because they are the difference between a memory that informs decisions and a memory that merely furnishes text.

3. The Substrate

The invariants require machinery. We specify it compactly; the architecture here is deliberately conventional, because the substrate is not where novelty should live.

Layer 1 — Raw events. Immutable capture: source, actor, type, payload, and the bi-temporal pair (occurred_at, observed_at) that I2 requires. Append-only; never edited by any downstream process, including the system's own agents. The terminus of every I1 derivation path.

Layer 2 — Evidence. Events parsed into documents and paragraph-level chunks. The chunk is the system's unit of citation: stable URI, position, section path, source events. Every claim at every layer above resolves to chunk identifiers; every chunk resolves to a verbatim passage with capture context. Anchoring — retrieving the exact source view for any handle — is a single primitive operation.

Layer 3 — The graph. An extraction pipeline lifts typed entities and relationships from evidence: people, accounts, products, projects, decisions, commitments, and an extensible relation vocabulary. Extraction is graph-aware — candidates resolve against existing entities before writing, over the full identifier surface (names, addresses, handles, internal IDs), with confidence-scored, transactional, I3-compliant merges. Typing is governed by a per-organization ontology under an explicit adoption gate, specified fully in TR-2026-03. Every entity and edge carries the evidence chunks that justify it; an entity with no surviving evidence is a defect.

Layer 4 — Temporal state. Entities carry typed state timelines, not current-state snapshots. Each transition is timestamped and causally linked to its triggering event, making point-in-time queries (I2, I3) and "why is this entity in this state?" (I1) ordinary operations. Above the timelines, the system computes lifecycle dynamics — duration-in-state against the historical norm for the entity's type, momentum, behavioral baselines, and deviations from them — so that a deal stalled past its cohort's median with activity gone quiet is a structurally different object from an active one, before anyone asks.

Layer 5 — Synthesis. Continuous processes read across layers 1–4 and produce Thoughts: typed proactive-intelligence objects, each carrying a synthesis chain extending I1 to the system's own conclusions. Synthesis producers pass a typed promotion contract — candidates are validated and promoted, held as provisional, rejected, or deprecated, with typed reasons — so that derived knowledge enters the graph through a gate, not a side door.

Retrieval. Four channels run in parallel — graph traversal, semantic similarity, lexical match, and a freshness channel surfacing material too recent to be fully indexed — each covering the others' blind spots. Fusion policy and parameters are not disclosed.

4. The Grounded Answer Protocol

The protocol is the system's emission discipline — the machinery of I4 and I5 — and the central contribution of this report. Its defining commitment: it fails closed. The most damaging output of any memory system is a confident answer its corpus cannot support, and the protocol makes that output structurally difficult rather than merely discouraged.

Stage 1 — Retrieve. Candidate evidence is gathered through the four channels, bounded by the caller's governance scope (TR-2026-02): results are filtered by authority at the source, not redacted after.

Stage 2 — Synthesize. An answer is composed strictly from retrieved evidence, with inline citations binding each claim to specific chunks. Because the substrate has already resolved meaning, identity, and time, the synthesis model's task is reduced to reading comprehension over structured context — which is precisely what permits accurate operation with small, economical, sovereign-hostable models. The architecture, not the model, carries the intelligence.

Stage 3 — Validate. Every citation is checked against its anchored source. Claims that fail are stripped before the answer returns. The validation record — citations checked, failures, claims removed — ships as machine-readable metadata with every response (I4).

The refusal path (I5). Insufficient support yields an explicit refusal carrying a machine-readable reason and recommended next actions — broaden the query, ingest the missing source — never a degradation into plausible generation. We evaluate refusal quality as seriously as answer quality: a system that refuses too rarely is dangerous, and one that refuses too often is useless; the calibration between them is a measured property, not an aspiration (§5).

Two consequences of the protocol are worth stating because they are unusual. First, the answer is evidence about itself: a downstream agent consuming an N71 answer can programmatically inspect what was validated, what was stripped, and why — and make its own trust decision. Second, the protocol composes: because every claim resolves to chunk handles (I1), an agent can take any sentence of any answer and drill to the verbatim source in one call. There is no dead end between an assertion and its proof.

5. Evaluation

We evaluate where the invariants make falsifiable predictions. The MEME benchmark [3] constructs 100 episodes of evolving multi-session history — facts introduced, changed, deleted, and bound by dependency rules, with gold answers computed from the dependency graph and grading validated against human annotators at 98.6% agreement — and isolates the two task classes on which practical-cost systems collectively fail: Cascade (when an upstream fact changes, do dependent facts update? — a direct test of I2/I3 at answer time) and Absence (does the system know when it no longer knows? — a direct test of I5), with field averages of 3% and 1% across the study's six memory systems. A system claiming these invariants should expect to be measured exactly there. In June 2026 we ran the full 100-episode suite — the benchmark's published episodes and judge prompts — end to end through N71's production pipeline.

System	ER	Agg	Trk	Del	Cas	Abs	Overall
N71 — production run, iteration 1 · June 12, 2026	0.99	0.12	0.50	0.56	0.55	0.35	0.512
MD-flat — best published memory system	0.94	0.45	0.77	0.25	0.06	0.05	0.42
gpt-4.1-mini — full transcript in context	1.00	0.27	0.69	0.45	0.03	0.04	0.36
Sonnet 4.6 — full transcript in context	0.50	0.21	0.58	0.39	0.05	0.35	0.32
Field average — six memory systems	0.62	0.23	0.35	0.17	0.03	0.01	0.24
MD-flat × Opus 4.7 — ≈70× cost, 20-episode subset	0.60	0.80	0.20	0.80	0.32	0.59	0.55

Tasks: Exact Recall, Aggregation, Tracking, Deletion, Cascade, Absence. Baseline rows from [3], Table 2 (all baselines share a single small answering model end to end; the study's full system set appears in the source). Bold marks the best result among practical-cost configurations. N71 figures are from the full 100-episode suite; the per-miss classification and complete per-question results are published with this report.

Reading the result. Overall accuracy of 0.512 is the highest of any memory system in the study — the prior best is 0.42 — and sits within 0.04 of the frontier-model reference (0.55 at roughly seventy times baseline cost, on a 20-episode subset) that the benchmark's authors describe as not deployable today. The margins concentrate where the invariants predict. Cascade reaches 0.55 against a 0.03 field average and a 0.06 prior best among memory systems — nine times the prior state of the art: under I2/I3, a superseding fact replaces the current value at answer time rather than competing with it as a semantically similar passage — the failure mechanism the benchmark's authors identify in every system they study. The per-miss classification confirms it: of 74 Cascade misses, 64 are reasoning-stage and only 2 are retrieval misses. Absence reaches 0.35 against a 0.01 field average — the best practical-cost result in the study — through the conditional-refusal mechanism analyzed below. Deletion (0.56, best in the study, against a 0.25 prior best) follows the supersession mechanism. Tracking (0.50) trails the study's best file agent (0.77): full version-chain retrieval is incomplete, and a bi-temporal substrate should dominate this task. Aggregation (0.12) is the weakest column and its mechanism is fully diagnosed: multi-part personal facts whose components are captured as graph relationships (memberships) rather than card attributes are invisible to the answer surface for list-everything questions; the capture-side fix is specified. Iteration 2 (June 13, 2026) shipped it and raised Aggregation to 0.29 — but the same release regressed Cascade (to 0.48) and Deletion (to 0.50) through a field-naming drift that fires a dependency rule against a field the answer no longer reads; the fault is recoverable and under repair, and the full iteration-2 table, including the regression, is published on the benchmarks page. Iteration targets, not excuses.

The propagation–refusal frontier. Absence is the one task on which our score moved backward during protocol iteration — dropping to 0.22 in the same revision that raised Cascade, Deletion, and Aggregation — and the mechanism is structural: we expect it to bind every system that attempts both tasks. Cascade and Absence share a trigger — an upstream fact changed — and differ in one property: whether a dependency rule determines the new value. Where a rule exists, the correct behavior is derivation; where none exists, the only correct behavior is refusal. A protocol tuned to chase dependencies and commit to derived answers gains the assertive tasks and begins answering questions whose correct answer is "this is no longer knowable." The study's own baselines exhibit the frontier's other end: its most cautious configuration posts a strong baseline Absence (0.35) alongside a near-floor Cascade (0.05). Blanket caution is cheap and blanket assertiveness is cheap; the problem worth solving is conditional refusal — derive when a rule determines the value, refuse when none does. In protocol terms it is a rule-existence check at emission: a dependent fact whose basis has changed is re-emitted as the rule's derived value when a rule exists, and as an explicit "no longer supported" marker — which the answer path surfaces as uncertainty, not as the stale value — when none does, with invalidation cascading through dependency chains so a fact two hops from the change is as honest as a fact one hop away. Under this mechanism, on the full suite, N71 holds the study's best Cascade (0.55) and its best practical-cost Absence (0.35) simultaneously — the combination the frontier makes hard — where the cautious baseline pairs its 0.35 with a Cascade of 0.05. A note on sample sensitivity, because it is a lesson in reading this category: across six-episode iteration runs Absence ranged from 0.22 to 0.89; the full-suite figure is 0.35. Numbers published without disclosed sample sizes should be read with that spread in mind. We treat I5 as unfinished — 0.35 is the field's best and still far from where calibrated refusal needs to be — and we publish the regression alongside the recovery because the argument of §1 demands both.

Methodology and discipline. Our run uses the benchmark's published episodes and judge prompts, executed through the same ingestion and answer paths our customers use; it is therefore not the controlled setting of [3], in which every baseline shares one answering model. The per-miss failure classification — retrieval miss, ranking loss, or reasoning failure — is published with the complete per-question results, because the classification, not the score, determines whether a failure is fixed in the channels, the substrate, or the protocol. Run artifacts are frozen for reproduction. We publish failures alongside successes. A vendor that publishes only wins should be presumed to have losses.

6. Limitations

Stated plainly. Extraction is performed by language models and inherits their errors; the mitigation is not perfect extraction but I1 plus I3 — every error is traceable and correctable without history loss. The freshness channel trades ranking quality for availability inside the indexing window. Refusal calibration is corpus-dependent and is weakest early in a deployment, for the same reason all of Layer 4 is: the system's distinctive properties compound with tenure, and a week-old workspace has invariants but not yet much for them to protect.

7. Disclosure

We publish invariants, contracts, and protocols — the claims we expect to be held to and measured on. Extraction prompt architecture, resolution scoring, confidence calibration, fusion weights, and synthesis triggering remain proprietary.

References

[1] Ray Barman, S., Starenky, A., Bodnar, S., Narasimhan, N., Gopinath, A. The Geometry of Forgetting: How High-Dimensional Embeddings Reproduce Human Memory Phenomena. arXiv:2604.06222, 2026.

[2] Ray Barman, S., Starenky, A., Bodnar, S., Narasimhan, N., Gopinath, A. The Price of Meaning: Impossibility Theorems for Semantic Memory Systems. arXiv:2603.27116, 2026.

[3] Jung, S., et al. MEME: Multi-entity and Evolving Memory Evaluation. KAIST AI · Tübingen · NAVER. arXiv:2605.12477, 2026.