External Epistemic Memory

Justified, persistent, auditable knowledge for LLMs

By Ben Thomasson · Source: ftl-reasons

What is EEM?

External Epistemic Memory (EEM) is knowledge that lives outside the model, carries its justifications with it, and lets you understand how the system knows what it knows.

EEM is defined by three load-bearing properties: it is external (outside model parameters), epistemic (justified beliefs with truth values), and memory (persistent semantic knowledge). Each property is necessary. Together they distinguish EEM from every other approach to LLM knowledge management.

Three Properties

External

Knowledge lives outside model parameters, in a separate substrate. It survives compaction, model swaps, and session boundaries. It is separable, copyable, shareable, inspectable, editable, and auditable. Of these six properties, auditability is the most epistemically important — it makes "how do you know that?" answerable by justification chain traversal.

Epistemic

Not just facts but justified beliefs with truth values (IN/OUT), retraction cascades, contradiction records (nogoods), and derivation depth. This is what distinguishes EEM from RAG, which is external semantic memory but not epistemic.

Memory

Persistent structured knowledge in Tulving's semantic memory category — not ephemeral context. The knowledge persists across sessions, across models, and across time.

What EEM Replaces

EEM vs RAG

RAG is external semantic memory but not epistemic. It retrieves content by similarity but has no justification chains, truth values, retraction cascades, or contradiction tracking. EEM adds the epistemic layer that RAG lacks.

EEM vs Context Windows

Conversation history and context windows are ephemeral — lost at session boundaries, destroyed by compaction. EEM persists across sessions and model swaps. Context compaction destroys justification networks (quantified across 33 measured compaction events).

EEM vs Parametric Knowledge

In-parameter knowledge has no audit trail. EEM makes "how do you know that?" answerable by justification chain traversal.

EEM vs Self-Assessed Confidence

LLM self-assessed confidence does not track accuracy. Confirmed across 4 models (March 2026): Sonnet r=0.135 (not significant at p<0.05), Opus r=-0.045 (no correlation — worse than random). 55 questions x 3 conditions x 2 models x 5 runs = 1,650 invocations. Answer and confidence come from the same process — the same structural flaw as human overconfidence (Kahneman). EEM replaces "am I sure?" with "is this justified?" — shifting from unreliable confidence to auditable justification chains. Methodology and results.

How It Works

Truth Maintenance System (TMS)

EEM is built on Doyle's (1979) Truth Maintenance System architecture: SL justifications with antecedents, propagation cascades, retraction cascades, and an exogenous problem-solver slot. The TMS substrate is content-agnostic by design.

Hybrid Architecture

The implementation is a hybrid TMS: symbolic TMS handles structure (justifications, propagation, cascades, backtracking, challenge/defend) while LLMs handle semantic operations (derive generates beliefs, review-beliefs critiques them, contradiction detection finds nogoods). Putting an LLM in the TMS problem-solver slot is what Doyle's architecture prescribes.

Key Mechanisms

Derive-then-Review

Over-derive, then review catches errors, retraction cascades propagate corrections. Both roles overshoot: derive over-generates, review over-retracts. Working through candidate retractions is where insights hide. 13-37% of derived beliefs are retracted per review round — the system finds and removes its own errors.

Measured Results

98.5%
A/B grade across 3,853 questions with dual-path architecture (Claude Opus 4.6, May 2026). Zero D/F grades — eliminated the failure tail entirely. Methodology.
88% vs 33%
Expert-service with EEM (Claude Opus 4.6) scores 88% A-grade vs agent pipeline baseline 33% on same 50 Red Hat domain questions, 15x faster. Three rubrics, six systems tested (May 2026). Methodology.
40+
Expert knowledge bases built, from 237 beliefs (aap-expert) to 12,731 beliefs (redhat-expert).

Model Compensation

EEM compensates for model size: Sonnet 4.6 + beliefs approximates Opus 4.6 without beliefs. Haiku 4.5 with dual-path achieves 94% A+B, matching Opus at 98%. Smaller models with EEM match larger models without it.

Expert Prompt Paradox

Telling an agent it is an expert reduces belief utilization. Beliefs alone outperform beliefs + expert prompt: Opus 4.6 100% vs 94.2%, Sonnet 4.6 94.2% vs 91.8% (March 2026, 50 questions). The humble generic prompt produces better results because the agent consults the knowledge base instead of trusting its "expertise." Methodology.

Self-Critique Failure

LLM revision based on self-critique makes answers worse: Sonnet 4.6 dropped from 87% to 60% accuracy when asked to revise based on self-assessed confidence (March 2026, 1,650 invocations). Self-critique fails because the same model that made the error evaluates the error. EEM externalizes the critic's judgments, replacing internal self-assessment with external structured tracking. Methodology.

Architecture

Dual-Path Retrieval

EEM is queried via dual-path retrieval: TMS path (pre-computed beliefs) + FTS path (source chunk search), merged by a third pass. Each path stays within cognitive budget.

Cognitive Budget

Borrowed from graphics frame budgets: decompose work into focused passes (TMS pass, RAG pass, merge pass) each within the model's attention budget. Mixing beliefs and document chunks in a single prompt degrades performance (Opus 4.6 drops 95.5% to 86%). Three focused passes achieve 100%. Ablation results.

Expert Pipeline

Chunk source material, propose beliefs, human accepts, derive connections, review derivations, export. Value accrues at each stage. Derive produces new knowledge — connections the source doesn't make explicit.

Multi-Agent TMS

Import another agent's beliefs with SL justifications including agent:active as antecedent. A node is IN iff the agent is active AND the original belief is justified. Doyle-style truth maintenance across agents.

Model Stacking

Model A generates candidates, TMS records with provenance, review critiques (machine + human), Model B receives validated beliefs, derives new beliefs, review critiques derivations, repeat. Each level is a full model pass with fresh context and a critique pipeline as quality gate.

For AI Agents

LLM agents use EEM by:

The agent does not need to be told it is an expert — the knowledge base speaks for itself.

Two CLIs

Both are available in the ftl-reasons repository.

beliefs
Structured markdown KB with provenance and manual maintenance. Simple, flat. Use for independent facts.
reasons
Full TMS with automatic propagation, cascades, backtracking, and LLM-driven operations. Use for justified conclusions with dependency chains.

Architecture Pattern

Use the reasons database for all structural operations (add, retract, derive, review). Export to beliefs.md for querying (fast, human-readable, grep-able). Keep both in sync via reasons export-markdown.

Getting Started

Install from the ftl-reasons repository, then:

  1. reasons init — creates reasons.db
  2. Add premises from observations: reasons add node-id "observation text"
  3. Add justified conclusions with --sl to link dependencies: reasons add conclusion "derived text" --sl premise-a,premise-b
  4. Use reasons derive to find connections the source doesn't make explicit
  5. Use reasons review-beliefs to audit — expect 13-37% retraction rate
  6. Retract when evidence changes: reasons retract node-id — cascades propagate automatically

Construction cost is O(chunks) + O(beliefs x rounds), but it amortizes across all queries O(queries). Expensive to build, cheap to query at scale.

Glossary

TMS (Truth Maintenance System)
A system that tracks which beliefs are currently justified and automatically propagates changes when justifications change. Invented by Doyle (1979).
IN / OUT
A belief's current truth status. IN means all its justifications hold. OUT means at least one required justification has failed.
SL Justification (Support List)
A rule that says "believe X when all of A, B, C are IN." A belief can have multiple SL justifications — it stays IN if any one of them holds.
Retraction Cascade
When a belief goes OUT, everything that depended on it is automatically re-evaluated. Dependents whose justifications no longer hold also go OUT, transitively.
Nogood
A recorded contradiction — a set of beliefs that cannot all be true simultaneously. When detected, the system traces backward through justification chains to find the least-disruptive belief to retract.
Derive
An LLM operation that reads existing beliefs and proposes new ones with justification links. Generates knowledge the source material doesn't state explicitly.
Review
An LLM operation that critiques existing beliefs, proposing retractions for beliefs that are wrong, unsupported, or redundant. 13-37% retraction rate per round.

Theoretical Foundations