External Epistemic Memory (EEM)

What is EEM?

External Epistemic Memory (EEM) is knowledge that lives outside the model, carries its justifications with it, and lets you understand how the system knows what it knows.

EEM is defined by three load-bearing properties: it is external (outside model parameters), epistemic (justified beliefs with truth values), and memory (persistent semantic knowledge). Each property is necessary. Together they distinguish EEM from every other approach to LLM knowledge management.

Three Properties

External

Knowledge lives outside model parameters, in a separate substrate. It survives compaction, model swaps, and session boundaries. It is separable, copyable, shareable, inspectable, editable, and auditable. Of these six properties, auditability is the most epistemically important — it makes "how do you know that?" answerable by justification chain traversal.

Epistemic

Not just facts but justified beliefs with truth values (IN/OUT), retraction cascades, contradiction records (nogoods), and derivation depth. This is what distinguishes EEM from RAG, which is external semantic memory but not epistemic.

Memory

Persistent structured knowledge in Tulving's semantic memory category — not ephemeral context. The knowledge persists across sessions, across models, and across time.

What EEM Replaces

EEM vs RAG

RAG is external semantic memory but not epistemic. It retrieves content by similarity but has no justification chains, truth values, retraction cascades, or contradiction tracking. EEM adds the epistemic layer that RAG lacks.

EEM vs Context Windows

Conversation history and context windows are ephemeral — lost at session boundaries, destroyed by compaction. EEM persists across sessions and model swaps. Context compaction destroys justification networks (quantified across 33 measured compaction events).

EEM vs Parametric Knowledge

In-parameter knowledge has no audit trail. EEM makes "how do you know that?" answerable by justification chain traversal.

EEM vs Self-Assessed Confidence

LLM self-assessed confidence does not track accuracy. Confirmed across 4 models (March 2026): Sonnet r=0.135 (not significant at p<0.05), Opus r=-0.045 (no correlation — worse than random). 55 questions x 3 conditions x 2 models x 5 runs = 1,650 invocations. Answer and confidence come from the same process — the same structural flaw as human overconfidence (Kahneman). EEM replaces "am I sure?" with "is this justified?" — shifting from unreliable confidence to auditable justification chains. Confidence experiment methodology.

How It Works

Belief Maintenance System (BMS)

EEM is built on Doyle's (1979) Belief Maintenance System architecture¹: SL justifications with antecedents, propagation cascades, retraction cascades, and an exogenous problem-solver slot. The BMS substrate is content-agnostic by design.

Hybrid Architecture

The implementation is a hybrid BMS: symbolic BMS handles structure (justifications, propagation, cascades, backtracking, challenge/defend) while LLMs handle semantic operations (derive generates beliefs, review-beliefs critiques them, contradiction detection finds nogoods). Putting an LLM in the BMS problem-solver slot is what Doyle's architecture prescribes.

Key Mechanisms

SL Justification — a node is IN when ALL antecedents are IN. Multiple justifications allowed — node stays IN if ANY justification is valid. Enables non-monotonic reasoning via outlist.
Retraction Cascade — when a node goes OUT, all dependents whose justifications become invalid also go OUT, automatically and transitively. Retract one belief and the network figures out what else falls.
Nogoods — a set of nodes that cannot all be IN simultaneously. When detected, dependency-directed backtracking traces backward through justification chains and retracts the responsible premise with fewest dependents (minimal disruption).
Challenge/Defend — dialectical argumentation: challenging a node makes it go OUT. Defending neutralizes the challenge. Multi-level chains supported. Preserves the original argument unlike retract.
Restoration — when a retracted node comes back IN, dependents are recomputed — no manual rederivation needed.

Derive-then-Review

Over-derive, then review catches errors, retraction cascades propagate corrections. Both roles overshoot: derive over-generates, review over-retracts. Working through candidate retractions is where insights hide. 13-37% of derived beliefs are retracted per review round — the system finds and removes its own errors.

Measured Results

98.5%: A/B grade across 3,853 questions with dual-path architecture (Claude Opus 4.6, May 2026). Zero D/F grades — eliminated the failure tail entirely. Full-scale validation methodology.
88% vs 33%: Expert-service with EEM (Claude Opus 4.6) scores 88% A-grade vs agent pipeline baseline 33% on same 50 Red Hat domain questions, 15x faster. Three rubrics, six systems tested (May 2026). Three-way eval methodology.
40+: Expert knowledge bases built, from 237 beliefs (aap-expert) to 12,731 beliefs (redhat-expert).

Model Compensation

EEM compensates for model size: Sonnet 4.6 + beliefs approximates Opus 4.6 without beliefs. Haiku 4.5 with dual-path achieves 94% A+B, matching Opus at 98%. Smaller models with EEM match larger models without it.

Expert Prompt Paradox

Telling an agent it is an expert reduces belief utilization. Beliefs alone outperform beliefs + expert prompt: Opus 4.6 100% vs 94.2%, Sonnet 4.6 94.2% vs 91.8% (March 2026, 50 questions). The humble generic prompt produces better results because the agent consults the knowledge base instead of trusting its "expertise." Expert prompt ablation methodology.

Self-Critique Failure

LLM revision based on self-critique makes answers worse: Sonnet 4.6 dropped from 87% to 60% accuracy when asked to revise based on self-assessed confidence (March 2026, 1,650 invocations). Self-critique fails because the same model that made the error evaluates the error. EEM externalizes the critic's judgments, replacing internal self-assessment with external structured tracking. Self-critique experiment methodology.

Architecture

Dual-Path Retrieval

EEM is queried via dual-path retrieval: BMS path (pre-computed beliefs) + FTS path (source chunk search), merged by a third pass. Each path stays within cognitive budget.

Cognitive Budget

Borrowed from graphics frame budgets: decompose work into focused passes (BMS pass, RAG pass, merge pass) each within the model's attention budget. Mixing beliefs and document chunks in a single prompt degrades performance (Opus 4.6 drops 95.5% to 86%). Three focused passes achieve 100%. Architectural ablation results.

Expert Pipeline

Chunk source material, propose beliefs, human accepts, derive connections, review derivations, export. Value accrues at each stage. Derive produces new knowledge — connections the source doesn't make explicit.

Multi-Agent BMS

Import another agent's beliefs with SL justifications including agent:active as antecedent. A node is IN iff the agent is active AND the original belief is justified. Doyle-style truth maintenance across agents.

Model Stacking

Model A generates candidates, BMS records with provenance, review critiques (machine + human), Model B receives validated beliefs, derives new beliefs, review critiques derivations, repeat. Each level is a full model pass with fresh context and a critique pipeline as quality gate.

For AI Agents

LLM agents use EEM by:

Querying beliefs via reasons search / reasons show / reasons explain before answering
Citing node IDs for auditability
Running reasons derive to generate new beliefs from existing ones
Running reasons review-beliefs to self-audit
Recording nogoods with reasons nogood when contradictions appear

The agent does not need to be told it is an expert — the knowledge base speaks for itself.

Two CLIs

Both are available in the ftl-reasons repository.

beliefs: Structured markdown KB with provenance and manual maintenance. Simple, flat. Use for independent facts.
reasons: Full BMS with automatic propagation, cascades, backtracking, and LLM-driven operations. Use for justified conclusions with dependency chains.

Architecture Pattern

Use the reasons database for all structural operations (add, retract, derive, review). Export to beliefs.md for querying (fast, human-readable, grep-able). Keep both in sync via reasons export-markdown.

Getting Started

Install from the ftl-reasons repository, then:

reasons init — creates reasons.db
Add premises from observations: reasons add node-id "observation text"
Add justified conclusions with --sl to link dependencies: reasons add conclusion "derived text" --sl premise-a,premise-b
Use reasons derive to find connections the source doesn't make explicit
Use reasons review-beliefs to audit — expect 13-37% retraction rate
Retract when evidence changes: reasons retract node-id — cascades propagate automatically

Construction cost is O(chunks) + O(beliefs x rounds), but it amortizes across all queries O(queries). Expensive to build, cheap to query at scale.

Glossary

BMS (Belief Maintenance System): A system that tracks which beliefs are currently justified and automatically propagates changes when justifications change. Based on Doyle (1979).
IN / OUT: A belief's current truth status. IN means all its justifications hold. OUT means at least one required justification has failed.
SL Justification (Support List): A rule that says "believe X when all of A, B, C are IN." A belief can have multiple SL justifications — it stays IN if any one of them holds.
Retraction Cascade: When a belief goes OUT, everything that depended on it is automatically re-evaluated. Dependents whose justifications no longer hold also go OUT, transitively.
Nogood: A recorded contradiction — a set of beliefs that cannot all be true simultaneously. When detected, the system traces backward through justification chains to find the least-disruptive belief to retract.
Derive: An LLM operation that reads existing beliefs and proposes new ones with justification links. Generates knowledge the source material doesn't state explicitly.
Review: An LLM operation that critiques existing beliefs, proposing retractions for beliefs that are wrong, unsupported, or redundant. 13-37% retraction rate per round.

Theoretical Foundations

Doyle (1979) — Belief Maintenance Systems with SL justifications, propagation, retraction cascades, and an exogenous problem-solver slot.
de Kleer (1986) — ATMS uses assumption-based environments and nogoods. BMS beats ATMS for EEM because revision matters more than multiple environments when the problem solver (LLM) produces 13-37% errors.
AGM (Alchourrón, Gärdenfors, Makinson 1985) — formal theory for rational belief revision. Entrenchment scoring in backtracking is a crude approximation of AGM.
McCarthy & Hayes (1969) — frame problem: what persists across state changes. Staleness checking addresses this by detecting when source files change under beliefs.