Adding long-term memory to an AI agent means giving it three persistent stores beyond the context window: episodic (recent run history), semantic (extracted facts), and procedural (learned skills). You write memories after each session, retrieve the top-k relevant ones on every turn, and expire stale entries on a schedule. Done right, this cuts token spend roughly 90% and lifts long-horizon accuracy from 49% to 83%+ on the LongMemEval benchmark. This guide walks through the implementation step by step, with code, prompts, and a Mem0 / Letta / pgvector comparison.

What are the three layers of memory in an AI agent?

An AI agent needs three long-term memory layers plus a short-term working memory. The taxonomy comes from Sumers et al.'s CoALA paper (2024), which adapted cognitive science models to LLM agents.

  • Episodic memory stores past events: "On 2026-04-12 the agent tried tool X and it failed with error Y."
  • Semantic memory stores facts: "User prefers Python over TypeScript. Project DB is Postgres 16."
  • Procedural memory stores how-to: prompts, tool definitions, and learned routines for repeating tasks.
  • Working memory is the current context window, what the agent is reasoning over right now.

Most production agents collapse these into two stores in practice: a transactional log (episodic) and a vector index (semantic). Procedural memory often lives in the system prompt or in a versioned skills file. According to Atlan's 2026 memory taxonomy guide, keeping the layers separated improves retrieval precision because each layer answers a different question type.

How do you add memory to an AI agent step by step?

Adding memory to a basic Claude Agent SDK agent takes five concrete steps. Each step maps to roughly 20-100 lines of code in a typical Python implementation.

  1. Pick a backend. Postgres + pgvector for simplicity, Mem0 for a hosted memory layer, or Letta for full runtime memory management.
  2. Wrap your agent loop. On every turn, before calling Claude, run a retrieve() against your memory store using the user's latest message as the query.
  3. Inject memories into the system prompt. Format the top-k results as a bulleted list under a ## Memory section.
  4. Run an extraction call after each session. Pass the conversation transcript to a small LLM and ask it to emit atomic semantic facts plus an episode summary.
  5. Schedule expiry. A nightly cron deletes memories past their TTL, merges duplicates, and resolves contradictions.

Claude's managed Memory Store handles steps 2-5 automatically as a versioned text directory mounted into the session container. If you want full control, the client-side memory tool gives Claude read/write access to a /memories directory you persist yourself.

What prompts do you use to extract memories from a conversation?

The extraction prompt is the most important piece of a memory pipeline. It runs after each session and decides what gets stored. A weak prompt produces noise; a tight prompt produces high-signal facts.

Here is the pattern that works in production:

You are a memory extractor. From the transcript below, output JSON with three arrays:

1. "semantic": atomic facts about the user, project, or world.
   Format: {"fact": "...", "category": "preference|project|tool|person", "confidence": 0-1}

2. "episodic": notable events that should be retrievable later.
   Format: {"summary": "...", "timestamp": ISO8601, "outcome": "success|failure|neutral"}

3. "procedural": new rules or routines the agent should follow next time.
   Format: {"rule": "...", "trigger": "when X happens"}

Rules:
- Skip pleasantries, greetings, meta-talk.
- One atomic fact per entry, do not bundle.
- If a fact contradicts an existing memory (provided below), mark it "supersedes": <id>.
- Confidence < 0.6 facts are dropped.

The Mem0 paper (arXiv:2504.19413) describes a similar two-stage pipeline (extract then update) and reports it reaches 91.6% on LoCoMo and 93.4% on LongMemEval while keeping retrieval under 7,000 tokens per call. Pass the existing top-20 most relevant memories into the extractor as context so it can detect duplicates and contradictions.

How do you retrieve memories on each agent turn?

Retrieval runs before every model call. The standard pipeline is: embed the user query, run a hybrid search (vector + metadata filters), rerank, and format the top-k into the system prompt.

def retrieve(query: str, user_id: str, k: int = 8) -> list[Memory]:
    embedding = embed(query)  # text-embedding-3-small
    candidates = db.execute("""
        SELECT id, content, category, created_at,
               1 - (embedding <=> %s) AS similarity
        FROM memories
        WHERE user_id = %s AND expires_at > now()
        ORDER BY embedding <=> %s
        LIMIT 30
    """, (embedding, user_id, embedding))
    # Rerank with recency + category boost
    return rerank(candidates, query)[:k]

Key tuning levers:

  • k between 5 and 10. More than 10 starts hurting reasoning quality per Databricks' memory scaling research (2026).
  • Hybrid filter on category. A query about preferences should not surface project state.
  • Recency boost. Multiply similarity by exp(-age_days / 90) so old memories fade.
  • Always include pinned memories. User name, current project, hard constraints, regardless of similarity score.

Should you use a vector DB or a relational DB for agent memory?

Use Postgres with pgvector if you are already on Postgres and have under 10 million memories. Move to a dedicated vector database above 20 million vectors or when vector workloads start contending with OLTP traffic.

The trade-off, per pgvector's GitHub repo and Zen van Riel's pgvector vs dedicated DB analysis (2026):

  • pgvector wins on ops simplicity. One database, one backup, one set of credentials. SQL joins between vector results and relational data work in a single query.
  • pgvector wins on transactional consistency. Updating a memory and its embedding happens in one transaction, no eventual-consistency bugs.
  • Dedicated vector DBs win on scale. Pinecone, Weaviate, and Qdrant scale horizontally and isolate vector workloads from your application database.
  • Performance breakpoint sits around 10-20M vectors depending on dimensionality and hardware.

For most agent products, you will hit the user count or compliance ceiling long before you hit the vector ceiling. Start with pgvector. If you outgrow it, the migration is mechanical.

Mem0 vs Letta vs pgvector: which storage option should you use?

The three options sit at different points on the build-vs-buy curve. pgvector is build-your-own. Mem0 is a memory layer that plugs into any agent framework. Letta is a full agent runtime that owns memory management.

Key differences from the Vectorize Mem0 vs Letta comparison (2026) and Atlan's 2026 framework rankings:

  • Mem0 extracts memories passively. You call add() with a transcript; the pipeline decides what to store. Narrow SDK, low lock-in.
  • Letta agents self-edit memory. The agent decides what is worth remembering and calls memory functions inside its reasoning loop. High lock-in, but state-of-the-art for agents that run autonomously for days.
  • pgvector is whatever you build. Maximum flexibility, maximum implementation cost.

On the LongMemEval benchmark, the Mem0 paper reports 93.4% accuracy. An independent OMEGA benchmark (2026) measured Mem0 at 49% and Letta at ~83.2%. The gap suggests benchmark setup matters; test on your own data before committing.

Decision rule from n1n.ai's 2026 comparison: Mem0 for chatbots and personalization, Letta for long-running autonomous agents, pgvector when you need full control or already operate Postgres.

LongMemEval Accuracy: Memory Frameworks Compared
Mem0 (paper, self-reported)
93.4%
Letta (independent eval)
83.2%
Mem0 (independent eval)
49%
Source: Mem0 arXiv 2504.19413 + independent OMEGA benchmark (2026)

How much does long-term memory cost an AI agent in tokens?

Adding memory reduces total token spend roughly 90% on long conversations because you stop replaying the entire history into context.

From the Mem0 paper (arXiv:2504.19413, 2025):

  • Full-context approach: ~26,000 tokens per conversation turn.
  • Mem0 memory-based: ~1,800 tokens per turn.
  • Result: 90% token reduction, 91% lower p95 latency (1.44s vs 17.12s).

Oracle's AI Agent Memory benchmark (2025) reported similar shape: per-request input held near 1,300 tokens with memory, while a flat-history baseline grew linearly to 13,900 tokens by the final turn, 9.5x more tokens per request.

Doing the math atClaude Sonnet 4.5 pricing ($3/M input tokens), a 1,000-turn agent session goes from ~$78 (full-context) to ~$5.40 (memory-based). At scale this is the difference between a viable product and one that bleeds margin.

The extraction step adds cost too, roughly 2,000-4,000 tokens per session for a small extractor model. That overhead is amortized across every future turn that retrieves the resulting memory.

Token Cost: Full-Context vs Memory-Based Agent
Full-context (no memory)
26000 tokens
Mem0 (memory-based)
1800 tokens
Source: Mem0 paper, arXiv 2504.19413 (2025)

How do you prevent stale memories from confusing your agent?

Stale memories are the #1 failure mode of production memory systems. A user changes their email, the old one stays in semantic memory, the agent sends to the wrong address. Four mechanisms keep this from happening.

  1. Timestamp every memory and decay relevance. Multiply similarity scores by exp(-age_days / decay_constant). Use shorter decay for volatile categories (project state) and longer for stable ones (user identity).

  2. Detect contradictions at write time. When the extractor emits a new fact, fetch the top-5 semantically similar existing memories and ask the LLM: "Does the new fact contradict any of these? If yes, output the IDs to supersede." Mark superseded memories as deleted.

  3. Set TTLs by category. Sensible defaults:

    • Preferences: 90 days
    • Project state: 30 days
    • One-off events: 7 days
    • Identity facts: indefinite (until contradicted)
  4. Run weekly compaction. A scheduled job that merges duplicate memories, deletes orphans, and re-embeds memories whose source text was edited. The State of AI Agent Memory 2026 report found teams that skip compaction see retrieval quality degrade ~15% per quarter as duplicates accumulate.

Letta's tiered architecture handles parts of this automatically by promoting frequently-accessed memories to core and demoting unused ones to archival. With pgvector or Mem0, you build it yourself.

FeaturePostgres + pgvectorMem0Letta (formerly MemGPT)
ArchitectureSelf-hosted SQL + vector indexBolt-on memory layer (SDK)Full agent runtime with OS-style memory
Memory modelWhatever you buildPassive extraction on add()Agent self-edits core / archival / recall
LongMemEval scoreDIY (depends on your code)93.4% (paper) / 49% (independent)~83.2% (independent)
Tokens per retrievalDepends on query design<7,000 (Mem0 paper)Variable, agent-controlled
Best forTeams already on Postgres, <10M vectorsPersonalization, chatbots, fast integrationLong-running autonomous agents (days+)
Lock-inLow (standard SQL)Low (3 SDK call sites)High (rebuild loop = 2-6 weeks)
License / pricingOpen source (PostgreSQL license)Hosted + open-core (Apache-2.0)Apache-2.0 (self-host free)
Switch costDaysDaysWeeks