how-to 11 min read May 04, 2026

How to Add Long-Term Memory to Your AI Agent

Q: How do you add memory to an AI agent?

Add three layers: episodic (raw run history in Postgres or JSON), semantic (extracted facts in a vector store), and procedural (learned skills as system-prompt edits or tool definitions). Retrieve top-k on each turn and run an extraction LLM call after each session.

Q: What is the difference between Mem0 and Letta?

Mem0 is a memory layer you bolt onto an existing agent. Letta is a full agent runtime where the agent self-manages tiered memory by calling memory tools during reasoning.

Q: How much does long-term memory cost in tokens?

The Mem0 paper reports under 7,000 tokens per call versus 25,000+ for full-context, a 90% reduction.

Q: Should you use a vector database or a relational database for agent memory?

Use Postgres with pgvector under 10M memories. Move to a dedicated vector DB above 10-20M vectors or when vector workloads affect relational queries.

Q: How do you prevent stale memories from confusing your agent?

Use timestamps with relevance decay, contradiction detection, TTLs by category, and weekly compaction jobs.

By Peter Foy

Step-by-step: add episodic, semantic, and procedural memory to a Claude Agent SDK agent. Compare pgvector, Mem0, and Letta with real cost numbers.

TL;DR

Long-term memory for an AI agent is built in three layers: episodic (run history), semantic (extracted facts), and procedural (learned skills). Store memories in pgvector, Mem0, or Letta, retrieve top-k on each turn, and expire stale entries. Mem0's paper reports a 90% token reduction versus full-context. Letta scores ~83.2% on LongMemEval. Pick by lock-in tolerance and runtime needs.

Use the CoALA taxonomy: episodic + semantic + procedural memory, plus working memory for the active turn.
Mem0 cuts tokens per retrieval to under 7,000 vs 25,000+ for full-context (Mem0 paper, 2025).
Letta (~83.2% LongMemEval) beats Mem0's independent score (49%) but locks you into its agent runtime.
Use Postgres + pgvector under 10M vectors; switch to a dedicated vector DB above 20M.
Expire stale memories with timestamps, contradiction detection, TTLs by category, and weekly compaction.

Adding long-term memory to an AI agent means giving it three persistent stores beyond the context window: episodic (recent run history), semantic (extracted facts), and procedural (learned skills). You write memories after each session, retrieve the top-k relevant ones on every turn, and expire stale entries on a schedule. Done right, this cuts token spend roughly 90% and lifts long-horizon accuracy from 49% to 83%+ on the LongMemEval benchmark. This guide walks through the implementation step by step, with code, prompts, and a Mem0 / Letta / pgvector comparison.

What are the three layers of memory in an AI agent?

An AI agent needs three long-term memory layers plus a short-term working memory. The taxonomy comes from Sumers et al.'s CoALA paper (2024), which adapted cognitive science models to LLM agents.

Episodic memory stores past events: "On 2026-04-12 the agent tried tool X and it failed with error Y."
Semantic memory stores facts: "User prefers Python over TypeScript. Project DB is Postgres 16."
Procedural memory stores how-to: prompts, tool definitions, and learned routines for repeating tasks.
Working memory is the current context window, what the agent is reasoning over right now.

Most production agents collapse these into two stores in practice: a transactional log (episodic) and a vector index (semantic). Procedural memory often lives in the system prompt or in a versioned skills file. According to Atlan's 2026 memory taxonomy guide, keeping the layers separated improves retrieval precision because each layer answers a different question type.

How do you add memory to an AI agent step by step?

Adding memory to a basic Claude Agent SDK agent takes five concrete steps. Each step maps to roughly 20-100 lines of code in a typical Python implementation.

Pick a backend. Postgres + pgvector for simplicity, Mem0 for a hosted memory layer, or Letta for full runtime memory management.
Wrap your agent loop. On every turn, before calling Claude, run a retrieve() against your memory store using the user's latest message as the query.
Inject memories into the system prompt. Format the top-k results as a bulleted list under a ## Memory section.
Run an extraction call after each session. Pass the conversation transcript to a small LLM and ask it to emit atomic semantic facts plus an episode summary.
Schedule expiry. A nightly cron deletes memories past their TTL, merges duplicates, and resolves contradictions.

Claude's managed Memory Store handles steps 2-5 automatically as a versioned text directory mounted into the session container. If you want full control, the client-side memory tool gives Claude read/write access to a /memories directory you persist yourself.

What prompts do you use to extract memories from a conversation?

The extraction prompt is the most important piece of a memory pipeline. It runs after each session and decides what gets stored. A weak prompt produces noise; a tight prompt produces high-signal facts.

Here is the pattern that works in production:

You are a memory extractor. From the transcript below, output JSON with three arrays:

1. "semantic": atomic facts about the user, project, or world.
   Format: {"fact": "...", "category": "preference|project|tool|person", "confidence": 0-1}

2. "episodic": notable events that should be retrievable later.
   Format: {"summary": "...", "timestamp": ISO8601, "outcome": "success|failure|neutral"}

3. "procedural": new rules or routines the agent should follow next time.
   Format: {"rule": "...", "trigger": "when X happens"}

Rules:
- Skip pleasantries, greetings, meta-talk.
- One atomic fact per entry, do not bundle.
- If a fact contradicts an existing memory (provided below), mark it "supersedes": <id>.
- Confidence < 0.6 facts are dropped.

The Mem0 paper (arXiv:2504.19413) describes a similar two-stage pipeline (extract then update) and reports it reaches 91.6% on LoCoMo and 93.4% on LongMemEval while keeping retrieval under 7,000 tokens per call. Pass the existing top-20 most relevant memories into the extractor as context so it can detect duplicates and contradictions.

How do you retrieve memories on each agent turn?

Retrieval runs before every model call. The standard pipeline is: embed the user query, run a hybrid search (vector + metadata filters), rerank, and format the top-k into the system prompt.

def retrieve(query: str, user_id: str, k: int = 8) -> list[Memory]:
    embedding = embed(query)  # text-embedding-3-small
    candidates = db.execute("""
        SELECT id, content, category, created_at,
               1 - (embedding <=> %s) AS similarity
        FROM memories
        WHERE user_id = %s AND expires_at > now()
        ORDER BY embedding <=> %s
        LIMIT 30
    """, (embedding, user_id, embedding))
    # Rerank with recency + category boost
    return rerank(candidates, query)[:k]

Key tuning levers:

k between 5 and 10. More than 10 starts hurting reasoning quality per Databricks' memory scaling research (2026).
Hybrid filter on category. A query about preferences should not surface project state.
Recency boost. Multiply similarity by exp(-age_days / 90) so old memories fade.
Always include pinned memories. User name, current project, hard constraints, regardless of similarity score.

Should you use a vector DB or a relational DB for agent memory?

Use Postgres with pgvector if you are already on Postgres and have under 10 million memories. Move to a dedicated vector database above 20 million vectors or when vector workloads start contending with OLTP traffic.

The trade-off, per pgvector's GitHub repo and Zen van Riel's pgvector vs dedicated DB analysis (2026):

pgvector wins on ops simplicity. One database, one backup, one set of credentials. SQL joins between vector results and relational data work in a single query.
pgvector wins on transactional consistency. Updating a memory and its embedding happens in one transaction, no eventual-consistency bugs.
Dedicated vector DBs win on scale. Pinecone, Weaviate, and Qdrant scale horizontally and isolate vector workloads from your application database.
Performance breakpoint sits around 10-20M vectors depending on dimensionality and hardware.

For most agent products, you will hit the user count or compliance ceiling long before you hit the vector ceiling. Start with pgvector. If you outgrow it, the migration is mechanical.

Mem0 vs Letta vs pgvector: which storage option should you use?

The three options sit at different points on the build-vs-buy curve. pgvector is build-your-own. Mem0 is a memory layer that plugs into any agent framework. Letta is a full agent runtime that owns memory management.

Key differences from the Vectorize Mem0 vs Letta comparison (2026) and Atlan's 2026 framework rankings:

Mem0 extracts memories passively. You call add() with a transcript; the pipeline decides what to store. Narrow SDK, low lock-in.
Letta agents self-edit memory. The agent decides what is worth remembering and calls memory functions inside its reasoning loop. High lock-in, but state-of-the-art for agents that run autonomously for days.
pgvector is whatever you build. Maximum flexibility, maximum implementation cost.

On the LongMemEval benchmark, the Mem0 paper reports 93.4% accuracy. An independent OMEGA benchmark (2026) measured Mem0 at 49% and Letta at ~83.2%. The gap suggests benchmark setup matters; test on your own data before committing.

Decision rule from n1n.ai's 2026 comparison: Mem0 for chatbots and personalization, Letta for long-running autonomous agents, pgvector when you need full control or already operate Postgres.

LongMemEval Accuracy: Memory Frameworks Compared

Mem0 (paper, self-reported)

93.4%

Letta (independent eval)

83.2%

Mem0 (independent eval)

49%

Source: Mem0 arXiv 2504.19413 + independent OMEGA benchmark (2026)

How much does long-term memory cost an AI agent in tokens?

Adding memory reduces total token spend roughly 90% on long conversations because you stop replaying the entire history into context.

From the Mem0 paper (arXiv:2504.19413, 2025):

Full-context approach: ~26,000 tokens per conversation turn.
Mem0 memory-based: ~1,800 tokens per turn.
Result: 90% token reduction, 91% lower p95 latency (1.44s vs 17.12s).

Oracle's AI Agent Memory benchmark (2025) reported similar shape: per-request input held near 1,300 tokens with memory, while a flat-history baseline grew linearly to 13,900 tokens by the final turn, 9.5x more tokens per request.

Doing the math atClaude Sonnet 4.5 pricing ($3/M input tokens), a 1,000-turn agent session goes from ~$78 (full-context) to ~$5.40 (memory-based). At scale this is the difference between a viable product and one that bleeds margin.

The extraction step adds cost too, roughly 2,000-4,000 tokens per session for a small extractor model. That overhead is amortized across every future turn that retrieves the resulting memory.

Token Cost: Full-Context vs Memory-Based Agent

Full-context (no memory)

26000 tokens

Mem0 (memory-based)

1800 tokens

Source: Mem0 paper, arXiv 2504.19413 (2025)

How do you prevent stale memories from confusing your agent?

Stale memories are the #1 failure mode of production memory systems. A user changes their email, the old one stays in semantic memory, the agent sends to the wrong address. Four mechanisms keep this from happening.

Timestamp every memory and decay relevance. Multiply similarity scores by exp(-age_days / decay_constant). Use shorter decay for volatile categories (project state) and longer for stable ones (user identity).
Detect contradictions at write time. When the extractor emits a new fact, fetch the top-5 semantically similar existing memories and ask the LLM: "Does the new fact contradict any of these? If yes, output the IDs to supersede." Mark superseded memories as deleted.
Set TTLs by category. Sensible defaults:
- Preferences: 90 days
- Project state: 30 days
- One-off events: 7 days
- Identity facts: indefinite (until contradicted)
Run weekly compaction. A scheduled job that merges duplicate memories, deletes orphans, and re-embeds memories whose source text was edited. The State of AI Agent Memory 2026 report found teams that skip compaction see retrieval quality degrade ~15% per quarter as duplicates accumulate.

Letta's tiered architecture handles parts of this automatically by promoting frequently-accessed memories to core and demoting unused ones to archival. With pgvector or Mem0, you build it yourself.

Feature	Postgres + pgvector	Mem0	Letta (formerly MemGPT)
Architecture	Self-hosted SQL + vector index	Bolt-on memory layer (SDK)	Full agent runtime with OS-style memory
Memory model	Whatever you build	Passive extraction on add()	Agent self-edits core / archival / recall
LongMemEval score	DIY (depends on your code)	93.4% (paper) / 49% (independent)	~83.2% (independent)
Tokens per retrieval	Depends on query design	<7,000 (Mem0 paper)	Variable, agent-controlled
Best for	Teams already on Postgres, <10M vectors	Personalization, chatbots, fast integration	Long-running autonomous agents (days+)
Lock-in	Low (standard SQL)	Low (3 SDK call sites)	High (rebuild loop = 2-6 weeks)
License / pricing	Open source (PostgreSQL license)	Hosted + open-core (Apache-2.0)	Apache-2.0 (self-host free)
Switch cost	Days	Days	Weeks

Frequently asked questions

How do you add memory to an AI agent?

Add three layers: episodic (raw run history in Postgres or JSON), semantic (extracted facts in a vector store like pgvector or Mem0), and procedural (learned skills as system-prompt edits or tool definitions). On every agent turn, retrieve the top-k semantic memories, summarise relevant episodes, and inject the procedural rules. Use a small extraction LLM call after each session to write new memories.

What is the difference between Mem0 and Letta?

Mem0 is a memory layer you bolt onto an existing agent: you call add() and search(), it handles extraction and retrieval. Letta (formerly MemGPT) is a full agent runtime where the agent self-manages tiered memory (core, archival, recall) by calling memory tools during reasoning. Mem0 is faster to integrate; Letta is built for agents that run autonomously for days.

How much does long-term memory cost in tokens?

The Mem0 paper (arXiv 2504.19413) reports memory-based retrieval averages under 7,000 tokens per call versus 25,000+ for full-context approaches, a 90% reduction. In long conversations, Oracle benchmarks showed memory-based requests stayed near 1,300 tokens while flat-history baselines grew to 13,900, a 9.5x difference by the final turn.

Should you use a vector database or a relational database for agent memory?

Use Postgres with pgvector if you are already on Postgres and have under 10 million memories, the operational simplicity and transactional consistency outweigh the marginal speed loss. Move to a dedicated vector DB (Pinecone, Weaviate, Qdrant) above 10-20M vectors or when vector workloads start affecting your relational queries.

How do you prevent stale memories from confusing your agent?

Apply four expiry mechanisms: (1) timestamp every memory and decay relevance scores by age, (2) detect contradictions by comparing new facts against retrieved ones and overwriting on conflict, (3) mark memories with TTLs by category (preferences = 90 days, project state = 30 days, factual claims = until contradicted), (4) run a weekly compaction job that merges duplicate memories and deletes orphans.

What are the three types of long-term memory in an AI agent?

Per the CoALA framework (Sumers et al., 2024), the three long-term memory types are episodic (past events and run history), semantic (factual knowledge about users, projects, and the world), and procedural (how to perform tasks, often stored in code or system prompts). Working memory holds the active reasoning state.

Does Claude Agent SDK have built-in memory?

Yes. Claude provides two options: a managed Memory Store (workspace-scoped versioned text directory mounted into the session container) and a client-side memory tool that lets Claude read, write, and delete files in a /memories directory across sessions. Both work with the Agent SDK out of the box.

Is pgvector good enough for production agent memory?

Yes, up to a point. pgvector handles millions to roughly 10-20 million vectors well on modern hardware, and you get SQL joins, transactional writes, and a single ops surface. Beyond 20M vectors, or when vector queries start contending with OLTP traffic, switch to a dedicated vector store.

After the cost section, for readers who want help shipping a production memory layer.

Build your first memory-enabled agent with Growth Engineer