Agent Memory: 4 Types and When to Use Each

Your AI agent is stateless. Every API call starts from zero with no memory of who it's talking to, what was said before or what it learned last week. Everything that makes an agent feel intelligent across conversations is built on top of this blank slate.

I'm building a production AI coaching platform using Mastra (a TypeScript agent framework), Deno and AWS. The agent is a personal trainer that remembers your injuries, tracks your progress and adjusts its coaching style over time. None of that is possible without memory.

There are four types of agent memory. Each solves a different problem, and you don't need all of them on day one. Here's what they are, when to use each and how they work together.

The Four Tiers at a Glance

Tier	What it stores	Lifespan	Added cost per message
Message History	Raw conversation	Current session	Near zero (DB read/write)
Working Memory	Structured user profile	Forever (per user)	One extra LLM call
Observational Memory	Compressed conversation notes	Forever (grows slowly)	Background LLM calls
Semantic Recall	Vector embeddings of messages	Forever (grows with usage)	Embedding API call + vector search

Tier 1: Message History

The problem: Without this, the agent can't even remember what you said two messages ago.

What it does: Stores every message (user, assistant and tool calls) in a database. On each turn, loads the last N messages into the LLM's context window.

typescript

const memory = new Memory({
  options: {
    lastMessages: 20,
  },
});

That's it. The agent now sees the last 20 messages from the current conversation.

When it resets: When a new session starts. In my app, I create a new conversation thread after 30 minutes of inactivity. The old messages stay in the database, but the agent only loads messages from the current thread.

This is intentional. You don't want a workout conversation from Tuesday polluting a nutrition question on Thursday.

When to add it: Always. Every agent needs this. It's the baseline.

The concept is universal. Every agent framework does this: OpenAI's Assistants API, LangChain's ConversationBufferMemory or a simple messages[] array you manage yourself.

Tier 2: Working Memory

The problem: Message history resets between sessions. The agent forgets who you are.

What it does: Maintains a structured profile, a notepad the agent reads at the start of every conversation and updates after each response.

Here's my trainer's working memory template:

markdown

# User Profile
 
- Name:
- Age:
- Weight:
- Training experience (years):
 
# Current Program
 
- Split type:
- Training days per week:
- Injuries or limitations:
 
# Progress Tracking
 
- Key lift numbers (squat/bench/deadlift):
- Recent trend (gaining/losing/maintaining):
 
# Coaching Notes
 
- What's working well:
- What needs adjustment:
- Cues or approaches that resonate with this user:

The agent fills these in as it learns about the user through conversation. When someone says "I'm 33 and I've been training for about 3 years", the agent updates the profile.

How it works under the hood:

Notice the two LLM calls. One for the response, one for the memory update. That second call is the trade-off. It costs tokens and adds latency, but the agent never forgets important information about the user.

The key insight: scope.

Working memory can be scoped to the user or to the conversation:

User-scoped (what I use): The profile persists across all conversations. New session starts, message history resets, but the agent still knows your name, injuries and lift numbers. This is what you want for any agent that builds a relationship.
Conversation-scoped: The profile resets with each new thread. Useful for task-specific scratchpads. A meal planning agent might track the plan-in-progress per conversation without carrying it into the next one.

typescript

const memory = new Memory({
  options: {
    lastMessages: 20,
    workingMemory: {
      enabled: true,
      scope: "resource", // persists across all conversations
      template, // the markdown template above
    },
  },
});

When to add it: When your agent needs to remember users across sessions. Names, preferences, goals and medical history. Anything where "forgetting" would break the experience.

The concept is universal. Every serious agent app implements persistent user profiles. Mastra automates it with a second LLM call. LangChain offers EntityMemory. OpenAI Assistants have no built-in equivalent, so you'd manage it yourself. Some apps skip the LLM and use structured extraction instead.

Tier 3: Observational Memory

The problem: Over weeks and months, raw message history becomes noisy and expensive. Working memory captures facts (weight: 85kg), but not patterns (this user skips leg day when stressed).

What it does: Compresses old conversations into dense notes. Think of it like a coach's notebook margins, observations built from patterns across many sessions.

After training someone for three months, a good coach has noticed things: "responds better to accountability than encouragement", "form breaks down on heavy squats past rep 6", "tends to rush back from injuries". The user never stated these explicitly. The coach observed them.

How it works:

An Observer watches conversations and creates concise, timestamped notes when the message history gets large (typically at 30,000+ tokens). The compression is typically 5-40x.

A Reflector kicks in when observations themselves grow too large. It consolidates them, identifies patterns and produces higher-level insights.

The result is three layers:

Recent messages, exact conversation history for the current task
Observations, compressed notes from past conversations
Reflections, condensed observations when memory grows too long

typescript

const memory = new Memory({
  options: {
    observationalMemory: true,
  },
});

One line of config. The framework handles the rest.

An important claim from Mastra's docs: Observational memory "replaces both working memory and message history" with "greater accuracy and lower cost than semantic recall." That's a bold statement. The idea is that instead of managing three separate systems (raw history + structured profile + vector search), you get one unified compressed context. Whether that holds in practice is something I'm still evaluating.

When to add it: When conversations get long enough that the context window fills up. When you notice the agent losing track of important details from earlier in the conversation. When you want the agent to recognise patterns it wouldn't spot from a single session.

The concept is universal. Compressing old context to fit more into the window. LangChain has ConversationSummaryBufferMemory for a similar idea. But Mastra's specific implementation with dual background agents, async buffering and automatic reflection is framework-specific.

Tier 4: Semantic Recall

The problem: The user asks "what did we discuss about my knee injury?" and that conversation happened three weeks ago, well outside the 20-message window. Working memory only stores current state, not history.

What it does: Converts every message into a vector embedding (an array of numbers representing its meaning) and stores it in a vector database. When a new message comes in, it searches for semantically similar past messages and injects them into context.

A concrete example:

Three weeks ago:

User: "I tweaked my knee doing walking lunges" Agent: "Let's swap lunges for step-ups until that settles"

The knee healed. Working memory was updated to "Injuries: none current."

Today:

User: "Should I do lunges again?"

Without semantic recall, the agent checks working memory (no current injuries) and says "sure, go for it."

With semantic recall, the agent's message gets vectorised and compared against all stored messages. It finds the conversation about the knee and lunges from three weeks ago. Now it says "last time lunges bothered your knee, so let's start light and see how it feels."

How it works:

Storing (after every response):
  New messages → embedding API → vectors → stored in vector DB

Retrieving (before every response):
  User's message → embedding API → vector → search vector DB
  → top 3 similar messages + surrounding context → inject into LLM context

typescript

const memory = new Memory({
  vector: new PgVector({ connectionString }),
  embedder: new ModelRouterEmbeddingModel("openai/text-embedding-3-small"),
  options: {
    semanticRecall: {
      topK: 3, // retrieve 3 most relevant past messages
      messageRange: 2, // include 2 messages of surrounding context
      scope: "resource", // search across ALL conversations for this user
    },
  },
});

Three parameters control the behaviour:

topK: how many matching messages to retrieve. Start with 3.
messageRange: how many surrounding messages to include with each match. Without this, you'd get isolated messages with no context. With messageRange: 2, a match at position 15 in a thread loads messages 13-17.
scope: "resource" searches across all the user's conversations (the default). "thread" searches only the current one.

The cost: Every message gets an embedding API call (~$0.02 per million tokens for OpenAI's text-embedding-3-small). Every incoming message triggers a vector search (50-200ms latency). For a coaching bot where responses take a few seconds anyway, this is invisible. For real-time voice, you'd skip it.

When to add it: When users start asking about past conversations. When working memory's current-state snapshot isn't enough. When you have a knowledge base (research papers and guidelines) you want the agent to reference.

The concept is universal. This is standard RAG (Retrieval-Augmented Generation) applied to conversation history. Every framework has it: LangChain's vector stores, LlamaIndex and OpenAI's file_search.

How They Work Together

Each tier adds depth that the others can't provide:

User: "My shoulder's feeling better, can we add overhead press back?"

Tier	What it contributes
Message History	The agent sees the recent conversation context
Working Memory	"Injuries: right shoulder impingement, cleared for light pressing 2 weeks ago"
Observational Memory	"This user tends to rush back from injuries. Err on conservative side"
Semantic Recall	Retrieves past messages about the shoulder: the initial injury, physio updates and progression

Without observational memory, the agent might just say "sure, go for it." Without semantic recall, it can't reference the specific history of the injury. Each tier fills a gap the others miss.

When to Add Each Tier

You don't need all four on day one. Here's the order I'd recommend:

Start here:

Message History, non-negotiable. Add on day one.
Working Memory, add as soon as your agent needs to remember users across sessions. For most apps, that's also day one.

Add when you feel the pain: 3. Semantic Recall, add when users ask about past conversations, or when you have a knowledge base the agent should reference. 4. Observational Memory, add when conversations get long enough that context windows fill up, or when you want the agent to spot patterns across sessions.

The complexity and cost increase with each tier. Message history is near-free. Working memory adds one extra LLM call per message. Semantic recall adds embedding costs and a vector database. Observational memory adds background LLM processing.

Start simple. Add complexity only when you feel the pain of not having it.

Beyond Conversations: RAG for Knowledge Bases

One final note. Semantic recall (Tier 4) isn't just for searching past conversations. The same infrastructure (chunk content, embed it, store vectors, search on query) works for any knowledge base.

For my trainer, the future use case is embedding exercise science content: research papers, programme design guidelines and injury management protocols. When a user asks "is it safe to squat with a herniated disc?", the agent retrieves relevant passages from curated sources rather than relying on what the LLM happens to know from training.

The concept is called RAG (Retrieval-Augmented Generation), and semantic recall over conversations is just one application of it. The pattern is the same regardless of what you're searching over.

This is part of a series on building production AI agents. Next up: a broader look at every building block of an agent app, from models and prompts to tools and workflows.

There are four types of agent memory. Each solves a different problem, and you don't need all of them on day one. Here's what they are, when to use each and how they work together.

The Four Tiers at a Glance

Tier	What it stores	Lifespan	Added cost per message
Message History	Raw conversation	Current session	Near zero (DB read/write)
Working Memory	Structured user profile	Forever (per user)	One extra LLM call
Observational Memory	Compressed conversation notes	Forever (grows slowly)	Background LLM calls
Semantic Recall	Vector embeddings of messages	Forever (grows with usage)	Embedding API call + vector search

Tier 1: Message History

The problem: Without this, the agent can't even remember what you said two messages ago.

What it does: Stores every message (user, assistant and tool calls) in a database. On each turn, loads the last N messages into the LLM's context window.

typescript

const memory = new Memory({
  options: {
    lastMessages: 20,
  },
});

That's it. The agent now sees the last 20 messages from the current conversation.

This is intentional. You don't want a workout conversation from Tuesday polluting a nutrition question on Thursday.

When to add it: Always. Every agent needs this. It's the baseline.

The concept is universal. Every agent framework does this: OpenAI's Assistants API, LangChain's ConversationBufferMemory or a simple messages[] array you manage yourself.

Tier 2: Working Memory

The problem: Message history resets between sessions. The agent forgets who you are.

What it does: Maintains a structured profile, a notepad the agent reads at the start of every conversation and updates after each response.

Here's my trainer's working memory template:

markdown

# User Profile
 
- Name:
- Age:
- Weight:
- Training experience (years):
 
# Current Program
 
- Split type:
- Training days per week:
- Injuries or limitations:
 
# Progress Tracking
 
- Key lift numbers (squat/bench/deadlift):
- Recent trend (gaining/losing/maintaining):
 
# Coaching Notes
 
- What's working well:
- What needs adjustment:
- Cues or approaches that resonate with this user:

The agent fills these in as it learns about the user through conversation. When someone says "I'm 33 and I've been training for about 3 years", the agent updates the profile.

How it works under the hood:

The key insight: scope.

Working memory can be scoped to the user or to the conversation:

User-scoped (what I use): The profile persists across all conversations. New session starts, message history resets, but the agent still knows your name, injuries and lift numbers. This is what you want for any agent that builds a relationship.
Conversation-scoped: The profile resets with each new thread. Useful for task-specific scratchpads. A meal planning agent might track the plan-in-progress per conversation without carrying it into the next one.

typescript

const memory = new Memory({
  options: {
    lastMessages: 20,
    workingMemory: {
      enabled: true,
      scope: "resource", // persists across all conversations
      template, // the markdown template above
    },
  },
});

When to add it: When your agent needs to remember users across sessions. Names, preferences, goals and medical history. Anything where "forgetting" would break the experience.

Tier 3: Observational Memory

The problem: Over weeks and months, raw message history becomes noisy and expensive. Working memory captures facts (weight: 85kg), but not patterns (this user skips leg day when stressed).

What it does: Compresses old conversations into dense notes. Think of it like a coach's notebook margins, observations built from patterns across many sessions.

How it works:

An Observer watches conversations and creates concise, timestamped notes when the message history gets large (typically at 30,000+ tokens). The compression is typically 5-40x.

A Reflector kicks in when observations themselves grow too large. It consolidates them, identifies patterns and produces higher-level insights.

The result is three layers:

Recent messages, exact conversation history for the current task
Observations, compressed notes from past conversations
Reflections, condensed observations when memory grows too long

typescript

const memory = new Memory({
  options: {
    observationalMemory: true,
  },
});

One line of config. The framework handles the rest.

Tier 4: Semantic Recall

A concrete example:

Three weeks ago:

User: "I tweaked my knee doing walking lunges" Agent: "Let's swap lunges for step-ups until that settles"

The knee healed. Working memory was updated to "Injuries: none current."

Today:

User: "Should I do lunges again?"

Without semantic recall, the agent checks working memory (no current injuries) and says "sure, go for it."

How it works:

Storing (after every response):
  New messages → embedding API → vectors → stored in vector DB

Retrieving (before every response):
  User's message → embedding API → vector → search vector DB
  → top 3 similar messages + surrounding context → inject into LLM context

typescript

const memory = new Memory({
  vector: new PgVector({ connectionString }),
  embedder: new ModelRouterEmbeddingModel("openai/text-embedding-3-small"),
  options: {
    semanticRecall: {
      topK: 3, // retrieve 3 most relevant past messages
      messageRange: 2, // include 2 messages of surrounding context
      scope: "resource", // search across ALL conversations for this user
    },
  },
});

Three parameters control the behaviour:

topK: how many matching messages to retrieve. Start with 3.
messageRange: how many surrounding messages to include with each match. Without this, you'd get isolated messages with no context. With messageRange: 2, a match at position 15 in a thread loads messages 13-17.
scope: "resource" searches across all the user's conversations (the default). "thread" searches only the current one.

How They Work Together

Each tier adds depth that the others can't provide:

User: "My shoulder's feeling better, can we add overhead press back?"

Tier	What it contributes
Message History	The agent sees the recent conversation context
Working Memory	"Injuries: right shoulder impingement, cleared for light pressing 2 weeks ago"
Observational Memory	"This user tends to rush back from injuries. Err on conservative side"
Semantic Recall	Retrieves past messages about the shoulder: the initial injury, physio updates and progression

When to Add Each Tier

You don't need all four on day one. Here's the order I'd recommend:

Start here:

Message History, non-negotiable. Add on day one.
Working Memory, add as soon as your agent needs to remember users across sessions. For most apps, that's also day one.

Start simple. Add complexity only when you feel the pain of not having it.

Beyond Conversations: RAG for Knowledge Bases

One final note. Semantic recall (Tier 4) isn't just for searching past conversations. The same infrastructure (chunk content, embed it, store vectors, search on query) works for any knowledge base.

The concept is called RAG (Retrieval-Augmented Generation), and semantic recall over conversations is just one application of it. The pattern is the same regardless of what you're searching over.

This is part of a series on building production AI agents. Next up: a broader look at every building block of an agent app, from models and prompts to tools and workflows.

Agent Memory: 4 Types and When to Use Each

The Four Tiers at a Glance

Tier 1: Message History

Tier 2: Working Memory

Tier 3: Observational Memory

Tier 4: Semantic Recall

How They Work Together

When to Add Each Tier

Beyond Conversations: RAG for Knowledge Bases

Want to go deeper?

Agent Memory: 4 Types and When to Use Each

The Four Tiers at a Glance

Tier 1: Message History

Tier 2: Working Memory

Tier 3: Observational Memory

Tier 4: Semantic Recall

How They Work Together

When to Add Each Tier

Beyond Conversations: RAG for Knowledge Bases

Want to go deeper?