alirezan.dev
  • Home
  • Blog
  • Resume
  • Course
  • Home
  • Blog
  • Resume
  • Course

© 2026 AliReza Noori. All rights reserved.

Blog/Agent Memory: 4 Types and When to Use Each
6 April 202611 min read

Agent Memory: 4 Types and When to Use Each

Message history, working memory, observational memory and semantic recall. What they are, how they work together and when to add each to your AI agent. Real code from a production Mastra app.

aiagentsmemorymastraragworking-memory

Your AI agent is stateless. Every API call starts from zero with no memory of who it's talking to, what was said before or what it learned last week. Everything that makes an agent feel intelligent across conversations is built on top of this blank slate.

I'm building a production AI coaching platform using Mastra (a TypeScript agent framework), Deno and AWS. The agent is a personal trainer that remembers your injuries, tracks your progress and adjusts its coaching style over time. None of that is possible without memory.

There are four types of agent memory. Each solves a different problem, and you don't need all of them on day one. Here's what they are, when to use each and how they work together.

The Four Tiers at a Glance

TierWhat it storesLifespanAdded cost per message
Message HistoryRaw conversationCurrent sessionNear zero (DB read/write)
Working MemoryStructured user profileForever (per user)One extra LLM call
Observational MemoryCompressed conversation notesForever (grows slowly)Background LLM calls
Semantic RecallVector embeddings of messagesForever (grows with usage)Embedding API call + vector search

Tier 1: Message History

The problem: Without this, the agent can't even remember what you said two messages ago.

What it does: Stores every message (user, assistant and tool calls) in a database. On each turn, loads the last N messages into the LLM's context window.

typescript
const memory = new Memory({
  options: {
    lastMessages: 20,
  },
});

That's it. The agent now sees the last 20 messages from the current conversation.

When it resets: When a new session starts. In my app, I create a new conversation thread after 30 minutes of inactivity. The old messages stay in the database, but the agent only loads messages from the current thread.

This is intentional. You don't want a workout conversation from Tuesday polluting a nutrition question on Thursday.

When to add it: Always. Every agent needs this. It's the baseline.

The concept is universal. Every agent framework does this: OpenAI's Assistants API, LangChain's ConversationBufferMemory or a simple messages[] array you manage yourself.

Tier 2: Working Memory

The problem: Message history resets between sessions. The agent forgets who you are.

What it does: Maintains a structured profile, a notepad the agent reads at the start of every conversation and updates after each response.

Here's my trainer's working memory template:

markdown
# User Profile
 
- Name:
- Age:
- Weight:
- Training experience (years):
 
# Current Program
 
- Split type:
- Training days per week:
- Injuries or limitations:
 
# Progress Tracking
 
- Key lift numbers (squat/bench/deadlift):
- Recent trend (gaining/losing/maintaining):
 
# Coaching Notes
 
- What's working well:
- What needs adjustment:
- Cues or approaches that resonate with this user:

The agent fills these in as it learns about the user through conversation. When someone says "I'm 33 and I've been training for about 3 years", the agent updates the profile.

How it works under the hood:

Notice the two LLM calls. One for the response, one for the memory update. That second call is the trade-off. It costs tokens and adds latency, but the agent never forgets important information about the user.

The key insight: scope.

Working memory can be scoped to the user or to the conversation:

  • User-scoped (what I use): The profile persists across all conversations. New session starts, message history resets, but the agent still knows your name, injuries and lift numbers. This is what you want for any agent that builds a relationship.

  • Conversation-scoped: The profile resets with each new thread. Useful for task-specific scratchpads. A meal planning agent might track the plan-in-progress per conversation without carrying it into the next one.

typescript
const memory = new Memory({
  options: {
    lastMessages: 20,
    workingMemory: {
      enabled: true,
      scope: "resource", // persists across all conversations
      template, // the markdown template above
    },
  },
});

When to add it: When your agent needs to remember users across sessions. Names, preferences, goals and medical history. Anything where "forgetting" would break the experience.

The concept is universal. Every serious agent app implements persistent user profiles. Mastra automates it with a second LLM call. LangChain offers EntityMemory. OpenAI Assistants have no built-in equivalent, so you'd manage it yourself. Some apps skip the LLM and use structured extraction instead.

Tier 3: Observational Memory

The problem: Over weeks and months, raw message history becomes noisy and expensive. Working memory captures facts (weight: 85kg), but not patterns (this user skips leg day when stressed).

What it does: Compresses old conversations into dense notes. Think of it like a coach's notebook margins, observations built from patterns across many sessions.

After training someone for three months, a good coach has noticed things: "responds better to accountability than encouragement", "form breaks down on heavy squats past rep 6", "tends to rush back from injuries". The user never stated these explicitly. The coach observed them.

How it works:

An Observer watches conversations and creates concise, timestamped notes when the message history gets large (typically at 30,000+ tokens). The compression is typically 5-40x.

A Reflector kicks in when observations themselves grow too large. It consolidates them, identifies patterns and produces higher-level insights.

The result is three layers:

  1. Recent messages, exact conversation history for the current task
  2. Observations, compressed notes from past conversations
  3. Reflections, condensed observations when memory grows too long
typescript
const memory = new Memory({
  options: {
    observationalMemory: true,
  },
});

One line of config. The framework handles the rest.

An important claim from Mastra's docs: Observational memory "replaces both working memory and message history" with "greater accuracy and lower cost than semantic recall." That's a bold statement. The idea is that instead of managing three separate systems (raw history + structured profile + vector search), you get one unified compressed context. Whether that holds in practice is something I'm still evaluating.

When to add it: When conversations get long enough that the context window fills up. When you notice the agent losing track of important details from earlier in the conversation. When you want the agent to recognise patterns it wouldn't spot from a single session.

The concept is universal. Compressing old context to fit more into the window. LangChain has ConversationSummaryBufferMemory for a similar idea. But Mastra's specific implementation with dual background agents, async buffering and automatic reflection is framework-specific.

Tier 4: Semantic Recall

The problem: The user asks "what did we discuss about my knee injury?" and that conversation happened three weeks ago, well outside the 20-message window. Working memory only stores current state, not history.

What it does: Converts every message into a vector embedding (an array of numbers representing its meaning) and stores it in a vector database. When a new message comes in, it searches for semantically similar past messages and injects them into context.

A concrete example:

Three weeks ago:

User: "I tweaked my knee doing walking lunges" Agent: "Let's swap lunges for step-ups until that settles"

The knee healed. Working memory was updated to "Injuries: none current."

Today:

User: "Should I do lunges again?"

Without semantic recall, the agent checks working memory (no current injuries) and says "sure, go for it."

With semantic recall, the agent's message gets vectorised and compared against all stored messages. It finds the conversation about the knee and lunges from three weeks ago. Now it says "last time lunges bothered your knee, so let's start light and see how it feels."

How it works:

Storing (after every response):
  New messages → embedding API → vectors → stored in vector DB

Retrieving (before every response):
  User's message → embedding API → vector → search vector DB
  → top 3 similar messages + surrounding context → inject into LLM context
typescript
const memory = new Memory({
  vector: new PgVector({ connectionString }),
  embedder: new ModelRouterEmbeddingModel("openai/text-embedding-3-small"),
  options: {
    semanticRecall: {
      topK: 3, // retrieve 3 most relevant past messages
      messageRange: 2, // include 2 messages of surrounding context
      scope: "resource", // search across ALL conversations for this user
    },
  },
});

Three parameters control the behaviour:

  • topK: how many matching messages to retrieve. Start with 3.
  • messageRange: how many surrounding messages to include with each match. Without this, you'd get isolated messages with no context. With messageRange: 2, a match at position 15 in a thread loads messages 13-17.
  • scope: "resource" searches across all the user's conversations (the default). "thread" searches only the current one.

The cost: Every message gets an embedding API call (~$0.02 per million tokens for OpenAI's text-embedding-3-small). Every incoming message triggers a vector search (50-200ms latency). For a coaching bot where responses take a few seconds anyway, this is invisible. For real-time voice, you'd skip it.

When to add it: When users start asking about past conversations. When working memory's current-state snapshot isn't enough. When you have a knowledge base (research papers and guidelines) you want the agent to reference.

The concept is universal. This is standard RAG (Retrieval-Augmented Generation) applied to conversation history. Every framework has it: LangChain's vector stores, LlamaIndex and OpenAI's file_search.

How They Work Together

Each tier adds depth that the others can't provide:

User: "My shoulder's feeling better, can we add overhead press back?"

TierWhat it contributes
Message HistoryThe agent sees the recent conversation context
Working Memory"Injuries: right shoulder impingement, cleared for light pressing 2 weeks ago"
Observational Memory"This user tends to rush back from injuries. Err on conservative side"
Semantic RecallRetrieves past messages about the shoulder: the initial injury, physio updates and progression

Without observational memory, the agent might just say "sure, go for it." Without semantic recall, it can't reference the specific history of the injury. Each tier fills a gap the others miss.

When to Add Each Tier

You don't need all four on day one. Here's the order I'd recommend:

Start here:

  1. Message History, non-negotiable. Add on day one.
  2. Working Memory, add as soon as your agent needs to remember users across sessions. For most apps, that's also day one.

Add when you feel the pain: 3. Semantic Recall, add when users ask about past conversations, or when you have a knowledge base the agent should reference. 4. Observational Memory, add when conversations get long enough that context windows fill up, or when you want the agent to spot patterns across sessions.

The complexity and cost increase with each tier. Message history is near-free. Working memory adds one extra LLM call per message. Semantic recall adds embedding costs and a vector database. Observational memory adds background LLM processing.

Start simple. Add complexity only when you feel the pain of not having it.

Beyond Conversations: RAG for Knowledge Bases

One final note. Semantic recall (Tier 4) isn't just for searching past conversations. The same infrastructure (chunk content, embed it, store vectors, search on query) works for any knowledge base.

For my trainer, the future use case is embedding exercise science content: research papers, programme design guidelines and injury management protocols. When a user asks "is it safe to squat with a herniated disc?", the agent retrieves relevant passages from curated sources rather than relying on what the LLM happens to know from training.

The concept is called RAG (Retrieval-Augmented Generation), and semantic recall over conversations is just one application of it. The pattern is the same regardless of what you're searching over.


This is part of a series on building production AI agents. Next up: a broader look at every building block of an agent app, from models and prompts to tools and workflows.

Want to go deeper?

I'm building a hands-on course teaching you how to work effectively with AI coding tools. From first prompt to autonomous workflows.

Preview the Course

On this page

The Four Tiers at a GlanceTier 1: Message HistoryTier 2: Working MemoryTier 3: Observational MemoryTier 4: Semantic RecallHow They Work TogetherWhen to Add Each TierBeyond Conversations: RAG for Knowledge Bases