Every skilled human conversation partner does something remarkable: they remember what was said five minutes ago, an hour ago, and sometimes years ago — and they know which memories actually matter right now. Building AI agents that can do the same is one of the most practical and rewarding challenges in agent engineering today.
If you’ve ever watched an AI assistant confidently forget everything a user said three exchanges back, you know exactly why this matters. And if you’ve ever tried to fix it by just stuffing the entire conversation history into the context window, you’ve probably met its nemesis: the token budget.
In this guide, we’ll walk through the core strategies for implementing multi-turn conversation memory in AI agents — from simple sliding windows to sophisticated hybrid retrieval systems — so your agents can hold intelligent, coherent long-form conversations without running out of tokens or budget.
Why Multi-Turn Memory Is Non-Negotiable for Real-World Agents
Single-turn AI interactions are straightforward: user sends a message, agent replies, done. But production agents — customer support bots, coding assistants, tutors, research agents — operate across many turns, often over minutes or hours. Without memory, every turn feels like meeting the user for the first time.
Consider a tutoring agent helping a student learn Python. By turn 15, the student has revealed that they’re a visual learner, they struggle with recursion, and they’re preparing for a job interview next week. An agent with no memory treats turn 16 as a cold start. An agent with good memory uses all of that context to give a precisely tailored response.
The challenge: modern LLMs have finite context windows (even 128K-token models have limits), and large context windows cost more to fill. Naively appending every message to the prompt is a recipe for slow, expensive, eventually-broken agents.
The solution: strategic memory management — deciding what to keep, what to compress, and what to retrieve.
The Four Core Memory Strategies
Before writing a single line of code, you need to understand the four primary strategies. Most production agents combine two or more.
1. Sliding Window (Recency Buffer)
The simplest approach: keep only the last N turns of conversation in the context.
How it works: You maintain a list of message objects. When the list exceeds your window size (say, 20 turns), you drop the oldest messages.
Best for: Conversational agents where recent context is all that matters — casual chat, simple Q&A bots.
Limitation: The agent has no memory of anything outside the window. If the user mentioned their name in turn 1 and you’re now on turn 25, that’s gone.
2. Hierarchical Summarization
Instead of dropping old turns, you summarize them into a compact representation and keep that summary in the prompt alongside recent turns.
How it works: Every N turns (or when you approach a token threshold), you call the LLM to summarize the conversation so far into a tight paragraph or structured JSON block. That summary replaces the raw history.
Best for: Long conversations where high-level context matters — therapy bots, extended tutoring sessions, project planning agents.
Limitation: Summarization loses detail. If a user gave a precise number or a nuanced statement in turn 8, the summary might smooth it over.
3. Vector-Based Retrieval (RAG for Conversation History)
Store all conversation turns as embeddings in a vector database. At each new turn, retrieve the most semantically relevant past turns and inject them into the prompt.
How it works: Every message is embedded and stored. When the agent receives a new message, you embed it and run a similarity search to find the top-K most relevant prior exchanges. Those retrieved exchanges go into the context alongside recent turns.
Best for: Agents with very long conversation histories (days or weeks), or where specific past facts need to be recalled on demand.
Limitation: Retrieval adds latency and infrastructure complexity. You also need to design your retrieval query carefully — the most semantically similar turn isn’t always the most useful one.
4. Structured State / Entity Memory
Rather than preserving raw turns, you maintain a structured record of key facts extracted from the conversation — user name, preferences, stated goals, known constraints.
How it works: After each turn (or batch of turns), an extraction step updates a structured object (JSON, a database row, etc.) with facts. The agent reads this “state” at the start of each turn.
Best for: Goal-oriented agents where specific facts (user profile, task state) matter more than conversational flow.
Limitation: You have to design the schema. If a fact doesn’t fit your schema, it gets dropped.
Building a Hybrid Memory Architecture: Step-by-Step
In practice, the most robust agents combine strategies. Here’s a concrete architecture — and the code to build it — using a sliding window for recency, rolling summarization for medium-term context, and structured entity extraction for persistent facts.
Step 1: Define Your Memory Object
Start by modeling what memory looks like:
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ConversationMemory:
# Structured entity state — persists indefinitely
entities: dict = field(default_factory=dict)
# Rolling summary — updated periodically
summary: str = ""
# Recent turns — sliding window
recent_turns: list = field(default_factory=list)
# Configuration
window_size: int = 10 # Max recent turns to keep raw
summary_trigger: int = 8 # Summarize when window hits this
max_summary_tokens: int = 300 # Target summary length
Step 2: Build the Context Assembler
This function takes the memory object and assembles the prompt context:
def assemble_context(memory: ConversationMemory) -> list[dict]:
messages = []
# Layer 1: System prompt with entity state
system_content = "You are a helpful assistant."
if memory.entities:
entity_str = "\n".join(f"- {k}: {v}" for k, v in memory.entities.items())
system_content += f"\n\nKnown facts about the user:\n{entity_str}"
messages.append({"role": "system", "content": system_content})
# Layer 2: Rolling summary as a synthetic message
if memory.summary:
messages.append({
"role": "system",
"content": f"[Earlier conversation summary]: {memory.summary}"
})
# Layer 3: Recent raw turns
messages.extend(memory.recent_turns)
return messages
Step 3: Implement the Update Pipeline
After each turn, update the memory:
import anthropic
client = anthropic.Anthropic()
def update_memory(memory: ConversationMemory, user_msg: str, assistant_msg: str):
# Add new turns to recent window
memory.recent_turns.append({"role": "user", "content": user_msg})
memory.recent_turns.append({"role": "assistant", "content": assistant_msg})
# Extract entities from latest exchange
_extract_entities(memory, user_msg, assistant_msg)
# Summarize if window is getting full
if len(memory.recent_turns) >= memory.summary_trigger * 2:
_roll_summary(memory)
def _roll_summary(memory: ConversationMemory):
# Take the oldest half of the window and summarize it
half = len(memory.recent_turns) // 2
turns_to_summarize = memory.recent_turns[:half]
memory.recent_turns = memory.recent_turns[half:]
# Build a summarization prompt
turns_text = "\n".join(
f"{t['role'].upper()}: {t['content']}" for t in turns_to_summarize
)
prior_summary = f"Prior summary: {memory.summary}\n\n" if memory.summary else ""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=memory.max_summary_tokens,
messages=[{
"role": "user",
"content": (
f"{prior_summary}"
f"Summarize the following conversation excerpt in 2-3 sentences, "
f"preserving key facts, decisions, and user preferences:\n\n{turns_text}"
)
}]
)
memory.summary = response.content[0].text
def _extract_entities(memory: ConversationMemory, user_msg: str, assistant_msg: str):
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Use fast/cheap model for extraction
max_tokens=200,
messages=[{
"role": "user",
"content": (
f"Extract any concrete facts about the user from this exchange. "
f"Return JSON only (e.g. {{\"name\": \"Alice\", \"goal\": \"learn Python\"}}). "
f"Return {{}} if nothing new.\n\n"
f"User: {user_msg}\nAssistant: {assistant_msg}"
)
}]
)
import json
try:
extracted = json.loads(response.content[0].text.strip())
memory.entities.update(extracted)
except json.JSONDecodeError:
pass # Gracefully skip malformed extractions
Step 4: Wire It Into Your Agent Loop
def run_agent(memory: ConversationMemory, user_input: str) -> str:
# Assemble context from memory layers
messages = assemble_context(memory)
# Append the new user message
messages.append({"role": "user", "content": user_input})
# Call the model
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
assistant_reply = response.content[0].text
# Update memory with this exchange
update_memory(memory, user_input, assistant_reply)
return assistant_reply
# Example usage
memory = ConversationMemory()
reply1 = run_agent(memory, "Hi! I'm Alex and I'm learning Python for a data science interview.")
reply2 = run_agent(memory, "I keep getting confused by list comprehensions.")
reply3 = run_agent(memory, "Can you give me a practice problem?")
# By turn 3, the agent knows: user is Alex, goal is data science interview, struggles with list comprehensions
Token Budget Management: Staying Under Control
Even with smart memory strategies, you need to actively track token usage. Here’s a lightweight budget monitor you can drop into any agent:
def estimate_tokens(messages: list[dict]) -> int:
# Rough estimate: ~4 characters per token
total_chars = sum(len(m.get("content", "")) for m in messages)
return total_chars // 4
def enforce_budget(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
while estimate_tokens(messages) > max_tokens and len(messages) > 2:
# Drop the oldest non-system message pair
for i, msg in enumerate(messages):
if msg["role"] == "user":
messages.pop(i + 1) # Remove assistant response
messages.pop(i) # Remove user message
break
return messages
For production systems, use the usage field returned by the Anthropic API to get exact token counts rather than estimates.
Common Mistakes (and How to Avoid Them)
Mistake 1: Summarizing Too Aggressively
If your summaries are too compressed, the agent loses nuance. A user’s offhand comment about hating verbose explanations is exactly the kind of detail a heavy summarizer drops — and it’s exactly the kind of detail that should inform every subsequent response.
Fix: Tune your max_summary_tokens upward and prompt the summarizer to explicitly preserve user preferences and stated constraints.
Mistake 2: Using the Same Model for Memory Operations
Running expensive models (like Claude Opus) for entity extraction and summarization on every turn is a budget killer.
Fix: Use a fast, cheap model (Claude Haiku) for memory operations. Reserve your powerful model for the actual agent response.
Mistake 3: Storing Everything as Raw Text
Embedding entire conversation turns for vector retrieval is noisy. Turn 7 might say “yeah” in response to a yes/no question — that’s not worth embedding and retrieving.
Fix: Pre-filter before embedding. Only embed turns that contain substantive information (longer than N characters, or flagged by an extraction step).
Mistake 4: Ignoring Memory Staleness
If an agent runs across multiple sessions (days or weeks), old entity extractions can become stale. A user’s goal, location, or knowledge level can change.
Fix: Add timestamps to entity extractions and build a staleness check into your context assembler. For facts older than a threshold, either drop them or flag them as potentially outdated in the prompt.
Real-World Example: A Coding Tutor Agent
Let’s see how this plays out in a concrete scenario. Imagine a Python tutoring agent running on harnessengineering.academy:
- Turn 1: “I’m Maya. I’m new to Python and I learn best by doing, not reading.”
-
Entity extracted:
{"name": "Maya", "learning_style": "hands-on", "level": "beginner"} -
Turns 2-9: Maya works through variables, functions, and loops with the agent.
-
Summary generated: “Maya is a beginner Python learner who prefers hands-on exercises. She has completed variables, functions, and loops, showing confidence with basic syntax but needing extra practice with loop logic.”
-
Turn 10: “Can we work on something harder?”
- Context assembled: System prompt includes Maya’s entity profile + the rolling summary + recent turns.
- Agent responds appropriately: “Great, Maya! Since you’ve got loops down, let’s tackle list comprehensions with some hands-on exercises…”
Without memory, turn 10 would require Maya to re-introduce herself and recap everything. With this architecture, the agent continues seamlessly — exactly like a skilled human tutor would.
Choosing the Right Strategy for Your Use Case
| Use Case | Recommended Strategy |
|---|---|
| Simple chatbot, short sessions | Sliding window only |
| Customer support, medium sessions | Sliding window + summarization |
| Personal assistant, long sessions | Summarization + entity memory |
| Research agent, multi-day sessions | Vector retrieval + entity memory |
| Complex goal-oriented agent | Full hybrid (all four strategies) |
What to Build Next
Once your multi-turn memory is working, the natural next step is cross-session persistence — storing memory to a database so the agent picks up where it left off, even after the user closes the browser and comes back the next day. That typically means:
- Serializing your
ConversationMemoryobject to JSON and writing it to a datastore (PostgreSQL, Redis, DynamoDB). - Loading it at the start of each session using a user/session ID.
- Adding a decay function to gradually deprioritize very old facts.
That’s the subject of our next tutorial — so bookmark this page and keep an eye on the Harness Engineering Academy feed.
Ready to Level Up?
Multi-turn conversation memory is one of the core competencies tested in the Harness AI Agent Engineering Certification. If you want to prove your skills to employers and build a portfolio of real agent systems, our certification program walks you through memory management, tool use, multi-agent orchestration, and production deployment — with hands-on projects at every stage.
Explore the Certification Program →
And if you found this tutorial helpful, share it with a fellow learner. The AI agent engineering community grows stronger when we learn together.
Jamie Park is an educator and career coach at Harness Engineering Academy, specializing in LLM engineering and AI agent systems. She writes beginner-to-advanced tutorials focused on practical, production-ready techniques.