Implementing Multi-Turn Conversation Memory in AI Agents: Building Long-Context Awareness without Breaking Token Budgets

Every skilled human conversation partner does something remarkable: they remember what was said five minutes ago, an hour ago, and sometimes years ago — and they know which memories actually matter right now. Building AI agents that can do the same is one of the most practical and rewarding challenges in agent engineering today.

If you’ve ever watched an AI assistant confidently forget everything a user said three exchanges back, you know exactly why this matters. And if you’ve ever tried to fix it by just stuffing the entire conversation history into the context window, you’ve probably met its nemesis: the token budget.

In this guide, we’ll walk through the core strategies for implementing multi-turn conversation memory in AI agents — from simple sliding windows to sophisticated hybrid retrieval systems — so your agents can hold intelligent, coherent long-form conversations without running out of tokens or budget.


Why Multi-Turn Memory Is Non-Negotiable for Real-World Agents

Single-turn AI interactions are straightforward: user sends a message, agent replies, done. But production agents — customer support bots, coding assistants, tutors, research agents — operate across many turns, often over minutes or hours. Without memory, every turn feels like meeting the user for the first time.

Consider a tutoring agent helping a student learn Python. By turn 15, the student has revealed that they’re a visual learner, they struggle with recursion, and they’re preparing for a job interview next week. An agent with no memory treats turn 16 as a cold start. An agent with good memory uses all of that context to give a precisely tailored response.

The challenge: modern LLMs have finite context windows (even 128K-token models have limits), and large context windows cost more to fill. Naively appending every message to the prompt is a recipe for slow, expensive, eventually-broken agents.

The solution: strategic memory management — deciding what to keep, what to compress, and what to retrieve.


The Four Core Memory Strategies

Before writing a single line of code, you need to understand the four primary strategies. Most production agents combine two or more.

1. Sliding Window (Recency Buffer)

The simplest approach: keep only the last N turns of conversation in the context.

How it works: You maintain a list of message objects. When the list exceeds your window size (say, 20 turns), you drop the oldest messages.

Best for: Conversational agents where recent context is all that matters — casual chat, simple Q&A bots.

Limitation: The agent has no memory of anything outside the window. If the user mentioned their name in turn 1 and you’re now on turn 25, that’s gone.

2. Hierarchical Summarization

Instead of dropping old turns, you summarize them into a compact representation and keep that summary in the prompt alongside recent turns.

How it works: Every N turns (or when you approach a token threshold), you call the LLM to summarize the conversation so far into a tight paragraph or structured JSON block. That summary replaces the raw history.

Best for: Long conversations where high-level context matters — therapy bots, extended tutoring sessions, project planning agents.

Limitation: Summarization loses detail. If a user gave a precise number or a nuanced statement in turn 8, the summary might smooth it over.

3. Vector-Based Retrieval (RAG for Conversation History)

Store all conversation turns as embeddings in a vector database. At each new turn, retrieve the most semantically relevant past turns and inject them into the prompt.

How it works: Every message is embedded and stored. When the agent receives a new message, you embed it and run a similarity search to find the top-K most relevant prior exchanges. Those retrieved exchanges go into the context alongside recent turns.

Best for: Agents with very long conversation histories (days or weeks), or where specific past facts need to be recalled on demand.

Limitation: Retrieval adds latency and infrastructure complexity. You also need to design your retrieval query carefully — the most semantically similar turn isn’t always the most useful one.

4. Structured State / Entity Memory

Rather than preserving raw turns, you maintain a structured record of key facts extracted from the conversation — user name, preferences, stated goals, known constraints.

How it works: After each turn (or batch of turns), an extraction step updates a structured object (JSON, a database row, etc.) with facts. The agent reads this “state” at the start of each turn.

Best for: Goal-oriented agents where specific facts (user profile, task state) matter more than conversational flow.

Limitation: You have to design the schema. If a fact doesn’t fit your schema, it gets dropped.


Building a Hybrid Memory Architecture: Step-by-Step

In practice, the most robust agents combine strategies. Here’s a concrete architecture — and the code to build it — using a sliding window for recency, rolling summarization for medium-term context, and structured entity extraction for persistent facts.

Step 1: Define Your Memory Object

Start by modeling what memory looks like:

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ConversationMemory:
    # Structured entity state — persists indefinitely
    entities: dict = field(default_factory=dict)

    # Rolling summary — updated periodically
    summary: str = ""

    # Recent turns — sliding window
    recent_turns: list = field(default_factory=list)

    # Configuration
    window_size: int = 10          # Max recent turns to keep raw
    summary_trigger: int = 8       # Summarize when window hits this
    max_summary_tokens: int = 300  # Target summary length

Step 2: Build the Context Assembler

This function takes the memory object and assembles the prompt context:

def assemble_context(memory: ConversationMemory) -> list[dict]:
    messages = []

    # Layer 1: System prompt with entity state
    system_content = "You are a helpful assistant."
    if memory.entities:
        entity_str = "\n".join(f"- {k}: {v}" for k, v in memory.entities.items())
        system_content += f"\n\nKnown facts about the user:\n{entity_str}"

    messages.append({"role": "system", "content": system_content})

    # Layer 2: Rolling summary as a synthetic message
    if memory.summary:
        messages.append({
            "role": "system",
            "content": f"[Earlier conversation summary]: {memory.summary}"
        })

    # Layer 3: Recent raw turns
    messages.extend(memory.recent_turns)

    return messages

Step 3: Implement the Update Pipeline

After each turn, update the memory:

import anthropic

client = anthropic.Anthropic()

def update_memory(memory: ConversationMemory, user_msg: str, assistant_msg: str):
    # Add new turns to recent window
    memory.recent_turns.append({"role": "user", "content": user_msg})
    memory.recent_turns.append({"role": "assistant", "content": assistant_msg})

    # Extract entities from latest exchange
    _extract_entities(memory, user_msg, assistant_msg)

    # Summarize if window is getting full
    if len(memory.recent_turns) >= memory.summary_trigger * 2:
        _roll_summary(memory)

def _roll_summary(memory: ConversationMemory):
    # Take the oldest half of the window and summarize it
    half = len(memory.recent_turns) // 2
    turns_to_summarize = memory.recent_turns[:half]
    memory.recent_turns = memory.recent_turns[half:]

    # Build a summarization prompt
    turns_text = "\n".join(
        f"{t['role'].upper()}: {t['content']}" for t in turns_to_summarize
    )
    prior_summary = f"Prior summary: {memory.summary}\n\n" if memory.summary else ""

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=memory.max_summary_tokens,
        messages=[{
            "role": "user",
            "content": (
                f"{prior_summary}"
                f"Summarize the following conversation excerpt in 2-3 sentences, "
                f"preserving key facts, decisions, and user preferences:\n\n{turns_text}"
            )
        }]
    )
    memory.summary = response.content[0].text

def _extract_entities(memory: ConversationMemory, user_msg: str, assistant_msg: str):
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Use fast/cheap model for extraction
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": (
                f"Extract any concrete facts about the user from this exchange. "
                f"Return JSON only (e.g. {{\"name\": \"Alice\", \"goal\": \"learn Python\"}}). "
                f"Return {{}} if nothing new.\n\n"
                f"User: {user_msg}\nAssistant: {assistant_msg}"
            )
        }]
    )
    import json
    try:
        extracted = json.loads(response.content[0].text.strip())
        memory.entities.update(extracted)
    except json.JSONDecodeError:
        pass  # Gracefully skip malformed extractions

Step 4: Wire It Into Your Agent Loop

def run_agent(memory: ConversationMemory, user_input: str) -> str:
    # Assemble context from memory layers
    messages = assemble_context(memory)

    # Append the new user message
    messages.append({"role": "user", "content": user_input})

    # Call the model
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=messages
    )

    assistant_reply = response.content[0].text

    # Update memory with this exchange
    update_memory(memory, user_input, assistant_reply)

    return assistant_reply

# Example usage
memory = ConversationMemory()

reply1 = run_agent(memory, "Hi! I'm Alex and I'm learning Python for a data science interview.")
reply2 = run_agent(memory, "I keep getting confused by list comprehensions.")
reply3 = run_agent(memory, "Can you give me a practice problem?")

# By turn 3, the agent knows: user is Alex, goal is data science interview, struggles with list comprehensions

Token Budget Management: Staying Under Control

Even with smart memory strategies, you need to actively track token usage. Here’s a lightweight budget monitor you can drop into any agent:

def estimate_tokens(messages: list[dict]) -> int:
    # Rough estimate: ~4 characters per token
    total_chars = sum(len(m.get("content", "")) for m in messages)
    return total_chars // 4

def enforce_budget(messages: list[dict], max_tokens: int = 8000) -> list[dict]:
    while estimate_tokens(messages) > max_tokens and len(messages) > 2:
        # Drop the oldest non-system message pair
        for i, msg in enumerate(messages):
            if msg["role"] == "user":
                messages.pop(i + 1)  # Remove assistant response
                messages.pop(i)      # Remove user message
                break
    return messages

For production systems, use the usage field returned by the Anthropic API to get exact token counts rather than estimates.


Common Mistakes (and How to Avoid Them)

Mistake 1: Summarizing Too Aggressively

If your summaries are too compressed, the agent loses nuance. A user’s offhand comment about hating verbose explanations is exactly the kind of detail a heavy summarizer drops — and it’s exactly the kind of detail that should inform every subsequent response.

Fix: Tune your max_summary_tokens upward and prompt the summarizer to explicitly preserve user preferences and stated constraints.

Mistake 2: Using the Same Model for Memory Operations

Running expensive models (like Claude Opus) for entity extraction and summarization on every turn is a budget killer.

Fix: Use a fast, cheap model (Claude Haiku) for memory operations. Reserve your powerful model for the actual agent response.

Mistake 3: Storing Everything as Raw Text

Embedding entire conversation turns for vector retrieval is noisy. Turn 7 might say “yeah” in response to a yes/no question — that’s not worth embedding and retrieving.

Fix: Pre-filter before embedding. Only embed turns that contain substantive information (longer than N characters, or flagged by an extraction step).

Mistake 4: Ignoring Memory Staleness

If an agent runs across multiple sessions (days or weeks), old entity extractions can become stale. A user’s goal, location, or knowledge level can change.

Fix: Add timestamps to entity extractions and build a staleness check into your context assembler. For facts older than a threshold, either drop them or flag them as potentially outdated in the prompt.


Real-World Example: A Coding Tutor Agent

Let’s see how this plays out in a concrete scenario. Imagine a Python tutoring agent running on harnessengineering.academy:

  • Turn 1: “I’m Maya. I’m new to Python and I learn best by doing, not reading.”
  • Entity extracted: {"name": "Maya", "learning_style": "hands-on", "level": "beginner"}

  • Turns 2-9: Maya works through variables, functions, and loops with the agent.

  • Summary generated: “Maya is a beginner Python learner who prefers hands-on exercises. She has completed variables, functions, and loops, showing confidence with basic syntax but needing extra practice with loop logic.”

  • Turn 10: “Can we work on something harder?”

  • Context assembled: System prompt includes Maya’s entity profile + the rolling summary + recent turns.
  • Agent responds appropriately: “Great, Maya! Since you’ve got loops down, let’s tackle list comprehensions with some hands-on exercises…”

Without memory, turn 10 would require Maya to re-introduce herself and recap everything. With this architecture, the agent continues seamlessly — exactly like a skilled human tutor would.


Choosing the Right Strategy for Your Use Case

Use Case Recommended Strategy
Simple chatbot, short sessions Sliding window only
Customer support, medium sessions Sliding window + summarization
Personal assistant, long sessions Summarization + entity memory
Research agent, multi-day sessions Vector retrieval + entity memory
Complex goal-oriented agent Full hybrid (all four strategies)

What to Build Next

Once your multi-turn memory is working, the natural next step is cross-session persistence — storing memory to a database so the agent picks up where it left off, even after the user closes the browser and comes back the next day. That typically means:

  1. Serializing your ConversationMemory object to JSON and writing it to a datastore (PostgreSQL, Redis, DynamoDB).
  2. Loading it at the start of each session using a user/session ID.
  3. Adding a decay function to gradually deprioritize very old facts.

That’s the subject of our next tutorial — so bookmark this page and keep an eye on the Harness Engineering Academy feed.


Ready to Level Up?

Multi-turn conversation memory is one of the core competencies tested in the Harness AI Agent Engineering Certification. If you want to prove your skills to employers and build a portfolio of real agent systems, our certification program walks you through memory management, tool use, multi-agent orchestration, and production deployment — with hands-on projects at every stage.

Explore the Certification Program →

And if you found this tutorial helpful, share it with a fellow learner. The AI agent engineering community grows stronger when we learn together.


Jamie Park is an educator and career coach at Harness Engineering Academy, specializing in LLM engineering and AI agent systems. She writes beginner-to-advanced tutorials focused on practical, production-ready techniques.

Leave a Comment