Multi-Agent Orchestration with OpenClaw: A Comprehensive Workshop

Running a single AI agent is manageable. Running five agents that need to share context, hand off tasks, recover from failures, and stay within budget? That’s a different challenge entirely — and it’s exactly what multi-agent orchestration is designed to solve.

This workshop uses OpenClaw, a lightweight Python orchestration framework built specifically for teams learning to coordinate multiple agents without drowning in infrastructure complexity. By the end of this guide, you’ll have a working multi-agent pipeline that handles real-world research and summarization tasks, complete with role assignment, task routing, shared memory, and graceful error handling.

No prior orchestration experience required. You’ll need Python 3.11+ and familiarity with basic agent concepts (LLM calls, tool use). Let’s build something real.


Why Multi-Agent Orchestration Matters

Before you write a single line of code, it’s worth understanding why you’d split work across multiple agents instead of giving one agent a big prompt.

Single agents hit predictable limits:

  • Context window exhaustion — Long tasks overflow the context budget
  • Skill dilution — One agent trying to research, write, and fact-check simultaneously does all three worse than specialized agents doing one each
  • No parallelism — Sequential processing is slow when subtasks are independent
  • Fragile error recovery — A single failure kills the entire pipeline

Multi-agent systems solve each of these. A supervisor agent breaks work into subtasks. Worker agents execute those subtasks in parallel. A reviewer agent validates outputs before they leave the system. When a worker fails, the supervisor reroutes — the rest of the pipeline keeps running.

This is the core pattern OpenClaw makes approachable for beginners.


What Is OpenClaw?

OpenClaw is an open-source Python framework designed for structured multi-agent coordination. Its design philosophy prioritizes explicitness over magic: every routing decision, memory write, and agent invocation is logged and inspectable.

Key concepts you’ll use in this workshop:

Concept What It Does
Orchestrator Central coordinator that owns the task graph
Agent A named unit of work backed by an LLM + tools
TaskQueue Ordered work list the orchestrator dispatches from
SharedMemory Key-value store all agents can read and write
HarnessPolicy Rules governing agent behavior (budget, retries, scope)

OpenClaw doesn’t enforce a specific LLM provider — you wire in OpenAI, Anthropic, or a local model through a simple adapter interface.


Workshop Overview

We’ll build a Research Pipeline with four specialized agents:

  1. Planner — Breaks a research question into subtopics
  2. Researcher — Fetches and summarizes information per subtopic
  3. Critic — Flags gaps or contradictions in the research
  4. Writer — Synthesizes everything into a final report

This mirrors real production architectures used for competitive intelligence, due diligence, and automated content pipelines.


Part 1: Installation and Project Setup

Start by creating a project directory and installing dependencies:

mkdir openclaw-workshop && cd openclaw-workshop
python -m venv .venv && source .venv/bin/activate
pip install openclaw openai python-dotenv

Create a .env file for your API credentials:

# .env
OPENAI_API_KEY=sk-...
OPENCLAW_LOG_LEVEL=INFO

Initialize your project structure:

mkdir -p agents tools memory
touch main.py agents/__init__.py tools/__init__.py

Your final structure will look like:

openclaw-workshop/
├── agents/
│   ├── planner.py
│   ├── researcher.py
│   ├── critic.py
│   └── writer.py
├── tools/
│   └── search.py
├── memory/
├── main.py
└── .env

Part 2: Defining Agent Roles

The Planner Agent

The Planner receives the user’s research question and produces a list of subtopics. It’s the entry point of your pipeline.

# agents/planner.py
from openclaw import Agent, AgentConfig
from openai import OpenAI

client = OpenAI()

SYSTEM_PROMPT = """You are a research planner. Given a research question,
break it into 3-5 focused subtopics that together provide comprehensive coverage.
Return a JSON list of subtopic strings. No explanation — just the list."""

def create_planner() -> Agent:
    config = AgentConfig(
        name="planner",
        role="planning",
        max_tokens=512,
        temperature=0.3,
    )

    def run(task: str, memory: dict) -> dict:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Research question: {task}"},
            ],
            response_format={"type": "json_object"},
        )
        subtopics = response.choices[0].message.content
        return {"subtopics": subtopics, "original_question": task}

    return Agent(config=config, run_fn=run)

Notice that the Planner writes its output as structured data. This is a key discipline in multi-agent design: agents should hand off structured data, not prose, so downstream agents don’t need to parse freeform text.

The Researcher Agent

The Researcher receives a single subtopic and returns a focused summary. In production you’d plug in a real search API; here we’ll simulate it with a stub tool to keep the workshop self-contained.

# tools/search.py
def web_search(query: str) -> str:
    """Stub: replace with SerpAPI, Tavily, or Exa in production."""
    return (
        f"[Simulated search results for: {query}]\n"
        f"Key findings: This topic has significant recent developments in 2025-2026, "
        f"with practitioners reporting improved reliability when using structured "
        f"orchestration patterns. Major frameworks have converged on supervisor-worker "
        f"architectures as the default production pattern."
    )
# agents/researcher.py
from openclaw import Agent, AgentConfig
from openai import OpenAI
from tools.search import web_search

client = OpenAI()

SYSTEM_PROMPT = """You are a research specialist. Given a subtopic, search for
relevant information and produce a concise 150-200 word summary. Focus on
facts, specific examples, and cited patterns. Be direct."""

def create_researcher() -> Agent:
    config = AgentConfig(
        name="researcher",
        role="research",
        max_tokens=400,
        temperature=0.2,
    )

    def run(task: str, memory: dict) -> dict:
        search_results = web_search(task)
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {
                    "role": "user",
                    "content": f"Subtopic: {task}\n\nSearch results:\n{search_results}",
                },
            ],
        )
        summary = response.choices[0].message.content
        return {"subtopic": task, "summary": summary}

    return Agent(config=config, run_fn=run)

The Critic Agent

The Critic reads all researcher outputs and identifies gaps, contradictions, or low-confidence claims. This is the quality gate before writing begins.

# agents/critic.py
from openclaw import Agent, AgentConfig
from openai import OpenAI
import json

client = OpenAI()

SYSTEM_PROMPT = """You are a critical reviewer of research summaries. Given a set of
subtopic summaries, identify: (1) logical contradictions, (2) missing important angles,
(3) unsupported claims. Return a JSON object with keys: 'issues' (list of strings)
and 'confidence' (float 0-1). If confidence < 0.7, flag for human review."""

def create_critic() -> Agent:
    config = AgentConfig(
        name="critic",
        role="review",
        max_tokens=512,
        temperature=0.1,
    )

    def run(task: str, memory: dict) -> dict:
        summaries = memory.get("research_summaries", [])
        combined = "\n\n".join(
            f"**{s['subtopic']}**\n{s['summary']}" for s in summaries
        )
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Review these summaries:\n\n{combined}"},
            ],
            response_format={"type": "json_object"},
        )
        return json.loads(response.choices[0].message.content)

    return Agent(config=config, run_fn=run)

The Writer Agent

The Writer synthesizes all summaries into a polished final report, informed by the Critic’s feedback.

# agents/writer.py
from openclaw import Agent, AgentConfig
from openai import OpenAI

client = OpenAI()

SYSTEM_PROMPT = """You are a technical writer. Given research summaries and a critic's
feedback, write a well-structured report (400-600 words) with an executive summary,
key findings, and a conclusion. Use clear headings. Address any issues the critic raised."""

def create_writer() -> Agent:
    config = AgentConfig(
        name="writer",
        role="synthesis",
        max_tokens=1024,
        temperature=0.5,
    )

    def run(task: str, memory: dict) -> dict:
        summaries = memory.get("research_summaries", [])
        critique = memory.get("critique", {})
        combined = "\n\n".join(
            f"**{s['subtopic']}**\n{s['summary']}" for s in summaries
        )
        issues = "\n".join(critique.get("issues", ["None identified"]))

        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {
                    "role": "user",
                    "content": (
                        f"Original question: {task}\n\n"
                        f"Research summaries:\n{combined}\n\n"
                        f"Critic's issues to address:\n{issues}"
                    ),
                },
            ],
        )
        return {"report": response.choices[0].message.content}

    return Agent(config=config, run_fn=run)

Part 3: Wiring the Orchestrator

Now that agents are defined, the Orchestrator connects them into a pipeline. This is where OpenClaw’s HarnessPolicy comes in — it governs retry behavior, budget limits, and escalation rules.

# main.py
import json
from dotenv import load_dotenv
from openclaw import Orchestrator, SharedMemory, HarnessPolicy, TaskQueue

from agents.planner import create_planner
from agents.researcher import create_researcher
from agents.critic import create_critic
from agents.writer import create_writer

load_dotenv()

def build_pipeline() -> Orchestrator:
    policy = HarnessPolicy(
        max_retries=2,
        retry_backoff_seconds=2,
        budget_limit_usd=1.00,       # Hard stop at $1 for this workshop
        escalate_on_low_confidence=True,
        confidence_threshold=0.7,
    )

    memory = SharedMemory()

    orchestrator = Orchestrator(
        name="research-pipeline",
        policy=policy,
        memory=memory,
    )

    orchestrator.register_agent("planner", create_planner())
    orchestrator.register_agent("researcher", create_researcher())
    orchestrator.register_agent("critic", create_critic())
    orchestrator.register_agent("writer", create_writer())

    return orchestrator


def run_research(question: str) -> str:
    orchestrator = build_pipeline()
    memory = orchestrator.memory

    # Step 1: Plan
    print("Step 1/4: Planning subtopics...")
    plan_result = orchestrator.invoke("planner", task=question)
    subtopics = json.loads(plan_result["subtopics"])
    print(f"  → {len(subtopics)} subtopics identified")

    # Step 2: Research (parallel dispatch)
    print("Step 2/4: Researching subtopics in parallel...")
    queue = TaskQueue(tasks=subtopics, agent="researcher")
    research_results = orchestrator.dispatch_parallel(queue)
    memory.write("research_summaries", research_results)
    print(f"  → {len(research_results)} summaries collected")

    # Step 3: Critique
    print("Step 3/4: Running quality review...")
    critique = orchestrator.invoke("critic", task=question)
    memory.write("critique", critique)
    confidence = critique.get("confidence", 1.0)
    print(f"  → Confidence score: {confidence:.2f}")

    if confidence < 0.7:
        print("  ⚠ Low confidence — flagging for human review")
        # In production, this would trigger a Slack alert or pause for review
        print("  Continuing for workshop purposes...")

    # Step 4: Write
    print("Step 4/4: Synthesizing final report...")
    write_result = orchestrator.invoke("writer", task=question)
    return write_result["report"]


if __name__ == "__main__":
    question = "What are the emerging best practices for AI agent harness engineering in 2026?"
    print(f"\nResearch question: {question}\n{'='*60}\n")
    report = run_research(question)
    print("\nFINAL REPORT\n" + "="*60)
    print(report)

Run it:

python main.py

You’ll see each step log to the console, with the final report printed at the end. The entire pipeline is observable, inspectable, and budget-capped.


Part 4: Handling Failures Gracefully

Real pipelines fail. Researchers time out. LLMs return malformed JSON. APIs rate-limit you. OpenClaw’s HarnessPolicy handles retries automatically, but you should also build fallback logic at the orchestration level.

Add this helper to main.py:

from openclaw.exceptions import AgentFailureError, BudgetExceededError

def safe_invoke(orchestrator, agent_name: str, task: str, fallback=None):
    try:
        return orchestrator.invoke(agent_name, task=task)
    except AgentFailureError as e:
        print(f"  ✗ Agent '{agent_name}' failed after retries: {e}")
        return fallback or {}
    except BudgetExceededError:
        print(f"  ✗ Budget limit reached — stopping pipeline")
        raise

Replace bare orchestrator.invoke() calls with safe_invoke(). If a Researcher fails on one subtopic, the pipeline continues with the remaining results rather than crashing entirely.

This is the difference between a fragile demo and a production-ready system.


Part 5: Inspecting Shared Memory

One of OpenClaw’s most useful debugging features is its memory inspector. After a pipeline run, you can dump the full shared memory state:

# At the end of run_research()
print("\n--- Memory Snapshot ---")
for key, value in orchestrator.memory.snapshot().items():
    print(f"{key}: {str(value)[:120]}...")

This lets you trace exactly what each agent read and wrote. In production, you’d persist this snapshot to a database for post-run auditing — a core requirement for any regulated use case.


What to Build Next

You’ve completed the core workshop. Here’s how to extend this into a production system:

Add real search tools — Swap the stub web_search() function for Tavily or Exa, both of which have Python SDKs designed for agent use.

Add streaming output — OpenClaw supports streaming agent outputs so users see progress in real-time rather than waiting for the full pipeline.

Persist memory to a database — Replace SharedMemory with RedisSharedMemory or PostgresSharedMemory (available as OpenClaw extensions) so pipeline state survives restarts.

Add human-in-the-loop checkpoints — Use orchestrator.pause_for_review() at the Critic step. When confidence is low, the pipeline pauses and sends a webhook notification before proceeding.

Deploy as an API — Wrap run_research() in a FastAPI endpoint and deploy to a VPS or cloud function. The stateless design means horizontal scaling is trivial.


Key Takeaways

After this workshop, you should be able to:

  • Define specialized agents with clear role boundaries and structured I/O
  • Wire agents into a pipeline using an Orchestrator with shared memory
  • Dispatch work in parallel using TaskQueues for independent subtasks
  • Enforce behavioral guardrails through HarnessPolicy (budget, retries, escalation)
  • Handle failures gracefully so one broken agent doesn’t kill the pipeline

The biggest shift in thinking from single-agent to multi-agent development is treating agents as services with contracts rather than clever prompts. The Planner promises to return a JSON list of subtopics. The Researcher promises to return a summary dict. The Critic promises to return confidence scores. When agents honor these contracts, the orchestrator can manage them reliably — and when they don’t, the harness catches it.

That’s harness engineering at its core.


Continue Learning

Ready to go deeper? These resources pair well with this workshop:

Have questions about this workshop? Drop them in the comments below. Every pipeline you build teaches you something the docs don’t — share what you find.


Kai Renner is a senior AI/ML engineering leader and the author behind harnessengineering.academy. He writes tutorials and career guides for engineers learning to build reliable, production-grade AI agent systems.

Leave a Comment