Modern AI Agent Architectures: How Multi-Agent Systems Like OpenHands and Claude Flow Work
Modern AI systems use multi-agent architectures where a planner decomposes tasks and specialized agents (research, coding, testing, review) execute them in parallel. This approach improves scalability, modularity, and output quality compared to single-LLM workflows. We compare two real approaches: OpenHands (dynamic agent spawning) and Claude Flow (role-based orchestration).
The way AI systems solve complex tasks is changing.
Instead of a single model grinding through a long prompt, modern frameworks break work across multiple coordinated agents - a planner that delegates, workers that execute, and an integrator that merges results.
If you’ve built microservices or worked with task queues, the pattern will feel familiar. These are distributed systems, except the workers are LLMs.
“These are distributed systems, except the workers are LLMs.”
I’ve been studying two approaches that represent different points on this spectrum: OpenHands (dynamic agent spawning) and Claude Flow (role-based agent orchestration). Here’s what I learned.
The Architecture at a Glance
Before diving into specifics, here’s the high-level pattern every multi-agent system follows:
Multi-Agent Architecture Overview
Decomposes task into subtasks
Auth
Catalog
Checkout
Merges results & runs verification
This is the scatter-gather pattern: decompose, fan out to parallel workers, fan in to merge. The two frameworks below implement this differently - but the skeleton is the same.
System-Level Component Map
Every production multi-agent system has these layers. The diagram below shows how data and control flow between them:
System Architecture: Components + Data/Control Flow
Orchestration Layer
Planner
Task graph + deps
Router
Model selection
Scheduler
Parallel dispatch
Agent Execution Layer
LLM Agent
Reasoning + generation
LLM Agent
Reasoning + generation
LLM Agent
Reasoning + generation
Infrastructure Layer
Control flows top-down from orchestrator to agents. Data flows bidirectionally between agents and infrastructure. The orchestrator never touches tools directly — agents do.
OpenHands: Dynamic Sub-Agent Delegation
OpenHands uses a main planning agent that dynamically spawns short-lived sub-agents to handle pieces of a larger task. Think of it as a supervisor process forking workers.
The Components
Main Agent (Supervisor) - The orchestrator at the top. It receives the user request, holds the full context of the codebase, breaks the task into steps, and decides what to handle directly versus what to delegate.
Sub-Agents (Workers) - Ephemeral executors. Created dynamically via SDK calls like create_agent() or delegate(). Each one receives a focused prompt, does one thing, returns results, and terminates.
They can run on the same model or a cheaper one - cost optimization is built into the architecture.
Key Insight: Sub-agents don’t need full context. They receive only the minimum information required for their task - reducing token cost and improving output quality.
How It Flows
Say a user asks: “Build a simple e-commerce page.”
The main agent analyzes the request and breaks it down:
- Create UI components
- Build API endpoints
- Generate product seed data
- Write tests
Then it delegates. Each sub-agent gets minimal context - just enough to complete its task. They run in parallel, return their outputs, and terminate.
The main agent collects everything, merges the changes, runs tests, and produces the final result.
OpenHands: Dynamic Sub-Agent Delegation
Analyzes & decomposes task
Sub-Agent 1
Sub-Agent 2
Sub-Agent 3
Sub-Agent 4
Real Example: E-Commerce Page Build
Let’s trace a concrete request through this system. A developer asks: “Build an e-commerce product page with a shopping cart.”
Here’s exactly what each agent receives and produces:
OpenHands: Real Delegation Example
MAIN AGENT DECOMPOSITION
Input: "Build an e-commerce product page with shopping cart"
UI Agent
Input: Design specs
Model: Claude Sonnet
Output:
ProductCard.tsx
CartDrawer.tsx
ProductGrid.tsx
API Agent
Input: Data schema
Model: Claude Haiku
Output:
cart.ts (store)
api/products.ts
api/cart.ts
Data Agent
Input: Product model
Model: Claude Haiku
Output:
seed/products.json
(50 products)
Test Agent
Input: All file paths
Model: Claude Sonnet
Output:
__tests__/cart.test.ts
__tests__/products.test.ts
Main Agent: Merge & Verify
Resolves import conflicts • Runs npm test • Verifies build passes • Returns working project
Notice each sub-agent uses the minimum model needed. Data generation and simple API routes go to Haiku (cheap, fast). UI components and tests go to Sonnet (better reasoning). The planner uses the most capable model. This is how you optimize cost without sacrificing quality.
Why This Works
The power here is parallelism with isolation. Each sub-agent operates in a narrow scope, which means:
- Less token waste - small context windows per worker
- Fewer hallucinations - focused prompts produce more reliable outputs
- Faster execution - parallel, not sequential
- Cost control - workers can use cheaper models for straightforward tasks
It’s the same reason we don’t run monoliths in production anymore. Decomposition works.
Sequence: How a Request Flows Through OpenHands
Here’s the time-ordered sequence showing how the supervisor spawns, manages, and collects from sub-agents:
OpenHands: Request Lifecycle (Time →)
Workers A-C run in parallel immediately after spawn. Worker D starts later (depends on A+B outputs). The supervisor blocks on collect(), then merges and verifies.
Key details in this sequence: the supervisor is not idle during worker execution - it monitors for early failures and can abort/respawn workers. Worker D (tests) has a dependency on A and B, so it starts later. This is a DAG, not a flat fan-out.
Claude Flow: Role-Based Agent Orchestration
Claude Flow takes a different approach. Instead of spawning temporary agents on the fly, it orchestrates persistent specialized roles - each responsible for a distinct phase of reasoning.
The Roles
| Role | Responsibility |
|---|---|
| Planner Agent | Understands the problem, breaks it into subtasks, assigns work |
| Research Agent | Gathers external information, summarizes docs, retrieves knowledge |
| Builder Agent | Implements code, modifies files, generates architecture |
| Critic Agent | Reviews output for correctness, identifies errors, suggests improvements |
| Integrator Agent | Merges outputs, resolves conflicts, produces the final deliverable |
Agent Roles in Detail
Each role has distinct capabilities, tools, and model configurations:
Claude Flow: Agent Role Responsibilities
Responsibility
Understands the full request. Breaks it into ordered subtasks with dependencies. Decides which agent handles each piece. Holds the execution plan.
Responsibility
Searches documentation, APIs, and knowledge bases. Gathers context the Builder needs. Summarizes findings into structured briefs.
Responsibility
Writes production code based on research and specs. Creates files, modifies existing code, generates architecture. Focuses purely on implementation.
Responsibility
Reviews Builder output for bugs, security issues, and logic errors. Suggests improvements. Can reject and send work back for revision.
Responsibility
Merges all agent outputs into a cohesive deliverable. Resolves conflicts between agents. Ensures consistency across the final result.
How It Flows
A request enters the system and hits the Planner. The Planner routes subtasks to the appropriate role. Each agent processes its piece and passes results forward.
The key difference from OpenHands: these agents aren’t spawned and killed - they’re predefined reasoning stages. The architecture emphasizes role specialization over dynamic delegation. Each agent is tuned (via system prompts, tool access, or temperature settings) for its specific job.
Claude Flow: Role-Based Agent Orchestration
Decomposes problem & routes subtasks
Gathers info &
retrieves knowledge
Implements code &
generates architecture
Reviews for
correctness
Merges outputs & resolves conflicts
Artifact Pipeline: What Each Stage Produces
The key to understanding Claude Flow is tracking what artifacts each role produces and consumes. Each stage transforms its input into a specific deliverable:
Claude Flow: Artifact Pipeline (Input → Stage → Output)
INPUT
user_request.txt
Planner
Opus
OUTPUT
task_graph.json
dependency_map.json
INPUT
task_graph.json
Research
Sonnet
OUTPUT
research_brief.md
api_references.json
INPUT
research_brief.md
task_graph.json
Builder
Sonnet
OUTPUT
src/**/*.ts (code)
schema.prisma
INPUT
src/**/*.ts
test_results.log
Critic
Opus
OUTPUT
review.md
fix_requests[]
INPUT
approved code + review
Integrator
Opus
OUTPUT
merged_repo/
build_log.txt ✓
Each stage produces typed artifacts that flow to the next. The Critic→Builder loop is the key quality gate - it runs until review passes or max retries are hit.
Notice the feedback loop between Critic and Builder. This is what makes Claude Flow’s sequential pipeline more rigorous than a flat fan-out: work is iteratively refined before it reaches the integrator.
Comparing the Two Approaches
| Feature | OpenHands | Claude Flow |
|---|---|---|
| Agent creation | Dynamic, on-demand | Predefined roles |
| Agent lifespan | Short-lived (ephemeral) | Persistent across the workflow |
| Execution model | Parallel tasks | Sequential reasoning stages |
| Strengths | Speed, cost efficiency | Depth, structured review |
| Best for | Coding automation | Complex reasoning workflows |
Neither approach is strictly better. OpenHands optimizes for throughput - get many things done fast. Claude Flow optimizes for quality - route work through specialized stages that each add rigor.
Side-by-Side: How Each Handles the Same Task
OpenHands
Parallel execution
Workers are ephemeral
Optimized for speed
Claude Flow
Sequential pipeline
Roles are persistent
Optimized for quality
In practice, production systems will likely blend both.
Real-World Example: Building an E-Commerce App
Let’s walk through a complete example to make this concrete. A developer asks an AI system: “Build an e-commerce app with product listings, user auth, and a checkout flow.”
First, here’s the task dependency graph (DAG) the planner would produce. Not all tasks can run in parallel - some have hard dependencies:
E-Commerce Task DAG: Dependencies Between Tasks
Haiku • ~2min
Sonnet • ~4min
Haiku • ~1min
Sonnet • ~3min
needs T1,T3
needs T2 (sessions)
needs T4 (shell)
needs T5,T6,T7
needs T8 (all pass)
The planner builds this DAG before any agent starts. Tier 0 tasks run in parallel. Tier 1 tasks start as soon as their specific dependencies resolve - not when all of Tier 0 finishes. This is the critical optimization over naive sequential execution.
Here’s how each framework would then distribute this work:
E-Commerce Build: Task Distribution Across Agents
Planning Phase
Planner identifies 3 independent workstreams: UI/Frontend, Backend/API, and Testing/QA. Creates a dependency graph: Auth must complete before Checkout can reference user sessions.
Parallel Execution
UI Agent
- ProductCard component
- ProductGrid with filters
- CartDrawer with quantities
- CheckoutForm with validation
- LoginModal + SignupFlow
Backend Agent
- POST /api/auth/login
- POST /api/auth/register
- GET /api/products
- POST /api/cart/add
- POST /api/checkout
Test Agent
- Auth flow e2e tests
- Cart CRUD unit tests
- Checkout integration tests
- Product API load tests
- UI component snapshots
Review Phase
Critic agent reviews all outputs: catches SQL injection in auth endpoint, identifies missing CSRF protection on checkout, flags inconsistent error handling between API routes. Sends backend work back for revision.
Integration Phase
Integrator merges all files, resolves import paths, wires API calls to frontend components, runs full test suite. Final output: a working repo with npm run dev ready to go.
The key insight: a single LLM attempting this would lose context by the checkout phase. Multi-agent systems avoid this by keeping each worker’s context window focused on one piece of the problem.
The Bigger Pattern
Strip away the AI-specific details and what you have is a well-known distributed systems pattern:
- Plan - decompose work
- Delegate - assign to workers
- Execute - run independently
- Integrate - merge and verify
This is MapReduce. This is scatter-gather. This is a workflow engine.
The Universal Pattern
The difference is that the workers are language models instead of deterministic functions. That introduces new failure modes - hallucination, drift, inconsistency - but the architectural response is the same: isolate, specialize, verify, integrate.
“Modern AI agent frameworks aren’t inventing new patterns - they’re applying battle-tested distributed systems thinking to LLM orchestration.”
When Things Go Wrong: Partial Failure and Recovery
The diagrams above show the happy path. In practice, agents fail. Code doesn’t compile. Tests don’t pass. An LLM hallucinates an import that doesn’t exist. What matters is how the system recovers.
Here’s a concrete failure scenario:
Failure Recovery: Builder Agent Produces Broken Code
Escalation policy (if Retry 2 also fails):
1. Upgrade model: re-run with Opus instead of Sonnet
2. Expand context: include related files the agent didn't originally see
3. Human-in-the-loop: pause execution, surface the error to the developer with a diff, ask for guidance
4. Abort & report: mark task as failed, log diagnostics, continue with remaining tasks
The retry loop is bounded (typically max_retries=3). Each retry includes the previous error in the prompt so the agent doesn't repeat the same mistake. Exponential backoff prevents rate limit issues.
The recovery pattern follows a standard escalation ladder: retry with context → upgrade model → expand scope → human intervention → abort. Production systems typically set max_retries=3 per agent with exponential backoff (1s, 4s, 16s) between attempts.
Smart Model Routing: Cost vs. Capability
Not every subtask needs your most expensive model. A well-designed orchestrator routes tasks based on complexity, risk, and cost — the same way you’d choose between a senior engineer and a junior dev.
Here’s a routing policy table used in practice:
| Task Type | Model | Cost/1K tokens | Why |
|---|---|---|---|
| Seed data generation | Haiku | ~$0.00025 | Structured output, low reasoning needed |
| CRUD API endpoints | Haiku | ~$0.00025 | Template-based, well-defined patterns |
| UI component creation | Sonnet | ~$0.003 | Needs design sense, moderate reasoning |
| Auth / security logic | Sonnet | ~$0.003 | Higher stakes, needs careful implementation |
| Architecture planning | Opus | ~$0.015 | Complex decomposition, dependency analysis |
| Code review / critique | Opus | ~$0.015 | Deep analysis, needs to catch subtle bugs |
| Conflict resolution | Opus | ~$0.015 | Cross-module reasoning, integration logic |
The router makes this decision per-task, not per-session. A single workflow might use all three tiers:
Cost Optimization: Model Routing for E-Commerce Build
Planning
Code review
Integration
~$0.45 total
3 tasks • ~30K tokens
UI components
Auth module
Cart logic
~$0.12 total
3 tasks • ~40K tokens
Seed data
CRUD routes
Schema gen
~$0.01 total
3 tasks • ~40K tokens
Total: ~$0.58 vs ~$1.65 (all Opus)
65% cost reduction with no quality loss on routine tasks
The routing decision can be simple — a mapping of task category to model tier — or learned from historical performance data. The key insight: most tokens in a multi-agent workflow are spent on routine work that doesn’t need your best model.
Orchestration in Practice: Pseudo-Code
Here’s the orchestration loop that ties everything together. This is simplified, but it captures the real control flow of a production multi-agent system:
async def orchestrate(user_request: str) -> Result:
# 1. Plan: decompose into a task DAG
task_graph = await planner.decompose(
request=user_request,
model="opus" # planning needs the strongest model
)
# 2. Execute: run tasks respecting dependencies
results = {}
for tier in task_graph.tiers():
# Tasks within a tier run in parallel
tier_tasks = [
execute_agent(
task=task,
model=router.select_model(task), # Haiku/Sonnet/Opus
context=gather_context(task, results),
max_retries=3
)
for task in tier.tasks
]
tier_results = await asyncio.gather(*tier_tasks)
# Check for failures
for task, result in zip(tier.tasks, tier_results):
if result.failed:
result = await escalate(task, result) # retry → upgrade → human
results[task.id] = result
# 3. Review: critic checks all outputs
review = await critic.review(
results=results,
model="opus" # review needs deep analysis
)
if review.has_fixes:
# Loop: send fix requests back to builder
for fix in review.fixes:
results[fix.task_id] = await execute_agent(
task=fix.revised_task,
model="sonnet",
context=fix.error_context + results[fix.task_id]
)
# 4. Integrate: merge everything into final output
return await integrator.merge(
results=results,
model="opus"
)
The key patterns in this code:
- Tiered execution — tasks run in parallel within tiers, sequentially across tiers
- Smart routing —
router.select_model()picks the cheapest model that can handle the task - Bounded retry — failures escalate through retry → model upgrade → human → abort
- Critic loop — review happens after all tasks complete, not inline with each task
This is ~50 lines but it captures the architecture of systems processing millions of agent tasks per day.
Core Components of Modern AI Agent Systems
Regardless of which framework you use, production AI agent systems share the same foundational layers:
| Layer | Purpose | Examples |
|---|---|---|
| Planner / Orchestrator | Decomposes tasks, manages execution order, handles dependencies | Task graphs, DAG schedulers |
| LLM Reasoning Engine | Core intelligence - understands instructions, generates outputs | Claude, GPT-4, Gemini |
| Tools & APIs | External capabilities agents can invoke | File system, terminal, web browser, databases |
| Memory Systems | Short-term (conversation) and long-term (persistent) context | Context windows, vector stores, session state |
| Retrieval (RAG) | Grounds agent responses in real data | Embeddings search, document retrieval, knowledge bases |
| Guardrails & Safety | Prevents harmful outputs, enforces constraints | Content filters, output validation, permission scoping |
| Observability & Evaluation | Monitors agent behavior, measures quality | Logging, tracing, automated evals, cost tracking |
These layers are common across frameworks like LangGraph, CrewAI, AutoGen, and OpenHands. The difference between frameworks is primarily in how they wire these layers together - not which layers they include.
Key insight: If you’re evaluating agent frameworks, don’t just compare features. Compare how they handle the hard problems: error recovery, context management, cost optimization, and human-in-the-loop checkpoints.
What This Means for Developers
If you’re building AI-powered tools, the single-agent-in-a-loop pattern will hit a ceiling fast. The path forward looks a lot like the path backend engineering already took:
- Decompose tasks instead of stuffing everything into one prompt
- Specialize agents instead of asking one model to do everything
- Run in parallel where tasks are independent
- Add review stages where correctness matters
- Use cheaper models for routine work, expensive ones for planning and judgment
The tooling is still early, but the architecture is clear. The best AI systems will be the ones that look most like well-designed distributed systems.
Further Reading & Frameworks
If you want to start building with multi-agent architectures, here are the frameworks worth exploring:
- OpenHands - Open-source platform for AI-powered software development agents
- LangGraph - Framework for building stateful, multi-agent applications with LLMs
- CrewAI - Role-based multi-agent orchestration framework
- AutoGen - Microsoft’s framework for building multi-agent conversational systems
Each takes a slightly different approach to the patterns described in this article - but they all converge on the same core idea: specialized agents, coordinated by a planner, are more capable than a single model working alone.
Originally published at aiagentlab.dev
Mohan
Software engineer writing about AI, distributed systems, and the craft of building great software.
Sai Rasmi
Co-author
Related Articles
Distributed Systems Fundamentals Every Developer Should Know
CAP theorem, consensus algorithms, and event-driven architecture - the core concepts behind every scalable system, explained for working developers.
TypeScript Patterns That Scale: Lessons from Large Codebases
Discriminated unions, branded types, and the builder pattern - TypeScript techniques that keep large codebases maintainable.
React Server Components: What They Actually Change
RSCs aren't just SSR. They fundamentally change how React apps are architected - here's what matters for real projects.
Stay up to date
Get notified when I publish new articles. No spam, unsubscribe anytime.
No spam. Unsubscribe anytime.