What is the difference between a single-step LLM call and a multi-step agent?

A single-step call is stateless request/response — one prompt, one answer. A multi-step agent plans a task, calls tools, reads their output, decides the next action, and loops until done. State, memory, retries, and tool schemas exist to make that loop reliable in production.

Which SDK should I pick: Claude Agent SDK, OpenAI Assistants, or LangGraph?

Pick Claude Agent SDK for managed infrastructure with sandboxed tool execution and built-in memory. Pick OpenAI Assistants if you are already in the OpenAI ecosystem and want turn-key file search and code interpreter. Pick LangGraph if you need fine-grained graph control, multiple model providers, or complex branching.

How does the Claude Agent SDK handle retries?

The SDK supports pipeline-style multi-agent setups with retry mechanisms, including configurable max_attempts and exponential backoff strategies. Sub-agents can be retried with configurable max_retries, and checkpoints save progress between stages.

What is the memory tool in Claude Agent SDK?

It is a just-in-time retrieval store: the agent writes notes, plans, and intermediate results into memory, then reads them back when relevant — keeping the active context focused instead of carrying the full history forward. This is the main pattern for long-running workflows that outlive a single context window.

When should I add human-in-the-loop checkpoints?

Any irreversible action (send money, write to production DB, send email to customers, delete data, deploy code) should pause for human approval. Human-in-the-loop is not about distrust — it is about blast-radius control. Reversible actions can run unattended.

Agentic Dev: Building Production Multi-Step Agents 2026 — India Guide 2026

Agentic Dev: Building Production Multi-Step Agents 2026

Claude Agent SDK, OpenAI Assistants, LangGraph — memory, retries, tool schemas, observability

Last updated: April 19, 2026

Most agent tutorials stop at "here is a ReAct loop in 30 lines." Production agents are different — they need to survive restarts, handle flaky tools, hold state across hours, and fail safely when something goes wrong. This guide walks through the five non-negotiables for production multi-step agents, with runnable code in the Claude Agent SDK, OpenAI Assistants, and LangGraph.

If you are new to agents, read AI agents tutorial 2026 and agentic AI workflows first. This guide assumes you have built a toy agent and hit the limits.

Key Takeaways

Production agents need five things: durable memory, retries with backoff, typed tool schemas, human-in-the-loop, observability.
Claude Agent SDK (docs) provides managed infrastructure — sandboxes, state management, checkpointing — out of the box.
OpenAI Assistants is tightest for file search and code interpreter, weakest for custom orchestration.
LangGraph gives you graph control across providers; you write more code but own the runtime.
MCP is the tool layer, not the orchestration layer — use it inside any of the three.

The Five Non-Negotiables

+----------------------+
| 1. Durable memory    |  survives restarts, outlives context window
+----------------------+
| 2. Retries + backoff |  flaky tools, transient 429s, network blips
+----------------------+
| 3. Typed tool schemas|  strong contracts, validated inputs/outputs
+----------------------+
| 4. Human-in-the-loop |  pause for approval on irreversible actions
+----------------------+
| 5. Observability     |  traces, metrics, replay — not just logs
+----------------------+

Skip any one of these and you have a demo, not a system. Let's take them in order.

1. Durable Memory

The mistake: treating the LLM context window as memory. The fix: treat the context window as working memory, and push long-lived state to durable storage.

Claude Agent SDK memory tool (docs) is the simplest path. The agent writes notes and reads them back on demand:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "memory_20260401", "name": "memory"}],
    messages=[
        {"role": "user", "content": "Start work on migration. "
                                    "Remember: we use PostgreSQL 16, "
                                    "EF Core 9, soft delete flag is 'IsDeleted'."}
    ],
)
# The agent stores the constraints in memory and retrieves them
# in subsequent turns without re-reading the full history.

OpenAI Assistants uses thread-scoped memory — every message is persisted to a thread_id automatically:

from openai import OpenAI

client = OpenAI()

thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="We use PostgreSQL 16 and soft delete via IsDeleted.",
)
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id="asst_abc123",
)
# Thread persists until you delete it.

LangGraph gives you checkpointers — a pluggable state backend (SQLite, Postgres, Redis):

from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(
    "postgres://localhost/agent_state"
)
graph = StateGraph(AgentState).compile(checkpointer=checkpointer)

# Every node invocation checkpoints state to Postgres.
# Restart the process and resume from the last checkpoint.

Rule of thumb: if your agent runs longer than a single HTTP request, you need durable memory. Thread-scoped works for chat; checkpointer or memory tool is needed for workflows.

2. Retries with Exponential Backoff

Tools fail. APIs rate-limit. Networks blip. Your agent loop must retry.

Claude Agent SDK has built-in retry for sub-agents:

from claude_agent_sdk import ClaudeAgent, RetryPolicy

agent = ClaudeAgent(
    model="claude-opus-4-7",
    retry_policy=RetryPolicy(
        max_attempts=3,
        backoff="exponential",
        initial_delay_seconds=1.0,
        max_delay_seconds=30.0,
        retryable_errors=["rate_limit", "timeout", "tool_error"],
    ),
)

result = await agent.run(
    task="Generate and deploy the migration",
    max_steps=40,
    checkpoint_every=5,  # checkpoint after every 5 steps
)

OpenAI Assistants retries at the SDK level. Configure it on the client:

from openai import OpenAI

client = OpenAI(
    max_retries=3,
    timeout=30.0,
)
# Tool call failures inside a run need explicit handling
# via run.submit_tool_outputs with your retry wrapper.

LangGraph — you own retries. The common pattern:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    reraise=True,
)
def call_flaky_tool(args):
    return external_api.fetch(args)

A production agent that runs 40 tool calls with no retries has a ~25% chance of failing on a 99%-reliable tool. Add retries.

3. Typed Tool Schemas

Agents that pass raw strings to tools fail unpredictably. Tools should declare typed inputs and validate them.

Claude Agent SDK / Anthropic tool use format:

tools = [
    {
        "name": "create_ticket",
        "description": "Create a Jira ticket in the given project.",
        "input_schema": {
            "type": "object",
            "properties": {
                "project_key": {"type": "string", "pattern": "^[A-Z]{2,10}$"},
                "title": {"type": "string", "minLength": 5, "maxLength": 200},
                "priority": {
                    "type": "string",
                    "enum": ["Low", "Medium", "High", "Critical"],
                },
            },
            "required": ["project_key", "title"],
        },
    }
]

OpenAI Assistants uses the same JSON Schema format under tools[].function.parameters.

MCP servers — the standard way to expose tools across providers — let you publish tool schemas once and consume them from any client. See MCP Servers Tutorial for the full pattern and What is MCP for the conceptual overview.

The discipline: write your schema once, validate every input with a library like pydantic or zod before the tool executes, and return typed errors back to the agent so it can self-correct.

4. Human-in-the-Loop

Any irreversible action should pause for human approval. The pattern in practice:

from claude_agent_sdk import ClaudeAgent, ApprovalRequired

agent = ClaudeAgent(
    model="claude-opus-4-7",
    approval_policy={
        "send_email": ApprovalRequired.ALWAYS,
        "delete_record": ApprovalRequired.ALWAYS,
        "run_migration": ApprovalRequired.ALWAYS,
        "read_file": ApprovalRequired.NEVER,
        "write_code": ApprovalRequired.NEVER,
    },
)

# When the agent attempts an approval-required tool, the SDK
# emits an approval_request event. Your app routes this to 
# Slack, email, or a dashboard. The agent waits.

async for event in agent.run(task="Clean up stale accounts"):
    if event.type == "approval_request":
        decision = await slack_approval(event)
        await agent.resolve_approval(event.id, approved=decision)

LangGraph has first-class interrupt support via interrupt_before and interrupt_after on nodes:

graph = StateGraph(AgentState).compile(
    interrupt_before=["send_email", "run_migration"],
)
# The graph pauses before these nodes; your app resumes it
# after a human approves.

Rule of thumb: if undoing the action takes more than 10 minutes, require approval.

5. Observability

Logs are not enough. You need traces, metrics, and the ability to replay a failed run.

Minimum production setup:

Structured logs with run_id, step_id, tool_name, input_size, output_size, duration_ms, cost_usd per step.
Distributed traces — one span per step, parent span per run. OpenTelemetry integrates with every SDK.
Metrics dashboard — p50/p95 step duration, success rate by tool, token usage per run.
Replay store — save the full message log so you can rerun a failed agent deterministically.

Example OpenTelemetry instrumentation:

from opentelemetry import trace

tracer = trace.get_tracer("agent")

async def run_step(step):
    with tracer.start_as_current_span("agent.step") as span:
        span.set_attribute("step.tool", step.tool_name)
        span.set_attribute("step.input_tokens", step.input_tokens)
        result = await step.execute()
        span.set_attribute("step.output_tokens", result.output_tokens)
        span.set_attribute("step.cost_usd", result.cost_usd)
        return result

Without this, debugging a failed long-running agent is guesswork.

Side-by-Side Comparison

| Feature | Claude Agent SDK | OpenAI Assistants | LangGraph | |---------|-----------------|-------------------|-----------| | Managed runtime | Yes | Yes | No (self-hosted) | | Multi-provider | No (Claude only) | No (OpenAI only) | Yes | | Memory tool | Native | Thread-scoped | Checkpointer | | Retries built-in | Yes (exp. backoff) | Client-level | DIY (tenacity) | | Human-in-the-loop | Native approval events | Manual via run states | Native interrupts | | Graph control | Linear pipelines | Linear runs | Full DAG | | Observability | OTel integration | Logs API | DIY with LangSmith | | Cost control | Effort levels | Token limits | Per-node limits | | Sandboxed tools | Yes | Yes (code interp.) | No | | Best for | Claude-native prod | OpenAI ecosystem | Cross-provider, complex graphs |

A Real Multi-Step Example

Task: nightly pipeline that reads overnight support tickets, categorises them, drafts replies, flags escalations.

from claude_agent_sdk import ClaudeAgent, MemoryTool, RetryPolicy

agent = ClaudeAgent(
    model="claude-opus-4-7",
    tools=[
        read_tickets_tool,       # MCP server to Zendesk
        categorise_ticket_tool,  # internal classifier
        draft_reply_tool,        # Claude-native
        MemoryTool(),
    ],
    retry_policy=RetryPolicy(max_attempts=3, backoff="exponential"),
    approval_policy={"send_reply": "always", "escalate": "always"},
)

result = await agent.run(
    task="""Process last night's tickets:
    1. Read new tickets from Zendesk (after 2026-04-18 22:00 IST).
    2. Categorise each as Bug / Feature / Support / Spam.
    3. Draft a reply for Support (<=200 words).
    4. Flag any ticket with 'refund', 'legal', or 'data breach' 
       for human review before any action.
    5. Save a daily summary to memory under key 'tickets/2026-04-19'.
    """,
    max_steps=200,
    checkpoint_every=10,
)

This run touches: 3 tools, 40-80 tickets, 200+ steps, persists memory, pauses for human approval on escalations, checkpoints every 10 steps. If the process crashes at step 150, it resumes from step 140. If the tool fails, it retries 3 times with backoff. If a draft reply contains "refund," it pauses and waits.

That is a production agent. Everything above it is plumbing.

What NOT to Build

Agents for tasks a script can do. If the task is deterministic, write a script. Agents are for ambiguous, variable-shape work.
Agents that call themselves recursively without a step budget. Set max_steps. Infinite loops are a real failure mode.
Agents that touch production without approvals. Dev/staging first, always.
Agents without cost caps. Put a daily token budget in place; a misbehaving agent can burn $500 overnight.

Where to Go Next

MCP Servers Tutorial — expose your tools to every agent framework
Claude Code Skills & Superpowers — orchestration patterns in Claude Code
Cursor IDE Tutorial India — the IDE most agents are built from
GitHub Copilot Free Setup — for quick prototyping inside VS Code
AI-first workflow 2026 — where agents fit into the daily dev loop
Build with AI APIs — direct API usage if you want full control

Community Questions

No questions yet. Be the first to ask!

Share this guide

r/developersIndia r/india r/ChatGPT

Agentic Dev: Building Production Multi-Step Agents 2026

Claude Agent SDK, OpenAI Assistants, LangGraph — memory, retries, tool schemas, observability

Last updated: April 19, 2026

If you are new to agents, read AI agents tutorial 2026 and agentic AI workflows first. This guide assumes you have built a toy agent and hit the limits.

Key Takeaways

Production agents need five things: durable memory, retries with backoff, typed tool schemas, human-in-the-loop, observability.
Claude Agent SDK (docs) provides managed infrastructure — sandboxes, state management, checkpointing — out of the box.
OpenAI Assistants is tightest for file search and code interpreter, weakest for custom orchestration.
LangGraph gives you graph control across providers; you write more code but own the runtime.
MCP is the tool layer, not the orchestration layer — use it inside any of the three.

The Five Non-Negotiables

+----------------------+
| 1. Durable memory    |  survives restarts, outlives context window
+----------------------+
| 2. Retries + backoff |  flaky tools, transient 429s, network blips
+----------------------+
| 3. Typed tool schemas|  strong contracts, validated inputs/outputs
+----------------------+
| 4. Human-in-the-loop |  pause for approval on irreversible actions
+----------------------+
| 5. Observability     |  traces, metrics, replay — not just logs
+----------------------+

Skip any one of these and you have a demo, not a system. Let's take them in order.

1. Durable Memory

The mistake: treating the LLM context window as memory. The fix: treat the context window as working memory, and push long-lived state to durable storage.

Claude Agent SDK memory tool (docs) is the simplest path. The agent writes notes and reads them back on demand:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "memory_20260401", "name": "memory"}],
    messages=[
        {"role": "user", "content": "Start work on migration. "
                                    "Remember: we use PostgreSQL 16, "
                                    "EF Core 9, soft delete flag is 'IsDeleted'."}
    ],
)
# The agent stores the constraints in memory and retrieves them
# in subsequent turns without re-reading the full history.

OpenAI Assistants uses thread-scoped memory — every message is persisted to a thread_id automatically:

from openai import OpenAI

client = OpenAI()

thread = client.beta.threads.create()
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="We use PostgreSQL 16 and soft delete via IsDeleted.",
)
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id="asst_abc123",
)
# Thread persists until you delete it.

LangGraph gives you checkpointers — a pluggable state backend (SQLite, Postgres, Redis):

from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(
    "postgres://localhost/agent_state"
)
graph = StateGraph(AgentState).compile(checkpointer=checkpointer)

# Every node invocation checkpoints state to Postgres.
# Restart the process and resume from the last checkpoint.

Rule of thumb: if your agent runs longer than a single HTTP request, you need durable memory. Thread-scoped works for chat; checkpointer or memory tool is needed for workflows.

2. Retries with Exponential Backoff

Tools fail. APIs rate-limit. Networks blip. Your agent loop must retry.

Claude Agent SDK has built-in retry for sub-agents:

from claude_agent_sdk import ClaudeAgent, RetryPolicy

agent = ClaudeAgent(
    model="claude-opus-4-7",
    retry_policy=RetryPolicy(
        max_attempts=3,
        backoff="exponential",
        initial_delay_seconds=1.0,
        max_delay_seconds=30.0,
        retryable_errors=["rate_limit", "timeout", "tool_error"],
    ),
)

result = await agent.run(
    task="Generate and deploy the migration",
    max_steps=40,
    checkpoint_every=5,  # checkpoint after every 5 steps
)

OpenAI Assistants retries at the SDK level. Configure it on the client:

from openai import OpenAI

client = OpenAI(
    max_retries=3,
    timeout=30.0,
)
# Tool call failures inside a run need explicit handling
# via run.submit_tool_outputs with your retry wrapper.

LangGraph — you own retries. The common pattern:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    reraise=True,
)
def call_flaky_tool(args):
    return external_api.fetch(args)

A production agent that runs 40 tool calls with no retries has a ~25% chance of failing on a 99%-reliable tool. Add retries.

3. Typed Tool Schemas

Agents that pass raw strings to tools fail unpredictably. Tools should declare typed inputs and validate them.

Claude Agent SDK / Anthropic tool use format:

tools = [
    {
        "name": "create_ticket",
        "description": "Create a Jira ticket in the given project.",
        "input_schema": {
            "type": "object",
            "properties": {
                "project_key": {"type": "string", "pattern": "^[A-Z]{2,10}$"},
                "title": {"type": "string", "minLength": 5, "maxLength": 200},
                "priority": {
                    "type": "string",
                    "enum": ["Low", "Medium", "High", "Critical"],
                },
            },
            "required": ["project_key", "title"],
        },
    }
]

OpenAI Assistants uses the same JSON Schema format under tools[].function.parameters.

The discipline: write your schema once, validate every input with a library like pydantic or zod before the tool executes, and return typed errors back to the agent so it can self-correct.

4. Human-in-the-Loop

Any irreversible action should pause for human approval. The pattern in practice:

from claude_agent_sdk import ClaudeAgent, ApprovalRequired

agent = ClaudeAgent(
    model="claude-opus-4-7",
    approval_policy={
        "send_email": ApprovalRequired.ALWAYS,
        "delete_record": ApprovalRequired.ALWAYS,
        "run_migration": ApprovalRequired.ALWAYS,
        "read_file": ApprovalRequired.NEVER,
        "write_code": ApprovalRequired.NEVER,
    },
)

# When the agent attempts an approval-required tool, the SDK
# emits an approval_request event. Your app routes this to 
# Slack, email, or a dashboard. The agent waits.

async for event in agent.run(task="Clean up stale accounts"):
    if event.type == "approval_request":
        decision = await slack_approval(event)
        await agent.resolve_approval(event.id, approved=decision)

LangGraph has first-class interrupt support via interrupt_before and interrupt_after on nodes:

graph = StateGraph(AgentState).compile(
    interrupt_before=["send_email", "run_migration"],
)
# The graph pauses before these nodes; your app resumes it
# after a human approves.

Rule of thumb: if undoing the action takes more than 10 minutes, require approval.

5. Observability

Logs are not enough. You need traces, metrics, and the ability to replay a failed run.

Minimum production setup:

Structured logs with run_id, step_id, tool_name, input_size, output_size, duration_ms, cost_usd per step.
Distributed traces — one span per step, parent span per run. OpenTelemetry integrates with every SDK.
Metrics dashboard — p50/p95 step duration, success rate by tool, token usage per run.
Replay store — save the full message log so you can rerun a failed agent deterministically.

Example OpenTelemetry instrumentation:

from opentelemetry import trace

tracer = trace.get_tracer("agent")

async def run_step(step):
    with tracer.start_as_current_span("agent.step") as span:
        span.set_attribute("step.tool", step.tool_name)
        span.set_attribute("step.input_tokens", step.input_tokens)
        result = await step.execute()
        span.set_attribute("step.output_tokens", result.output_tokens)
        span.set_attribute("step.cost_usd", result.cost_usd)
        return result

Without this, debugging a failed long-running agent is guesswork.

Side-by-Side Comparison

A Real Multi-Step Example

Task: nightly pipeline that reads overnight support tickets, categorises them, drafts replies, flags escalations.

from claude_agent_sdk import ClaudeAgent, MemoryTool, RetryPolicy

agent = ClaudeAgent(
    model="claude-opus-4-7",
    tools=[
        read_tickets_tool,       # MCP server to Zendesk
        categorise_ticket_tool,  # internal classifier
        draft_reply_tool,        # Claude-native
        MemoryTool(),
    ],
    retry_policy=RetryPolicy(max_attempts=3, backoff="exponential"),
    approval_policy={"send_reply": "always", "escalate": "always"},
)

result = await agent.run(
    task="""Process last night's tickets:
    1. Read new tickets from Zendesk (after 2026-04-18 22:00 IST).
    2. Categorise each as Bug / Feature / Support / Spam.
    3. Draft a reply for Support (<=200 words).
    4. Flag any ticket with 'refund', 'legal', or 'data breach' 
       for human review before any action.
    5. Save a daily summary to memory under key 'tickets/2026-04-19'.
    """,
    max_steps=200,
    checkpoint_every=10,
)

That is a production agent. Everything above it is plumbing.

What NOT to Build

Agents for tasks a script can do. If the task is deterministic, write a script. Agents are for ambiguous, variable-shape work.
Agents that call themselves recursively without a step budget. Set max_steps. Infinite loops are a real failure mode.
Agents that touch production without approvals. Dev/staging first, always.
Agents without cost caps. Put a daily token budget in place; a misbehaving agent can burn $500 overnight.

Where to Go Next

MCP Servers Tutorial — expose your tools to every agent framework
Claude Code Skills & Superpowers — orchestration patterns in Claude Code
Cursor IDE Tutorial India — the IDE most agents are built from
GitHub Copilot Free Setup — for quick prototyping inside VS Code
AI-first workflow 2026 — where agents fit into the daily dev loop
Build with AI APIs — direct API usage if you want full control

Community Questions

No questions yet. Be the first to ask!

Share this guide

r/developersIndia r/india r/ChatGPT

Key Takeaways

The Five Non-Negotiables

1. Durable Memory

2. Retries with Exponential Backoff

3. Typed Tool Schemas

4. Human-in-the-Loop

5. Observability

Side-by-Side Comparison

A Real Multi-Step Example

What NOT to Build

Where to Go Next

Community Questions

Share this guide

More guides in Advanced AI

What is MCP (Model Context Protocol)?

Build Your Own MCP Server

Claude Certification & Learning Paths

You Might Also Like

AI for DevOps — CI/CD, Infra & Monitoring

The Ultimate AI Coding Workflow 2026

AI for Security Engineers 2026: SAST, Threat Modeling, IaC Scanning

Key Takeaways

The Five Non-Negotiables

1. Durable Memory

2. Retries with Exponential Backoff

3. Typed Tool Schemas

4. Human-in-the-Loop

5. Observability

Side-by-Side Comparison

A Real Multi-Step Example

What NOT to Build

Where to Go Next

Community Questions

Share this guide

More guides in Advanced AI

What is MCP (Model Context Protocol)?

Build Your Own MCP Server

Claude Certification & Learning Paths

You Might Also Like

AI for DevOps — CI/CD, Infra & Monitoring

The Ultimate AI Coding Workflow 2026

AI for Security Engineers 2026: SAST, Threat Modeling, IaC Scanning