Agentic Dev: Building Production Multi-Step Agents 2026
Claude Agent SDK, OpenAI Assistants, LangGraph — memory, retries, tool schemas, observability
Last updated: April 19, 2026
Most agent tutorials stop at "here is a ReAct loop in 30 lines." Production agents are different — they need to survive restarts, handle flaky tools, hold state across hours, and fail safely when something goes wrong. This guide walks through the five non-negotiables for production multi-step agents, with runnable code in the Claude Agent SDK, OpenAI Assistants, and LangGraph.
If you are new to agents, read AI agents tutorial 2026 and agentic AI workflows first. This guide assumes you have built a toy agent and hit the limits.
Key Takeaways
- Production agents need five things: durable memory, retries with backoff, typed tool schemas, human-in-the-loop, observability.
- Claude Agent SDK (docs) provides managed infrastructure — sandboxes, state management, checkpointing — out of the box.
- OpenAI Assistants is tightest for file search and code interpreter, weakest for custom orchestration.
- LangGraph gives you graph control across providers; you write more code but own the runtime.
- MCP is the tool layer, not the orchestration layer — use it inside any of the three.
The Five Non-Negotiables
+----------------------+
| 1. Durable memory | survives restarts, outlives context window
+----------------------+
| 2. Retries + backoff | flaky tools, transient 429s, network blips
+----------------------+
| 3. Typed tool schemas| strong contracts, validated inputs/outputs
+----------------------+
| 4. Human-in-the-loop | pause for approval on irreversible actions
+----------------------+
| 5. Observability | traces, metrics, replay — not just logs
+----------------------+
Skip any one of these and you have a demo, not a system. Let's take them in order.
1. Durable Memory
The mistake: treating the LLM context window as memory. The fix: treat the context window as working memory, and push long-lived state to durable storage.
Claude Agent SDK memory tool (docs) is the simplest path. The agent writes notes and reads them back on demand:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=[{"type": "memory_20260401", "name": "memory"}],
messages=[
{"role": "user", "content": "Start work on migration. "
"Remember: we use PostgreSQL 16, "
"EF Core 9, soft delete flag is 'IsDeleted'."}
],
)
# The agent stores the constraints in memory and retrieves them
# in subsequent turns without re-reading the full history.
OpenAI Assistants uses thread-scoped memory — every message is persisted to a thread_id automatically:
from openai import OpenAI
client = OpenAI()
thread = client.beta.threads.create()
client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="We use PostgreSQL 16 and soft delete via IsDeleted.",
)
run = client.beta.threads.runs.create_and_poll(
thread_id=thread.id,
assistant_id="asst_abc123",
)
# Thread persists until you delete it.
LangGraph gives you checkpointers — a pluggable state backend (SQLite, Postgres, Redis):
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string(
"postgres://localhost/agent_state"
)
graph = StateGraph(AgentState).compile(checkpointer=checkpointer)
# Every node invocation checkpoints state to Postgres.
# Restart the process and resume from the last checkpoint.
Rule of thumb: if your agent runs longer than a single HTTP request, you need durable memory. Thread-scoped works for chat; checkpointer or memory tool is needed for workflows.
2. Retries with Exponential Backoff
Tools fail. APIs rate-limit. Networks blip. Your agent loop must retry.
Claude Agent SDK has built-in retry for sub-agents:
from claude_agent_sdk import ClaudeAgent, RetryPolicy
agent = ClaudeAgent(
model="claude-opus-4-7",
retry_policy=RetryPolicy(
max_attempts=3,
backoff="exponential",
initial_delay_seconds=1.0,
max_delay_seconds=30.0,
retryable_errors=["rate_limit", "timeout", "tool_error"],
),
)
result = await agent.run(
task="Generate and deploy the migration",
max_steps=40,
checkpoint_every=5, # checkpoint after every 5 steps
)
OpenAI Assistants retries at the SDK level. Configure it on the client:
from openai import OpenAI
client = OpenAI(
max_retries=3,
timeout=30.0,
)
# Tool call failures inside a run need explicit handling
# via run.submit_tool_outputs with your retry wrapper.
LangGraph — you own retries. The common pattern:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=30),
reraise=True,
)
def call_flaky_tool(args):
return external_api.fetch(args)
A production agent that runs 40 tool calls with no retries has a ~25% chance of failing on a 99%-reliable tool. Add retries.
3. Typed Tool Schemas
Agents that pass raw strings to tools fail unpredictably. Tools should declare typed inputs and validate them.
Claude Agent SDK / Anthropic tool use format:
tools = [
{
"name": "create_ticket",
"description": "Create a Jira ticket in the given project.",
"input_schema": {
"type": "object",
"properties": {
"project_key": {"type": "string", "pattern": "^[A-Z]{2,10}$"},
"title": {"type": "string", "minLength": 5, "maxLength": 200},
"priority": {
"type": "string",
"enum": ["Low", "Medium", "High", "Critical"],
},
},
"required": ["project_key", "title"],
},
}
]
OpenAI Assistants uses the same JSON Schema format under tools[].function.parameters.
MCP servers — the standard way to expose tools across providers — let you publish tool schemas once and consume them from any client. See MCP Servers Tutorial for the full pattern and What is MCP for the conceptual overview.
The discipline: write your schema once, validate every input with a library like pydantic or zod before the tool executes, and return typed errors back to the agent so it can self-correct.
4. Human-in-the-Loop
Any irreversible action should pause for human approval. The pattern in practice:
from claude_agent_sdk import ClaudeAgent, ApprovalRequired
agent = ClaudeAgent(
model="claude-opus-4-7",
approval_policy={
"send_email": ApprovalRequired.ALWAYS,
"delete_record": ApprovalRequired.ALWAYS,
"run_migration": ApprovalRequired.ALWAYS,
"read_file": ApprovalRequired.NEVER,
"write_code": ApprovalRequired.NEVER,
},
)
# When the agent attempts an approval-required tool, the SDK
# emits an approval_request event. Your app routes this to
# Slack, email, or a dashboard. The agent waits.
async for event in agent.run(task="Clean up stale accounts"):
if event.type == "approval_request":
decision = await slack_approval(event)
await agent.resolve_approval(event.id, approved=decision)
LangGraph has first-class interrupt support via interrupt_before and interrupt_after on nodes:
graph = StateGraph(AgentState).compile(
interrupt_before=["send_email", "run_migration"],
)
# The graph pauses before these nodes; your app resumes it
# after a human approves.
Rule of thumb: if undoing the action takes more than 10 minutes, require approval.
5. Observability
Logs are not enough. You need traces, metrics, and the ability to replay a failed run.
Minimum production setup:
- Structured logs with
run_id,step_id,tool_name,input_size,output_size,duration_ms,cost_usdper step. - Distributed traces — one span per step, parent span per run. OpenTelemetry integrates with every SDK.
- Metrics dashboard — p50/p95 step duration, success rate by tool, token usage per run.
- Replay store — save the full message log so you can rerun a failed agent deterministically.
Example OpenTelemetry instrumentation:
from opentelemetry import trace
tracer = trace.get_tracer("agent")
async def run_step(step):
with tracer.start_as_current_span("agent.step") as span:
span.set_attribute("step.tool", step.tool_name)
span.set_attribute("step.input_tokens", step.input_tokens)
result = await step.execute()
span.set_attribute("step.output_tokens", result.output_tokens)
span.set_attribute("step.cost_usd", result.cost_usd)
return result
Without this, debugging a failed long-running agent is guesswork.
Side-by-Side Comparison
| Feature | Claude Agent SDK | OpenAI Assistants | LangGraph | |---------|-----------------|-------------------|-----------| | Managed runtime | Yes | Yes | No (self-hosted) | | Multi-provider | No (Claude only) | No (OpenAI only) | Yes | | Memory tool | Native | Thread-scoped | Checkpointer | | Retries built-in | Yes (exp. backoff) | Client-level | DIY (tenacity) | | Human-in-the-loop | Native approval events | Manual via run states | Native interrupts | | Graph control | Linear pipelines | Linear runs | Full DAG | | Observability | OTel integration | Logs API | DIY with LangSmith | | Cost control | Effort levels | Token limits | Per-node limits | | Sandboxed tools | Yes | Yes (code interp.) | No | | Best for | Claude-native prod | OpenAI ecosystem | Cross-provider, complex graphs |
A Real Multi-Step Example
Task: nightly pipeline that reads overnight support tickets, categorises them, drafts replies, flags escalations.
from claude_agent_sdk import ClaudeAgent, MemoryTool, RetryPolicy
agent = ClaudeAgent(
model="claude-opus-4-7",
tools=[
read_tickets_tool, # MCP server to Zendesk
categorise_ticket_tool, # internal classifier
draft_reply_tool, # Claude-native
MemoryTool(),
],
retry_policy=RetryPolicy(max_attempts=3, backoff="exponential"),
approval_policy={"send_reply": "always", "escalate": "always"},
)
result = await agent.run(
task="""Process last night's tickets:
1. Read new tickets from Zendesk (after 2026-04-18 22:00 IST).
2. Categorise each as Bug / Feature / Support / Spam.
3. Draft a reply for Support (<=200 words).
4. Flag any ticket with 'refund', 'legal', or 'data breach'
for human review before any action.
5. Save a daily summary to memory under key 'tickets/2026-04-19'.
""",
max_steps=200,
checkpoint_every=10,
)
This run touches: 3 tools, 40-80 tickets, 200+ steps, persists memory, pauses for human approval on escalations, checkpoints every 10 steps. If the process crashes at step 150, it resumes from step 140. If the tool fails, it retries 3 times with backoff. If a draft reply contains "refund," it pauses and waits.
That is a production agent. Everything above it is plumbing.
What NOT to Build
- Agents for tasks a script can do. If the task is deterministic, write a script. Agents are for ambiguous, variable-shape work.
- Agents that call themselves recursively without a step budget. Set
max_steps. Infinite loops are a real failure mode. - Agents that touch production without approvals. Dev/staging first, always.
- Agents without cost caps. Put a daily token budget in place; a misbehaving agent can burn $500 overnight.
Where to Go Next
- MCP Servers Tutorial — expose your tools to every agent framework
- Claude Code Skills & Superpowers — orchestration patterns in Claude Code
- Cursor IDE Tutorial India — the IDE most agents are built from
- GitHub Copilot Free Setup — for quick prototyping inside VS Code
- AI-first workflow 2026 — where agents fit into the daily dev loop
- Build with AI APIs — direct API usage if you want full control
Community Questions
0No questions yet. Be the first to ask!