The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussion, we use a specific definition: an AI agent is a system that takes a goal, formulates a plan, executes actions using tools, observes results, and iterates until the goal is achieved or it determines the goal cannot be achieved.
This is distinct from a chain, where the sequence of steps is predefined. An agent decides its own control flow at runtime.
Building agents that work reliably in enterprise environments — where failures have real consequences, latency matters, and costs must be controlled — requires more engineering rigor than most tutorials suggest. This post covers the architecture, tooling, and operational concerns for production multi-agent systems using LangGraph.
Agents vs Chains: The Distinction That Matters
A chain is a fixed pipeline: input → step 1 → step 2 → step 3 → output. The steps are determined at development time. A RAG pipeline is a chain. A summarization pipeline is a chain.
An agent adds three capabilities:
- ✓Tool use: The ability to call external functions — APIs, databases, file systems, calculators — and incorporate the results.
- ✓Planning: The ability to decide which tools to call and in what order, based on the current goal and observations.
- ✓State management: The ability to maintain context across multiple steps, remember what has been tried, and track progress toward the goal.
These capabilities introduce non-determinism. The same input may produce different execution paths depending on tool results, model reasoning, and intermediate state. This makes agents powerful but harder to test, debug, and operate.
LangGraph: State Machines for AI Agents
LangGraph models agent workflows as directed graphs with typed state. Each node in the graph is a function that reads the current state, performs some computation (which may include an LLM call or tool invocation), and returns an updated state. Edges define transitions between nodes, and conditional edges allow the agent to branch based on state.
This is a meaningful improvement over the "loop until done" pattern used by basic ReAct agents. With LangGraph, the control flow is explicit and inspectable.
Core Concepts
State: A typed dictionary (usually a TypedDict or Pydantic model) that holds all information the agent needs across steps.
from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator
class AgentState(TypedDict):
messages: Annotated[Sequence[BaseMessage], operator.add]
current_step: str
tool_results: dict
error_count: int
final_output: str | None
Nodes: Functions that take the state as input and return partial state updates.
def parse_intent(state: AgentState) -> dict:
# LLM call to classify user intent
messages = state["messages"]
response = llm.invoke(intent_prompt.format(messages=messages))
return {"current_step": response.content, "tool_results": {}}
Edges: Define the graph structure. Conditional edges allow branching.
from langgraph.graph import StateGraph, END
graph = StateGraph(AgentState)
graph.add_node("parse_intent", parse_intent)
graph.add_node("lookup_vendor", lookup_vendor)
graph.add_node("validate_budget", validate_budget)
graph.add_conditional_edges(
"parse_intent",
route_after_intent, # function that returns next node name
{
"vendor_lookup": "lookup_vendor",
"budget_check": "validate_budget",
"unknown": END,
}
)
Agent Architectures
ReAct (Reasoning + Acting)
The agent alternates between reasoning (thinking about what to do) and acting (calling a tool). After each tool result, the agent reasons again about what to do next.
ReAct is simple and works well for straightforward tool-use scenarios. Its weakness is that it can get stuck in loops — repeatedly trying the same failing action — and it does not plan ahead.
Plan-and-Execute
The agent first generates a complete plan (a list of steps), then executes each step sequentially. After execution, it can replan if results deviate from expectations.
This works better for complex tasks where the order of operations matters. The downside is that the initial plan may be wrong, and replanning is expensive (requires a full LLM call).
Reflexion
The agent executes a task, evaluates its own output, identifies what went wrong, and tries again with the self-critique as additional context. This is useful for tasks where quality is hard to get right on the first attempt — code generation, for example.
Multi-Agent Collaboration
Multiple specialized agents work together on a task. Each agent has a defined role and set of tools. A coordinator routes work between them.
This is the pattern that matters most for enterprise systems, and the one we will focus on.
Multi-Agent Patterns
Supervisor-Worker
A supervisor agent receives the user's request, breaks it into sub-tasks, delegates each sub-task to a specialized worker agent, collects results, and synthesizes a final response.
User → Supervisor → [Worker A, Worker B, Worker C] → Supervisor → Response
The supervisor handles planning and coordination. Workers handle execution. This is the most common pattern for enterprise applications because it maps naturally to organizational structures and keeps each agent's scope narrow.
Peer-to-Peer
Agents communicate directly with each other without a central coordinator. Each agent can invoke other agents as tools.
This is more flexible but harder to debug. Use it when the workflow is not hierarchical — for example, a negotiation between a buyer agent and a seller agent.
Hierarchical Delegation
Like supervisor-worker, but with multiple levels. A top-level agent delegates to mid-level coordinators, which delegate to workers. This is necessary when the task is too complex for a single supervisor to decompose.
Tool Design
Tools are the interface between the agent and the outside world. Poorly designed tools are the number one cause of agent failure.
Principles
- ✓Atomic: Each tool does one thing. "search_vendor_catalog" not "search_and_compare_vendors".
- ✓Deterministic: Same input produces same output (when possible). The agent should be able to predict what a tool will do.
- ✓Well-documented: The tool description is part of the agent's prompt. A vague description leads to incorrect tool selection.
- ✓Error-explicit: Tools should return structured errors, not raise exceptions. The agent needs to reason about failures.
from langchain_core.tools import tool
from pydantic import BaseModel, Field
class VendorSearchResult(BaseModel):
vendors: list[dict] = Field(description="List of matching vendors")
total_count: int = Field(description="Total number of matches")
error: str | None = Field(default=None, description="Error message if search failed")
@tool
def search_vendor_catalog(
query: str,
category: str | None = None,
max_results: int = 10,
) -> VendorSearchResult:
"""Search the vendor catalog for products matching the query.
Args:
query: Product description or name to search for.
category: Optional category filter (e.g., 'office_supplies', 'it_equipment').
max_results: Maximum number of results to return (default 10).
Returns:
VendorSearchResult with matching vendors and their products.
Check the error field — if non-null, the search failed and vendors list is empty.
"""
try:
results = vendor_db.search(query, category=category, limit=max_results)
return VendorSearchResult(vendors=results, total_count=len(results))
except VendorDBError as e:
return VendorSearchResult(vendors=[], total_count=0, error=str(e))
Tool Count
More tools means more decisions for the agent, which means more opportunities for errors. Keep the tool set minimal — under 10 tools per agent. If an agent needs more, consider splitting it into multiple agents.
State Management
Checkpointing
LangGraph supports automatic checkpointing at each node transition. This means you can resume a failed agent from the last successful step rather than restarting from scratch.
from langgraph.checkpoint.sqlite import SqliteSaver
memory = SqliteSaver.from_conn_string(":memory:")
app = graph.compile(checkpointer=memory)
# Invoke with thread_id for persistent state
config = {"configurable": {"thread_id": "procurement-123"}}
result = app.invoke(initial_state, config)
Time-Travel Debugging
With checkpoints, you can inspect the state at any point in the agent's execution history. This is invaluable for debugging:
# Get all checkpoints for a thread
history = list(app.get_state_history(config))
for state in history:
print(f"Step: {state.values['current_step']}")
print(f"Messages: {len(state.values['messages'])}")
print(f"Next: {state.next}")
print("---")
Human-in-the-Loop
For high-stakes decisions (approving a purchase order, escalating a support ticket), the agent should pause and wait for human confirmation. LangGraph supports this with interrupt nodes:
from langgraph.graph import StateGraph
graph = StateGraph(AgentState)
# ... add nodes ...
# The agent will pause before executing the approval node
app = graph.compile(
checkpointer=memory,
interrupt_before=["execute_approval"],
)
# Agent runs until it hits the interrupt
result = app.invoke(initial_state, config)
# Human reviews the state, then resumes
app.invoke(None, config) # continues from interrupt point
Memory
Short-Term Memory
The conversation history within a single session. In LangGraph, this is typically the messages field in the state. Keep it bounded — summarize or truncate old messages to stay within context limits.
Long-Term Memory
Information that persists across sessions. Two common approaches:
- ✓Vector store: Embed and store important facts, decisions, and user preferences. Retrieve relevant memories at the start of each session.
- ✓Structured summaries: After each session, generate a summary of key decisions and outcomes. Store these in a database.
def update_long_term_memory(state: AgentState, memory_store):
summary = llm.invoke(
f"Summarize the key decisions and outcomes from this interaction: "
f"{state['messages']}"
)
memory_store.add(
text=summary.content,
metadata={
"user_id": state.get("user_id"),
"timestamp": datetime.now().isoformat(),
"topic": state.get("current_step"),
}
)
Error Handling
Agents fail. Tools return errors, LLM calls time out, APIs rate-limit, and the model sometimes produces unparseable output. Production agents need robust error handling at every level.
Retry with Backoff
For transient failures (API timeouts, rate limits):
def resilient_node(state: AgentState) -> dict:
max_retries = 3
for attempt in range(max_retries):
try:
result = call_external_api(state["query"])
return {"tool_results": {"api": result}, "error_count": 0}
except (TimeoutError, RateLimitError) as e:
if attempt == max_retries - 1:
return {
"tool_results": {"api_error": str(e)},
"error_count": state["error_count"] + 1,
}
time.sleep(2 ** attempt)
Fallback Strategies
When a tool fails permanently:
- ✓Use an alternative tool: If the vendor catalog API is down, fall back to a cached version.
- ✓Ask the user: Present what the agent knows and ask the user to provide the missing information.
- ✓Graceful degradation: Complete the parts of the task that do not depend on the failed tool. Flag the incomplete parts.
Maximum Iteration Limits
Always set a hard limit on the number of steps an agent can take. An agent stuck in a loop will burn through API credits indefinitely.
def should_continue(state: AgentState) -> str:
if state["error_count"] >= 3:
return "error_exit"
if len(state["messages"]) > 50: # hard limit on iterations
return "max_iterations_exit"
if state["final_output"] is not None:
return "complete"
return "continue"
Observability
You cannot debug what you cannot see. Agent systems need observability at multiple levels.
LangSmith Tracing
LangSmith provides end-to-end tracing for LangChain and LangGraph applications. Every LLM call, tool invocation, and state transition is recorded with inputs, outputs, and latency.
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "procurement-agent"
Custom Metrics
Track beyond what the tracing platform provides:
- ✓Token usage per node: Identify which nodes are expensive.
- ✓Tool success rate: Which tools fail most often?
- ✓Latency per node: Where is time being spent?
- ✓Completion rate: What percentage of requests complete successfully?
import time
from dataclasses import dataclass
@dataclass
class NodeMetrics:
node_name: str
latency_ms: float
tokens_used: int
tool_calls: int
success: bool
def instrumented_node(node_fn):
def wrapper(state: AgentState) -> dict:
start = time.perf_counter()
try:
result = node_fn(state)
metrics = NodeMetrics(
node_name=node_fn.__name__,
latency_ms=(time.perf_counter() - start) * 1000,
tokens_used=result.get("_tokens_used", 0),
tool_calls=result.get("_tool_calls", 0),
success=True,
)
except Exception as e:
metrics = NodeMetrics(
node_name=node_fn.__name__,
latency_ms=(time.perf_counter() - start) * 1000,
tokens_used=0, tool_calls=0, success=False,
)
raise
finally:
emit_metrics(metrics) # send to your metrics backend
return wrapper
Security
Sandboxing Tool Execution
If tools execute user-provided code or interact with external systems, they must be sandboxed. Run tool functions in containers or restricted subprocesses. Never let an agent execute arbitrary shell commands in the host environment.
Input Validation
Validate tool inputs before execution. An agent might produce malformed inputs — SQL injection in a database query tool, path traversal in a file access tool.
@tool
def query_purchase_orders(
department: str,
date_from: str,
date_to: str,
) -> list[dict]:
"""Query purchase orders by department and date range."""
# Input validation
allowed_departments = {"engineering", "marketing", "operations", "hr"}
if department.lower() not in allowed_departments:
return {"error": f"Invalid department. Allowed: {allowed_departments}"}
try:
datetime.fromisoformat(date_from)
datetime.fromisoformat(date_to)
except ValueError:
return {"error": "Invalid date format. Use ISO 8601 (YYYY-MM-DD)."}
# Use parameterized queries — never f-strings for SQL
return db.execute(
"SELECT * FROM purchase_orders WHERE dept = %s AND created BETWEEN %s AND %s",
(department, date_from, date_to),
)
Prompt Injection Defense
Agents that process user input are vulnerable to prompt injection — a user might craft input that overrides the system prompt. Defenses:
- ✓Separate user input from instructions: Use clear delimiters and instruct the model to treat user input as data.
- ✓Input sanitization: Strip or escape special characters that might be interpreted as prompt instructions.
- ✓Output validation: Check that the agent's actions are within its allowed scope before executing them.
Cost Control
Agent Budgets
Set a maximum token budget per agent execution. Track cumulative token usage across all LLM calls and terminate if the budget is exceeded.
class BudgetTracker:
def __init__(self, max_tokens: int = 50_000):
self.max_tokens = max_tokens
self.used_tokens = 0
def track(self, usage):
self.used_tokens += usage.total_tokens
if self.used_tokens > self.max_tokens:
raise BudgetExceededError(
f"Used {self.used_tokens} tokens, budget is {self.max_tokens}"
)
Model Routing
Not every LLM call in an agent needs the most capable model. Use a cheap, fast model (GPT-4o-mini, Claude Haiku) for planning, classification, and simple extraction. Reserve the expensive model (GPT-4o, Claude Sonnet) for complex reasoning and final answer generation.
from langchain_openai import ChatOpenAI
planner_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
executor_llm = ChatOpenAI(model="gpt-4o", temperature=0)
def parse_intent(state: AgentState) -> dict:
# Cheap model for intent classification
response = planner_llm.invoke(intent_prompt.format(messages=state["messages"]))
return {"current_step": response.content}
def generate_purchase_order(state: AgentState) -> dict:
# Expensive model for structured document generation
response = executor_llm.invoke(po_prompt.format(details=state["tool_results"]))
return {"final_output": response.content}
Case Study: Enterprise Procurement System
A manufacturing company needed to modernize their procurement process. Users were filling out paper forms, emailing them to a procurement officer, who manually searched vendor catalogs, checked budgets, and routed approvals. Average time from request to PO: 4.5 days.
The goal: allow users to describe their needs in natural language and have the system generate purchase orders, check vendor catalogs, validate budgets, and route for approval — all within minutes.
System Architecture
Stripe Systems designed and built a 4-agent system using LangGraph:
Agent 1 — IntentParser: Extracts structured procurement details from natural language input. Identifies product category, specifications, quantity, urgency, and department.
Agent 2 — VendorLookup: Searches the vendor catalog for matching products. Compares prices, delivery times, and vendor ratings. Returns ranked vendor options.
Agent 3 — BudgetValidator: Checks the requesting department's remaining budget. Validates against procurement policies (e.g., single-item limits, annual caps).
Agent 4 — ApprovalRouter: Determines the approval chain based on amount, department, and item category. Generates the PO document and routes it.
LangGraph Implementation
State Definition:
from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator
class ProcurementState(TypedDict):
messages: Annotated[Sequence[BaseMessage], operator.add]
# Parsed intent
product_category: str | None
product_specs: dict | None
quantity: int | None
department: str | None
urgency: str | None
# Vendor results
vendor_options: list[dict]
selected_vendor: dict | None
# Budget
budget_remaining: float | None
budget_approved: bool
policy_violations: list[str]
# Approval
approval_chain: list[str]
po_document: str | None
# Control
current_step: str
error_count: int
error_messages: list[str]
Graph Structure:
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.sqlite import SqliteSaver
def build_procurement_graph():
graph = StateGraph(ProcurementState)
# Add agent nodes
graph.add_node("parse_intent", intent_parser_node)
graph.add_node("lookup_vendor", vendor_lookup_node)
graph.add_node("validate_budget", budget_validator_node)
graph.add_node("route_approval", approval_router_node)
graph.add_node("handle_error", error_handler_node)
graph.add_node("request_clarification", clarification_node)
# Entry point
graph.set_entry_point("parse_intent")
# Conditional routing after intent parsing
graph.add_conditional_edges(
"parse_intent",
route_after_parse,
{
"vendor_lookup": "lookup_vendor",
"needs_clarification": "request_clarification",
"error": "handle_error",
},
)
# Conditional routing after vendor lookup
graph.add_conditional_edges(
"lookup_vendor",
route_after_vendor,
{
"budget_check": "validate_budget",
"no_vendors": "request_clarification",
"error": "handle_error",
},
)
# Conditional routing after budget validation
graph.add_conditional_edges(
"validate_budget",
route_after_budget,
{
"approved": "route_approval",
"over_budget": "request_clarification",
"policy_violation": "request_clarification",
"error": "handle_error",
},
)
# Approval routing always terminates
graph.add_edge("route_approval", END)
graph.add_edge("handle_error", END)
graph.add_edge("request_clarification", END)
return graph.compile(
checkpointer=SqliteSaver.from_conn_string("procurement.db"),
interrupt_before=["route_approval"], # human approval before final PO
)
def route_after_parse(state: ProcurementState) -> str:
if state["error_count"] >= 3:
return "error"
if state["product_category"] is None or state["quantity"] is None:
return "needs_clarification"
return "vendor_lookup"
def route_after_vendor(state: ProcurementState) -> str:
if state["error_count"] >= 3:
return "error"
if not state["vendor_options"]:
return "no_vendors"
return "budget_check"
def route_after_budget(state: ProcurementState) -> str:
if state["error_count"] >= 3:
return "error"
if state["policy_violations"]:
return "policy_violation"
if not state["budget_approved"]:
return "over_budget"
return "approved"
Vendor Lookup Node with Fallback:
def vendor_lookup_node(state: ProcurementState) -> dict:
category = state["product_category"]
specs = state["product_specs"]
quantity = state["quantity"]
# Primary: Live vendor catalog API
try:
results = vendor_api.search(
category=category,
specs=specs,
min_quantity=quantity,
)
if results:
return {
"vendor_options": results,
"selected_vendor": results[0], # pre-select best match
"current_step": "vendor_lookup_complete",
}
except VendorAPIError as e:
logger.warning(f"Vendor API failed: {e}")
# Fallback: Cached catalog (updated nightly)
try:
cached_results = cached_catalog.search(
category=category,
specs=specs,
)
if cached_results:
return {
"vendor_options": cached_results,
"selected_vendor": cached_results[0],
"current_step": "vendor_lookup_complete_cached",
"error_messages": state["error_messages"] + [
"Live catalog unavailable. Using cached data (last updated: "
f"{cached_catalog.last_updated}). Prices may differ."
],
}
except Exception as e:
logger.error(f"Cached catalog also failed: {e}")
# Both failed
return {
"vendor_options": [],
"error_count": state["error_count"] + 1,
"error_messages": state["error_messages"] + [
"Unable to search vendor catalog. Both live and cached sources unavailable."
],
}
Sample Execution Trace
User input: "I need 50 ergonomic office chairs for the engineering floor. Budget is flexible but ideally under $500 per chair. We need them within 3 weeks."
[Step 1: parse_intent]
Input: User message
Output: category=office_furniture, specs={type: ergonomic_chair, features: [adjustable_height, lumbar_support]},
quantity=50, department=engineering, urgency=3_weeks, budget_hint=$500/unit
Tokens: 847 (gpt-4o-mini)
Latency: 420ms
[Step 2: lookup_vendor]
Input: category=office_furniture, specs=..., quantity=50
Tool call: vendor_api.search(...)
Output: 4 vendors found
- ErgoMax Pro Chair: $429/unit, delivery 2 weeks, rating 4.7
- ComfortElite Series: $489/unit, delivery 10 days, rating 4.5
- BasicErgo Model: $319/unit, delivery 3 weeks, rating 4.1
- PremiumPosture X1: $612/unit, delivery 1 week, rating 4.8
Selected: ErgoMax Pro Chair (best value within budget)
Tokens: 1,203 (gpt-4o)
Latency: 1,850ms (including API call)
[Step 3: validate_budget]
Input: department=engineering, amount=$21,450 (50 × $429)
Tool call: budget_api.check(department=engineering, amount=21450)
Output: budget_remaining=$45,200, policy_check=PASS, budget_approved=True
Tokens: 523 (gpt-4o-mini)
Latency: 380ms
[Step 4: INTERRUPT — human approval required]
PO generated, waiting for manager approval
Approval chain: [engineering_lead, procurement_officer]
[Step 5: route_approval] (after human approval)
PO #ENG-2026-0847 generated and submitted
Notification sent to: vendor, requesting department, finance
Tokens: 689 (gpt-4o)
Latency: 520ms
Total: 3,262 tokens, 3.17 seconds execution time
Results
After 3 months of deployment:
- ✓Average request-to-PO time: 12 minutes (down from 4.5 days), with most of that being human approval wait time
- ✓Agent execution time: under 5 seconds for 94% of requests
- ✓Successful completions: 87% (remaining 13% require human intervention due to ambiguous requests or out-of-catalog items)
- ✓Monthly LLM cost: $340 for ~1,200 procurement requests
- ✓Fallback to cached catalog: triggered 3% of the time (vendor API downtime)
- ✓Policy violations caught by BudgetValidator that would have been missed manually: 8 per month on average
The human-in-the-loop design was essential for adoption. Procurement officers trusted the system because they maintained approval authority. The system handled the tedious parts — catalog search, budget checking, form generation — while humans made the final decisions.
Ready to discuss your project?
Get in Touch →