Skip to main content
Stripe SystemsStripe Systems
AI/ML📅 February 28, 2026· 14 min read

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

✍️
Stripe Systems Engineering

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussion, we use a specific definition: an AI agent is a system that takes a goal, formulates a plan, executes actions using tools, observes results, and iterates until the goal is achieved or it determines the goal cannot be achieved.

This is distinct from a chain, where the sequence of steps is predefined. An agent decides its own control flow at runtime.

Building agents that work reliably in enterprise environments — where failures have real consequences, latency matters, and costs must be controlled — requires more engineering rigor than most tutorials suggest. This post covers the architecture, tooling, and operational concerns for production multi-agent systems using LangGraph.

Agents vs Chains: The Distinction That Matters

A chain is a fixed pipeline: input → step 1 → step 2 → step 3 → output. The steps are determined at development time. A RAG pipeline is a chain. A summarization pipeline is a chain.

An agent adds three capabilities:

  1. Tool use: The ability to call external functions — APIs, databases, file systems, calculators — and incorporate the results.
  2. Planning: The ability to decide which tools to call and in what order, based on the current goal and observations.
  3. State management: The ability to maintain context across multiple steps, remember what has been tried, and track progress toward the goal.

These capabilities introduce non-determinism. The same input may produce different execution paths depending on tool results, model reasoning, and intermediate state. This makes agents powerful but harder to test, debug, and operate.

LangGraph: State Machines for AI Agents

LangGraph models agent workflows as directed graphs with typed state. Each node in the graph is a function that reads the current state, performs some computation (which may include an LLM call or tool invocation), and returns an updated state. Edges define transitions between nodes, and conditional edges allow the agent to branch based on state.

This is a meaningful improvement over the "loop until done" pattern used by basic ReAct agents. With LangGraph, the control flow is explicit and inspectable.

Core Concepts

State: A typed dictionary (usually a TypedDict or Pydantic model) that holds all information the agent needs across steps.

from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    current_step: str
    tool_results: dict
    error_count: int
    final_output: str | None

Nodes: Functions that take the state as input and return partial state updates.

def parse_intent(state: AgentState) -> dict:
    # LLM call to classify user intent
    messages = state["messages"]
    response = llm.invoke(intent_prompt.format(messages=messages))
    return {"current_step": response.content, "tool_results": {}}

Edges: Define the graph structure. Conditional edges allow branching.

from langgraph.graph import StateGraph, END

graph = StateGraph(AgentState)
graph.add_node("parse_intent", parse_intent)
graph.add_node("lookup_vendor", lookup_vendor)
graph.add_node("validate_budget", validate_budget)

graph.add_conditional_edges(
    "parse_intent",
    route_after_intent,  # function that returns next node name
    {
        "vendor_lookup": "lookup_vendor",
        "budget_check": "validate_budget",
        "unknown": END,
    }
)

Agent Architectures

ReAct (Reasoning + Acting)

The agent alternates between reasoning (thinking about what to do) and acting (calling a tool). After each tool result, the agent reasons again about what to do next.

ReAct is simple and works well for straightforward tool-use scenarios. Its weakness is that it can get stuck in loops — repeatedly trying the same failing action — and it does not plan ahead.

Plan-and-Execute

The agent first generates a complete plan (a list of steps), then executes each step sequentially. After execution, it can replan if results deviate from expectations.

This works better for complex tasks where the order of operations matters. The downside is that the initial plan may be wrong, and replanning is expensive (requires a full LLM call).

Reflexion

The agent executes a task, evaluates its own output, identifies what went wrong, and tries again with the self-critique as additional context. This is useful for tasks where quality is hard to get right on the first attempt — code generation, for example.

Multi-Agent Collaboration

Multiple specialized agents work together on a task. Each agent has a defined role and set of tools. A coordinator routes work between them.

This is the pattern that matters most for enterprise systems, and the one we will focus on.

Multi-Agent Patterns

Supervisor-Worker

A supervisor agent receives the user's request, breaks it into sub-tasks, delegates each sub-task to a specialized worker agent, collects results, and synthesizes a final response.

User → Supervisor → [Worker A, Worker B, Worker C] → Supervisor → Response

The supervisor handles planning and coordination. Workers handle execution. This is the most common pattern for enterprise applications because it maps naturally to organizational structures and keeps each agent's scope narrow.

Peer-to-Peer

Agents communicate directly with each other without a central coordinator. Each agent can invoke other agents as tools.

This is more flexible but harder to debug. Use it when the workflow is not hierarchical — for example, a negotiation between a buyer agent and a seller agent.

Hierarchical Delegation

Like supervisor-worker, but with multiple levels. A top-level agent delegates to mid-level coordinators, which delegate to workers. This is necessary when the task is too complex for a single supervisor to decompose.

Tool Design

Tools are the interface between the agent and the outside world. Poorly designed tools are the number one cause of agent failure.

Principles

  1. Atomic: Each tool does one thing. "search_vendor_catalog" not "search_and_compare_vendors".
  2. Deterministic: Same input produces same output (when possible). The agent should be able to predict what a tool will do.
  3. Well-documented: The tool description is part of the agent's prompt. A vague description leads to incorrect tool selection.
  4. Error-explicit: Tools should return structured errors, not raise exceptions. The agent needs to reason about failures.
from langchain_core.tools import tool
from pydantic import BaseModel, Field

class VendorSearchResult(BaseModel):
    vendors: list[dict] = Field(description="List of matching vendors")
    total_count: int = Field(description="Total number of matches")
    error: str | None = Field(default=None, description="Error message if search failed")

@tool
def search_vendor_catalog(
    query: str,
    category: str | None = None,
    max_results: int = 10,
) -> VendorSearchResult:
    """Search the vendor catalog for products matching the query.

    Args:
        query: Product description or name to search for.
        category: Optional category filter (e.g., 'office_supplies', 'it_equipment').
        max_results: Maximum number of results to return (default 10).

    Returns:
        VendorSearchResult with matching vendors and their products.
        Check the error field — if non-null, the search failed and vendors list is empty.
    """
    try:
        results = vendor_db.search(query, category=category, limit=max_results)
        return VendorSearchResult(vendors=results, total_count=len(results))
    except VendorDBError as e:
        return VendorSearchResult(vendors=[], total_count=0, error=str(e))

Tool Count

More tools means more decisions for the agent, which means more opportunities for errors. Keep the tool set minimal — under 10 tools per agent. If an agent needs more, consider splitting it into multiple agents.

State Management

Checkpointing

LangGraph supports automatic checkpointing at each node transition. This means you can resume a failed agent from the last successful step rather than restarting from scratch.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")
app = graph.compile(checkpointer=memory)

# Invoke with thread_id for persistent state
config = {"configurable": {"thread_id": "procurement-123"}}
result = app.invoke(initial_state, config)

Time-Travel Debugging

With checkpoints, you can inspect the state at any point in the agent's execution history. This is invaluable for debugging:

# Get all checkpoints for a thread
history = list(app.get_state_history(config))
for state in history:
    print(f"Step: {state.values['current_step']}")
    print(f"Messages: {len(state.values['messages'])}")
    print(f"Next: {state.next}")
    print("---")

Human-in-the-Loop

For high-stakes decisions (approving a purchase order, escalating a support ticket), the agent should pause and wait for human confirmation. LangGraph supports this with interrupt nodes:

from langgraph.graph import StateGraph

graph = StateGraph(AgentState)
# ... add nodes ...

# The agent will pause before executing the approval node
app = graph.compile(
    checkpointer=memory,
    interrupt_before=["execute_approval"],
)

# Agent runs until it hits the interrupt
result = app.invoke(initial_state, config)
# Human reviews the state, then resumes
app.invoke(None, config)  # continues from interrupt point

Memory

Short-Term Memory

The conversation history within a single session. In LangGraph, this is typically the messages field in the state. Keep it bounded — summarize or truncate old messages to stay within context limits.

Long-Term Memory

Information that persists across sessions. Two common approaches:

  1. Vector store: Embed and store important facts, decisions, and user preferences. Retrieve relevant memories at the start of each session.
  2. Structured summaries: After each session, generate a summary of key decisions and outcomes. Store these in a database.
def update_long_term_memory(state: AgentState, memory_store):
    summary = llm.invoke(
        f"Summarize the key decisions and outcomes from this interaction: "
        f"{state['messages']}"
    )
    memory_store.add(
        text=summary.content,
        metadata={
            "user_id": state.get("user_id"),
            "timestamp": datetime.now().isoformat(),
            "topic": state.get("current_step"),
        }
    )

Error Handling

Agents fail. Tools return errors, LLM calls time out, APIs rate-limit, and the model sometimes produces unparseable output. Production agents need robust error handling at every level.

Retry with Backoff

For transient failures (API timeouts, rate limits):

def resilient_node(state: AgentState) -> dict:
    max_retries = 3
    for attempt in range(max_retries):
        try:
            result = call_external_api(state["query"])
            return {"tool_results": {"api": result}, "error_count": 0}
        except (TimeoutError, RateLimitError) as e:
            if attempt == max_retries - 1:
                return {
                    "tool_results": {"api_error": str(e)},
                    "error_count": state["error_count"] + 1,
                }
            time.sleep(2 ** attempt)

Fallback Strategies

When a tool fails permanently:

  1. Use an alternative tool: If the vendor catalog API is down, fall back to a cached version.
  2. Ask the user: Present what the agent knows and ask the user to provide the missing information.
  3. Graceful degradation: Complete the parts of the task that do not depend on the failed tool. Flag the incomplete parts.

Maximum Iteration Limits

Always set a hard limit on the number of steps an agent can take. An agent stuck in a loop will burn through API credits indefinitely.

def should_continue(state: AgentState) -> str:
    if state["error_count"] >= 3:
        return "error_exit"
    if len(state["messages"]) > 50:  # hard limit on iterations
        return "max_iterations_exit"
    if state["final_output"] is not None:
        return "complete"
    return "continue"

Observability

You cannot debug what you cannot see. Agent systems need observability at multiple levels.

LangSmith Tracing

LangSmith provides end-to-end tracing for LangChain and LangGraph applications. Every LLM call, tool invocation, and state transition is recorded with inputs, outputs, and latency.

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "procurement-agent"

Custom Metrics

Track beyond what the tracing platform provides:

  • Token usage per node: Identify which nodes are expensive.
  • Tool success rate: Which tools fail most often?
  • Latency per node: Where is time being spent?
  • Completion rate: What percentage of requests complete successfully?
import time
from dataclasses import dataclass

@dataclass
class NodeMetrics:
    node_name: str
    latency_ms: float
    tokens_used: int
    tool_calls: int
    success: bool

def instrumented_node(node_fn):
    def wrapper(state: AgentState) -> dict:
        start = time.perf_counter()
        try:
            result = node_fn(state)
            metrics = NodeMetrics(
                node_name=node_fn.__name__,
                latency_ms=(time.perf_counter() - start) * 1000,
                tokens_used=result.get("_tokens_used", 0),
                tool_calls=result.get("_tool_calls", 0),
                success=True,
            )
        except Exception as e:
            metrics = NodeMetrics(
                node_name=node_fn.__name__,
                latency_ms=(time.perf_counter() - start) * 1000,
                tokens_used=0, tool_calls=0, success=False,
            )
            raise
        finally:
            emit_metrics(metrics)  # send to your metrics backend
    return wrapper

Security

Sandboxing Tool Execution

If tools execute user-provided code or interact with external systems, they must be sandboxed. Run tool functions in containers or restricted subprocesses. Never let an agent execute arbitrary shell commands in the host environment.

Input Validation

Validate tool inputs before execution. An agent might produce malformed inputs — SQL injection in a database query tool, path traversal in a file access tool.

@tool
def query_purchase_orders(
    department: str,
    date_from: str,
    date_to: str,
) -> list[dict]:
    """Query purchase orders by department and date range."""
    # Input validation
    allowed_departments = {"engineering", "marketing", "operations", "hr"}
    if department.lower() not in allowed_departments:
        return {"error": f"Invalid department. Allowed: {allowed_departments}"}

    try:
        datetime.fromisoformat(date_from)
        datetime.fromisoformat(date_to)
    except ValueError:
        return {"error": "Invalid date format. Use ISO 8601 (YYYY-MM-DD)."}

    # Use parameterized queries — never f-strings for SQL
    return db.execute(
        "SELECT * FROM purchase_orders WHERE dept = %s AND created BETWEEN %s AND %s",
        (department, date_from, date_to),
    )

Prompt Injection Defense

Agents that process user input are vulnerable to prompt injection — a user might craft input that overrides the system prompt. Defenses:

  1. Separate user input from instructions: Use clear delimiters and instruct the model to treat user input as data.
  2. Input sanitization: Strip or escape special characters that might be interpreted as prompt instructions.
  3. Output validation: Check that the agent's actions are within its allowed scope before executing them.

Cost Control

Agent Budgets

Set a maximum token budget per agent execution. Track cumulative token usage across all LLM calls and terminate if the budget is exceeded.

class BudgetTracker:
    def __init__(self, max_tokens: int = 50_000):
        self.max_tokens = max_tokens
        self.used_tokens = 0

    def track(self, usage):
        self.used_tokens += usage.total_tokens
        if self.used_tokens > self.max_tokens:
            raise BudgetExceededError(
                f"Used {self.used_tokens} tokens, budget is {self.max_tokens}"
            )

Model Routing

Not every LLM call in an agent needs the most capable model. Use a cheap, fast model (GPT-4o-mini, Claude Haiku) for planning, classification, and simple extraction. Reserve the expensive model (GPT-4o, Claude Sonnet) for complex reasoning and final answer generation.

from langchain_openai import ChatOpenAI

planner_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
executor_llm = ChatOpenAI(model="gpt-4o", temperature=0)

def parse_intent(state: AgentState) -> dict:
    # Cheap model for intent classification
    response = planner_llm.invoke(intent_prompt.format(messages=state["messages"]))
    return {"current_step": response.content}

def generate_purchase_order(state: AgentState) -> dict:
    # Expensive model for structured document generation
    response = executor_llm.invoke(po_prompt.format(details=state["tool_results"]))
    return {"final_output": response.content}

Case Study: Enterprise Procurement System

A manufacturing company needed to modernize their procurement process. Users were filling out paper forms, emailing them to a procurement officer, who manually searched vendor catalogs, checked budgets, and routed approvals. Average time from request to PO: 4.5 days.

The goal: allow users to describe their needs in natural language and have the system generate purchase orders, check vendor catalogs, validate budgets, and route for approval — all within minutes.

System Architecture

Stripe Systems designed and built a 4-agent system using LangGraph:

Agent 1 — IntentParser: Extracts structured procurement details from natural language input. Identifies product category, specifications, quantity, urgency, and department.

Agent 2 — VendorLookup: Searches the vendor catalog for matching products. Compares prices, delivery times, and vendor ratings. Returns ranked vendor options.

Agent 3 — BudgetValidator: Checks the requesting department's remaining budget. Validates against procurement policies (e.g., single-item limits, annual caps).

Agent 4 — ApprovalRouter: Determines the approval chain based on amount, department, and item category. Generates the PO document and routes it.

LangGraph Implementation

State Definition:

from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator

class ProcurementState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    # Parsed intent
    product_category: str | None
    product_specs: dict | None
    quantity: int | None
    department: str | None
    urgency: str | None
    # Vendor results
    vendor_options: list[dict]
    selected_vendor: dict | None
    # Budget
    budget_remaining: float | None
    budget_approved: bool
    policy_violations: list[str]
    # Approval
    approval_chain: list[str]
    po_document: str | None
    # Control
    current_step: str
    error_count: int
    error_messages: list[str]

Graph Structure:

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.sqlite import SqliteSaver

def build_procurement_graph():
    graph = StateGraph(ProcurementState)

    # Add agent nodes
    graph.add_node("parse_intent", intent_parser_node)
    graph.add_node("lookup_vendor", vendor_lookup_node)
    graph.add_node("validate_budget", budget_validator_node)
    graph.add_node("route_approval", approval_router_node)
    graph.add_node("handle_error", error_handler_node)
    graph.add_node("request_clarification", clarification_node)

    # Entry point
    graph.set_entry_point("parse_intent")

    # Conditional routing after intent parsing
    graph.add_conditional_edges(
        "parse_intent",
        route_after_parse,
        {
            "vendor_lookup": "lookup_vendor",
            "needs_clarification": "request_clarification",
            "error": "handle_error",
        },
    )

    # Conditional routing after vendor lookup
    graph.add_conditional_edges(
        "lookup_vendor",
        route_after_vendor,
        {
            "budget_check": "validate_budget",
            "no_vendors": "request_clarification",
            "error": "handle_error",
        },
    )

    # Conditional routing after budget validation
    graph.add_conditional_edges(
        "validate_budget",
        route_after_budget,
        {
            "approved": "route_approval",
            "over_budget": "request_clarification",
            "policy_violation": "request_clarification",
            "error": "handle_error",
        },
    )

    # Approval routing always terminates
    graph.add_edge("route_approval", END)
    graph.add_edge("handle_error", END)
    graph.add_edge("request_clarification", END)

    return graph.compile(
        checkpointer=SqliteSaver.from_conn_string("procurement.db"),
        interrupt_before=["route_approval"],  # human approval before final PO
    )

def route_after_parse(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if state["product_category"] is None or state["quantity"] is None:
        return "needs_clarification"
    return "vendor_lookup"

def route_after_vendor(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if not state["vendor_options"]:
        return "no_vendors"
    return "budget_check"

def route_after_budget(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if state["policy_violations"]:
        return "policy_violation"
    if not state["budget_approved"]:
        return "over_budget"
    return "approved"

Vendor Lookup Node with Fallback:

def vendor_lookup_node(state: ProcurementState) -> dict:
    category = state["product_category"]
    specs = state["product_specs"]
    quantity = state["quantity"]

    # Primary: Live vendor catalog API
    try:
        results = vendor_api.search(
            category=category,
            specs=specs,
            min_quantity=quantity,
        )
        if results:
            return {
                "vendor_options": results,
                "selected_vendor": results[0],  # pre-select best match
                "current_step": "vendor_lookup_complete",
            }
    except VendorAPIError as e:
        logger.warning(f"Vendor API failed: {e}")

    # Fallback: Cached catalog (updated nightly)
    try:
        cached_results = cached_catalog.search(
            category=category,
            specs=specs,
        )
        if cached_results:
            return {
                "vendor_options": cached_results,
                "selected_vendor": cached_results[0],
                "current_step": "vendor_lookup_complete_cached",
                "error_messages": state["error_messages"] + [
                    "Live catalog unavailable. Using cached data (last updated: "
                    f"{cached_catalog.last_updated}). Prices may differ."
                ],
            }
    except Exception as e:
        logger.error(f"Cached catalog also failed: {e}")

    # Both failed
    return {
        "vendor_options": [],
        "error_count": state["error_count"] + 1,
        "error_messages": state["error_messages"] + [
            "Unable to search vendor catalog. Both live and cached sources unavailable."
        ],
    }

Sample Execution Trace

User input: "I need 50 ergonomic office chairs for the engineering floor. Budget is flexible but ideally under $500 per chair. We need them within 3 weeks."

[Step 1: parse_intent] 
  Input: User message
  Output: category=office_furniture, specs={type: ergonomic_chair, features: [adjustable_height, lumbar_support]},
          quantity=50, department=engineering, urgency=3_weeks, budget_hint=$500/unit
  Tokens: 847 (gpt-4o-mini)
  Latency: 420ms

[Step 2: lookup_vendor]
  Input: category=office_furniture, specs=..., quantity=50
  Tool call: vendor_api.search(...)
  Output: 4 vendors found
    - ErgoMax Pro Chair: $429/unit, delivery 2 weeks, rating 4.7
    - ComfortElite Series: $489/unit, delivery 10 days, rating 4.5
    - BasicErgo Model: $319/unit, delivery 3 weeks, rating 4.1
    - PremiumPosture X1: $612/unit, delivery 1 week, rating 4.8
  Selected: ErgoMax Pro Chair (best value within budget)
  Tokens: 1,203 (gpt-4o)
  Latency: 1,850ms (including API call)

[Step 3: validate_budget]
  Input: department=engineering, amount=$21,450 (50 × $429)
  Tool call: budget_api.check(department=engineering, amount=21450)
  Output: budget_remaining=$45,200, policy_check=PASS, budget_approved=True
  Tokens: 523 (gpt-4o-mini)
  Latency: 380ms

[Step 4: INTERRUPT — human approval required]
  PO generated, waiting for manager approval
  Approval chain: [engineering_lead, procurement_officer]

[Step 5: route_approval] (after human approval)
  PO #ENG-2026-0847 generated and submitted
  Notification sent to: vendor, requesting department, finance
  Tokens: 689 (gpt-4o)
  Latency: 520ms

Total: 3,262 tokens, 3.17 seconds execution time

Results

After 3 months of deployment:

  • Average request-to-PO time: 12 minutes (down from 4.5 days), with most of that being human approval wait time
  • Agent execution time: under 5 seconds for 94% of requests
  • Successful completions: 87% (remaining 13% require human intervention due to ambiguous requests or out-of-catalog items)
  • Monthly LLM cost: $340 for ~1,200 procurement requests
  • Fallback to cached catalog: triggered 3% of the time (vendor API downtime)
  • Policy violations caught by BudgetValidator that would have been missed manually: 8 per month on average

The human-in-the-loop design was essential for adoption. Procurement officers trusted the system because they maintained approval authority. The system handled the tedious parts — catalog search, budget checking, form generation — while humans made the final decisions.

Ready to discuss your project?

Get in Touch →
← Back to Blog

More Articles