AI/ML📅 February 28, 2026· 14 min read

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

✍️

Stripe Systems Engineering

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussion, we use a specific definition: an AI agent is a system that takes a goal, formulates a plan, executes actions using tools, observes results, and iterates until the goal is achieved or it determines the goal cannot be achieved.

This is distinct from a chain, where the sequence of steps is predefined. An agent decides its own control flow at runtime.

Building agents that work reliably in enterprise environments — where failures have real consequences, latency matters, and costs must be controlled — requires more engineering rigor than most tutorials suggest. This post covers the architecture, tooling, and operational concerns for production multi-agent systems using LangGraph.

Agents vs Chains: The Distinction That Matters

A chain is a fixed pipeline: input → step 1 → step 2 → step 3 → output. The steps are determined at development time. A RAG pipeline is a chain. A summarization pipeline is a chain.

An agent adds three capabilities:

✓Tool use: The ability to call external functions — APIs, databases, file systems, calculators — and incorporate the results.
✓Planning: The ability to decide which tools to call and in what order, based on the current goal and observations.
✓State management: The ability to maintain context across multiple steps, remember what has been tried, and track progress toward the goal.

These capabilities introduce non-determinism. The same input may produce different execution paths depending on tool results, model reasoning, and intermediate state. This makes agents powerful but harder to test, debug, and operate.

LangGraph: State Machines for AI Agents

LangGraph models agent workflows as directed graphs with typed state. Each node in the graph is a function that reads the current state, performs some computation (which may include an LLM call or tool invocation), and returns an updated state. Edges define transitions between nodes, and conditional edges allow the agent to branch based on state.

This is a meaningful improvement over the "loop until done" pattern used by basic ReAct agents. With LangGraph, the control flow is explicit and inspectable.

Core Concepts

State: A typed dictionary (usually a TypedDict or Pydantic model) that holds all information the agent needs across steps.

from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    current_step: str
    tool_results: dict
    error_count: int
    final_output: str | None

Nodes: Functions that take the state as input and return partial state updates.

def parse_intent(state: AgentState) -> dict:
    # LLM call to classify user intent
    messages = state["messages"]
    response = llm.invoke(intent_prompt.format(messages=messages))
    return {"current_step": response.content, "tool_results": {}}

Edges: Define the graph structure. Conditional edges allow branching.

from langgraph.graph import StateGraph, END

graph = StateGraph(AgentState)
graph.add_node("parse_intent", parse_intent)
graph.add_node("lookup_vendor", lookup_vendor)
graph.add_node("validate_budget", validate_budget)

graph.add_conditional_edges(
    "parse_intent",
    route_after_intent,  # function that returns next node name
    {
        "vendor_lookup": "lookup_vendor",
        "budget_check": "validate_budget",
        "unknown": END,
    }
)

Agent Architectures

ReAct (Reasoning + Acting)

The agent alternates between reasoning (thinking about what to do) and acting (calling a tool). After each tool result, the agent reasons again about what to do next.

ReAct is simple and works well for straightforward tool-use scenarios. Its weakness is that it can get stuck in loops — repeatedly trying the same failing action — and it does not plan ahead.

Plan-and-Execute

The agent first generates a complete plan (a list of steps), then executes each step sequentially. After execution, it can replan if results deviate from expectations.

This works better for complex tasks where the order of operations matters. The downside is that the initial plan may be wrong, and replanning is expensive (requires a full LLM call).

Reflexion

The agent executes a task, evaluates its own output, identifies what went wrong, and tries again with the self-critique as additional context. This is useful for tasks where quality is hard to get right on the first attempt — code generation, for example.

Multi-Agent Collaboration

Multiple specialized agents work together on a task. Each agent has a defined role and set of tools. A coordinator routes work between them.

This is the pattern that matters most for enterprise systems, and the one we will focus on.

Multi-Agent Patterns

Supervisor-Worker

A supervisor agent receives the user's request, breaks it into sub-tasks, delegates each sub-task to a specialized worker agent, collects results, and synthesizes a final response.

User → Supervisor → [Worker A, Worker B, Worker C] → Supervisor → Response

The supervisor handles planning and coordination. Workers handle execution. This is the most common pattern for enterprise applications because it maps naturally to organizational structures and keeps each agent's scope narrow.

Peer-to-Peer

Agents communicate directly with each other without a central coordinator. Each agent can invoke other agents as tools.

This is more flexible but harder to debug. Use it when the workflow is not hierarchical — for example, a negotiation between a buyer agent and a seller agent.

Hierarchical Delegation

Like supervisor-worker, but with multiple levels. A top-level agent delegates to mid-level coordinators, which delegate to workers. This is necessary when the task is too complex for a single supervisor to decompose.

Tool Design

Tools are the interface between the agent and the outside world. Poorly designed tools are the number one cause of agent failure.

Principles

✓Atomic: Each tool does one thing. "search_vendor_catalog" not "search_and_compare_vendors".
✓Deterministic: Same input produces same output (when possible). The agent should be able to predict what a tool will do.
✓Well-documented: The tool description is part of the agent's prompt. A vague description leads to incorrect tool selection.
✓Error-explicit: Tools should return structured errors, not raise exceptions. The agent needs to reason about failures.

from langchain_core.tools import tool
from pydantic import BaseModel, Field

class VendorSearchResult(BaseModel):
    vendors: list[dict] = Field(description="List of matching vendors")
    total_count: int = Field(description="Total number of matches")
    error: str | None = Field(default=None, description="Error message if search failed")

@tool
def search_vendor_catalog(
    query: str,
    category: str | None = None,
    max_results: int = 10,
) -> VendorSearchResult:
    """Search the vendor catalog for products matching the query.

    Args:
        query: Product description or name to search for.
        category: Optional category filter (e.g., 'office_supplies', 'it_equipment').
        max_results: Maximum number of results to return (default 10).

    Returns:
        VendorSearchResult with matching vendors and their products.
        Check the error field — if non-null, the search failed and vendors list is empty.
    """
    try:
        results = vendor_db.search(query, category=category, limit=max_results)
        return VendorSearchResult(vendors=results, total_count=len(results))
    except VendorDBError as e:
        return VendorSearchResult(vendors=[], total_count=0, error=str(e))

Tool Count

More tools means more decisions for the agent, which means more opportunities for errors. Keep the tool set minimal — under 10 tools per agent. If an agent needs more, consider splitting it into multiple agents.

State Management

Checkpointing

LangGraph supports automatic checkpointing at each node transition. This means you can resume a failed agent from the last successful step rather than restarting from scratch.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")
app = graph.compile(checkpointer=memory)

# Invoke with thread_id for persistent state
config = {"configurable": {"thread_id": "procurement-123"}}
result = app.invoke(initial_state, config)

Time-Travel Debugging

With checkpoints, you can inspect the state at any point in the agent's execution history. This is invaluable for debugging:

# Get all checkpoints for a thread
history = list(app.get_state_history(config))
for state in history:
    print(f"Step: {state.values['current_step']}")
    print(f"Messages: {len(state.values['messages'])}")
    print(f"Next: {state.next}")
    print("---")

Human-in-the-Loop

For high-stakes decisions (approving a purchase order, escalating a support ticket), the agent should pause and wait for human confirmation. LangGraph supports this with interrupt nodes:

from langgraph.graph import StateGraph

graph = StateGraph(AgentState)
# ... add nodes ...

# The agent will pause before executing the approval node
app = graph.compile(
    checkpointer=memory,
    interrupt_before=["execute_approval"],
)

# Agent runs until it hits the interrupt
result = app.invoke(initial_state, config)
# Human reviews the state, then resumes
app.invoke(None, config)  # continues from interrupt point

Memory

Short-Term Memory

The conversation history within a single session. In LangGraph, this is typically the messages field in the state. Keep it bounded — summarize or truncate old messages to stay within context limits.

Long-Term Memory

Information that persists across sessions. Two common approaches:

✓Vector store: Embed and store important facts, decisions, and user preferences. Retrieve relevant memories at the start of each session.
✓Structured summaries: After each session, generate a summary of key decisions and outcomes. Store these in a database.

def update_long_term_memory(state: AgentState, memory_store):
    summary = llm.invoke(
        f"Summarize the key decisions and outcomes from this interaction: "
        f"{state['messages']}"
    )
    memory_store.add(
        text=summary.content,
        metadata={
            "user_id": state.get("user_id"),
            "timestamp": datetime.now().isoformat(),
            "topic": state.get("current_step"),
        }
    )

Error Handling

Agents fail. Tools return errors, LLM calls time out, APIs rate-limit, and the model sometimes produces unparseable output. Production agents need robust error handling at every level.

Retry with Backoff

For transient failures (API timeouts, rate limits):

def resilient_node(state: AgentState) -> dict:
    max_retries = 3
    for attempt in range(max_retries):
        try:
            result = call_external_api(state["query"])
            return {"tool_results": {"api": result}, "error_count": 0}
        except (TimeoutError, RateLimitError) as e:
            if attempt == max_retries - 1:
                return {
                    "tool_results": {"api_error": str(e)},
                    "error_count": state["error_count"] + 1,
                }
            time.sleep(2 ** attempt)

Fallback Strategies

When a tool fails permanently:

✓Use an alternative tool: If the vendor catalog API is down, fall back to a cached version.
✓Ask the user: Present what the agent knows and ask the user to provide the missing information.
✓Graceful degradation: Complete the parts of the task that do not depend on the failed tool. Flag the incomplete parts.

Maximum Iteration Limits

Always set a hard limit on the number of steps an agent can take. An agent stuck in a loop will burn through API credits indefinitely.

def should_continue(state: AgentState) -> str:
    if state["error_count"] >= 3:
        return "error_exit"
    if len(state["messages"]) > 50:  # hard limit on iterations
        return "max_iterations_exit"
    if state["final_output"] is not None:
        return "complete"
    return "continue"

Observability

You cannot debug what you cannot see. Agent systems need observability at multiple levels.

LangSmith Tracing

LangSmith provides end-to-end tracing for LangChain and LangGraph applications. Every LLM call, tool invocation, and state transition is recorded with inputs, outputs, and latency.

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "procurement-agent"

Custom Metrics

Track beyond what the tracing platform provides:

✓Token usage per node: Identify which nodes are expensive.
✓Tool success rate: Which tools fail most often?
✓Latency per node: Where is time being spent?
✓Completion rate: What percentage of requests complete successfully?

import time
from dataclasses import dataclass

@dataclass
class NodeMetrics:
    node_name: str
    latency_ms: float
    tokens_used: int
    tool_calls: int
    success: bool

def instrumented_node(node_fn):
    def wrapper(state: AgentState) -> dict:
        start = time.perf_counter()
        try:
            result = node_fn(state)
            metrics = NodeMetrics(
                node_name=node_fn.__name__,
                latency_ms=(time.perf_counter() - start) * 1000,
                tokens_used=result.get("_tokens_used", 0),
                tool_calls=result.get("_tool_calls", 0),
                success=True,
            )
        except Exception as e:
            metrics = NodeMetrics(
                node_name=node_fn.__name__,
                latency_ms=(time.perf_counter() - start) * 1000,
                tokens_used=0, tool_calls=0, success=False,
            )
            raise
        finally:
            emit_metrics(metrics)  # send to your metrics backend
    return wrapper

Security

Sandboxing Tool Execution

If tools execute user-provided code or interact with external systems, they must be sandboxed. Run tool functions in containers or restricted subprocesses. Never let an agent execute arbitrary shell commands in the host environment.

Input Validation

Validate tool inputs before execution. An agent might produce malformed inputs — SQL injection in a database query tool, path traversal in a file access tool.

@tool
def query_purchase_orders(
    department: str,
    date_from: str,
    date_to: str,
) -> list[dict]:
    """Query purchase orders by department and date range."""
    # Input validation
    allowed_departments = {"engineering", "marketing", "operations", "hr"}
    if department.lower() not in allowed_departments:
        return {"error": f"Invalid department. Allowed: {allowed_departments}"}

    try:
        datetime.fromisoformat(date_from)
        datetime.fromisoformat(date_to)
    except ValueError:
        return {"error": "Invalid date format. Use ISO 8601 (YYYY-MM-DD)."}

    # Use parameterized queries — never f-strings for SQL
    return db.execute(
        "SELECT * FROM purchase_orders WHERE dept = %s AND created BETWEEN %s AND %s",
        (department, date_from, date_to),
    )

Prompt Injection Defense

Agents that process user input are vulnerable to prompt injection — a user might craft input that overrides the system prompt. Defenses:

✓Separate user input from instructions: Use clear delimiters and instruct the model to treat user input as data.
✓Input sanitization: Strip or escape special characters that might be interpreted as prompt instructions.
✓Output validation: Check that the agent's actions are within its allowed scope before executing them.

Cost Control

Agent Budgets

Set a maximum token budget per agent execution. Track cumulative token usage across all LLM calls and terminate if the budget is exceeded.

class BudgetTracker:
    def __init__(self, max_tokens: int = 50_000):
        self.max_tokens = max_tokens
        self.used_tokens = 0

    def track(self, usage):
        self.used_tokens += usage.total_tokens
        if self.used_tokens > self.max_tokens:
            raise BudgetExceededError(
                f"Used {self.used_tokens} tokens, budget is {self.max_tokens}"
            )

Model Routing

Not every LLM call in an agent needs the most capable model. Use a cheap, fast model (GPT-4o-mini, Claude Haiku) for planning, classification, and simple extraction. Reserve the expensive model (GPT-4o, Claude Sonnet) for complex reasoning and final answer generation.

from langchain_openai import ChatOpenAI

planner_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
executor_llm = ChatOpenAI(model="gpt-4o", temperature=0)

def parse_intent(state: AgentState) -> dict:
    # Cheap model for intent classification
    response = planner_llm.invoke(intent_prompt.format(messages=state["messages"]))
    return {"current_step": response.content}

def generate_purchase_order(state: AgentState) -> dict:
    # Expensive model for structured document generation
    response = executor_llm.invoke(po_prompt.format(details=state["tool_results"]))
    return {"final_output": response.content}

Case Study: Enterprise Procurement System

A manufacturing company needed to modernize their procurement process. Users were filling out paper forms, emailing them to a procurement officer, who manually searched vendor catalogs, checked budgets, and routed approvals. Average time from request to PO: 4.5 days.

The goal: allow users to describe their needs in natural language and have the system generate purchase orders, check vendor catalogs, validate budgets, and route for approval — all within minutes.

System Architecture

Stripe Systems designed and built a 4-agent system using LangGraph:

Agent 1 — IntentParser: Extracts structured procurement details from natural language input. Identifies product category, specifications, quantity, urgency, and department.

Agent 2 — VendorLookup: Searches the vendor catalog for matching products. Compares prices, delivery times, and vendor ratings. Returns ranked vendor options.

Agent 3 — BudgetValidator: Checks the requesting department's remaining budget. Validates against procurement policies (e.g., single-item limits, annual caps).

Agent 4 — ApprovalRouter: Determines the approval chain based on amount, department, and item category. Generates the PO document and routes it.

LangGraph Implementation

State Definition:

from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator

class ProcurementState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    # Parsed intent
    product_category: str | None
    product_specs: dict | None
    quantity: int | None
    department: str | None
    urgency: str | None
    # Vendor results
    vendor_options: list[dict]
    selected_vendor: dict | None
    # Budget
    budget_remaining: float | None
    budget_approved: bool
    policy_violations: list[str]
    # Approval
    approval_chain: list[str]
    po_document: str | None
    # Control
    current_step: str
    error_count: int
    error_messages: list[str]

Graph Structure:

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.sqlite import SqliteSaver

def build_procurement_graph():
    graph = StateGraph(ProcurementState)

    # Add agent nodes
    graph.add_node("parse_intent", intent_parser_node)
    graph.add_node("lookup_vendor", vendor_lookup_node)
    graph.add_node("validate_budget", budget_validator_node)
    graph.add_node("route_approval", approval_router_node)
    graph.add_node("handle_error", error_handler_node)
    graph.add_node("request_clarification", clarification_node)

    # Entry point
    graph.set_entry_point("parse_intent")

    # Conditional routing after intent parsing
    graph.add_conditional_edges(
        "parse_intent",
        route_after_parse,
        {
            "vendor_lookup": "lookup_vendor",
            "needs_clarification": "request_clarification",
            "error": "handle_error",
        },
    )

    # Conditional routing after vendor lookup
    graph.add_conditional_edges(
        "lookup_vendor",
        route_after_vendor,
        {
            "budget_check": "validate_budget",
            "no_vendors": "request_clarification",
            "error": "handle_error",
        },
    )

    # Conditional routing after budget validation
    graph.add_conditional_edges(
        "validate_budget",
        route_after_budget,
        {
            "approved": "route_approval",
            "over_budget": "request_clarification",
            "policy_violation": "request_clarification",
            "error": "handle_error",
        },
    )

    # Approval routing always terminates
    graph.add_edge("route_approval", END)
    graph.add_edge("handle_error", END)
    graph.add_edge("request_clarification", END)

    return graph.compile(
        checkpointer=SqliteSaver.from_conn_string("procurement.db"),
        interrupt_before=["route_approval"],  # human approval before final PO
    )

def route_after_parse(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if state["product_category"] is None or state["quantity"] is None:
        return "needs_clarification"
    return "vendor_lookup"

def route_after_vendor(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if not state["vendor_options"]:
        return "no_vendors"
    return "budget_check"

def route_after_budget(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if state["policy_violations"]:
        return "policy_violation"
    if not state["budget_approved"]:
        return "over_budget"
    return "approved"

Vendor Lookup Node with Fallback:

def vendor_lookup_node(state: ProcurementState) -> dict:
    category = state["product_category"]
    specs = state["product_specs"]
    quantity = state["quantity"]

    # Primary: Live vendor catalog API
    try:
        results = vendor_api.search(
            category=category,
            specs=specs,
            min_quantity=quantity,
        )
        if results:
            return {
                "vendor_options": results,
                "selected_vendor": results[0],  # pre-select best match
                "current_step": "vendor_lookup_complete",
            }
    except VendorAPIError as e:
        logger.warning(f"Vendor API failed: {e}")

    # Fallback: Cached catalog (updated nightly)
    try:
        cached_results = cached_catalog.search(
            category=category,
            specs=specs,
        )
        if cached_results:
            return {
                "vendor_options": cached_results,
                "selected_vendor": cached_results[0],
                "current_step": "vendor_lookup_complete_cached",
                "error_messages": state["error_messages"] + [
                    "Live catalog unavailable. Using cached data (last updated: "
                    f"{cached_catalog.last_updated}). Prices may differ."
                ],
            }
    except Exception as e:
        logger.error(f"Cached catalog also failed: {e}")

    # Both failed
    return {
        "vendor_options": [],
        "error_count": state["error_count"] + 1,
        "error_messages": state["error_messages"] + [
            "Unable to search vendor catalog. Both live and cached sources unavailable."
        ],
    }

Sample Execution Trace

User input: "I need 50 ergonomic office chairs for the engineering floor. Budget is flexible but ideally under $500 per chair. We need them within 3 weeks."

[Step 1: parse_intent] 
  Input: User message
  Output: category=office_furniture, specs={type: ergonomic_chair, features: [adjustable_height, lumbar_support]},
          quantity=50, department=engineering, urgency=3_weeks, budget_hint=$500/unit
  Tokens: 847 (gpt-4o-mini)
  Latency: 420ms

[Step 2: lookup_vendor]
  Input: category=office_furniture, specs=..., quantity=50
  Tool call: vendor_api.search(...)
  Output: 4 vendors found
    - ErgoMax Pro Chair: $429/unit, delivery 2 weeks, rating 4.7
    - ComfortElite Series: $489/unit, delivery 10 days, rating 4.5
    - BasicErgo Model: $319/unit, delivery 3 weeks, rating 4.1
    - PremiumPosture X1: $612/unit, delivery 1 week, rating 4.8
  Selected: ErgoMax Pro Chair (best value within budget)
  Tokens: 1,203 (gpt-4o)
  Latency: 1,850ms (including API call)

[Step 3: validate_budget]
  Input: department=engineering, amount=$21,450 (50 × $429)
  Tool call: budget_api.check(department=engineering, amount=21450)
  Output: budget_remaining=$45,200, policy_check=PASS, budget_approved=True
  Tokens: 523 (gpt-4o-mini)
  Latency: 380ms

[Step 4: INTERRUPT — human approval required]
  PO generated, waiting for manager approval
  Approval chain: [engineering_lead, procurement_officer]

[Step 5: route_approval] (after human approval)
  PO #ENG-2026-0847 generated and submitted
  Notification sent to: vendor, requesting department, finance
  Tokens: 689 (gpt-4o)
  Latency: 520ms

Total: 3,262 tokens, 3.17 seconds execution time

Results

After 3 months of deployment:

✓Average request-to-PO time: 12 minutes (down from 4.5 days), with most of that being human approval wait time
✓Agent execution time: under 5 seconds for 94% of requests
✓Successful completions: 87% (remaining 13% require human intervention due to ambiguous requests or out-of-catalog items)
✓Monthly LLM cost: $340 for ~1,200 procurement requests
✓Fallback to cached catalog: triggered 3% of the time (vendor API downtime)
✓Policy violations caught by BudgetValidator that would have been missed manually: 8 per month on average

The human-in-the-loop design was essential for adoption. Procurement officers trusted the system because they maintained approval authority. The system handled the tedious parts — catalog search, budget checking, form generation — while humans made the final decisions.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months ...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, an...

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, paymen...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million r...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a lang...

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentic...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

AI/ML📅 February 28, 2026· 14 min read

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

✍️

Stripe Systems Engineering

This is distinct from a chain, where the sequence of steps is predefined. An agent decides its own control flow at runtime.

Agents vs Chains: The Distinction That Matters

A chain is a fixed pipeline: input → step 1 → step 2 → step 3 → output. The steps are determined at development time. A RAG pipeline is a chain. A summarization pipeline is a chain.

An agent adds three capabilities:

✓Tool use: The ability to call external functions — APIs, databases, file systems, calculators — and incorporate the results.
✓Planning: The ability to decide which tools to call and in what order, based on the current goal and observations.
✓State management: The ability to maintain context across multiple steps, remember what has been tried, and track progress toward the goal.

LangGraph: State Machines for AI Agents

This is a meaningful improvement over the "loop until done" pattern used by basic ReAct agents. With LangGraph, the control flow is explicit and inspectable.

Core Concepts

State: A typed dictionary (usually a TypedDict or Pydantic model) that holds all information the agent needs across steps.

from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator

class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    current_step: str
    tool_results: dict
    error_count: int
    final_output: str | None

Nodes: Functions that take the state as input and return partial state updates.

def parse_intent(state: AgentState) -> dict:
    # LLM call to classify user intent
    messages = state["messages"]
    response = llm.invoke(intent_prompt.format(messages=messages))
    return {"current_step": response.content, "tool_results": {}}

Edges: Define the graph structure. Conditional edges allow branching.

from langgraph.graph import StateGraph, END

graph = StateGraph(AgentState)
graph.add_node("parse_intent", parse_intent)
graph.add_node("lookup_vendor", lookup_vendor)
graph.add_node("validate_budget", validate_budget)

graph.add_conditional_edges(
    "parse_intent",
    route_after_intent,  # function that returns next node name
    {
        "vendor_lookup": "lookup_vendor",
        "budget_check": "validate_budget",
        "unknown": END,
    }
)

Agent Architectures

ReAct (Reasoning + Acting)

The agent alternates between reasoning (thinking about what to do) and acting (calling a tool). After each tool result, the agent reasons again about what to do next.

ReAct is simple and works well for straightforward tool-use scenarios. Its weakness is that it can get stuck in loops — repeatedly trying the same failing action — and it does not plan ahead.

Plan-and-Execute

The agent first generates a complete plan (a list of steps), then executes each step sequentially. After execution, it can replan if results deviate from expectations.

This works better for complex tasks where the order of operations matters. The downside is that the initial plan may be wrong, and replanning is expensive (requires a full LLM call).

Reflexion

Multi-Agent Collaboration

Multiple specialized agents work together on a task. Each agent has a defined role and set of tools. A coordinator routes work between them.

This is the pattern that matters most for enterprise systems, and the one we will focus on.

Multi-Agent Patterns

Supervisor-Worker

A supervisor agent receives the user's request, breaks it into sub-tasks, delegates each sub-task to a specialized worker agent, collects results, and synthesizes a final response.

User → Supervisor → [Worker A, Worker B, Worker C] → Supervisor → Response

Peer-to-Peer

Agents communicate directly with each other without a central coordinator. Each agent can invoke other agents as tools.

This is more flexible but harder to debug. Use it when the workflow is not hierarchical — for example, a negotiation between a buyer agent and a seller agent.

Hierarchical Delegation

Tool Design

Tools are the interface between the agent and the outside world. Poorly designed tools are the number one cause of agent failure.

Principles

✓Atomic: Each tool does one thing. "search_vendor_catalog" not "search_and_compare_vendors".
✓Deterministic: Same input produces same output (when possible). The agent should be able to predict what a tool will do.
✓Well-documented: The tool description is part of the agent's prompt. A vague description leads to incorrect tool selection.
✓Error-explicit: Tools should return structured errors, not raise exceptions. The agent needs to reason about failures.

from langchain_core.tools import tool
from pydantic import BaseModel, Field

class VendorSearchResult(BaseModel):
    vendors: list[dict] = Field(description="List of matching vendors")
    total_count: int = Field(description="Total number of matches")
    error: str | None = Field(default=None, description="Error message if search failed")

@tool
def search_vendor_catalog(
    query: str,
    category: str | None = None,
    max_results: int = 10,
) -> VendorSearchResult:
    """Search the vendor catalog for products matching the query.

    Args:
        query: Product description or name to search for.
        category: Optional category filter (e.g., 'office_supplies', 'it_equipment').
        max_results: Maximum number of results to return (default 10).

    Returns:
        VendorSearchResult with matching vendors and their products.
        Check the error field — if non-null, the search failed and vendors list is empty.
    """
    try:
        results = vendor_db.search(query, category=category, limit=max_results)
        return VendorSearchResult(vendors=results, total_count=len(results))
    except VendorDBError as e:
        return VendorSearchResult(vendors=[], total_count=0, error=str(e))

Tool Count

State Management

Checkpointing

LangGraph supports automatic checkpointing at each node transition. This means you can resume a failed agent from the last successful step rather than restarting from scratch.

from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")
app = graph.compile(checkpointer=memory)

# Invoke with thread_id for persistent state
config = {"configurable": {"thread_id": "procurement-123"}}
result = app.invoke(initial_state, config)

Time-Travel Debugging

With checkpoints, you can inspect the state at any point in the agent's execution history. This is invaluable for debugging:

# Get all checkpoints for a thread
history = list(app.get_state_history(config))
for state in history:
    print(f"Step: {state.values['current_step']}")
    print(f"Messages: {len(state.values['messages'])}")
    print(f"Next: {state.next}")
    print("---")

Human-in-the-Loop

For high-stakes decisions (approving a purchase order, escalating a support ticket), the agent should pause and wait for human confirmation. LangGraph supports this with interrupt nodes:

from langgraph.graph import StateGraph

graph = StateGraph(AgentState)
# ... add nodes ...

# The agent will pause before executing the approval node
app = graph.compile(
    checkpointer=memory,
    interrupt_before=["execute_approval"],
)

# Agent runs until it hits the interrupt
result = app.invoke(initial_state, config)
# Human reviews the state, then resumes
app.invoke(None, config)  # continues from interrupt point

Memory

Short-Term Memory

Long-Term Memory

Information that persists across sessions. Two common approaches:

✓Vector store: Embed and store important facts, decisions, and user preferences. Retrieve relevant memories at the start of each session.
✓Structured summaries: After each session, generate a summary of key decisions and outcomes. Store these in a database.

def update_long_term_memory(state: AgentState, memory_store):
    summary = llm.invoke(
        f"Summarize the key decisions and outcomes from this interaction: "
        f"{state['messages']}"
    )
    memory_store.add(
        text=summary.content,
        metadata={
            "user_id": state.get("user_id"),
            "timestamp": datetime.now().isoformat(),
            "topic": state.get("current_step"),
        }
    )

Error Handling

Agents fail. Tools return errors, LLM calls time out, APIs rate-limit, and the model sometimes produces unparseable output. Production agents need robust error handling at every level.

Retry with Backoff

For transient failures (API timeouts, rate limits):

def resilient_node(state: AgentState) -> dict:
    max_retries = 3
    for attempt in range(max_retries):
        try:
            result = call_external_api(state["query"])
            return {"tool_results": {"api": result}, "error_count": 0}
        except (TimeoutError, RateLimitError) as e:
            if attempt == max_retries - 1:
                return {
                    "tool_results": {"api_error": str(e)},
                    "error_count": state["error_count"] + 1,
                }
            time.sleep(2 ** attempt)

Fallback Strategies

When a tool fails permanently:

✓Use an alternative tool: If the vendor catalog API is down, fall back to a cached version.
✓Ask the user: Present what the agent knows and ask the user to provide the missing information.
✓Graceful degradation: Complete the parts of the task that do not depend on the failed tool. Flag the incomplete parts.

Maximum Iteration Limits

Always set a hard limit on the number of steps an agent can take. An agent stuck in a loop will burn through API credits indefinitely.

def should_continue(state: AgentState) -> str:
    if state["error_count"] >= 3:
        return "error_exit"
    if len(state["messages"]) > 50:  # hard limit on iterations
        return "max_iterations_exit"
    if state["final_output"] is not None:
        return "complete"
    return "continue"

Observability

You cannot debug what you cannot see. Agent systems need observability at multiple levels.

LangSmith Tracing

LangSmith provides end-to-end tracing for LangChain and LangGraph applications. Every LLM call, tool invocation, and state transition is recorded with inputs, outputs, and latency.

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "procurement-agent"

Custom Metrics

Track beyond what the tracing platform provides:

✓Token usage per node: Identify which nodes are expensive.
✓Tool success rate: Which tools fail most often?
✓Latency per node: Where is time being spent?
✓Completion rate: What percentage of requests complete successfully?

import time
from dataclasses import dataclass

@dataclass
class NodeMetrics:
    node_name: str
    latency_ms: float
    tokens_used: int
    tool_calls: int
    success: bool

def instrumented_node(node_fn):
    def wrapper(state: AgentState) -> dict:
        start = time.perf_counter()
        try:
            result = node_fn(state)
            metrics = NodeMetrics(
                node_name=node_fn.__name__,
                latency_ms=(time.perf_counter() - start) * 1000,
                tokens_used=result.get("_tokens_used", 0),
                tool_calls=result.get("_tool_calls", 0),
                success=True,
            )
        except Exception as e:
            metrics = NodeMetrics(
                node_name=node_fn.__name__,
                latency_ms=(time.perf_counter() - start) * 1000,
                tokens_used=0, tool_calls=0, success=False,
            )
            raise
        finally:
            emit_metrics(metrics)  # send to your metrics backend
    return wrapper

Security

Sandboxing Tool Execution

Input Validation

Validate tool inputs before execution. An agent might produce malformed inputs — SQL injection in a database query tool, path traversal in a file access tool.

@tool
def query_purchase_orders(
    department: str,
    date_from: str,
    date_to: str,
) -> list[dict]:
    """Query purchase orders by department and date range."""
    # Input validation
    allowed_departments = {"engineering", "marketing", "operations", "hr"}
    if department.lower() not in allowed_departments:
        return {"error": f"Invalid department. Allowed: {allowed_departments}"}

    try:
        datetime.fromisoformat(date_from)
        datetime.fromisoformat(date_to)
    except ValueError:
        return {"error": "Invalid date format. Use ISO 8601 (YYYY-MM-DD)."}

    # Use parameterized queries — never f-strings for SQL
    return db.execute(
        "SELECT * FROM purchase_orders WHERE dept = %s AND created BETWEEN %s AND %s",
        (department, date_from, date_to),
    )

Prompt Injection Defense

Agents that process user input are vulnerable to prompt injection — a user might craft input that overrides the system prompt. Defenses:

✓Separate user input from instructions: Use clear delimiters and instruct the model to treat user input as data.
✓Input sanitization: Strip or escape special characters that might be interpreted as prompt instructions.
✓Output validation: Check that the agent's actions are within its allowed scope before executing them.

Cost Control

Agent Budgets

Set a maximum token budget per agent execution. Track cumulative token usage across all LLM calls and terminate if the budget is exceeded.

class BudgetTracker:
    def __init__(self, max_tokens: int = 50_000):
        self.max_tokens = max_tokens
        self.used_tokens = 0

    def track(self, usage):
        self.used_tokens += usage.total_tokens
        if self.used_tokens > self.max_tokens:
            raise BudgetExceededError(
                f"Used {self.used_tokens} tokens, budget is {self.max_tokens}"
            )

Model Routing

from langchain_openai import ChatOpenAI

planner_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
executor_llm = ChatOpenAI(model="gpt-4o", temperature=0)

def parse_intent(state: AgentState) -> dict:
    # Cheap model for intent classification
    response = planner_llm.invoke(intent_prompt.format(messages=state["messages"]))
    return {"current_step": response.content}

def generate_purchase_order(state: AgentState) -> dict:
    # Expensive model for structured document generation
    response = executor_llm.invoke(po_prompt.format(details=state["tool_results"]))
    return {"final_output": response.content}

Case Study: Enterprise Procurement System

The goal: allow users to describe their needs in natural language and have the system generate purchase orders, check vendor catalogs, validate budgets, and route for approval — all within minutes.

System Architecture

Stripe Systems designed and built a 4-agent system using LangGraph:

Agent 1 — IntentParser: Extracts structured procurement details from natural language input. Identifies product category, specifications, quantity, urgency, and department.

Agent 2 — VendorLookup: Searches the vendor catalog for matching products. Compares prices, delivery times, and vendor ratings. Returns ranked vendor options.

Agent 3 — BudgetValidator: Checks the requesting department's remaining budget. Validates against procurement policies (e.g., single-item limits, annual caps).

Agent 4 — ApprovalRouter: Determines the approval chain based on amount, department, and item category. Generates the PO document and routes it.

LangGraph Implementation

State Definition:

from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator

class ProcurementState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    # Parsed intent
    product_category: str | None
    product_specs: dict | None
    quantity: int | None
    department: str | None
    urgency: str | None
    # Vendor results
    vendor_options: list[dict]
    selected_vendor: dict | None
    # Budget
    budget_remaining: float | None
    budget_approved: bool
    policy_violations: list[str]
    # Approval
    approval_chain: list[str]
    po_document: str | None
    # Control
    current_step: str
    error_count: int
    error_messages: list[str]

Graph Structure:

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.sqlite import SqliteSaver

def build_procurement_graph():
    graph = StateGraph(ProcurementState)

    # Add agent nodes
    graph.add_node("parse_intent", intent_parser_node)
    graph.add_node("lookup_vendor", vendor_lookup_node)
    graph.add_node("validate_budget", budget_validator_node)
    graph.add_node("route_approval", approval_router_node)
    graph.add_node("handle_error", error_handler_node)
    graph.add_node("request_clarification", clarification_node)

    # Entry point
    graph.set_entry_point("parse_intent")

    # Conditional routing after intent parsing
    graph.add_conditional_edges(
        "parse_intent",
        route_after_parse,
        {
            "vendor_lookup": "lookup_vendor",
            "needs_clarification": "request_clarification",
            "error": "handle_error",
        },
    )

    # Conditional routing after vendor lookup
    graph.add_conditional_edges(
        "lookup_vendor",
        route_after_vendor,
        {
            "budget_check": "validate_budget",
            "no_vendors": "request_clarification",
            "error": "handle_error",
        },
    )

    # Conditional routing after budget validation
    graph.add_conditional_edges(
        "validate_budget",
        route_after_budget,
        {
            "approved": "route_approval",
            "over_budget": "request_clarification",
            "policy_violation": "request_clarification",
            "error": "handle_error",
        },
    )

    # Approval routing always terminates
    graph.add_edge("route_approval", END)
    graph.add_edge("handle_error", END)
    graph.add_edge("request_clarification", END)

    return graph.compile(
        checkpointer=SqliteSaver.from_conn_string("procurement.db"),
        interrupt_before=["route_approval"],  # human approval before final PO
    )

def route_after_parse(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if state["product_category"] is None or state["quantity"] is None:
        return "needs_clarification"
    return "vendor_lookup"

def route_after_vendor(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if not state["vendor_options"]:
        return "no_vendors"
    return "budget_check"

def route_after_budget(state: ProcurementState) -> str:
    if state["error_count"] >= 3:
        return "error"
    if state["policy_violations"]:
        return "policy_violation"
    if not state["budget_approved"]:
        return "over_budget"
    return "approved"

Vendor Lookup Node with Fallback:

def vendor_lookup_node(state: ProcurementState) -> dict:
    category = state["product_category"]
    specs = state["product_specs"]
    quantity = state["quantity"]

    # Primary: Live vendor catalog API
    try:
        results = vendor_api.search(
            category=category,
            specs=specs,
            min_quantity=quantity,
        )
        if results:
            return {
                "vendor_options": results,
                "selected_vendor": results[0],  # pre-select best match
                "current_step": "vendor_lookup_complete",
            }
    except VendorAPIError as e:
        logger.warning(f"Vendor API failed: {e}")

    # Fallback: Cached catalog (updated nightly)
    try:
        cached_results = cached_catalog.search(
            category=category,
            specs=specs,
        )
        if cached_results:
            return {
                "vendor_options": cached_results,
                "selected_vendor": cached_results[0],
                "current_step": "vendor_lookup_complete_cached",
                "error_messages": state["error_messages"] + [
                    "Live catalog unavailable. Using cached data (last updated: "
                    f"{cached_catalog.last_updated}). Prices may differ."
                ],
            }
    except Exception as e:
        logger.error(f"Cached catalog also failed: {e}")

    # Both failed
    return {
        "vendor_options": [],
        "error_count": state["error_count"] + 1,
        "error_messages": state["error_messages"] + [
            "Unable to search vendor catalog. Both live and cached sources unavailable."
        ],
    }

Sample Execution Trace

User input: "I need 50 ergonomic office chairs for the engineering floor. Budget is flexible but ideally under $500 per chair. We need them within 3 weeks."

[Step 1: parse_intent] 
  Input: User message
  Output: category=office_furniture, specs={type: ergonomic_chair, features: [adjustable_height, lumbar_support]},
          quantity=50, department=engineering, urgency=3_weeks, budget_hint=$500/unit
  Tokens: 847 (gpt-4o-mini)
  Latency: 420ms

[Step 2: lookup_vendor]
  Input: category=office_furniture, specs=..., quantity=50
  Tool call: vendor_api.search(...)
  Output: 4 vendors found
    - ErgoMax Pro Chair: $429/unit, delivery 2 weeks, rating 4.7
    - ComfortElite Series: $489/unit, delivery 10 days, rating 4.5
    - BasicErgo Model: $319/unit, delivery 3 weeks, rating 4.1
    - PremiumPosture X1: $612/unit, delivery 1 week, rating 4.8
  Selected: ErgoMax Pro Chair (best value within budget)
  Tokens: 1,203 (gpt-4o)
  Latency: 1,850ms (including API call)

[Step 3: validate_budget]
  Input: department=engineering, amount=$21,450 (50 × $429)
  Tool call: budget_api.check(department=engineering, amount=21450)
  Output: budget_remaining=$45,200, policy_check=PASS, budget_approved=True
  Tokens: 523 (gpt-4o-mini)
  Latency: 380ms

[Step 4: INTERRUPT — human approval required]
  PO generated, waiting for manager approval
  Approval chain: [engineering_lead, procurement_officer]

[Step 5: route_approval] (after human approval)
  PO #ENG-2026-0847 generated and submitted
  Notification sent to: vendor, requesting department, finance
  Tokens: 689 (gpt-4o)
  Latency: 520ms

Total: 3,262 tokens, 3.17 seconds execution time

Results

After 3 months of deployment:

✓Average request-to-PO time: 12 minutes (down from 4.5 days), with most of that being human approval wait time
✓Agent execution time: under 5 seconds for 94% of requests
✓Successful completions: 87% (remaining 13% require human intervention due to ambiguous requests or out-of-catalog items)
✓Monthly LLM cost: $340 for ~1,200 procurement requests
✓Fallback to cached catalog: triggered 3% of the time (vendor API downtime)
✓Policy violations caught by BudgetValidator that would have been missed manually: 8 per month on average

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agents vs Chains: The Distinction That Matters

LangGraph: State Machines for AI Agents

Core Concepts

Agent Architectures

ReAct (Reasoning + Acting)

Plan-and-Execute

Reflexion

Multi-Agent Collaboration

Multi-Agent Patterns

Supervisor-Worker

Peer-to-Peer

Hierarchical Delegation

Tool Design

Principles

Tool Count

State Management

Checkpointing

Time-Travel Debugging

Human-in-the-Loop

Memory

Short-Term Memory

Long-Term Memory

Error Handling

Retry with Backoff

Fallback Strategies

Maximum Iteration Limits

Observability

LangSmith Tracing

Custom Metrics

Security

Sandboxing Tool Execution

Input Validation

Prompt Injection Defense

Cost Control

Agent Budgets

Model Routing

Case Study: Enterprise Procurement System

System Architecture

LangGraph Implementation

Sample Execution Trace

Results

Related Services from Stripe Systems

AI/ML Solutions

More Articles

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders