AI/ML📅 January 18, 2026· 14 min read

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

✍️

Stripe Systems Engineering

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million requests per day at an average of 1,000 tokens per request (input + output) on GPT-4o costs roughly $7,500/day — $225,000/month. Even GPT-4o-mini at the same volume runs $450/day.

Most teams discover this problem after launch, when the invoices arrive. The good news is that there are systematic, engineering-driven approaches to reduce LLM costs by 50-80% without degrading output quality. This post covers the major strategies: caching, model routing, batch inference, prompt optimization, token budgeting, self-hosting economics, and the evaluation framework needed to ensure cost cuts do not break things.

The Cost Anatomy of an LLM Request

Before optimizing, understand where the money goes:

Total cost = (input_tokens × input_price) + (output_tokens × output_price)

For GPT-4o (as of early 2026):

✓Input: $2.50 per 1M tokens
✓Output: $10.00 per 1M tokens

For GPT-4o-mini:

✓Input: $0.15 per 1M tokens
✓Output: $0.60 per 1M tokens

Output tokens are 4× more expensive than input tokens on GPT-4o. This means reducing output length has a disproportionate impact on cost. A verbose 500-token response costs 4× more than a concise 125-token response on the output side alone.

Strategy 1: Prompt Caching

Many LLM applications see the same or very similar questions repeatedly. Customer support systems, FAQ bots, documentation assistants — the query distribution follows a power law where a small number of queries account for a large share of traffic.

Exact-Match Caching

The simplest form: hash the full prompt (system + user message) and cache the response. If the same prompt appears again, return the cached response without calling the LLM.

import hashlib
import json
from redis import Redis

redis = Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 3600  # 1 hour

def cached_llm_call(messages: list[dict], model: str, **kwargs) -> str:
    cache_key = hashlib.sha256(
        json.dumps({"messages": messages, "model": model}, sort_keys=True).encode()
    ).hexdigest()

    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    response = openai_client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )
    result = response.choices[0].message.content

    redis.setex(cache_key, CACHE_TTL, json.dumps(result))
    return result

Exact-match caching has a low hit rate for conversational applications (every conversation is unique) but works well for structured queries — the same product lookup, the same policy question phrased identically.

Semantic Caching

Two users asking "What's the return policy?" and "How do I return an item?" should get the same cached response. Semantic caching uses embedding similarity instead of exact matching:

✓Embed the incoming query
✓Search for similar queries in the cache (cosine similarity > threshold)
✓If a match is found, return the cached response
✓If not, call the LLM, cache the response with the query embedding

import numpy as np
from openai import OpenAI

client = OpenAI()

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92):
        self.threshold = similarity_threshold
        self.cache: list[dict] = []  # in production, use a vector DB

    def _embed(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small", input=text
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def get(self, query: str) -> str | None:
        query_embedding = self._embed(query)
        best_match = None
        best_score = 0.0

        for entry in self.cache:
            score = self._cosine_similarity(query_embedding, entry["embedding"])
            if score > best_score:
                best_score = score
                best_match = entry

        if best_match and best_score >= self.threshold:
            return best_match["response"]
        return None

    def put(self, query: str, response: str):
        embedding = self._embed(query)
        self.cache.append({
            "query": query,
            "embedding": embedding,
            "response": response,
        })

The similarity threshold is critical. Too low (0.85) and you return irrelevant cached responses. Too high (0.98) and the hit rate drops to near-zero. Start at 0.92 and tune based on quality evaluations.

Cache Invalidation

Cached responses go stale when the underlying data changes. Strategies:

✓TTL-based: Simple, predictable. Set TTL based on how frequently your data changes.
✓Event-based: Invalidate cache entries when relevant documents are updated. Requires tracking which source documents informed each cached response.
✓Versioned: Include a data version in the cache key. When data updates, increment the version and old entries naturally expire.

Strategy 2: Model Routing

Not every request requires GPT-4o. A simple greeting ("Hi, how can I help?") does not need the same model as a complex multi-step reasoning task. Model routing sends each request to the most cost-effective model that can handle it.

Classification-Based Routing

Train a lightweight classifier (or use a small LLM) to categorize incoming requests by complexity:

from enum import Enum

class Complexity(Enum):
    SIMPLE = "simple"      # greetings, FAQs, simple lookups
    MODERATE = "moderate"  # multi-step reasoning, summarization
    COMPLEX = "complex"    # analysis, code generation, nuanced decisions

MODEL_MAP = {
    Complexity.SIMPLE: "gpt-4o-mini",
    Complexity.MODERATE: "gpt-4o-mini",
    Complexity.COMPLEX: "gpt-4o",
}

PRICE_MAP = {
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},  # per 1K tokens
    "gpt-4o": {"input": 0.0025, "output": 0.01},
}

def classify_complexity(query: str) -> Complexity:
    # Use a fine-tuned classifier or a cheap LLM call
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Classify the complexity of this customer support query. "
                "Respond with exactly one word: simple, moderate, or complex.\n"
                "simple: greetings, FAQs, status checks\n"
                "moderate: how-to questions, comparisons, multi-step requests\n"
                "complex: complaints needing investigation, technical debugging, "
                "policy exceptions"
            )},
            {"role": "user", "content": query},
        ],
        max_tokens=10,
    )
    label = response.choices[0].message.content.strip().lower()
    return Complexity(label) if label in Complexity._value2member_map_ else Complexity.COMPLEX

def route_request(query: str, messages: list[dict]) -> str:
    complexity = classify_complexity(query)
    model = MODEL_MAP[complexity]
    response = openai_client.chat.completions.create(
        model=model, messages=messages
    )
    return response.choices[0].message.content

The router itself costs tokens (the classification call), so it must be cheap. GPT-4o-mini with max_tokens=10 costs a fraction of a cent per classification.

Confidence-Based Fallback

A more sophisticated approach: always try the cheap model first. If its confidence is low (measured by logprobs or a self-assessment), escalate to the expensive model.

def route_with_fallback(messages: list[dict]) -> str:
    # Try cheap model first
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages + [
            {"role": "system", "content": (
                "After your response, rate your confidence on a scale of 1-5 "
                "where 5 means you are certain your answer is correct and complete. "
                "Format: [CONFIDENCE: N]"
            )}
        ],
    )
    content = response.choices[0].message.content

    # Extract confidence
    confidence = extract_confidence(content)  # parse [CONFIDENCE: N]

    if confidence >= 4:
        return strip_confidence_tag(content)

    # Low confidence — escalate to expensive model
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    return response.choices[0].message.content

This approach is self-correcting: the expensive model only runs when needed. In practice, 60-80% of requests can be handled by the cheap model.

Strategy 3: Batch Inference

If your application can tolerate latency (email processing, nightly report generation, bulk classification), batch inference offers significant savings.

OpenAI Batch API

OpenAI offers a 50% discount for batch requests with 24-hour turnaround:

import json

# Prepare batch file
requests = []
for i, ticket in enumerate(tickets):
    requests.append({
        "custom_id": f"ticket-{ticket.id}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": CLASSIFICATION_PROMPT},
                {"role": "user", "content": ticket.text},
            ],
            "max_tokens": 100,
        },
    })

# Write JSONL file
with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload and submit
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Async Processing Queues

For requests that need faster turnaround but can still be batched, use a queue-based architecture:

# Producer: enqueue requests
import redis

r = redis.Redis()

def enqueue_llm_request(request_id: str, messages: list[dict]):
    r.lpush("llm_queue", json.dumps({
        "id": request_id,
        "messages": messages,
        "enqueued_at": time.time(),
    }))

# Consumer: process in batches
def process_batch(batch_size: int = 20, max_wait_seconds: int = 5):
    batch = []
    deadline = time.time() + max_wait_seconds

    while len(batch) < batch_size and time.time() < deadline:
        item = r.brpop("llm_queue", timeout=1)
        if item:
            batch.append(json.loads(item[1]))

    if not batch:
        return

    # Process batch concurrently with asyncio
    results = asyncio.run(process_concurrent(batch))
    for request_id, result in results:
        r.set(f"llm_result:{request_id}", json.dumps(result), ex=3600)

Batching amortizes overhead and allows you to use rate limits more efficiently.

Strategy 4: Prompt Optimization

Shorter prompts cost less. This is obvious but underappreciated. Many production prompts contain redundant instructions, excessive examples, and verbose formatting that can be reduced without affecting quality.

Reduce Few-Shot Examples

Few-shot examples are expensive — each example consumes input tokens on every request. Reduce the number of examples to the minimum needed for consistent output:

# Before: 5 examples (≈ 500 tokens of examples)
PROMPT_V1 = """Classify the sentiment of this review.

Example 1: "Great product!" → positive
Example 2: "Terrible experience." → negative
Example 3: "It's okay." → neutral
Example 4: "Absolutely love it!" → positive
Example 5: "Would not recommend." → negative

Review: {review}
Sentiment:"""

# After: 1 example per class (≈ 200 tokens)
PROMPT_V2 = """Classify the sentiment as positive, negative, or neutral.

Examples:
"Great product!" → positive
"Terrible experience." → negative
"It's okay." → neutral

Review: {review}
Sentiment:"""

Run an evaluation to verify that reducing examples does not degrade accuracy. Often, 1-2 examples per class is sufficient for well-defined tasks.

Use Structured Output

Instead of asking the model to generate free-form text and then parsing it, request structured JSON output. This reduces output tokens and eliminates parsing errors:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    response_format={"type": "json_object"},
    max_tokens=150,  # JSON is typically more concise
)

Instruction Compression

Review your system prompts for redundancy. LLMs do not need verbose, human-friendly instructions:

# Before (87 tokens):
"""You are a helpful customer support assistant for our company.
When a customer asks a question, you should look at the provided context
and answer their question based on that context. If you cannot find
the answer in the context, please let the customer know that you
don't have that information available."""

# After (42 tokens):
"""Answer the customer's question using only the provided context.
If the context lacks the answer, say you don't have that information."""

Both produce equivalent behavior. The second saves 45 tokens per request — at 1M requests/day on GPT-4o, that is $112/day in input token savings alone.

Strategy 5: Token Budgeting

max_tokens

Always set max_tokens to a reasonable limit for your use case. Without it, the model might generate a 2,000-token response when you only need 100 tokens.

# Classification: max 10 tokens
response = client.chat.completions.create(
    model="gpt-4o-mini", messages=messages, max_tokens=10
)

# Short answer: max 150 tokens
response = client.chat.completions.create(
    model="gpt-4o-mini", messages=messages, max_tokens=150
)

# Detailed explanation: max 500 tokens
response = client.chat.completions.create(
    model="gpt-4o", messages=messages, max_tokens=500
)

Stop Sequences

Use stop sequences to terminate generation early when the model has produced the needed output:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stop=["\n\n", "---"],  # stop at paragraph break or separator
)

Strategy 6: Fine-Tuning vs Prompting

A fine-tuned GPT-4o-mini can often match GPT-4o quality on a specific task at a fraction of the cost. The economics:

Approach	Per-request cost (1K tokens)	Quality (task-specific)
GPT-4o with 5-shot prompt	$0.0075	High
GPT-4o-mini with 5-shot prompt	$0.00045	Moderate
Fine-tuned GPT-4o-mini (0-shot)	$0.00024	High (on trained task)

Fine-tuning costs: ~$25 for a small training set (500 examples), one-time. If you are making more than 100K requests/month on a well-defined task, fine-tuning almost always pays for itself within the first month.

The catch: fine-tuning is only effective for well-defined, consistent tasks. It does not help for open-ended reasoning or novel queries.

Strategy 7: Self-Hosted Models

Running open-source models on your own infrastructure eliminates per-token costs entirely. The question is whether the infrastructure cost is lower than the API cost.

Cost Breakeven Analysis

Running Llama 3.1 70B on an NVIDIA A100 GPU:

✓Cloud GPU cost: ~$2.50/hour (AWS p4d.24xlarge, amortized)
✓Throughput: ~30 requests/second with vLLM
✓Monthly cost: ~$1,800/month
✓Equivalent API cost: 30 req/s × 86,400 s/day × 30 days × $0.0003/req = $23,328/month

At 30 requests/second sustained throughput, self-hosting is roughly 13× cheaper. But the breakeven depends on your actual utilization. If you only process 1 request/second, the GPU still costs $1,800/month while the API would cost only $777/month.

vLLM Deployment

# Start vLLM server
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#   --tensor-parallel-size 4 \
#   --max-model-len 8192 \
#   --gpu-memory-utilization 0.9

# Use OpenAI-compatible API
from openai import OpenAI

local_client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

response = local_client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=messages,
)

Self-hosting adds operational complexity: GPU procurement, model updates, monitoring, failover. The decision should be based on a realistic assessment of your team's infrastructure capabilities.

Monitoring and Cost Allocation

You cannot optimize what you do not measure. Track token usage at multiple levels:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class LLMUsageRecord:
    timestamp: datetime
    model: str
    feature: str        # which product feature triggered this call
    team: str           # which team owns this feature
    input_tokens: int
    output_tokens: int
    cached: bool
    cost_usd: float
    latency_ms: float

def log_usage(response, feature: str, team: str, cached: bool = False):
    usage = response.usage
    model = response.model
    cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)

    record = LLMUsageRecord(
        timestamp=datetime.utcnow(),
        model=model,
        feature=feature,
        team=team,
        input_tokens=usage.prompt_tokens,
        output_tokens=usage.completion_tokens,
        cached=cached,
        cost_usd=cost,
        latency_ms=response._response_ms,
    )
    metrics_backend.emit(record)

Build dashboards that show:

✓Daily/weekly/monthly spend by feature and team
✓Cost per request by model and feature
✓Cache hit rates
✓Model routing distribution
✓Token usage trends

Evaluation: Ensuring Quality Survives Cost Cuts

Every cost optimization carries the risk of degrading quality. You must measure quality before and after each change.

A/B Testing Framework

Route a percentage of traffic to the optimized path and compare quality metrics:

import random

def handle_request(messages: list[dict], request_id: str) -> str:
    if random.random() < 0.1:  # 10% to control group
        response = call_llm(messages, model="gpt-4o")  # original path
        log_experiment(request_id, group="control", response=response)
    else:
        response = optimized_route(messages)  # optimized path
        log_experiment(request_id, group="treatment", response=response)
    return response

Quality Metrics

For each optimization, define measurable quality criteria:

✓Accuracy: Does the response correctly answer the question? (Evaluated by LLM-as-judge or human review on a sample.)
✓Completeness: Does the response cover all aspects of the question?
✓Relevance: Is the response focused on the question without unnecessary information?
✓Format compliance: Does the response follow the expected structure?

Run these evaluations on a held-out set of 200+ request-response pairs. Compare scores between the original and optimized paths. Only deploy optimizations that maintain quality scores within 5% of the baseline.

Case Study: Customer Support Automation

A SaaS company operating a customer support automation platform processed 50,000 tickets per day using GPT-4o for classification, response generation, and escalation decisions. Monthly LLM spend had reached $38,000 and was projected to grow 40% quarter-over-quarter as ticket volume increased.

Stripe Systems was engaged to reduce costs without degrading customer satisfaction scores (CSAT) or resolution accuracy.

Baseline Analysis

The team started by instrumenting every LLM call to understand the cost distribution:

Feature	Daily Requests	Avg Tokens	Model	Daily Cost	Monthly Cost
Ticket classification	50,000	480	GPT-4o	$180	$5,400
Response generation	42,000	1,200	GPT-4o	$630	$18,900
Escalation decision	15,000	350	GPT-4o	$79	$2,370
Sentiment analysis	50,000	280	GPT-4o	$105	$3,150
Knowledge base search	38,000	850	GPT-4o	$242	$7,260
Total				$1,236	$37,080

Optimization 1: Semantic Caching

Many support tickets are near-duplicates. "How do I reset my password?" appears dozens of times daily with slight variations. The team implemented semantic caching with a similarity threshold of 0.93 on the response generation pipeline.

Implementation details:

✓Cache store: Redis with vector search (RediSearch module)
✓Embedding model: text-embedding-3-small (cheap, fast)
✓Cache TTL: 24 hours (knowledge base updates daily)
✓Scope: applied to response generation and knowledge base search only (classification and escalation need per-ticket precision)

Results:

✓Cache hit rate: 34% on response generation, 41% on knowledge base search
✓Monthly savings: $12,100
✓Quality impact: CSAT scores unchanged (cached responses are identical to original responses for semantically equivalent queries)

Optimization 2: Model Routing

Not every ticket needs GPT-4o. Password resets, account status inquiries, and simple how-to questions are well within GPT-4o-mini's capabilities.

Router implementation:

✓A fine-tuned GPT-4o-mini classifier categorizes tickets into simple/moderate/complex
✓Simple and moderate tickets (70% of volume) route to GPT-4o-mini
✓Complex tickets (30% of volume) route to GPT-4o
✓Router cost: ~$45/month (negligible)

# Router training data: 2,000 labeled tickets
# Features: ticket text, category, customer tier
# Labels: simple, moderate, complex

# Routing rules:
#   simple → gpt-4o-mini (password resets, status checks, FAQ)
#   moderate → gpt-4o-mini (how-to, feature questions, billing)
#   complex → gpt-4o (complaints, bugs, multi-issue, escalations)

Results:

✓70% of response generation shifted to GPT-4o-mini
✓Monthly savings: $8,800
✓Quality impact: CSAT for simple/moderate tickets dropped 0.3 points (from 4.4 to 4.1 on a 5-point scale) — within the acceptable 5% threshold

Optimization 3: Prompt Optimization

The existing prompts were verbose, with redundant instructions and excessive few-shot examples. The team systematically shortened them:

✓Ticket classification prompt: 580 tokens → 340 tokens
✓Response generation prompt: 920 tokens → 580 tokens
✓Escalation decision prompt: 410 tokens → 260 tokens
✓System prompts consolidated, redundant safety instructions deduplicated

Results:

✓Average tokens per request: 340 → 210 (across all features)
✓Monthly savings: $4,100
✓Quality impact: no measurable change in any metric

Combined Results

Metric	Before	After	Change
Monthly LLM spend	$38,000	$13,000	-66%
Semantic cache hit rate	0%	34%	—
GPT-4o-mini usage	0%	70%	—
Avg tokens per request	340	210	-38%
CSAT score	4.4	4.2	-4.5%
Resolution accuracy	91%	89.5%	-1.6%
Avg response latency	2.1s	1.4s	-33%

The total monthly savings of $25,000 came with a minor quality tradeoff: a 0.2 point CSAT decrease and a 1.5 percentage point resolution accuracy decrease, both within the pre-agreed acceptable thresholds. Response latency actually improved because cached responses are instant and GPT-4o-mini is faster than GPT-4o.

Projected Savings at Scale

With ticket volume projected to grow 40% per quarter, the cost optimization infrastructure scales linearly. Without optimization, the projected monthly spend at 100K tickets/day would have been $76,000. With the optimizations in place, it is projected at $22,000 — a $54,000/month difference.

The engineering effort to build and validate these optimizations took 6 weeks. The return on investment was measured in days, not months. The key lesson: LLM cost optimization is not about using cheaper models — it is about using the right model for each request, eliminating redundant computation, and measuring quality to ensure you are not trading accuracy for savings blindly.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussi...

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months ...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, an...

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, paymen...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a lang...

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentic...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

AI/ML📅 January 18, 2026· 14 min read

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

✍️

Stripe Systems Engineering

The Cost Anatomy of an LLM Request

Before optimizing, understand where the money goes:

Total cost = (input_tokens × input_price) + (output_tokens × output_price)

For GPT-4o (as of early 2026):

✓Input: $2.50 per 1M tokens
✓Output: $10.00 per 1M tokens

For GPT-4o-mini:

✓Input: $0.15 per 1M tokens
✓Output: $0.60 per 1M tokens

Strategy 1: Prompt Caching

Exact-Match Caching

The simplest form: hash the full prompt (system + user message) and cache the response. If the same prompt appears again, return the cached response without calling the LLM.

import hashlib
import json
from redis import Redis

redis = Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 3600  # 1 hour

def cached_llm_call(messages: list[dict], model: str, **kwargs) -> str:
    cache_key = hashlib.sha256(
        json.dumps({"messages": messages, "model": model}, sort_keys=True).encode()
    ).hexdigest()

    cached = redis.get(cache_key)
    if cached:
        return json.loads(cached)

    response = openai_client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )
    result = response.choices[0].message.content

    redis.setex(cache_key, CACHE_TTL, json.dumps(result))
    return result

Semantic Caching

Two users asking "What's the return policy?" and "How do I return an item?" should get the same cached response. Semantic caching uses embedding similarity instead of exact matching:

✓Embed the incoming query
✓Search for similar queries in the cache (cosine similarity > threshold)
✓If a match is found, return the cached response
✓If not, call the LLM, cache the response with the query embedding

import numpy as np
from openai import OpenAI

client = OpenAI()

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92):
        self.threshold = similarity_threshold
        self.cache: list[dict] = []  # in production, use a vector DB

    def _embed(self, text: str) -> list[float]:
        response = client.embeddings.create(
            model="text-embedding-3-small", input=text
        )
        return response.data[0].embedding

    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def get(self, query: str) -> str | None:
        query_embedding = self._embed(query)
        best_match = None
        best_score = 0.0

        for entry in self.cache:
            score = self._cosine_similarity(query_embedding, entry["embedding"])
            if score > best_score:
                best_score = score
                best_match = entry

        if best_match and best_score >= self.threshold:
            return best_match["response"]
        return None

    def put(self, query: str, response: str):
        embedding = self._embed(query)
        self.cache.append({
            "query": query,
            "embedding": embedding,
            "response": response,
        })

Cache Invalidation

Cached responses go stale when the underlying data changes. Strategies:

✓TTL-based: Simple, predictable. Set TTL based on how frequently your data changes.
✓Event-based: Invalidate cache entries when relevant documents are updated. Requires tracking which source documents informed each cached response.
✓Versioned: Include a data version in the cache key. When data updates, increment the version and old entries naturally expire.

Strategy 2: Model Routing

Classification-Based Routing

Train a lightweight classifier (or use a small LLM) to categorize incoming requests by complexity:

from enum import Enum

class Complexity(Enum):
    SIMPLE = "simple"      # greetings, FAQs, simple lookups
    MODERATE = "moderate"  # multi-step reasoning, summarization
    COMPLEX = "complex"    # analysis, code generation, nuanced decisions

MODEL_MAP = {
    Complexity.SIMPLE: "gpt-4o-mini",
    Complexity.MODERATE: "gpt-4o-mini",
    Complexity.COMPLEX: "gpt-4o",
}

PRICE_MAP = {
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},  # per 1K tokens
    "gpt-4o": {"input": 0.0025, "output": 0.01},
}

def classify_complexity(query: str) -> Complexity:
    # Use a fine-tuned classifier or a cheap LLM call
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Classify the complexity of this customer support query. "
                "Respond with exactly one word: simple, moderate, or complex.\n"
                "simple: greetings, FAQs, status checks\n"
                "moderate: how-to questions, comparisons, multi-step requests\n"
                "complex: complaints needing investigation, technical debugging, "
                "policy exceptions"
            )},
            {"role": "user", "content": query},
        ],
        max_tokens=10,
    )
    label = response.choices[0].message.content.strip().lower()
    return Complexity(label) if label in Complexity._value2member_map_ else Complexity.COMPLEX

def route_request(query: str, messages: list[dict]) -> str:
    complexity = classify_complexity(query)
    model = MODEL_MAP[complexity]
    response = openai_client.chat.completions.create(
        model=model, messages=messages
    )
    return response.choices[0].message.content

The router itself costs tokens (the classification call), so it must be cheap. GPT-4o-mini with max_tokens=10 costs a fraction of a cent per classification.

Confidence-Based Fallback

A more sophisticated approach: always try the cheap model first. If its confidence is low (measured by logprobs or a self-assessment), escalate to the expensive model.

def route_with_fallback(messages: list[dict]) -> str:
    # Try cheap model first
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages + [
            {"role": "system", "content": (
                "After your response, rate your confidence on a scale of 1-5 "
                "where 5 means you are certain your answer is correct and complete. "
                "Format: [CONFIDENCE: N]"
            )}
        ],
    )
    content = response.choices[0].message.content

    # Extract confidence
    confidence = extract_confidence(content)  # parse [CONFIDENCE: N]

    if confidence >= 4:
        return strip_confidence_tag(content)

    # Low confidence — escalate to expensive model
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
    )
    return response.choices[0].message.content

This approach is self-correcting: the expensive model only runs when needed. In practice, 60-80% of requests can be handled by the cheap model.

Strategy 3: Batch Inference

If your application can tolerate latency (email processing, nightly report generation, bulk classification), batch inference offers significant savings.

OpenAI Batch API

OpenAI offers a 50% discount for batch requests with 24-hour turnaround:

import json

# Prepare batch file
requests = []
for i, ticket in enumerate(tickets):
    requests.append({
        "custom_id": f"ticket-{ticket.id}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": CLASSIFICATION_PROMPT},
                {"role": "user", "content": ticket.text},
            ],
            "max_tokens": 100,
        },
    })

# Write JSONL file
with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload and submit
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

Async Processing Queues

For requests that need faster turnaround but can still be batched, use a queue-based architecture:

# Producer: enqueue requests
import redis

r = redis.Redis()

def enqueue_llm_request(request_id: str, messages: list[dict]):
    r.lpush("llm_queue", json.dumps({
        "id": request_id,
        "messages": messages,
        "enqueued_at": time.time(),
    }))

# Consumer: process in batches
def process_batch(batch_size: int = 20, max_wait_seconds: int = 5):
    batch = []
    deadline = time.time() + max_wait_seconds

    while len(batch) < batch_size and time.time() < deadline:
        item = r.brpop("llm_queue", timeout=1)
        if item:
            batch.append(json.loads(item[1]))

    if not batch:
        return

    # Process batch concurrently with asyncio
    results = asyncio.run(process_concurrent(batch))
    for request_id, result in results:
        r.set(f"llm_result:{request_id}", json.dumps(result), ex=3600)

Batching amortizes overhead and allows you to use rate limits more efficiently.

Strategy 4: Prompt Optimization

Reduce Few-Shot Examples

Few-shot examples are expensive — each example consumes input tokens on every request. Reduce the number of examples to the minimum needed for consistent output:

# Before: 5 examples (≈ 500 tokens of examples)
PROMPT_V1 = """Classify the sentiment of this review.

Example 1: "Great product!" → positive
Example 2: "Terrible experience." → negative
Example 3: "It's okay." → neutral
Example 4: "Absolutely love it!" → positive
Example 5: "Would not recommend." → negative

Review: {review}
Sentiment:"""

# After: 1 example per class (≈ 200 tokens)
PROMPT_V2 = """Classify the sentiment as positive, negative, or neutral.

Examples:
"Great product!" → positive
"Terrible experience." → negative
"It's okay." → neutral

Review: {review}
Sentiment:"""

Run an evaluation to verify that reducing examples does not degrade accuracy. Often, 1-2 examples per class is sufficient for well-defined tasks.

Use Structured Output

Instead of asking the model to generate free-form text and then parsing it, request structured JSON output. This reduces output tokens and eliminates parsing errors:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    response_format={"type": "json_object"},
    max_tokens=150,  # JSON is typically more concise
)

Instruction Compression

Review your system prompts for redundancy. LLMs do not need verbose, human-friendly instructions:

# Before (87 tokens):
"""You are a helpful customer support assistant for our company.
When a customer asks a question, you should look at the provided context
and answer their question based on that context. If you cannot find
the answer in the context, please let the customer know that you
don't have that information available."""

# After (42 tokens):
"""Answer the customer's question using only the provided context.
If the context lacks the answer, say you don't have that information."""

Both produce equivalent behavior. The second saves 45 tokens per request — at 1M requests/day on GPT-4o, that is $112/day in input token savings alone.

Strategy 5: Token Budgeting

max_tokens

Always set max_tokens to a reasonable limit for your use case. Without it, the model might generate a 2,000-token response when you only need 100 tokens.

# Classification: max 10 tokens
response = client.chat.completions.create(
    model="gpt-4o-mini", messages=messages, max_tokens=10
)

# Short answer: max 150 tokens
response = client.chat.completions.create(
    model="gpt-4o-mini", messages=messages, max_tokens=150
)

# Detailed explanation: max 500 tokens
response = client.chat.completions.create(
    model="gpt-4o", messages=messages, max_tokens=500
)

Stop Sequences

Use stop sequences to terminate generation early when the model has produced the needed output:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stop=["\n\n", "---"],  # stop at paragraph break or separator
)

Strategy 6: Fine-Tuning vs Prompting

A fine-tuned GPT-4o-mini can often match GPT-4o quality on a specific task at a fraction of the cost. The economics:

Approach	Per-request cost (1K tokens)	Quality (task-specific)
GPT-4o with 5-shot prompt	$0.0075	High
GPT-4o-mini with 5-shot prompt	$0.00045	Moderate
Fine-tuned GPT-4o-mini (0-shot)	$0.00024	High (on trained task)

The catch: fine-tuning is only effective for well-defined, consistent tasks. It does not help for open-ended reasoning or novel queries.

Strategy 7: Self-Hosted Models

Running open-source models on your own infrastructure eliminates per-token costs entirely. The question is whether the infrastructure cost is lower than the API cost.

Cost Breakeven Analysis

Running Llama 3.1 70B on an NVIDIA A100 GPU:

✓Cloud GPU cost: ~$2.50/hour (AWS p4d.24xlarge, amortized)
✓Throughput: ~30 requests/second with vLLM
✓Monthly cost: ~$1,800/month
✓Equivalent API cost: 30 req/s × 86,400 s/day × 30 days × $0.0003/req = $23,328/month

vLLM Deployment

# Start vLLM server
# vllm serve meta-llama/Llama-3.1-70B-Instruct \
#   --tensor-parallel-size 4 \
#   --max-model-len 8192 \
#   --gpu-memory-utilization 0.9

# Use OpenAI-compatible API
from openai import OpenAI

local_client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

response = local_client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=messages,
)

Self-hosting adds operational complexity: GPU procurement, model updates, monitoring, failover. The decision should be based on a realistic assessment of your team's infrastructure capabilities.

Monitoring and Cost Allocation

You cannot optimize what you do not measure. Track token usage at multiple levels:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class LLMUsageRecord:
    timestamp: datetime
    model: str
    feature: str        # which product feature triggered this call
    team: str           # which team owns this feature
    input_tokens: int
    output_tokens: int
    cached: bool
    cost_usd: float
    latency_ms: float

def log_usage(response, feature: str, team: str, cached: bool = False):
    usage = response.usage
    model = response.model
    cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)

    record = LLMUsageRecord(
        timestamp=datetime.utcnow(),
        model=model,
        feature=feature,
        team=team,
        input_tokens=usage.prompt_tokens,
        output_tokens=usage.completion_tokens,
        cached=cached,
        cost_usd=cost,
        latency_ms=response._response_ms,
    )
    metrics_backend.emit(record)

Build dashboards that show:

✓Daily/weekly/monthly spend by feature and team
✓Cost per request by model and feature
✓Cache hit rates
✓Model routing distribution
✓Token usage trends

Evaluation: Ensuring Quality Survives Cost Cuts

Every cost optimization carries the risk of degrading quality. You must measure quality before and after each change.

A/B Testing Framework

Route a percentage of traffic to the optimized path and compare quality metrics:

import random

def handle_request(messages: list[dict], request_id: str) -> str:
    if random.random() < 0.1:  # 10% to control group
        response = call_llm(messages, model="gpt-4o")  # original path
        log_experiment(request_id, group="control", response=response)
    else:
        response = optimized_route(messages)  # optimized path
        log_experiment(request_id, group="treatment", response=response)
    return response

Quality Metrics

For each optimization, define measurable quality criteria:

✓Accuracy: Does the response correctly answer the question? (Evaluated by LLM-as-judge or human review on a sample.)
✓Completeness: Does the response cover all aspects of the question?
✓Relevance: Is the response focused on the question without unnecessary information?
✓Format compliance: Does the response follow the expected structure?

Case Study: Customer Support Automation

Stripe Systems was engaged to reduce costs without degrading customer satisfaction scores (CSAT) or resolution accuracy.

Baseline Analysis

The team started by instrumenting every LLM call to understand the cost distribution:

Feature	Daily Requests	Avg Tokens	Model	Daily Cost	Monthly Cost
Ticket classification	50,000	480	GPT-4o	$180	$5,400
Response generation	42,000	1,200	GPT-4o	$630	$18,900
Escalation decision	15,000	350	GPT-4o	$79	$2,370
Sentiment analysis	50,000	280	GPT-4o	$105	$3,150
Knowledge base search	38,000	850	GPT-4o	$242	$7,260
Total				$1,236	$37,080

Optimization 1: Semantic Caching

Implementation details:

✓Cache store: Redis with vector search (RediSearch module)
✓Embedding model: text-embedding-3-small (cheap, fast)
✓Cache TTL: 24 hours (knowledge base updates daily)
✓Scope: applied to response generation and knowledge base search only (classification and escalation need per-ticket precision)

Results:

✓Cache hit rate: 34% on response generation, 41% on knowledge base search
✓Monthly savings: $12,100
✓Quality impact: CSAT scores unchanged (cached responses are identical to original responses for semantically equivalent queries)

Optimization 2: Model Routing

Not every ticket needs GPT-4o. Password resets, account status inquiries, and simple how-to questions are well within GPT-4o-mini's capabilities.

Router implementation:

✓A fine-tuned GPT-4o-mini classifier categorizes tickets into simple/moderate/complex
✓Simple and moderate tickets (70% of volume) route to GPT-4o-mini
✓Complex tickets (30% of volume) route to GPT-4o
✓Router cost: ~$45/month (negligible)

# Router training data: 2,000 labeled tickets
# Features: ticket text, category, customer tier
# Labels: simple, moderate, complex

# Routing rules:
#   simple → gpt-4o-mini (password resets, status checks, FAQ)
#   moderate → gpt-4o-mini (how-to, feature questions, billing)
#   complex → gpt-4o (complaints, bugs, multi-issue, escalations)

Results:

✓70% of response generation shifted to GPT-4o-mini
✓Monthly savings: $8,800
✓Quality impact: CSAT for simple/moderate tickets dropped 0.3 points (from 4.4 to 4.1 on a 5-point scale) — within the acceptable 5% threshold

Optimization 3: Prompt Optimization

The existing prompts were verbose, with redundant instructions and excessive few-shot examples. The team systematically shortened them:

✓Ticket classification prompt: 580 tokens → 340 tokens
✓Response generation prompt: 920 tokens → 580 tokens
✓Escalation decision prompt: 410 tokens → 260 tokens
✓System prompts consolidated, redundant safety instructions deduplicated

Results:

✓Average tokens per request: 340 → 210 (across all features)
✓Monthly savings: $4,100
✓Quality impact: no measurable change in any metric

Combined Results

Metric	Before	After	Change
Monthly LLM spend	$38,000	$13,000	-66%
Semantic cache hit rate	0%	34%	—
GPT-4o-mini usage	0%	70%	—
Avg tokens per request	340	210	-38%
CSAT score	4.4	4.2	-4.5%
Resolution accuracy	91%	89.5%	-1.6%
Avg response latency	2.1s	1.4s	-33%

Projected Savings at Scale

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

The Cost Anatomy of an LLM Request

Strategy 1: Prompt Caching

Exact-Match Caching

Semantic Caching

Cache Invalidation

Strategy 2: Model Routing

Classification-Based Routing

Confidence-Based Fallback

Strategy 3: Batch Inference

OpenAI Batch API

Async Processing Queues

Strategy 4: Prompt Optimization

Reduce Few-Shot Examples

Use Structured Output

Instruction Compression

Strategy 5: Token Budgeting

max_tokens

Stop Sequences

Strategy 6: Fine-Tuning vs Prompting

Strategy 7: Self-Hosted Models

Cost Breakeven Analysis

vLLM Deployment

Monitoring and Cost Allocation

Evaluation: Ensuring Quality Survives Cost Cuts

A/B Testing Framework

Quality Metrics

Case Study: Customer Support Automation

Baseline Analysis

Optimization 1: Semantic Caching

Optimization 2: Model Routing

Optimization 3: Prompt Optimization

Combined Results

Projected Savings at Scale

Related Services from Stripe Systems

AI/ML Solutions

More Articles

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

Staff Augmentation — A Practical Guide for Engineering Leaders

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments