AI/ML📅 March 10, 2026· 14 min read

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

✍️

Stripe Systems Engineering

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a language model on your data (expensive, brittle, and hard to update), you retrieve relevant context at query time and inject it into the prompt. The model generates answers grounded in your documents rather than relying solely on its parametric knowledge.

This sounds simple. In practice, building a RAG pipeline that performs reliably in production — with consistent retrieval accuracy, acceptable latency, and manageable cost — requires careful engineering across every component. This post covers the end-to-end pipeline: document ingestion, chunking strategies, embedding selection, vector databases, hybrid search, re-ranking, prompt design, hallucination detection, and evaluation frameworks.

Why RAG Over Fine-Tuning

Fine-tuning has its place, but for most enterprise use cases RAG is the better starting point:

✓Data freshness: RAG uses the latest version of your documents. Fine-tuning requires retraining when data changes.
✓Attribution: RAG can cite specific source documents. Fine-tuned models cannot reliably tell you where their knowledge comes from.
✓Cost: Fine-tuning GPT-4-class models is expensive and slow. RAG works with any base model.
✓Separation of concerns: Your retrieval pipeline and generation model are independent components that can be improved separately.

Fine-tuning makes sense when you need the model to learn a specific style, follow domain-specific reasoning patterns, or when your retrieval corpus is too large for context windows. For knowledge retrieval over enterprise documents — contracts, policies, technical documentation, support tickets — RAG is almost always the right choice.

Document Ingestion Pipeline

Before anything touches a vector database, raw documents need to be parsed, cleaned, and structured. This is where most pipelines quietly fail.

Parsing

Different document formats require different parsers:

✓PDF: Use pymupdf (fitz) or pdfplumber for text extraction. For scanned PDFs, you need OCR — pytesseract or a cloud OCR API. Layout-aware parsing matters: tables, headers, and columns need to be reconstructed correctly.
✓DOCX: python-docx handles text extraction well. Watch for embedded tables, footnotes, and tracked changes.
✓HTML: BeautifulSoup with aggressive tag stripping. Remove navigation, footers, sidebars — keep only the content body.
✓Markdown/Plain text: Usually straightforward, but handle front matter, code blocks, and metadata separately.

Cleaning

Raw extracted text is messy. A typical cleaning pipeline:

✓Remove duplicate whitespace and normalize line endings
✓Strip headers/footers that repeat on every page (common in PDFs)
✓Handle hyphenation at line breaks
✓Normalize Unicode characters
✓Extract and preserve metadata: document title, author, creation date, section headers

Metadata Extraction

Metadata is often more valuable than people realize. Attaching document title, section headers, page numbers, and document type to each chunk enables filtered retrieval later. A query about "vacation policy" should be able to filter for HR documents before running vector similarity.

from dataclasses import dataclass
from typing import Optional

@dataclass
class DocumentChunk:
    text: str
    doc_id: str
    doc_title: str
    section_header: Optional[str]
    page_number: Optional[int]
    chunk_index: int
    token_count: int
    metadata: dict

Chunking Strategies

Chunking is the most underrated component of a RAG pipeline. The way you split documents into chunks directly determines retrieval quality. There is no universally optimal strategy — it depends on your documents, queries, and embedding model.

Fixed-Size Chunking

Split text every N tokens (or characters) with an overlap of M tokens.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,  # or a token counter
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

Pros: Simple, predictable chunk sizes, easy to reason about. Cons: Splits mid-sentence, mid-paragraph, or mid-thought. Semantic boundaries are ignored.

Sentence-Based Chunking

Split on sentence boundaries, then group sentences until you hit a token limit.

Pros: Preserves complete sentences. Cons: Sentence detection is imperfect (abbreviations, legal citations). Grouping still creates arbitrary boundaries.

Recursive Character Splitting

This is the most commonly used strategy in LangChain. It tries to split on paragraph boundaries first, then sentences, then words, recursively falling back to smaller separators.

The separator hierarchy matters:

separators = [
    "\n\n",   # Paragraph breaks (best)
    "\n",     # Line breaks
    ". ",     # Sentence endings
    ", ",     # Clause breaks
    " ",      # Word boundaries
    ""        # Character level (last resort)
]

This is a good default. It preserves semantic structure better than fixed-size chunking while keeping chunk sizes consistent.

Semantic Chunking

Group text by semantic similarity rather than syntactic boundaries. Compute embeddings for each sentence, then split where the cosine similarity between adjacent sentences drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85
)
chunks = splitter.create_documents([document_text])

Pros: Chunks are semantically coherent — each chunk is about one topic. Cons: Slower (requires embedding every sentence), variable chunk sizes (some very small, some very large), and the threshold needs tuning per document type.

Chunk Size vs Retrieval Accuracy

Smaller chunks are more precise — they are more likely to match a specific query. Larger chunks provide more context — the LLM gets a fuller picture. The tradeoff is real and must be tuned empirically for your use case.

General guidelines from experimentation:

Chunk Size (tokens)	Retrieval Precision	Context Completeness	Best For
128	High	Low	FAQ, short factual queries
256	Good	Moderate	General Q&A
512	Moderate	Good	Complex questions needing context
1024	Lower	High	Summarization, multi-hop reasoning

The overlap between chunks matters too. Too little overlap and you lose context at boundaries. Too much and you waste storage and compute. 10-15% of chunk size is a reasonable starting point.

Embedding Models

The embedding model converts text into dense vectors for similarity search. Your choice here directly affects retrieval quality and cost.

Proprietary Options

✓OpenAI text-embedding-ada-002: The workhorse. 1536 dimensions, $0.0001/1K tokens. Good general-purpose quality, but not the best on all benchmarks.
✓OpenAI text-embedding-3-small/large: Newer models with configurable dimensions. Better quality than ada-002 at similar cost.
✓Cohere embed-v3: Strong multilingual support. Offers separate input types for documents vs queries (improves retrieval). Competitive with OpenAI on MTEB benchmarks.

Open-Source Options

✓BGE (BAAI General Embedding): Consistently near the top of MTEB leaderboard. BGE-large-en-v1.5 is a solid choice.
✓E5: From Microsoft. E5-mistral-7b-instruct is the current best open-source model, but it is large.
✓GTE: From Alibaba. Good quality, reasonable size.

Running open-source models locally eliminates per-token costs but requires GPU infrastructure. For a corpus of 100K documents, embedding cost with OpenAI is typically under $50 — not a significant concern. The cost becomes relevant when you are re-embedding frequently or processing millions of documents.

The real decision factor is quality on your domain. Run a retrieval evaluation on a sample of your data with multiple embedding models before committing.

Vector Databases

Once you have embeddings, you need somewhere to store and query them. The vector database landscape is crowded. Here is an honest comparison:

Pinecone

Fully managed, serverless option available. Good for teams that do not want to manage infrastructure. Scales well. Metadata filtering is solid. Downside: vendor lock-in, and costs can escalate at scale.

Weaviate

Open-source, self-hostable. Supports hybrid search natively (vector + BM25). Has a built-in module system for embedding generation. Good for teams comfortable running their own infrastructure.

Qdrant

Open-source, written in Rust. Fast. Good filtering capabilities, supports multi-vector search. Lower memory footprint than some alternatives. The API is clean and well-documented.

pgvector

A PostgreSQL extension that adds vector similarity search to your existing Postgres database. If you already run PostgreSQL, this is the lowest-friction option. No new infrastructure, no new operational burden. Performance is acceptable for millions of vectors with proper indexing (HNSW or IVFFlat).

CREATE EXTENSION vector;

CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    doc_id TEXT NOT NULL,
    chunk_text TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

When each fits: Use pgvector if you already have PostgreSQL and your corpus is under 10M vectors. Use Qdrant or Weaviate if you need more advanced features or higher scale. Use Pinecone if you want fully managed and your budget allows it.

Hybrid Search

Pure vector search finds semantically similar documents but can miss exact keyword matches. A query for "ISO 27001 compliance" might retrieve documents about "information security standards" (semantically similar) but miss a document that literally mentions "ISO 27001" in a different context.

Hybrid search combines vector similarity with BM25 keyword search. The standard fusion method is Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(
    vector_results: list[dict],
    keyword_results: list[dict],
    k: int = 60,
    vector_weight: float = 0.7,
    keyword_weight: float = 0.3,
) -> list[dict]:
    scores = {}

    for rank, doc in enumerate(vector_results):
        doc_id = doc["id"]
        scores[doc_id] = scores.get(doc_id, 0) + vector_weight / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        doc_id = doc["id"]
        scores[doc_id] = scores.get(doc_id, 0) + keyword_weight / (k + rank + 1)

    fused = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{"id": doc_id, "score": score} for doc_id, score in fused]

The weights between vector and keyword results need tuning. For technical documentation with lots of jargon and specific terms, increase keyword weight. For conversational queries, increase vector weight.

Retrieval Augmentation: Re-Ranking

Initial retrieval (whether vector, keyword, or hybrid) returns a rough set of candidates. Re-ranking refines this set using a more expensive but more accurate model.

Cross-encoder re-rankers score each (query, document) pair independently, considering the full interaction between query and document tokens. This is more accurate than bi-encoder similarity but too slow for searching the full corpus — hence the two-stage approach.

from cohere import Client

co = Client(api_key="your-key")

def rerank_results(query: str, documents: list[str], top_n: int = 5):
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
    )
    return [
        {"index": r.index, "score": r.relevance_score}
        for r in response.results
    ]

Re-ranking typically improves retrieval precision by 10-25% over vector search alone. The cost is an additional API call and 100-300ms of latency. For most applications, this tradeoff is worth it.

Contextual Compression

After re-ranking, you can further improve context quality by compressing retrieved chunks — extracting only the sentences relevant to the query. This reduces token usage and focuses the LLM on the most pertinent information.

Prompt Engineering for RAG

The prompt template determines how the LLM uses retrieved context. A well-structured RAG prompt:

RAG_SYSTEM_PROMPT = """You are a helpful assistant that answers questions
based on the provided context. Follow these rules strictly:

1. Only use information from the provided context to answer.
2. If the context does not contain enough information, say so explicitly.
3. Cite your sources using [Source N] notation.
4. Do not make up information or use knowledge not in the context.
5. If multiple sources conflict, note the discrepancy.

Context:
{context}

Each context chunk is labeled with its source document and section.
"""

def format_context(chunks: list[DocumentChunk]) -> str:
    formatted = []
    for i, chunk in enumerate(chunks):
        source_label = f"[Source {i+1}: {chunk.doc_title}"
        if chunk.section_header:
            source_label += f" > {chunk.section_header}"
        source_label += "]"
        formatted.append(f"{source_label}\n{chunk.text}")
    return "\n\n---\n\n".join(formatted)

Key decisions:

✓Number of chunks: 3-7 is typical. More context helps recall but increases cost and can confuse the model.
✓Context ordering: Place the most relevant chunks first. LLMs attend more to the beginning and end of the context.
✓Citation format: Explicit citation instructions improve source attribution significantly.

Hallucination Detection

RAG reduces hallucination compared to vanilla LLM usage, but does not eliminate it. The model can still fabricate information, misinterpret context, or combine information from multiple sources in misleading ways.

Faithfulness Scoring

Compare the generated answer against the retrieved context. For each claim in the answer, check whether it is supported by the context:

✓Decompose the answer into individual claims
✓For each claim, check if any retrieved chunk supports it
✓Faithfulness = (supported claims) / (total claims)

This can be automated using an LLM-as-judge approach — a separate LLM call that evaluates whether each claim is grounded in the context.

Source Attribution Verification

If the answer cites "[Source 2]", verify that the cited claim actually appears in Source 2. This catches cases where the model attributes a claim to the wrong source.

Evaluation Framework: RAGAS

RAGAS (Retrieval Augmented Generation Assessment) provides a structured evaluation framework with four core metrics:

✓Context Precision: Of the retrieved chunks, how many are relevant? Measures retrieval quality.
✓Context Recall: Of the relevant information in the corpus, how much was retrieved? Measures retrieval completeness.
✓Answer Relevancy: Does the generated answer address the question? Measures generation quality.
✓Faithfulness: Is the answer grounded in the retrieved context? Measures hallucination.

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    answer_relevancy,
    faithfulness,
)
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,  # list of lists
    "ground_truth": ground_truth_answers,
})

results = evaluate(
    dataset=eval_dataset,
    metrics=[
        context_precision,
        context_recall,
        answer_relevancy,
        faithfulness,
    ],
)
print(results)

Building a good evaluation dataset is the hard part. You need at least 100-200 question-answer pairs with ground truth. These should be created by domain experts, not generated by an LLM.

Production Concerns

Caching Embeddings

Do not re-embed documents that have not changed. Maintain a hash of each document's content and only re-embed when the hash changes.

import hashlib

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

def should_reembed(doc_id: str, new_text: str, existing_hash: str) -> bool:
    return content_hash(new_text) != existing_hash

Incremental Indexing

For large corpora, full re-indexing is expensive. Track document modification timestamps and only process changed documents. Use soft deletes in the vector store — mark old chunks as inactive and add new ones, then periodically clean up.

Versioning Vector Indices

When you change chunking strategy, embedding model, or metadata schema, you need to re-index everything. Use versioned collection names (documents_v3, documents_v4) and run the old and new indices in parallel during migration. Route a percentage of traffic to the new index, compare metrics, and cut over when confident.

Latency Budget

A typical production RAG pipeline has the following latency profile:

Component	Latency
Query embedding	50-100ms
Vector search	10-50ms
BM25 search	5-20ms
RRF fusion	<5ms
Re-ranking	100-300ms
LLM generation	500-3000ms
Total	700-3500ms

Re-ranking and LLM generation dominate. If latency is critical, consider streaming the LLM response while running re-ranking in parallel with the initial retrieval call.

Case Study: Legal Document Knowledge Base

A legal technology company needed to build an internal knowledge base over 50,000+ legal documents — contracts, case summaries, regulatory filings, and internal memos. Lawyers needed to query this corpus in natural language and get accurate, cited answers.

Initial Approach (Naive RAG)

The first version used fixed 1000-token chunks, OpenAI ada-002 embeddings, Pinecone for storage, and a basic prompt template. Retrieval accuracy (measured as whether the correct source document appeared in the top-5 results) was 62%. Lawyers reported frequent irrelevant results and occasional hallucinated legal citations — a serious problem in a legal context.

Optimized Pipeline Built by Stripe Systems

The engineering team at Stripe Systems rebuilt the pipeline with the following changes:

Chunking: Switched to recursive character splitting with 512-token chunks and 50-token overlap. Legal documents have a clear structure (sections, subsections, clauses), so the splitter was configured to respect section boundaries as primary separators.

Chunking Strategy Comparison Results:

Chunk Strategy	Chunk Size	Overlap	Retrieval F1	Avg Relevance
Fixed-size	1000 tokens	0	0.58	0.61
Fixed-size	512 tokens	50	0.67	0.69
Fixed-size	256 tokens	25	0.64	0.72
Sentence-based	~400 tokens	0	0.63	0.70
Recursive	512 tokens	50	0.74	0.76
Semantic	Variable	N/A	0.71	0.78

Recursive chunking at 512 tokens with 50-token overlap gave the best balance of retrieval F1 and relevance.

Vector Storage: pgvector, co-located with the existing PostgreSQL database. This eliminated the need for a separate vector database service, simplified backups, and allowed joins between vector search results and document metadata stored in relational tables.

Hybrid Search: Combined pgvector cosine similarity with pg_trgm-based keyword search, fused with RRF. Legal documents contain specific terms (statute numbers, case citations) where exact match is critical.

Re-ranking: Cohere rerank-english-v3.0 applied to the top 20 hybrid search results, returning the top 5.

Pipeline Code:

from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import cohere

CONNECTION_STRING = "postgresql://user:pass@localhost:5432/legaldb"
COLLECTION_NAME = "legal_docs_v2"

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n\nSECTION",    # Legal section breaks
        "\n\nArticle",     # Article breaks
        "\n\n",            # Paragraph breaks
        "\n",              # Line breaks
        ". ",              # Sentence endings
        " ",
        "",
    ],
)

vectorstore = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
)

# Hybrid retrieval with re-ranking
def retrieve_and_rerank(query: str, top_k: int = 5) -> list[dict]:
    # Stage 1: Vector search (top 20)
    vector_results = vectorstore.similarity_search_with_score(query, k=20)

    # Stage 2: BM25 keyword search (top 20)
    keyword_results = bm25_search(query, k=20)  # custom pg_trgm implementation

    # Stage 3: Reciprocal rank fusion
    fused = reciprocal_rank_fusion(
        vector_results, keyword_results,
        vector_weight=0.6, keyword_weight=0.4  # higher keyword weight for legal
    )

    # Stage 4: Re-rank top 20 with Cohere
    co = cohere.Client(api_key="...")
    candidates = [get_chunk_text(doc_id) for doc_id, _ in fused[:20]]
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_k,
    )

    return [
        {"text": candidates[r.index], "score": r.relevance_score}
        for r in reranked.results
    ]

Evaluation Results (RAGAS Metrics)

Evaluated on a set of 200 question-answer pairs created by senior lawyers:

Metric	Naive RAG	Optimized Pipeline
Context Precision	0.54	0.88
Context Recall	0.61	0.89
Answer Relevancy	0.67	0.92
Faithfulness	0.72	0.95
Overall Retrieval Accuracy	62%	91%

The biggest improvement came from hybrid search (keyword matching for legal citations) and re-ranking (filtering out superficially similar but irrelevant chunks). The faithfulness score improvement from 0.72 to 0.95 meant that hallucinated legal citations — the most dangerous failure mode — dropped from roughly 1 in 4 answers to 1 in 20.

Production Metrics After 6 Months

✓Average query latency: 1.8 seconds (including streaming)
✓Daily queries: ~2,000
✓Embedding storage: 23 GB in pgvector (HNSW index)
✓Monthly infrastructure cost: ~$400 (PostgreSQL instance + OpenAI embeddings + Cohere re-ranking)
✓User satisfaction (internal survey): 4.2/5, up from 2.8/5 with the naive pipeline

The key takeaway: RAG pipeline quality is determined by the engineering of each component — parsing, chunking, retrieval, re-ranking — not by which LLM you use for generation. A well-built retrieval pipeline with a modest model will outperform a poorly built pipeline with the most expensive model available.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussi...

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months ...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, an...

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, paymen...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million r...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentic...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

AI/ML📅 March 10, 2026· 14 min read

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

✍️

Stripe Systems Engineering

Why RAG Over Fine-Tuning

Fine-tuning has its place, but for most enterprise use cases RAG is the better starting point:

✓Data freshness: RAG uses the latest version of your documents. Fine-tuning requires retraining when data changes.
✓Attribution: RAG can cite specific source documents. Fine-tuned models cannot reliably tell you where their knowledge comes from.
✓Cost: Fine-tuning GPT-4-class models is expensive and slow. RAG works with any base model.
✓Separation of concerns: Your retrieval pipeline and generation model are independent components that can be improved separately.

Document Ingestion Pipeline

Before anything touches a vector database, raw documents need to be parsed, cleaned, and structured. This is where most pipelines quietly fail.

Parsing

Different document formats require different parsers:

✓PDF: Use pymupdf (fitz) or pdfplumber for text extraction. For scanned PDFs, you need OCR — pytesseract or a cloud OCR API. Layout-aware parsing matters: tables, headers, and columns need to be reconstructed correctly.
✓DOCX: python-docx handles text extraction well. Watch for embedded tables, footnotes, and tracked changes.
✓HTML: BeautifulSoup with aggressive tag stripping. Remove navigation, footers, sidebars — keep only the content body.
✓Markdown/Plain text: Usually straightforward, but handle front matter, code blocks, and metadata separately.

Cleaning

Raw extracted text is messy. A typical cleaning pipeline:

✓Remove duplicate whitespace and normalize line endings
✓Strip headers/footers that repeat on every page (common in PDFs)
✓Handle hyphenation at line breaks
✓Normalize Unicode characters
✓Extract and preserve metadata: document title, author, creation date, section headers

Metadata Extraction

from dataclasses import dataclass
from typing import Optional

@dataclass
class DocumentChunk:
    text: str
    doc_id: str
    doc_title: str
    section_header: Optional[str]
    page_number: Optional[int]
    chunk_index: int
    token_count: int
    metadata: dict

Chunking Strategies

Fixed-Size Chunking

Split text every N tokens (or characters) with an overlap of M tokens.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,  # or a token counter
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

Pros: Simple, predictable chunk sizes, easy to reason about. Cons: Splits mid-sentence, mid-paragraph, or mid-thought. Semantic boundaries are ignored.

Sentence-Based Chunking

Split on sentence boundaries, then group sentences until you hit a token limit.

Pros: Preserves complete sentences. Cons: Sentence detection is imperfect (abbreviations, legal citations). Grouping still creates arbitrary boundaries.

Recursive Character Splitting

This is the most commonly used strategy in LangChain. It tries to split on paragraph boundaries first, then sentences, then words, recursively falling back to smaller separators.

The separator hierarchy matters:

separators = [
    "\n\n",   # Paragraph breaks (best)
    "\n",     # Line breaks
    ". ",     # Sentence endings
    ", ",     # Clause breaks
    " ",      # Word boundaries
    ""        # Character level (last resort)
]

This is a good default. It preserves semantic structure better than fixed-size chunking while keeping chunk sizes consistent.

Semantic Chunking

Group text by semantic similarity rather than syntactic boundaries. Compute embeddings for each sentence, then split where the cosine similarity between adjacent sentences drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85
)
chunks = splitter.create_documents([document_text])

Chunk Size vs Retrieval Accuracy

General guidelines from experimentation:

Chunk Size (tokens)	Retrieval Precision	Context Completeness	Best For
128	High	Low	FAQ, short factual queries
256	Good	Moderate	General Q&A
512	Moderate	Good	Complex questions needing context
1024	Lower	High	Summarization, multi-hop reasoning

The overlap between chunks matters too. Too little overlap and you lose context at boundaries. Too much and you waste storage and compute. 10-15% of chunk size is a reasonable starting point.

Embedding Models

The embedding model converts text into dense vectors for similarity search. Your choice here directly affects retrieval quality and cost.

Proprietary Options

✓OpenAI text-embedding-ada-002: The workhorse. 1536 dimensions, $0.0001/1K tokens. Good general-purpose quality, but not the best on all benchmarks.
✓OpenAI text-embedding-3-small/large: Newer models with configurable dimensions. Better quality than ada-002 at similar cost.
✓Cohere embed-v3: Strong multilingual support. Offers separate input types for documents vs queries (improves retrieval). Competitive with OpenAI on MTEB benchmarks.

Open-Source Options

✓BGE (BAAI General Embedding): Consistently near the top of MTEB leaderboard. BGE-large-en-v1.5 is a solid choice.
✓E5: From Microsoft. E5-mistral-7b-instruct is the current best open-source model, but it is large.
✓GTE: From Alibaba. Good quality, reasonable size.

The real decision factor is quality on your domain. Run a retrieval evaluation on a sample of your data with multiple embedding models before committing.

Vector Databases

Once you have embeddings, you need somewhere to store and query them. The vector database landscape is crowded. Here is an honest comparison:

Pinecone

Weaviate

Open-source, self-hostable. Supports hybrid search natively (vector + BM25). Has a built-in module system for embedding generation. Good for teams comfortable running their own infrastructure.

Qdrant

Open-source, written in Rust. Fast. Good filtering capabilities, supports multi-vector search. Lower memory footprint than some alternatives. The API is clean and well-documented.

pgvector

CREATE EXTENSION vector;

CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    doc_id TEXT NOT NULL,
    chunk_text TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Hybrid Search

Hybrid search combines vector similarity with BM25 keyword search. The standard fusion method is Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(
    vector_results: list[dict],
    keyword_results: list[dict],
    k: int = 60,
    vector_weight: float = 0.7,
    keyword_weight: float = 0.3,
) -> list[dict]:
    scores = {}

    for rank, doc in enumerate(vector_results):
        doc_id = doc["id"]
        scores[doc_id] = scores.get(doc_id, 0) + vector_weight / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        doc_id = doc["id"]
        scores[doc_id] = scores.get(doc_id, 0) + keyword_weight / (k + rank + 1)

    fused = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{"id": doc_id, "score": score} for doc_id, score in fused]

Retrieval Augmentation: Re-Ranking

Initial retrieval (whether vector, keyword, or hybrid) returns a rough set of candidates. Re-ranking refines this set using a more expensive but more accurate model.

from cohere import Client

co = Client(api_key="your-key")

def rerank_results(query: str, documents: list[str], top_n: int = 5):
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
    )
    return [
        {"index": r.index, "score": r.relevance_score}
        for r in response.results
    ]

Re-ranking typically improves retrieval precision by 10-25% over vector search alone. The cost is an additional API call and 100-300ms of latency. For most applications, this tradeoff is worth it.

Contextual Compression

Prompt Engineering for RAG

The prompt template determines how the LLM uses retrieved context. A well-structured RAG prompt:

RAG_SYSTEM_PROMPT = """You are a helpful assistant that answers questions
based on the provided context. Follow these rules strictly:

1. Only use information from the provided context to answer.
2. If the context does not contain enough information, say so explicitly.
3. Cite your sources using [Source N] notation.
4. Do not make up information or use knowledge not in the context.
5. If multiple sources conflict, note the discrepancy.

Context:
{context}

Each context chunk is labeled with its source document and section.
"""

def format_context(chunks: list[DocumentChunk]) -> str:
    formatted = []
    for i, chunk in enumerate(chunks):
        source_label = f"[Source {i+1}: {chunk.doc_title}"
        if chunk.section_header:
            source_label += f" > {chunk.section_header}"
        source_label += "]"
        formatted.append(f"{source_label}\n{chunk.text}")
    return "\n\n---\n\n".join(formatted)

Key decisions:

✓Number of chunks: 3-7 is typical. More context helps recall but increases cost and can confuse the model.
✓Context ordering: Place the most relevant chunks first. LLMs attend more to the beginning and end of the context.
✓Citation format: Explicit citation instructions improve source attribution significantly.

Hallucination Detection

Faithfulness Scoring

Compare the generated answer against the retrieved context. For each claim in the answer, check whether it is supported by the context:

✓Decompose the answer into individual claims
✓For each claim, check if any retrieved chunk supports it
✓Faithfulness = (supported claims) / (total claims)

This can be automated using an LLM-as-judge approach — a separate LLM call that evaluates whether each claim is grounded in the context.

Source Attribution Verification

If the answer cites "[Source 2]", verify that the cited claim actually appears in Source 2. This catches cases where the model attributes a claim to the wrong source.

Evaluation Framework: RAGAS

RAGAS (Retrieval Augmented Generation Assessment) provides a structured evaluation framework with four core metrics:

✓Context Precision: Of the retrieved chunks, how many are relevant? Measures retrieval quality.
✓Context Recall: Of the relevant information in the corpus, how much was retrieved? Measures retrieval completeness.
✓Answer Relevancy: Does the generated answer address the question? Measures generation quality.
✓Faithfulness: Is the answer grounded in the retrieved context? Measures hallucination.

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    answer_relevancy,
    faithfulness,
)
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,  # list of lists
    "ground_truth": ground_truth_answers,
})

results = evaluate(
    dataset=eval_dataset,
    metrics=[
        context_precision,
        context_recall,
        answer_relevancy,
        faithfulness,
    ],
)
print(results)

Building a good evaluation dataset is the hard part. You need at least 100-200 question-answer pairs with ground truth. These should be created by domain experts, not generated by an LLM.

Production Concerns

Caching Embeddings

Do not re-embed documents that have not changed. Maintain a hash of each document's content and only re-embed when the hash changes.

import hashlib

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

def should_reembed(doc_id: str, new_text: str, existing_hash: str) -> bool:
    return content_hash(new_text) != existing_hash

Incremental Indexing

Versioning Vector Indices

Latency Budget

A typical production RAG pipeline has the following latency profile:

Component	Latency
Query embedding	50-100ms
Vector search	10-50ms
BM25 search	5-20ms
RRF fusion	<5ms
Re-ranking	100-300ms
LLM generation	500-3000ms
Total	700-3500ms

Re-ranking and LLM generation dominate. If latency is critical, consider streaming the LLM response while running re-ranking in parallel with the initial retrieval call.

Case Study: Legal Document Knowledge Base

Initial Approach (Naive RAG)

Optimized Pipeline Built by Stripe Systems

The engineering team at Stripe Systems rebuilt the pipeline with the following changes:

Chunking Strategy Comparison Results:

Chunk Strategy	Chunk Size	Overlap	Retrieval F1	Avg Relevance
Fixed-size	1000 tokens	0	0.58	0.61
Fixed-size	512 tokens	50	0.67	0.69
Fixed-size	256 tokens	25	0.64	0.72
Sentence-based	~400 tokens	0	0.63	0.70
Recursive	512 tokens	50	0.74	0.76
Semantic	Variable	N/A	0.71	0.78

Recursive chunking at 512 tokens with 50-token overlap gave the best balance of retrieval F1 and relevance.

Re-ranking: Cohere rerank-english-v3.0 applied to the top 20 hybrid search results, returning the top 5.

Pipeline Code:

from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import cohere

CONNECTION_STRING = "postgresql://user:pass@localhost:5432/legaldb"
COLLECTION_NAME = "legal_docs_v2"

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n\nSECTION",    # Legal section breaks
        "\n\nArticle",     # Article breaks
        "\n\n",            # Paragraph breaks
        "\n",              # Line breaks
        ". ",              # Sentence endings
        " ",
        "",
    ],
)

vectorstore = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
)

# Hybrid retrieval with re-ranking
def retrieve_and_rerank(query: str, top_k: int = 5) -> list[dict]:
    # Stage 1: Vector search (top 20)
    vector_results = vectorstore.similarity_search_with_score(query, k=20)

    # Stage 2: BM25 keyword search (top 20)
    keyword_results = bm25_search(query, k=20)  # custom pg_trgm implementation

    # Stage 3: Reciprocal rank fusion
    fused = reciprocal_rank_fusion(
        vector_results, keyword_results,
        vector_weight=0.6, keyword_weight=0.4  # higher keyword weight for legal
    )

    # Stage 4: Re-rank top 20 with Cohere
    co = cohere.Client(api_key="...")
    candidates = [get_chunk_text(doc_id) for doc_id, _ in fused[:20]]
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_k,
    )

    return [
        {"text": candidates[r.index], "score": r.relevance_score}
        for r in reranked.results
    ]

Evaluation Results (RAGAS Metrics)

Evaluated on a set of 200 question-answer pairs created by senior lawyers:

Metric	Naive RAG	Optimized Pipeline
Context Precision	0.54	0.88
Context Recall	0.61	0.89
Answer Relevancy	0.67	0.92
Faithfulness	0.72	0.95
Overall Retrieval Accuracy	62%	91%

Production Metrics After 6 Months

✓Average query latency: 1.8 seconds (including streaming)
✓Daily queries: ~2,000
✓Embedding storage: 23 GB in pgvector (HNSW index)
✓Monthly infrastructure cost: ~$400 (PostgreSQL instance + OpenAI embeddings + Cohere re-ranking)
✓User satisfaction (internal survey): 4.2/5, up from 2.8/5 with the naive pipeline

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Why RAG Over Fine-Tuning

Document Ingestion Pipeline

Parsing

Cleaning

Metadata Extraction

Chunking Strategies

Fixed-Size Chunking

Sentence-Based Chunking

Recursive Character Splitting

Semantic Chunking

Chunk Size vs Retrieval Accuracy

Embedding Models

Proprietary Options

Open-Source Options

Vector Databases

Pinecone

Weaviate

Qdrant

pgvector

Hybrid Search

Retrieval Augmentation: Re-Ranking

Contextual Compression

Prompt Engineering for RAG

Hallucination Detection

Faithfulness Scoring

Source Attribution Verification

Evaluation Framework: RAGAS

Production Concerns

Caching Embeddings

Incremental Indexing

Versioning Vector Indices

Latency Budget

Case Study: Legal Document Knowledge Base

Initial Approach (Naive RAG)

Optimized Pipeline Built by Stripe Systems

Evaluation Results (RAGAS Metrics)

Production Metrics After 6 Months

Related Services from Stripe Systems

AI/ML Solutions

More Articles

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change