Skip to main content
Stripe SystemsStripe Systems
AI/ML📅 March 10, 2026· 14 min read

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

✍️
Stripe Systems Engineering

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a language model on your data (expensive, brittle, and hard to update), you retrieve relevant context at query time and inject it into the prompt. The model generates answers grounded in your documents rather than relying solely on its parametric knowledge.

This sounds simple. In practice, building a RAG pipeline that performs reliably in production — with consistent retrieval accuracy, acceptable latency, and manageable cost — requires careful engineering across every component. This post covers the end-to-end pipeline: document ingestion, chunking strategies, embedding selection, vector databases, hybrid search, re-ranking, prompt design, hallucination detection, and evaluation frameworks.

Why RAG Over Fine-Tuning

Fine-tuning has its place, but for most enterprise use cases RAG is the better starting point:

  • Data freshness: RAG uses the latest version of your documents. Fine-tuning requires retraining when data changes.
  • Attribution: RAG can cite specific source documents. Fine-tuned models cannot reliably tell you where their knowledge comes from.
  • Cost: Fine-tuning GPT-4-class models is expensive and slow. RAG works with any base model.
  • Separation of concerns: Your retrieval pipeline and generation model are independent components that can be improved separately.

Fine-tuning makes sense when you need the model to learn a specific style, follow domain-specific reasoning patterns, or when your retrieval corpus is too large for context windows. For knowledge retrieval over enterprise documents — contracts, policies, technical documentation, support tickets — RAG is almost always the right choice.

Document Ingestion Pipeline

Before anything touches a vector database, raw documents need to be parsed, cleaned, and structured. This is where most pipelines quietly fail.

Parsing

Different document formats require different parsers:

  • PDF: Use pymupdf (fitz) or pdfplumber for text extraction. For scanned PDFs, you need OCR — pytesseract or a cloud OCR API. Layout-aware parsing matters: tables, headers, and columns need to be reconstructed correctly.
  • DOCX: python-docx handles text extraction well. Watch for embedded tables, footnotes, and tracked changes.
  • HTML: BeautifulSoup with aggressive tag stripping. Remove navigation, footers, sidebars — keep only the content body.
  • Markdown/Plain text: Usually straightforward, but handle front matter, code blocks, and metadata separately.

Cleaning

Raw extracted text is messy. A typical cleaning pipeline:

  1. Remove duplicate whitespace and normalize line endings
  2. Strip headers/footers that repeat on every page (common in PDFs)
  3. Handle hyphenation at line breaks
  4. Normalize Unicode characters
  5. Extract and preserve metadata: document title, author, creation date, section headers

Metadata Extraction

Metadata is often more valuable than people realize. Attaching document title, section headers, page numbers, and document type to each chunk enables filtered retrieval later. A query about "vacation policy" should be able to filter for HR documents before running vector similarity.

from dataclasses import dataclass
from typing import Optional

@dataclass
class DocumentChunk:
    text: str
    doc_id: str
    doc_title: str
    section_header: Optional[str]
    page_number: Optional[int]
    chunk_index: int
    token_count: int
    metadata: dict

Chunking Strategies

Chunking is the most underrated component of a RAG pipeline. The way you split documents into chunks directly determines retrieval quality. There is no universally optimal strategy — it depends on your documents, queries, and embedding model.

Fixed-Size Chunking

Split text every N tokens (or characters) with an overlap of M tokens.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,  # or a token counter
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

Pros: Simple, predictable chunk sizes, easy to reason about. Cons: Splits mid-sentence, mid-paragraph, or mid-thought. Semantic boundaries are ignored.

Sentence-Based Chunking

Split on sentence boundaries, then group sentences until you hit a token limit.

Pros: Preserves complete sentences. Cons: Sentence detection is imperfect (abbreviations, legal citations). Grouping still creates arbitrary boundaries.

Recursive Character Splitting

This is the most commonly used strategy in LangChain. It tries to split on paragraph boundaries first, then sentences, then words, recursively falling back to smaller separators.

The separator hierarchy matters:

separators = [
    "\n\n",   # Paragraph breaks (best)
    "\n",     # Line breaks
    ". ",     # Sentence endings
    ", ",     # Clause breaks
    " ",      # Word boundaries
    ""        # Character level (last resort)
]

This is a good default. It preserves semantic structure better than fixed-size chunking while keeping chunk sizes consistent.

Semantic Chunking

Group text by semantic similarity rather than syntactic boundaries. Compute embeddings for each sentence, then split where the cosine similarity between adjacent sentences drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=85
)
chunks = splitter.create_documents([document_text])

Pros: Chunks are semantically coherent — each chunk is about one topic. Cons: Slower (requires embedding every sentence), variable chunk sizes (some very small, some very large), and the threshold needs tuning per document type.

Chunk Size vs Retrieval Accuracy

Smaller chunks are more precise — they are more likely to match a specific query. Larger chunks provide more context — the LLM gets a fuller picture. The tradeoff is real and must be tuned empirically for your use case.

General guidelines from experimentation:

Chunk Size (tokens)Retrieval PrecisionContext CompletenessBest For
128HighLowFAQ, short factual queries
256GoodModerateGeneral Q&A
512ModerateGoodComplex questions needing context
1024LowerHighSummarization, multi-hop reasoning

The overlap between chunks matters too. Too little overlap and you lose context at boundaries. Too much and you waste storage and compute. 10-15% of chunk size is a reasonable starting point.

Embedding Models

The embedding model converts text into dense vectors for similarity search. Your choice here directly affects retrieval quality and cost.

Proprietary Options

  • OpenAI text-embedding-ada-002: The workhorse. 1536 dimensions, $0.0001/1K tokens. Good general-purpose quality, but not the best on all benchmarks.
  • OpenAI text-embedding-3-small/large: Newer models with configurable dimensions. Better quality than ada-002 at similar cost.
  • Cohere embed-v3: Strong multilingual support. Offers separate input types for documents vs queries (improves retrieval). Competitive with OpenAI on MTEB benchmarks.

Open-Source Options

  • BGE (BAAI General Embedding): Consistently near the top of MTEB leaderboard. BGE-large-en-v1.5 is a solid choice.
  • E5: From Microsoft. E5-mistral-7b-instruct is the current best open-source model, but it is large.
  • GTE: From Alibaba. Good quality, reasonable size.

Running open-source models locally eliminates per-token costs but requires GPU infrastructure. For a corpus of 100K documents, embedding cost with OpenAI is typically under $50 — not a significant concern. The cost becomes relevant when you are re-embedding frequently or processing millions of documents.

The real decision factor is quality on your domain. Run a retrieval evaluation on a sample of your data with multiple embedding models before committing.

Vector Databases

Once you have embeddings, you need somewhere to store and query them. The vector database landscape is crowded. Here is an honest comparison:

Pinecone

Fully managed, serverless option available. Good for teams that do not want to manage infrastructure. Scales well. Metadata filtering is solid. Downside: vendor lock-in, and costs can escalate at scale.

Weaviate

Open-source, self-hostable. Supports hybrid search natively (vector + BM25). Has a built-in module system for embedding generation. Good for teams comfortable running their own infrastructure.

Qdrant

Open-source, written in Rust. Fast. Good filtering capabilities, supports multi-vector search. Lower memory footprint than some alternatives. The API is clean and well-documented.

pgvector

A PostgreSQL extension that adds vector similarity search to your existing Postgres database. If you already run PostgreSQL, this is the lowest-friction option. No new infrastructure, no new operational burden. Performance is acceptable for millions of vectors with proper indexing (HNSW or IVFFlat).

CREATE EXTENSION vector;

CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    doc_id TEXT NOT NULL,
    chunk_text TEXT NOT NULL,
    embedding vector(1536),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

When each fits: Use pgvector if you already have PostgreSQL and your corpus is under 10M vectors. Use Qdrant or Weaviate if you need more advanced features or higher scale. Use Pinecone if you want fully managed and your budget allows it.

Hybrid Search

Pure vector search finds semantically similar documents but can miss exact keyword matches. A query for "ISO 27001 compliance" might retrieve documents about "information security standards" (semantically similar) but miss a document that literally mentions "ISO 27001" in a different context.

Hybrid search combines vector similarity with BM25 keyword search. The standard fusion method is Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(
    vector_results: list[dict],
    keyword_results: list[dict],
    k: int = 60,
    vector_weight: float = 0.7,
    keyword_weight: float = 0.3,
) -> list[dict]:
    scores = {}

    for rank, doc in enumerate(vector_results):
        doc_id = doc["id"]
        scores[doc_id] = scores.get(doc_id, 0) + vector_weight / (k + rank + 1)

    for rank, doc in enumerate(keyword_results):
        doc_id = doc["id"]
        scores[doc_id] = scores.get(doc_id, 0) + keyword_weight / (k + rank + 1)

    fused = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{"id": doc_id, "score": score} for doc_id, score in fused]

The weights between vector and keyword results need tuning. For technical documentation with lots of jargon and specific terms, increase keyword weight. For conversational queries, increase vector weight.

Retrieval Augmentation: Re-Ranking

Initial retrieval (whether vector, keyword, or hybrid) returns a rough set of candidates. Re-ranking refines this set using a more expensive but more accurate model.

Cross-encoder re-rankers score each (query, document) pair independently, considering the full interaction between query and document tokens. This is more accurate than bi-encoder similarity but too slow for searching the full corpus — hence the two-stage approach.

from cohere import Client

co = Client(api_key="your-key")

def rerank_results(query: str, documents: list[str], top_n: int = 5):
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
    )
    return [
        {"index": r.index, "score": r.relevance_score}
        for r in response.results
    ]

Re-ranking typically improves retrieval precision by 10-25% over vector search alone. The cost is an additional API call and 100-300ms of latency. For most applications, this tradeoff is worth it.

Contextual Compression

After re-ranking, you can further improve context quality by compressing retrieved chunks — extracting only the sentences relevant to the query. This reduces token usage and focuses the LLM on the most pertinent information.

Prompt Engineering for RAG

The prompt template determines how the LLM uses retrieved context. A well-structured RAG prompt:

RAG_SYSTEM_PROMPT = """You are a helpful assistant that answers questions
based on the provided context. Follow these rules strictly:

1. Only use information from the provided context to answer.
2. If the context does not contain enough information, say so explicitly.
3. Cite your sources using [Source N] notation.
4. Do not make up information or use knowledge not in the context.
5. If multiple sources conflict, note the discrepancy.

Context:
{context}

Each context chunk is labeled with its source document and section.
"""

def format_context(chunks: list[DocumentChunk]) -> str:
    formatted = []
    for i, chunk in enumerate(chunks):
        source_label = f"[Source {i+1}: {chunk.doc_title}"
        if chunk.section_header:
            source_label += f" > {chunk.section_header}"
        source_label += "]"
        formatted.append(f"{source_label}\n{chunk.text}")
    return "\n\n---\n\n".join(formatted)

Key decisions:

  • Number of chunks: 3-7 is typical. More context helps recall but increases cost and can confuse the model.
  • Context ordering: Place the most relevant chunks first. LLMs attend more to the beginning and end of the context.
  • Citation format: Explicit citation instructions improve source attribution significantly.

Hallucination Detection

RAG reduces hallucination compared to vanilla LLM usage, but does not eliminate it. The model can still fabricate information, misinterpret context, or combine information from multiple sources in misleading ways.

Faithfulness Scoring

Compare the generated answer against the retrieved context. For each claim in the answer, check whether it is supported by the context:

  1. Decompose the answer into individual claims
  2. For each claim, check if any retrieved chunk supports it
  3. Faithfulness = (supported claims) / (total claims)

This can be automated using an LLM-as-judge approach — a separate LLM call that evaluates whether each claim is grounded in the context.

Source Attribution Verification

If the answer cites "[Source 2]", verify that the cited claim actually appears in Source 2. This catches cases where the model attributes a claim to the wrong source.

Evaluation Framework: RAGAS

RAGAS (Retrieval Augmented Generation Assessment) provides a structured evaluation framework with four core metrics:

  • Context Precision: Of the retrieved chunks, how many are relevant? Measures retrieval quality.
  • Context Recall: Of the relevant information in the corpus, how much was retrieved? Measures retrieval completeness.
  • Answer Relevancy: Does the generated answer address the question? Measures generation quality.
  • Faithfulness: Is the answer grounded in the retrieved context? Measures hallucination.
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    answer_relevancy,
    faithfulness,
)
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,  # list of lists
    "ground_truth": ground_truth_answers,
})

results = evaluate(
    dataset=eval_dataset,
    metrics=[
        context_precision,
        context_recall,
        answer_relevancy,
        faithfulness,
    ],
)
print(results)

Building a good evaluation dataset is the hard part. You need at least 100-200 question-answer pairs with ground truth. These should be created by domain experts, not generated by an LLM.

Production Concerns

Caching Embeddings

Do not re-embed documents that have not changed. Maintain a hash of each document's content and only re-embed when the hash changes.

import hashlib

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

def should_reembed(doc_id: str, new_text: str, existing_hash: str) -> bool:
    return content_hash(new_text) != existing_hash

Incremental Indexing

For large corpora, full re-indexing is expensive. Track document modification timestamps and only process changed documents. Use soft deletes in the vector store — mark old chunks as inactive and add new ones, then periodically clean up.

Versioning Vector Indices

When you change chunking strategy, embedding model, or metadata schema, you need to re-index everything. Use versioned collection names (documents_v3, documents_v4) and run the old and new indices in parallel during migration. Route a percentage of traffic to the new index, compare metrics, and cut over when confident.

Latency Budget

A typical production RAG pipeline has the following latency profile:

ComponentLatency
Query embedding50-100ms
Vector search10-50ms
BM25 search5-20ms
RRF fusion<5ms
Re-ranking100-300ms
LLM generation500-3000ms
Total700-3500ms

Re-ranking and LLM generation dominate. If latency is critical, consider streaming the LLM response while running re-ranking in parallel with the initial retrieval call.

Case Study: Legal Document Knowledge Base

A legal technology company needed to build an internal knowledge base over 50,000+ legal documents — contracts, case summaries, regulatory filings, and internal memos. Lawyers needed to query this corpus in natural language and get accurate, cited answers.

Initial Approach (Naive RAG)

The first version used fixed 1000-token chunks, OpenAI ada-002 embeddings, Pinecone for storage, and a basic prompt template. Retrieval accuracy (measured as whether the correct source document appeared in the top-5 results) was 62%. Lawyers reported frequent irrelevant results and occasional hallucinated legal citations — a serious problem in a legal context.

Optimized Pipeline Built by Stripe Systems

The engineering team at Stripe Systems rebuilt the pipeline with the following changes:

Chunking: Switched to recursive character splitting with 512-token chunks and 50-token overlap. Legal documents have a clear structure (sections, subsections, clauses), so the splitter was configured to respect section boundaries as primary separators.

Chunking Strategy Comparison Results:

Chunk StrategyChunk SizeOverlapRetrieval F1Avg Relevance
Fixed-size1000 tokens00.580.61
Fixed-size512 tokens500.670.69
Fixed-size256 tokens250.640.72
Sentence-based~400 tokens00.630.70
Recursive512 tokens500.740.76
SemanticVariableN/A0.710.78

Recursive chunking at 512 tokens with 50-token overlap gave the best balance of retrieval F1 and relevance.

Vector Storage: pgvector, co-located with the existing PostgreSQL database. This eliminated the need for a separate vector database service, simplified backups, and allowed joins between vector search results and document metadata stored in relational tables.

Hybrid Search: Combined pgvector cosine similarity with pg_trgm-based keyword search, fused with RRF. Legal documents contain specific terms (statute numbers, case citations) where exact match is critical.

Re-ranking: Cohere rerank-english-v3.0 applied to the top 20 hybrid search results, returning the top 5.

Pipeline Code:

from langchain_community.vectorstores import PGVector
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
import cohere

CONNECTION_STRING = "postgresql://user:pass@localhost:5432/legaldb"
COLLECTION_NAME = "legal_docs_v2"

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n\nSECTION",    # Legal section breaks
        "\n\nArticle",     # Article breaks
        "\n\n",            # Paragraph breaks
        "\n",              # Line breaks
        ". ",              # Sentence endings
        " ",
        "",
    ],
)

vectorstore = PGVector(
    connection_string=CONNECTION_STRING,
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
)

# Hybrid retrieval with re-ranking
def retrieve_and_rerank(query: str, top_k: int = 5) -> list[dict]:
    # Stage 1: Vector search (top 20)
    vector_results = vectorstore.similarity_search_with_score(query, k=20)

    # Stage 2: BM25 keyword search (top 20)
    keyword_results = bm25_search(query, k=20)  # custom pg_trgm implementation

    # Stage 3: Reciprocal rank fusion
    fused = reciprocal_rank_fusion(
        vector_results, keyword_results,
        vector_weight=0.6, keyword_weight=0.4  # higher keyword weight for legal
    )

    # Stage 4: Re-rank top 20 with Cohere
    co = cohere.Client(api_key="...")
    candidates = [get_chunk_text(doc_id) for doc_id, _ in fused[:20]]
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=candidates,
        top_n=top_k,
    )

    return [
        {"text": candidates[r.index], "score": r.relevance_score}
        for r in reranked.results
    ]

Evaluation Results (RAGAS Metrics)

Evaluated on a set of 200 question-answer pairs created by senior lawyers:

MetricNaive RAGOptimized Pipeline
Context Precision0.540.88
Context Recall0.610.89
Answer Relevancy0.670.92
Faithfulness0.720.95
Overall Retrieval Accuracy62%91%

The biggest improvement came from hybrid search (keyword matching for legal citations) and re-ranking (filtering out superficially similar but irrelevant chunks). The faithfulness score improvement from 0.72 to 0.95 meant that hallucinated legal citations — the most dangerous failure mode — dropped from roughly 1 in 4 answers to 1 in 20.

Production Metrics After 6 Months

  • Average query latency: 1.8 seconds (including streaming)
  • Daily queries: ~2,000
  • Embedding storage: 23 GB in pgvector (HNSW index)
  • Monthly infrastructure cost: ~$400 (PostgreSQL instance + OpenAI embeddings + Cohere re-ranking)
  • User satisfaction (internal survey): 4.2/5, up from 2.8/5 with the naive pipeline

The key takeaway: RAG pipeline quality is determined by the engineering of each component — parsing, chunking, retrieval, re-ranking — not by which LLM you use for generation. A well-built retrieval pipeline with a modest model will outperform a poorly built pipeline with the most expensive model available.

Ready to discuss your project?

Get in Touch →
← Back to Blog

More Articles