AI/ML📅 March 19, 2026· 14 min read

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

✍️

Stripe Systems Engineering

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months and make codebases progressively harder to work with. A service that directly imports a repository (bypassing the use-case layer), a controller that contains business logic, a missing DTO that leaks database schema to the API layer — these are the issues that matter most in mature codebases and that generic tools miss almost entirely.

The reason is straightforward: generic tools have no knowledge of your project's architecture. They do not know that your team uses Clean Architecture with specific layering rules. They do not know that repositories should only be accessed through use cases. They do not know your naming conventions, your dependency injection patterns, or the decisions documented in your ADRs.

This post describes how to build a custom AI code review pipeline that encodes your project's architecture rules and enforces them automatically on every pull request. We cover the technical architecture, prompt engineering, multi-pass review strategy, GitHub integration, false positive management, and the metrics that demonstrate whether it is working.

Why Generic AI Review Falls Short

We evaluated three commercial AI code review tools on a NestJS monorepo with Clean Architecture. We created 20 test PRs containing known architecture violations and measured detection rates:

Violation Type	Tool A	Tool B	Tool C
Service imports repository directly	1/5	0/5	0/5
Controller contains business logic	0/5	1/5	0/5
Missing DTO (entity returned from controller)	0/5	0/5	0/5
Wrong dependency direction (domain → infrastructure)	0/5	0/5	0/5
Missing interface for external service	2/5	1/5	0/5
Total Detection Rate	3/25 (12%)	2/25 (8%)	0/25 (0%)

All three tools caught linting issues, potential null pointer errors, and suggested minor style improvements. None consistently caught architecture violations because they lack the project context to identify them.

Architecture of the Review Pipeline

The pipeline has four stages:

PR Webhook → Diff Extraction → Multi-Pass Analysis → Structured Output → GitHub Comment

PR Webhook Handler

A webhook listener receives GitHub PR events and triggers the review pipeline:

from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib

app = FastAPI()

WEBHOOK_SECRET = os.environ["GITHUB_WEBHOOK_SECRET"]
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]

@app.post("/webhook/pr")
async def handle_pr_webhook(request: Request):
    # Verify webhook signature
    signature = request.headers.get("X-Hub-Signature-256")
    body = await request.body()

    expected = "sha256=" + hmac.new(
        WEBHOOK_SECRET.encode(), body, hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(signature, expected):
        raise HTTPException(status_code=401, detail="Invalid signature")

    payload = await request.json()

    # Only review on PR open and synchronize (new commits)
    if payload.get("action") not in ("opened", "synchronize"):
        return {"status": "skipped"}

    pr_number = payload["pull_request"]["number"]
    repo_full_name = payload["repository"]["full_name"]

    # Run review asynchronously
    review_task = asyncio.create_task(
        run_review_pipeline(repo_full_name, pr_number)
    )

    return {"status": "review_started", "pr": pr_number}

Diff Extraction

We only review changed files — reviewing the entire codebase on every PR is wasteful and introduces noise. The GitHub API provides the diff:

import httpx

async def get_pr_diff(repo: str, pr_number: int) -> list[dict]:
    """Fetch changed files with their diffs from GitHub."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.github.com/repos/{repo}/pulls/{pr_number}/files",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3+json",
            },
        )
        response.raise_for_status()

    files = []
    for f in response.json():
        if f["status"] in ("added", "modified") and f["filename"].endswith(
            (".ts", ".tsx", ".js")
        ):
            files.append({
                "filename": f["filename"],
                "status": f["status"],
                "patch": f.get("patch", ""),
                "additions": f["additions"],
                "deletions": f["deletions"],
            })
    return files

Context Assembly

The diff alone is insufficient — the LLM needs project context to identify architecture violations. We assemble a context package that includes:

✓Architecture rules: Extracted from ADRs and coding standards documents
✓File structure: The directory layout indicating architectural layers
✓Related files: Imports and dependencies of the changed files

ARCHITECTURE_CONTEXT = """
## Project Architecture: Clean Architecture (NestJS)

### Layer Rules (STRICT — violations must be flagged):
1. Controllers (src/*/controllers/) → ONLY import from Use Cases and DTOs
2. Use Cases (src/*/use-cases/) → ONLY import from Domain entities and Repository interfaces
3. Domain (src/*/domain/) → MUST NOT import from any other layer
4. Repositories (src/*/repositories/) → Implement interfaces defined in Domain
5. Infrastructure (src/*/infrastructure/) → Can import from Domain interfaces only

### Dependency Direction:
Controllers → Use Cases → Domain ← Repositories ← Infrastructure
(arrows show allowed import direction)

### Naming Conventions:
- Controllers: *.controller.ts
- Use Cases: *.use-case.ts (one public method: execute())
- DTOs: *.dto.ts (request and response separate)
- Entities: *.entity.ts (no decorators from infrastructure)
- Repositories: *.repository.ts (implement interface from domain)

### Common Violations to Watch For:
- Service/Controller importing directly from *.repository.ts (must go through use-case)
- Controller returning an entity instead of a response DTO
- Business logic in controllers (if/else on business rules, calculations, validations beyond input format)
- Domain entity importing from TypeORM, Prisma, or other ORM decorators
- Use case importing from infrastructure (e.g., importing HttpService, specific DB client)
- Missing interface: a use case depending on a concrete repository instead of an interface
"""

This context is included in the system prompt for the architecture review pass. It is the project-specific knowledge that makes this system effective where generic tools fail.

Multi-Pass Review Strategy

A single LLM pass trying to catch everything produces noisy results. Instead, we use three specialized passes, each with a focused objective:

Pass 1: Lightweight Static Checks (No LLM)

Fast, deterministic checks using regex and AST parsing. These are cheap and produce zero false positives:

import re
from typing import NamedTuple

class StaticFinding(NamedTuple):
    file: str
    line: int
    rule: str
    message: str
    severity: str

def run_static_checks(files: list[dict]) -> list[StaticFinding]:
    findings = []

    for f in files:
        filename = f["filename"]
        lines = f["patch"].split("\n")

        for i, line in enumerate(lines):
            if not line.startswith("+"):
                continue
            code = line[1:]  # strip the + prefix
            line_number = extract_line_number(f["patch"], i)

            # Rule: Controllers must not import from repositories
            if "/controllers/" in filename:
                if re.search(r"from\s+['\"].*\.repository['\"]", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="ARCH-001",
                        message="Controller imports directly from a repository. Use a use-case instead.",
                        severity="error",
                    ))

            # Rule: Domain entities must not import ORM decorators
            if "/domain/" in filename:
                if re.search(r"from\s+['\"](@nestjs\/typeorm|typeorm|@prisma|prisma)['\"]", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="ARCH-002",
                        message="Domain entity imports from ORM library. Domain must be infrastructure-agnostic.",
                        severity="error",
                    ))

            # Rule: console.log in production code
            if "/src/" in filename and ".spec." not in filename:
                if re.search(r"console\.(log|debug|info)\(", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="CODE-001",
                        message="console.log in production code. Use the Logger service instead.",
                        severity="warning",
                    ))

    return findings

Pass 1 runs in milliseconds and catches the most obvious violations. It serves as a fast filter — if Pass 1 catches an issue, there is no need to spend LLM tokens on it.

Pass 2: Architecture and Logic Review (LLM)

The main review pass. The LLM analyzes the diff with full architecture context:

REVIEW_PROMPT = """You are a senior software architect reviewing a pull request
for a NestJS application that follows Clean Architecture.

{architecture_context}

## Files Changed in This PR:
{file_list}

## Diffs:
{diffs}

## Your Task:
Review the code changes for architecture violations, logic errors, and design issues.
Focus ONLY on issues that matter — do not comment on formatting, naming style preferences,
or minor issues that a linter would catch.

For each issue found, respond with a JSON array of objects:
```json
[
  {{
    "file": "src/orders/controllers/order.controller.ts",
    "line": 42,
    "severity": "error|warning|info",
    "category": "architecture|logic|security|design",
    "rule": "ARCH-XXX or descriptive label",
    "message": "Clear description of the issue",
    "suggestion": "How to fix it (be specific, reference the correct layer/pattern)"
  }}
]

Rules:

✓Only flag issues you are confident about. If you are unsure, do not include it.
✓"error" severity: architecture violations, bugs, security issues
✓"warning" severity: design concerns, potential issues, missing abstractions
✓"info" severity: suggestions for improvement (use sparingly)
✓If there are no issues, return an empty array: [] """

async def run_architecture_review(files: list[dict]) -> list[dict]: file_list = "\n".join(f"- {f['filename']} ({f['status']})" for f in files) diffs = "\n\n".join( f"### {f['filename']}\ndiff\n{f['patch']}\n" for f in files )

response = await openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": REVIEW_PROMPT.format(
                architecture_context=ARCHITECTURE_CONTEXT,
                file_list=file_list,
                diffs=diffs,
            ),
        },
    ],
    response_format={"type": "json_object"},
    temperature=0.1,  # low temperature for consistent reviews
    max_tokens=2000,
)

return json.loads(response.choices[0].message.content).get("issues", [])


### Pass 3: Security-Focused Scan (LLM)

A separate pass focused exclusively on security concerns. Using a different prompt for security review avoids the "dilution effect" where the model tries to catch everything and catches nothing well:

```python
SECURITY_PROMPT = """You are a security engineer reviewing code changes.
Focus exclusively on security vulnerabilities:

- SQL injection, NoSQL injection
- XSS (reflected, stored, DOM-based)
- Authentication/authorization bypasses
- Insecure direct object references
- Missing input validation on user-facing endpoints
- Hardcoded secrets, API keys, credentials
- Path traversal
- Insecure deserialization

Do NOT flag:
- Code style, architecture, or design issues
- Performance concerns
- Missing error handling (unless it leads to information disclosure)

Respond with a JSON array. Use severity "error" for confirmed vulnerabilities,
"warning" for potential issues that need investigation.

{diffs}
"""

Separating security into its own pass improves detection rates. In our evaluation, a combined prompt caught 60% of injected security issues, while the dedicated security pass caught 85%.

Structured Output and GitHub Integration

Merging Results from All Passes

async def run_review_pipeline(repo: str, pr_number: int):
    # Fetch changed files
    files = await get_pr_diff(repo, pr_number)

    if not files:
        return  # no reviewable files changed

    # Run all three passes
    static_findings = run_static_checks(files)

    # Only run LLM passes on files not fully covered by static checks
    arch_findings, security_findings = await asyncio.gather(
        run_architecture_review(files),
        run_security_review(files),
    )

    # Merge and deduplicate
    all_findings = merge_findings(static_findings, arch_findings, security_findings)

    # Filter out known false positives
    filtered = apply_false_positive_filter(all_findings)

    # Post to GitHub
    if filtered:
        await post_review_comments(repo, pr_number, filtered)
    else:
        await post_approval(repo, pr_number)

Posting Review Comments

async def post_review_comments(repo: str, pr_number: int, findings: list[dict]):
    # Get the latest commit SHA for the PR
    pr_info = await get_pr_info(repo, pr_number)
    commit_sha = pr_info["head"]["sha"]

    comments = []
    for finding in findings:
        severity_emoji = {
            "error": "🔴",
            "warning": "🟡",
            "info": "💡"
        }.get(finding["severity"], "")

        body = (
            f"{severity_emoji} **{finding['rule']}** ({finding['category']})\n\n"
            f"{finding['message']}\n\n"
        )
        if finding.get("suggestion"):
            body += f"**Suggestion:** {finding['suggestion']}\n\n"
        body += (
            f"<sub>Severity: {finding['severity']} | "
            f"[Not a real issue? Click here to report false positive]"
            f"({FEEDBACK_URL}?finding={finding['rule']}&file={finding['file']})"
            f"</sub>"
        )

        comments.append({
            "path": finding["file"],
            "line": finding["line"],
            "body": body,
        })

    # Post as a PR review
    async with httpx.AsyncClient() as client:
        await client.post(
            f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3+json",
            },
            json={
                "commit_id": commit_sha,
                "body": f"AI Architecture Review: {len(findings)} issues found",
                "event": "COMMENT",
                "comments": comments,
            },
        )

False Positive Management

False positives erode developer trust faster than anything else. A review tool that flags non-issues gets ignored, then disabled. Managing false positives is not optional — it is a core system requirement.

Feedback Loop

Every review comment includes a "not a real issue" link. When a developer clicks it, the finding is logged:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class FalsePositiveReport:
    rule: str
    file: str
    finding_message: str
    reporter: str
    timestamp: datetime
    context: str  # the code that was flagged

# Store in a database
def record_false_positive(report: FalsePositiveReport):
    db.insert("false_positives", {
        "rule": report.rule,
        "file_pattern": extract_layer(report.file),  # e.g., "controllers"
        "message_hash": hash(report.finding_message),
        "reporter": report.reporter,
        "timestamp": report.timestamp.isoformat(),
        "context": report.context,
    })

Prompt Improvement Cycle

Weekly, review the false positive reports:

✓Pattern analysis: Are most false positives from one rule? That rule's prompt needs refinement.
✓Context gaps: Did the LLM flag something because it lacked context? Add that context to the architecture description.
✓Ambiguous rules: Is the rule legitimately ambiguous? Add examples to the prompt showing what is and is not a violation.

# Example: Refining the prompt based on false positives
# Original rule: "Controllers must not contain business logic"
# False positives: Input validation in controllers flagged as "business logic"

# Refined rule:
REFINED_RULES = """
Controllers must not contain business logic.
Business logic includes: conditional branching on domain rules, calculations,
state transitions, applying business policies.
NOT business logic (do not flag): input format validation (checking required fields,
type coercion, format validation), authentication/authorization checks
(guard decorators), response mapping (entity → DTO), pagination parameter handling.
"""

Suppression Mechanism

For known false positive patterns that are hard to fix in the prompt, add explicit suppressions:

SUPPRESSED_PATTERNS = [
    {
        "rule": "ARCH-001",
        "file_pattern": r".*\.spec\.ts$",
        "reason": "Test files may import from any layer",
    },
    {
        "rule": "ARCH-003",
        "file_pattern": r"src/shared/.*",
        "reason": "Shared module is exempt from strict layering",
    },
]

def apply_false_positive_filter(findings: list[dict]) -> list[dict]:
    filtered = []
    for finding in findings:
        suppressed = False
        for pattern in SUPPRESSED_PATTERNS:
            if (finding["rule"] == pattern["rule"] and
                re.match(pattern["file_pattern"], finding["file"])):
                suppressed = True
                break
        if not suppressed:
            filtered.append(finding)
    return filtered

Cost Management

AI code review runs on every PR. At 15-20 PRs/day with an average of 5 changed files per PR, costs can add up.

Diff-Only Context

Only send the diff, not entire files. This typically reduces input tokens by 70-80%.

Caching Repeated Patterns

If a developer pushes multiple commits to the same PR, only review the new changes. Cache the review results for files that have not changed since the last review.

Token Budget per PR

Set a maximum token budget per review. If a PR changes 50 files, review the most important ones (based on file path — controllers, use cases, domain files first) and skip configuration files, test utilities, and auto-generated code.

MAX_TOKENS_PER_REVIEW = 8000  # input tokens

def prioritize_files(files: list[dict]) -> list[dict]:
    priority_order = [
        r".*/controllers/.*",
        r".*/use-cases/.*",
        r".*/domain/.*",
        r".*/repositories/.*",
        r".*/services/.*",
    ]

    def file_priority(f):
        for i, pattern in enumerate(priority_order):
            if re.match(pattern, f["filename"]):
                return i
        return len(priority_order)

    sorted_files = sorted(files, key=file_priority)

    # Include files until we hit the token budget
    selected = []
    token_count = 0
    for f in sorted_files:
        file_tokens = estimate_tokens(f["patch"])
        if token_count + file_tokens > MAX_TOKENS_PER_REVIEW:
            break
        selected.append(f)
        token_count += file_tokens

    return selected

Measuring Effectiveness

Without metrics, you cannot know if the review agent is helping or just adding noise. Track these:

Defects Caught Pre-Merge

The primary metric. Count the number of review comments that resulted in code changes (the developer agreed and fixed the issue). Exclude false positives and ignored comments.

Review Cycle Time

Time from PR creation to first review comment. The AI reviewer should respond within minutes, not hours. Compare with the team's average human review time.

Architecture Violations Per Sprint

Track the number of architecture violations that reach the main branch (caught by manual audits or integration tests). This should decrease over time as developers learn from the AI reviews.

Developer Satisfaction

Survey developers quarterly. Ask:

✓Does the AI reviewer catch useful issues? (1-5)
✓Is the false positive rate acceptable? (1-5)
✓Do you trust the AI reviewer's suggestions? (1-5)

A tool that developers hate will be disabled regardless of its objective effectiveness.

Case Study: NestJS Clean Architecture Team

A team of 12 developers working on a NestJS monorepo implementing a B2B order management system. The codebase used Clean Architecture with strict layering, but architecture violations had been creeping in over 8 months of rapid development.

The Problem

A manual architecture audit found:

✓23 instances of services importing repositories directly (bypassing use cases)
✓11 controllers containing business logic (conditional pricing rules, discount calculations)
✓8 API endpoints returning raw entities instead of DTOs
✓5 domain entities decorated with TypeORM decorators

Average PR review cycle time was 4.2 hours. Architecture violations caught in review: approximately 3 per sprint (out of an estimated 12+ that were introduced).

The Solution

Stripe Systems built the 3-pass review agent described in this post, configured with the team's specific architecture rules, ADRs, and coding standards.

Pass 1 (Static): 6 regex rules covering the most common violations — direct repository imports in controllers, ORM decorators in domain entities, console.log in production code, missing DTO patterns.

Pass 2 (Architecture LLM): GPT-4o with the team's full architecture context, including examples of correct and incorrect patterns from their codebase.

Pass 3 (Security): GPT-4o focused scan on authentication decorators, input validation, and SQL/NoSQL injection patterns.

Sample Review Output

On a PR that added a new pricing endpoint:

File: src/orders/controllers/pricing.controller.ts

+ @Get('calculate')
+ async calculatePrice(@Query() query: CalculatePriceQuery) {
+   const product = await this.productRepository.findById(query.productId);
+   let price = product.basePrice;
+   if (query.quantity > 100) {
+     price = price * 0.9; // 10% bulk discount
+   }
+   if (product.category === 'electronics') {
+     price = price * 1.18; // GST
+   }
+   return { price, product };
+ }

Review comments posted:

🔴 ARCH-001 (architecture)

Controller imports and uses productRepository directly. In Clean Architecture, controllers should delegate to use cases. Create a CalculatePriceUseCase that encapsulates this logic.

Suggestion: Move the pricing logic to src/orders/use-cases/calculate-price.use-case.ts and inject it into the controller.

🔴 ARCH-004 (architecture)

Controller contains business logic: bulk discount calculation and GST application. This logic belongs in the domain layer or a use case, not in a controller.

Suggestion: Create a PricingService in the domain layer that handles discount rules and tax calculations.

🟡 ARCH-005 (architecture)

Endpoint returns the raw product entity. Create a CalculatePriceResponseDto to control the API surface and prevent leaking internal entity structure.

Suggestion: Define a response DTO in src/orders/dtos/calculate-price-response.dto.ts.

🟡 CODE-003 (design)

Discount rules (10% for quantity > 100) and tax rates (18% GST for electronics) are hardcoded. These should be configurable or pulled from a rules table.

Suggestion: Consider a pricing rules configuration or database table.

Metrics After 3 Months

Metric	Before	After	Change
PR review cycle time	4.2 hours	1.8 hours	-57%
Architecture violations per sprint	~12 introduced, ~3 caught	~3 introduced, ~2 caught	-75% introduced
Architecture violations reaching main branch	~9 per sprint	~1 per sprint	-89%
False positive rate	N/A	14% (month 1) → 6% (month 3)	Improving
Developer satisfaction (1-5)	N/A	3.8 (month 1) → 4.3 (month 3)	Improving
Monthly review cost	$0 (human time only)	$180 (LLM API costs)	—
Human reviewer time per PR	35 minutes	18 minutes	-49%

The most significant result was not the cycle time reduction — it was the drop in violations introduced per sprint from ~12 to ~3. Developers started self-correcting before submitting PRs because they knew the AI reviewer would catch violations. The review agent became a teaching tool, not just a gate.

The false positive rate started at 14% in the first month and dropped to 6% by month 3 through the feedback loop. The remaining false positives were edge cases in the shared module and test files, most of which were handled by suppression rules.

What Did Not Work

✓Reviewing auto-generated files: Prisma migrations and auto-generated type files produced nothing but noise. We added an exclusion list.
✓Reviewing large refactoring PRs: PRs with 40+ files exceeded the token budget and the model struggled with the volume of changes. We added a recommendation to split large PRs.
✓Security pass on non-web code: The security pass flagged false positives on internal utility functions that never handle user input. We scoped it to controller and middleware files only.

The overall conclusion: a custom AI code review system works when it encodes specific, well-defined architecture rules and when there is a functioning feedback loop to manage false positives. The LLM is not doing the hard part — defining and maintaining the architecture rules is. The LLM is a flexible pattern matcher that applies those rules more consistently than human reviewers who are tired, busy, or unfamiliar with the codebase.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussi...

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, an...

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, paymen...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million r...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a lang...

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentic...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

AI/ML📅 March 19, 2026· 14 min read

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

✍️

Stripe Systems Engineering

Why Generic AI Review Falls Short

We evaluated three commercial AI code review tools on a NestJS monorepo with Clean Architecture. We created 20 test PRs containing known architecture violations and measured detection rates:

Violation Type	Tool A	Tool B	Tool C
Service imports repository directly	1/5	0/5	0/5
Controller contains business logic	0/5	1/5	0/5
Missing DTO (entity returned from controller)	0/5	0/5	0/5
Wrong dependency direction (domain → infrastructure)	0/5	0/5	0/5
Missing interface for external service	2/5	1/5	0/5
Total Detection Rate	3/25 (12%)	2/25 (8%)	0/25 (0%)

Architecture of the Review Pipeline

The pipeline has four stages:

PR Webhook → Diff Extraction → Multi-Pass Analysis → Structured Output → GitHub Comment

PR Webhook Handler

A webhook listener receives GitHub PR events and triggers the review pipeline:

from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib

app = FastAPI()

WEBHOOK_SECRET = os.environ["GITHUB_WEBHOOK_SECRET"]
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]

@app.post("/webhook/pr")
async def handle_pr_webhook(request: Request):
    # Verify webhook signature
    signature = request.headers.get("X-Hub-Signature-256")
    body = await request.body()

    expected = "sha256=" + hmac.new(
        WEBHOOK_SECRET.encode(), body, hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(signature, expected):
        raise HTTPException(status_code=401, detail="Invalid signature")

    payload = await request.json()

    # Only review on PR open and synchronize (new commits)
    if payload.get("action") not in ("opened", "synchronize"):
        return {"status": "skipped"}

    pr_number = payload["pull_request"]["number"]
    repo_full_name = payload["repository"]["full_name"]

    # Run review asynchronously
    review_task = asyncio.create_task(
        run_review_pipeline(repo_full_name, pr_number)
    )

    return {"status": "review_started", "pr": pr_number}

Diff Extraction

We only review changed files — reviewing the entire codebase on every PR is wasteful and introduces noise. The GitHub API provides the diff:

import httpx

async def get_pr_diff(repo: str, pr_number: int) -> list[dict]:
    """Fetch changed files with their diffs from GitHub."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.github.com/repos/{repo}/pulls/{pr_number}/files",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3+json",
            },
        )
        response.raise_for_status()

    files = []
    for f in response.json():
        if f["status"] in ("added", "modified") and f["filename"].endswith(
            (".ts", ".tsx", ".js")
        ):
            files.append({
                "filename": f["filename"],
                "status": f["status"],
                "patch": f.get("patch", ""),
                "additions": f["additions"],
                "deletions": f["deletions"],
            })
    return files

Context Assembly

The diff alone is insufficient — the LLM needs project context to identify architecture violations. We assemble a context package that includes:

✓Architecture rules: Extracted from ADRs and coding standards documents
✓File structure: The directory layout indicating architectural layers
✓Related files: Imports and dependencies of the changed files

ARCHITECTURE_CONTEXT = """
## Project Architecture: Clean Architecture (NestJS)

### Layer Rules (STRICT — violations must be flagged):
1. Controllers (src/*/controllers/) → ONLY import from Use Cases and DTOs
2. Use Cases (src/*/use-cases/) → ONLY import from Domain entities and Repository interfaces
3. Domain (src/*/domain/) → MUST NOT import from any other layer
4. Repositories (src/*/repositories/) → Implement interfaces defined in Domain
5. Infrastructure (src/*/infrastructure/) → Can import from Domain interfaces only

### Dependency Direction:
Controllers → Use Cases → Domain ← Repositories ← Infrastructure
(arrows show allowed import direction)

### Naming Conventions:
- Controllers: *.controller.ts
- Use Cases: *.use-case.ts (one public method: execute())
- DTOs: *.dto.ts (request and response separate)
- Entities: *.entity.ts (no decorators from infrastructure)
- Repositories: *.repository.ts (implement interface from domain)

### Common Violations to Watch For:
- Service/Controller importing directly from *.repository.ts (must go through use-case)
- Controller returning an entity instead of a response DTO
- Business logic in controllers (if/else on business rules, calculations, validations beyond input format)
- Domain entity importing from TypeORM, Prisma, or other ORM decorators
- Use case importing from infrastructure (e.g., importing HttpService, specific DB client)
- Missing interface: a use case depending on a concrete repository instead of an interface
"""

This context is included in the system prompt for the architecture review pass. It is the project-specific knowledge that makes this system effective where generic tools fail.

Multi-Pass Review Strategy

A single LLM pass trying to catch everything produces noisy results. Instead, we use three specialized passes, each with a focused objective:

Pass 1: Lightweight Static Checks (No LLM)

Fast, deterministic checks using regex and AST parsing. These are cheap and produce zero false positives:

import re
from typing import NamedTuple

class StaticFinding(NamedTuple):
    file: str
    line: int
    rule: str
    message: str
    severity: str

def run_static_checks(files: list[dict]) -> list[StaticFinding]:
    findings = []

    for f in files:
        filename = f["filename"]
        lines = f["patch"].split("\n")

        for i, line in enumerate(lines):
            if not line.startswith("+"):
                continue
            code = line[1:]  # strip the + prefix
            line_number = extract_line_number(f["patch"], i)

            # Rule: Controllers must not import from repositories
            if "/controllers/" in filename:
                if re.search(r"from\s+['\"].*\.repository['\"]", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="ARCH-001",
                        message="Controller imports directly from a repository. Use a use-case instead.",
                        severity="error",
                    ))

            # Rule: Domain entities must not import ORM decorators
            if "/domain/" in filename:
                if re.search(r"from\s+['\"](@nestjs\/typeorm|typeorm|@prisma|prisma)['\"]", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="ARCH-002",
                        message="Domain entity imports from ORM library. Domain must be infrastructure-agnostic.",
                        severity="error",
                    ))

            # Rule: console.log in production code
            if "/src/" in filename and ".spec." not in filename:
                if re.search(r"console\.(log|debug|info)\(", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="CODE-001",
                        message="console.log in production code. Use the Logger service instead.",
                        severity="warning",
                    ))

    return findings

Pass 1 runs in milliseconds and catches the most obvious violations. It serves as a fast filter — if Pass 1 catches an issue, there is no need to spend LLM tokens on it.

Pass 2: Architecture and Logic Review (LLM)

The main review pass. The LLM analyzes the diff with full architecture context:

REVIEW_PROMPT = """You are a senior software architect reviewing a pull request
for a NestJS application that follows Clean Architecture.

{architecture_context}

## Files Changed in This PR:
{file_list}

## Diffs:
{diffs}

## Your Task:
Review the code changes for architecture violations, logic errors, and design issues.
Focus ONLY on issues that matter — do not comment on formatting, naming style preferences,
or minor issues that a linter would catch.

For each issue found, respond with a JSON array of objects:
```json
[
  {{
    "file": "src/orders/controllers/order.controller.ts",
    "line": 42,
    "severity": "error|warning|info",
    "category": "architecture|logic|security|design",
    "rule": "ARCH-XXX or descriptive label",
    "message": "Clear description of the issue",
    "suggestion": "How to fix it (be specific, reference the correct layer/pattern)"
  }}
]

Rules:

✓Only flag issues you are confident about. If you are unsure, do not include it.
✓"error" severity: architecture violations, bugs, security issues
✓"warning" severity: design concerns, potential issues, missing abstractions
✓"info" severity: suggestions for improvement (use sparingly)
✓If there are no issues, return an empty array: [] """

response = await openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": REVIEW_PROMPT.format(
                architecture_context=ARCHITECTURE_CONTEXT,
                file_list=file_list,
                diffs=diffs,
            ),
        },
    ],
    response_format={"type": "json_object"},
    temperature=0.1,  # low temperature for consistent reviews
    max_tokens=2000,
)

return json.loads(response.choices[0].message.content).get("issues", [])


### Pass 3: Security-Focused Scan (LLM)

A separate pass focused exclusively on security concerns. Using a different prompt for security review avoids the "dilution effect" where the model tries to catch everything and catches nothing well:

```python
SECURITY_PROMPT = """You are a security engineer reviewing code changes.
Focus exclusively on security vulnerabilities:

- SQL injection, NoSQL injection
- XSS (reflected, stored, DOM-based)
- Authentication/authorization bypasses
- Insecure direct object references
- Missing input validation on user-facing endpoints
- Hardcoded secrets, API keys, credentials
- Path traversal
- Insecure deserialization

Do NOT flag:
- Code style, architecture, or design issues
- Performance concerns
- Missing error handling (unless it leads to information disclosure)

Respond with a JSON array. Use severity "error" for confirmed vulnerabilities,
"warning" for potential issues that need investigation.

{diffs}
"""

Separating security into its own pass improves detection rates. In our evaluation, a combined prompt caught 60% of injected security issues, while the dedicated security pass caught 85%.

Structured Output and GitHub Integration

Merging Results from All Passes

async def run_review_pipeline(repo: str, pr_number: int):
    # Fetch changed files
    files = await get_pr_diff(repo, pr_number)

    if not files:
        return  # no reviewable files changed

    # Run all three passes
    static_findings = run_static_checks(files)

    # Only run LLM passes on files not fully covered by static checks
    arch_findings, security_findings = await asyncio.gather(
        run_architecture_review(files),
        run_security_review(files),
    )

    # Merge and deduplicate
    all_findings = merge_findings(static_findings, arch_findings, security_findings)

    # Filter out known false positives
    filtered = apply_false_positive_filter(all_findings)

    # Post to GitHub
    if filtered:
        await post_review_comments(repo, pr_number, filtered)
    else:
        await post_approval(repo, pr_number)

Posting Review Comments

async def post_review_comments(repo: str, pr_number: int, findings: list[dict]):
    # Get the latest commit SHA for the PR
    pr_info = await get_pr_info(repo, pr_number)
    commit_sha = pr_info["head"]["sha"]

    comments = []
    for finding in findings:
        severity_emoji = {
            "error": "🔴",
            "warning": "🟡",
            "info": "💡"
        }.get(finding["severity"], "")

        body = (
            f"{severity_emoji} **{finding['rule']}** ({finding['category']})\n\n"
            f"{finding['message']}\n\n"
        )
        if finding.get("suggestion"):
            body += f"**Suggestion:** {finding['suggestion']}\n\n"
        body += (
            f"<sub>Severity: {finding['severity']} | "
            f"[Not a real issue? Click here to report false positive]"
            f"({FEEDBACK_URL}?finding={finding['rule']}&file={finding['file']})"
            f"</sub>"
        )

        comments.append({
            "path": finding["file"],
            "line": finding["line"],
            "body": body,
        })

    # Post as a PR review
    async with httpx.AsyncClient() as client:
        await client.post(
            f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3+json",
            },
            json={
                "commit_id": commit_sha,
                "body": f"AI Architecture Review: {len(findings)} issues found",
                "event": "COMMENT",
                "comments": comments,
            },
        )

False Positive Management

Feedback Loop

Every review comment includes a "not a real issue" link. When a developer clicks it, the finding is logged:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class FalsePositiveReport:
    rule: str
    file: str
    finding_message: str
    reporter: str
    timestamp: datetime
    context: str  # the code that was flagged

# Store in a database
def record_false_positive(report: FalsePositiveReport):
    db.insert("false_positives", {
        "rule": report.rule,
        "file_pattern": extract_layer(report.file),  # e.g., "controllers"
        "message_hash": hash(report.finding_message),
        "reporter": report.reporter,
        "timestamp": report.timestamp.isoformat(),
        "context": report.context,
    })

Prompt Improvement Cycle

Weekly, review the false positive reports:

✓Pattern analysis: Are most false positives from one rule? That rule's prompt needs refinement.
✓Context gaps: Did the LLM flag something because it lacked context? Add that context to the architecture description.
✓Ambiguous rules: Is the rule legitimately ambiguous? Add examples to the prompt showing what is and is not a violation.

# Example: Refining the prompt based on false positives
# Original rule: "Controllers must not contain business logic"
# False positives: Input validation in controllers flagged as "business logic"

# Refined rule:
REFINED_RULES = """
Controllers must not contain business logic.
Business logic includes: conditional branching on domain rules, calculations,
state transitions, applying business policies.
NOT business logic (do not flag): input format validation (checking required fields,
type coercion, format validation), authentication/authorization checks
(guard decorators), response mapping (entity → DTO), pagination parameter handling.
"""

Suppression Mechanism

For known false positive patterns that are hard to fix in the prompt, add explicit suppressions:

SUPPRESSED_PATTERNS = [
    {
        "rule": "ARCH-001",
        "file_pattern": r".*\.spec\.ts$",
        "reason": "Test files may import from any layer",
    },
    {
        "rule": "ARCH-003",
        "file_pattern": r"src/shared/.*",
        "reason": "Shared module is exempt from strict layering",
    },
]

def apply_false_positive_filter(findings: list[dict]) -> list[dict]:
    filtered = []
    for finding in findings:
        suppressed = False
        for pattern in SUPPRESSED_PATTERNS:
            if (finding["rule"] == pattern["rule"] and
                re.match(pattern["file_pattern"], finding["file"])):
                suppressed = True
                break
        if not suppressed:
            filtered.append(finding)
    return filtered

Cost Management

AI code review runs on every PR. At 15-20 PRs/day with an average of 5 changed files per PR, costs can add up.

Diff-Only Context

Only send the diff, not entire files. This typically reduces input tokens by 70-80%.

Caching Repeated Patterns

If a developer pushes multiple commits to the same PR, only review the new changes. Cache the review results for files that have not changed since the last review.

Token Budget per PR

MAX_TOKENS_PER_REVIEW = 8000  # input tokens

def prioritize_files(files: list[dict]) -> list[dict]:
    priority_order = [
        r".*/controllers/.*",
        r".*/use-cases/.*",
        r".*/domain/.*",
        r".*/repositories/.*",
        r".*/services/.*",
    ]

    def file_priority(f):
        for i, pattern in enumerate(priority_order):
            if re.match(pattern, f["filename"]):
                return i
        return len(priority_order)

    sorted_files = sorted(files, key=file_priority)

    # Include files until we hit the token budget
    selected = []
    token_count = 0
    for f in sorted_files:
        file_tokens = estimate_tokens(f["patch"])
        if token_count + file_tokens > MAX_TOKENS_PER_REVIEW:
            break
        selected.append(f)
        token_count += file_tokens

    return selected

Measuring Effectiveness

Without metrics, you cannot know if the review agent is helping or just adding noise. Track these:

Defects Caught Pre-Merge

The primary metric. Count the number of review comments that resulted in code changes (the developer agreed and fixed the issue). Exclude false positives and ignored comments.

Review Cycle Time

Time from PR creation to first review comment. The AI reviewer should respond within minutes, not hours. Compare with the team's average human review time.

Architecture Violations Per Sprint

Track the number of architecture violations that reach the main branch (caught by manual audits or integration tests). This should decrease over time as developers learn from the AI reviews.

Developer Satisfaction

Survey developers quarterly. Ask:

✓Does the AI reviewer catch useful issues? (1-5)
✓Is the false positive rate acceptable? (1-5)
✓Do you trust the AI reviewer's suggestions? (1-5)

A tool that developers hate will be disabled regardless of its objective effectiveness.

Case Study: NestJS Clean Architecture Team

The Problem

A manual architecture audit found:

✓23 instances of services importing repositories directly (bypassing use cases)
✓11 controllers containing business logic (conditional pricing rules, discount calculations)
✓8 API endpoints returning raw entities instead of DTOs
✓5 domain entities decorated with TypeORM decorators

Average PR review cycle time was 4.2 hours. Architecture violations caught in review: approximately 3 per sprint (out of an estimated 12+ that were introduced).

The Solution

Stripe Systems built the 3-pass review agent described in this post, configured with the team's specific architecture rules, ADRs, and coding standards.

Pass 2 (Architecture LLM): GPT-4o with the team's full architecture context, including examples of correct and incorrect patterns from their codebase.

Pass 3 (Security): GPT-4o focused scan on authentication decorators, input validation, and SQL/NoSQL injection patterns.

Sample Review Output

On a PR that added a new pricing endpoint:

File: src/orders/controllers/pricing.controller.ts

+ @Get('calculate')
+ async calculatePrice(@Query() query: CalculatePriceQuery) {
+   const product = await this.productRepository.findById(query.productId);
+   let price = product.basePrice;
+   if (query.quantity > 100) {
+     price = price * 0.9; // 10% bulk discount
+   }
+   if (product.category === 'electronics') {
+     price = price * 1.18; // GST
+   }
+   return { price, product };
+ }

Review comments posted:

🔴 ARCH-001 (architecture)

Controller imports and uses productRepository directly. In Clean Architecture, controllers should delegate to use cases. Create a CalculatePriceUseCase that encapsulates this logic.

Suggestion: Move the pricing logic to src/orders/use-cases/calculate-price.use-case.ts and inject it into the controller.

🔴 ARCH-004 (architecture)

Controller contains business logic: bulk discount calculation and GST application. This logic belongs in the domain layer or a use case, not in a controller.

Suggestion: Create a PricingService in the domain layer that handles discount rules and tax calculations.

🟡 ARCH-005 (architecture)

Endpoint returns the raw product entity. Create a CalculatePriceResponseDto to control the API surface and prevent leaking internal entity structure.

Suggestion: Define a response DTO in src/orders/dtos/calculate-price-response.dto.ts.

🟡 CODE-003 (design)

Discount rules (10% for quantity > 100) and tax rates (18% GST for electronics) are hardcoded. These should be configurable or pulled from a rules table.

Suggestion: Consider a pricing rules configuration or database table.

Metrics After 3 Months

Metric	Before	After	Change
PR review cycle time	4.2 hours	1.8 hours	-57%
Architecture violations per sprint	~12 introduced, ~3 caught	~3 introduced, ~2 caught	-75% introduced
Architecture violations reaching main branch	~9 per sprint	~1 per sprint	-89%
False positive rate	N/A	14% (month 1) → 6% (month 3)	Improving
Developer satisfaction (1-5)	N/A	3.8 (month 1) → 4.3 (month 3)	Improving
Monthly review cost	$0 (human time only)	$180 (LLM API costs)	—
Human reviewer time per PR	35 minutes	18 minutes	-49%

What Did Not Work

✓Reviewing auto-generated files: Prisma migrations and auto-generated type files produced nothing but noise. We added an exclusion list.
✓Reviewing large refactoring PRs: PRs with 40+ files exceeded the token budget and the model struggled with the volume of changes. We added a recommendation to split large PRs.
✓Security pass on non-web code: The security pass flagged false positives on internal utility functions that never handle user input. We scoped it to controller and middleware files only.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Why Generic AI Review Falls Short

Architecture of the Review Pipeline

PR Webhook Handler

Diff Extraction

Context Assembly

Multi-Pass Review Strategy

Pass 1: Lightweight Static Checks (No LLM)

Pass 2: Architecture and Logic Review (LLM)

Structured Output and GitHub Integration

Merging Results from All Passes

Posting Review Comments

False Positive Management

Feedback Loop

Prompt Improvement Cycle

Suppression Mechanism

Cost Management

Diff-Only Context

Caching Repeated Patterns

Token Budget per PR

Measuring Effectiveness

Defects Caught Pre-Merge

Review Cycle Time

Architecture Violations Per Sprint

Developer Satisfaction

Case Study: NestJS Clean Architecture Team

The Problem

The Solution

Sample Review Output

Metrics After 3 Months

What Did Not Work

Related Services from Stripe Systems

AI/ML Solutions

More Articles

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

Staff Augmentation — A Practical Guide for Engineering Leaders

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Why Custom Software Development Matters for Growing Businesses

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy