Skip to main content
Stripe SystemsStripe Systems
AI/ML📅 March 19, 2026· 14 min read

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

✍️
Stripe Systems Engineering

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months and make codebases progressively harder to work with. A service that directly imports a repository (bypassing the use-case layer), a controller that contains business logic, a missing DTO that leaks database schema to the API layer — these are the issues that matter most in mature codebases and that generic tools miss almost entirely.

The reason is straightforward: generic tools have no knowledge of your project's architecture. They do not know that your team uses Clean Architecture with specific layering rules. They do not know that repositories should only be accessed through use cases. They do not know your naming conventions, your dependency injection patterns, or the decisions documented in your ADRs.

This post describes how to build a custom AI code review pipeline that encodes your project's architecture rules and enforces them automatically on every pull request. We cover the technical architecture, prompt engineering, multi-pass review strategy, GitHub integration, false positive management, and the metrics that demonstrate whether it is working.

Why Generic AI Review Falls Short

We evaluated three commercial AI code review tools on a NestJS monorepo with Clean Architecture. We created 20 test PRs containing known architecture violations and measured detection rates:

Violation TypeTool ATool BTool C
Service imports repository directly1/50/50/5
Controller contains business logic0/51/50/5
Missing DTO (entity returned from controller)0/50/50/5
Wrong dependency direction (domain → infrastructure)0/50/50/5
Missing interface for external service2/51/50/5
Total Detection Rate3/25 (12%)2/25 (8%)0/25 (0%)

All three tools caught linting issues, potential null pointer errors, and suggested minor style improvements. None consistently caught architecture violations because they lack the project context to identify them.

Architecture of the Review Pipeline

The pipeline has four stages:

PR Webhook → Diff Extraction → Multi-Pass Analysis → Structured Output → GitHub Comment

PR Webhook Handler

A webhook listener receives GitHub PR events and triggers the review pipeline:

from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib

app = FastAPI()

WEBHOOK_SECRET = os.environ["GITHUB_WEBHOOK_SECRET"]
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]

@app.post("/webhook/pr")
async def handle_pr_webhook(request: Request):
    # Verify webhook signature
    signature = request.headers.get("X-Hub-Signature-256")
    body = await request.body()

    expected = "sha256=" + hmac.new(
        WEBHOOK_SECRET.encode(), body, hashlib.sha256
    ).hexdigest()

    if not hmac.compare_digest(signature, expected):
        raise HTTPException(status_code=401, detail="Invalid signature")

    payload = await request.json()

    # Only review on PR open and synchronize (new commits)
    if payload.get("action") not in ("opened", "synchronize"):
        return {"status": "skipped"}

    pr_number = payload["pull_request"]["number"]
    repo_full_name = payload["repository"]["full_name"]

    # Run review asynchronously
    review_task = asyncio.create_task(
        run_review_pipeline(repo_full_name, pr_number)
    )

    return {"status": "review_started", "pr": pr_number}

Diff Extraction

We only review changed files — reviewing the entire codebase on every PR is wasteful and introduces noise. The GitHub API provides the diff:

import httpx

async def get_pr_diff(repo: str, pr_number: int) -> list[dict]:
    """Fetch changed files with their diffs from GitHub."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.github.com/repos/{repo}/pulls/{pr_number}/files",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3+json",
            },
        )
        response.raise_for_status()

    files = []
    for f in response.json():
        if f["status"] in ("added", "modified") and f["filename"].endswith(
            (".ts", ".tsx", ".js")
        ):
            files.append({
                "filename": f["filename"],
                "status": f["status"],
                "patch": f.get("patch", ""),
                "additions": f["additions"],
                "deletions": f["deletions"],
            })
    return files

Context Assembly

The diff alone is insufficient — the LLM needs project context to identify architecture violations. We assemble a context package that includes:

  1. Architecture rules: Extracted from ADRs and coding standards documents
  2. File structure: The directory layout indicating architectural layers
  3. Related files: Imports and dependencies of the changed files
ARCHITECTURE_CONTEXT = """
## Project Architecture: Clean Architecture (NestJS)

### Layer Rules (STRICT — violations must be flagged):
1. Controllers (src/*/controllers/) → ONLY import from Use Cases and DTOs
2. Use Cases (src/*/use-cases/) → ONLY import from Domain entities and Repository interfaces
3. Domain (src/*/domain/) → MUST NOT import from any other layer
4. Repositories (src/*/repositories/) → Implement interfaces defined in Domain
5. Infrastructure (src/*/infrastructure/) → Can import from Domain interfaces only

### Dependency Direction:
Controllers → Use Cases → Domain ← Repositories ← Infrastructure
(arrows show allowed import direction)

### Naming Conventions:
- Controllers: *.controller.ts
- Use Cases: *.use-case.ts (one public method: execute())
- DTOs: *.dto.ts (request and response separate)
- Entities: *.entity.ts (no decorators from infrastructure)
- Repositories: *.repository.ts (implement interface from domain)

### Common Violations to Watch For:
- Service/Controller importing directly from *.repository.ts (must go through use-case)
- Controller returning an entity instead of a response DTO
- Business logic in controllers (if/else on business rules, calculations, validations beyond input format)
- Domain entity importing from TypeORM, Prisma, or other ORM decorators
- Use case importing from infrastructure (e.g., importing HttpService, specific DB client)
- Missing interface: a use case depending on a concrete repository instead of an interface
"""

This context is included in the system prompt for the architecture review pass. It is the project-specific knowledge that makes this system effective where generic tools fail.

Multi-Pass Review Strategy

A single LLM pass trying to catch everything produces noisy results. Instead, we use three specialized passes, each with a focused objective:

Pass 1: Lightweight Static Checks (No LLM)

Fast, deterministic checks using regex and AST parsing. These are cheap and produce zero false positives:

import re
from typing import NamedTuple

class StaticFinding(NamedTuple):
    file: str
    line: int
    rule: str
    message: str
    severity: str

def run_static_checks(files: list[dict]) -> list[StaticFinding]:
    findings = []

    for f in files:
        filename = f["filename"]
        lines = f["patch"].split("\n")

        for i, line in enumerate(lines):
            if not line.startswith("+"):
                continue
            code = line[1:]  # strip the + prefix
            line_number = extract_line_number(f["patch"], i)

            # Rule: Controllers must not import from repositories
            if "/controllers/" in filename:
                if re.search(r"from\s+['\"].*\.repository['\"]", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="ARCH-001",
                        message="Controller imports directly from a repository. Use a use-case instead.",
                        severity="error",
                    ))

            # Rule: Domain entities must not import ORM decorators
            if "/domain/" in filename:
                if re.search(r"from\s+['\"](@nestjs\/typeorm|typeorm|@prisma|prisma)['\"]", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="ARCH-002",
                        message="Domain entity imports from ORM library. Domain must be infrastructure-agnostic.",
                        severity="error",
                    ))

            # Rule: console.log in production code
            if "/src/" in filename and ".spec." not in filename:
                if re.search(r"console\.(log|debug|info)\(", code):
                    findings.append(StaticFinding(
                        file=filename,
                        line=line_number,
                        rule="CODE-001",
                        message="console.log in production code. Use the Logger service instead.",
                        severity="warning",
                    ))

    return findings

Pass 1 runs in milliseconds and catches the most obvious violations. It serves as a fast filter — if Pass 1 catches an issue, there is no need to spend LLM tokens on it.

Pass 2: Architecture and Logic Review (LLM)

The main review pass. The LLM analyzes the diff with full architecture context:

REVIEW_PROMPT = """You are a senior software architect reviewing a pull request
for a NestJS application that follows Clean Architecture.

{architecture_context}

## Files Changed in This PR:
{file_list}

## Diffs:
{diffs}

## Your Task:
Review the code changes for architecture violations, logic errors, and design issues.
Focus ONLY on issues that matter — do not comment on formatting, naming style preferences,
or minor issues that a linter would catch.

For each issue found, respond with a JSON array of objects:
```json
[
  {{
    "file": "src/orders/controllers/order.controller.ts",
    "line": 42,
    "severity": "error|warning|info",
    "category": "architecture|logic|security|design",
    "rule": "ARCH-XXX or descriptive label",
    "message": "Clear description of the issue",
    "suggestion": "How to fix it (be specific, reference the correct layer/pattern)"
  }}
]

Rules:

  • Only flag issues you are confident about. If you are unsure, do not include it.
  • "error" severity: architecture violations, bugs, security issues
  • "warning" severity: design concerns, potential issues, missing abstractions
  • "info" severity: suggestions for improvement (use sparingly)
  • If there are no issues, return an empty array: [] """

async def run_architecture_review(files: list[dict]) -> list[dict]: file_list = "\n".join(f"- {f['filename']} ({f['status']})" for f in files) diffs = "\n\n".join( f"### {f['filename']}\ndiff\n{f['patch']}\n" for f in files )

response = await openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": REVIEW_PROMPT.format(
                architecture_context=ARCHITECTURE_CONTEXT,
                file_list=file_list,
                diffs=diffs,
            ),
        },
    ],
    response_format={"type": "json_object"},
    temperature=0.1,  # low temperature for consistent reviews
    max_tokens=2000,
)

return json.loads(response.choices[0].message.content).get("issues", [])

### Pass 3: Security-Focused Scan (LLM)

A separate pass focused exclusively on security concerns. Using a different prompt for security review avoids the "dilution effect" where the model tries to catch everything and catches nothing well:

```python
SECURITY_PROMPT = """You are a security engineer reviewing code changes.
Focus exclusively on security vulnerabilities:

- SQL injection, NoSQL injection
- XSS (reflected, stored, DOM-based)
- Authentication/authorization bypasses
- Insecure direct object references
- Missing input validation on user-facing endpoints
- Hardcoded secrets, API keys, credentials
- Path traversal
- Insecure deserialization

Do NOT flag:
- Code style, architecture, or design issues
- Performance concerns
- Missing error handling (unless it leads to information disclosure)

Respond with a JSON array. Use severity "error" for confirmed vulnerabilities,
"warning" for potential issues that need investigation.

{diffs}
"""

Separating security into its own pass improves detection rates. In our evaluation, a combined prompt caught 60% of injected security issues, while the dedicated security pass caught 85%.

Structured Output and GitHub Integration

Merging Results from All Passes

async def run_review_pipeline(repo: str, pr_number: int):
    # Fetch changed files
    files = await get_pr_diff(repo, pr_number)

    if not files:
        return  # no reviewable files changed

    # Run all three passes
    static_findings = run_static_checks(files)

    # Only run LLM passes on files not fully covered by static checks
    arch_findings, security_findings = await asyncio.gather(
        run_architecture_review(files),
        run_security_review(files),
    )

    # Merge and deduplicate
    all_findings = merge_findings(static_findings, arch_findings, security_findings)

    # Filter out known false positives
    filtered = apply_false_positive_filter(all_findings)

    # Post to GitHub
    if filtered:
        await post_review_comments(repo, pr_number, filtered)
    else:
        await post_approval(repo, pr_number)

Posting Review Comments

async def post_review_comments(repo: str, pr_number: int, findings: list[dict]):
    # Get the latest commit SHA for the PR
    pr_info = await get_pr_info(repo, pr_number)
    commit_sha = pr_info["head"]["sha"]

    comments = []
    for finding in findings:
        severity_emoji = {
            "error": "🔴",
            "warning": "🟡",
            "info": "💡"
        }.get(finding["severity"], "")

        body = (
            f"{severity_emoji} **{finding['rule']}** ({finding['category']})\n\n"
            f"{finding['message']}\n\n"
        )
        if finding.get("suggestion"):
            body += f"**Suggestion:** {finding['suggestion']}\n\n"
        body += (
            f"<sub>Severity: {finding['severity']} | "
            f"[Not a real issue? Click here to report false positive]"
            f"({FEEDBACK_URL}?finding={finding['rule']}&file={finding['file']})"
            f"</sub>"
        )

        comments.append({
            "path": finding["file"],
            "line": finding["line"],
            "body": body,
        })

    # Post as a PR review
    async with httpx.AsyncClient() as client:
        await client.post(
            f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews",
            headers={
                "Authorization": f"Bearer {GITHUB_TOKEN}",
                "Accept": "application/vnd.github.v3+json",
            },
            json={
                "commit_id": commit_sha,
                "body": f"AI Architecture Review: {len(findings)} issues found",
                "event": "COMMENT",
                "comments": comments,
            },
        )

False Positive Management

False positives erode developer trust faster than anything else. A review tool that flags non-issues gets ignored, then disabled. Managing false positives is not optional — it is a core system requirement.

Feedback Loop

Every review comment includes a "not a real issue" link. When a developer clicks it, the finding is logged:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class FalsePositiveReport:
    rule: str
    file: str
    finding_message: str
    reporter: str
    timestamp: datetime
    context: str  # the code that was flagged

# Store in a database
def record_false_positive(report: FalsePositiveReport):
    db.insert("false_positives", {
        "rule": report.rule,
        "file_pattern": extract_layer(report.file),  # e.g., "controllers"
        "message_hash": hash(report.finding_message),
        "reporter": report.reporter,
        "timestamp": report.timestamp.isoformat(),
        "context": report.context,
    })

Prompt Improvement Cycle

Weekly, review the false positive reports:

  1. Pattern analysis: Are most false positives from one rule? That rule's prompt needs refinement.
  2. Context gaps: Did the LLM flag something because it lacked context? Add that context to the architecture description.
  3. Ambiguous rules: Is the rule legitimately ambiguous? Add examples to the prompt showing what is and is not a violation.
# Example: Refining the prompt based on false positives
# Original rule: "Controllers must not contain business logic"
# False positives: Input validation in controllers flagged as "business logic"

# Refined rule:
REFINED_RULES = """
Controllers must not contain business logic.
Business logic includes: conditional branching on domain rules, calculations,
state transitions, applying business policies.
NOT business logic (do not flag): input format validation (checking required fields,
type coercion, format validation), authentication/authorization checks
(guard decorators), response mapping (entity → DTO), pagination parameter handling.
"""

Suppression Mechanism

For known false positive patterns that are hard to fix in the prompt, add explicit suppressions:

SUPPRESSED_PATTERNS = [
    {
        "rule": "ARCH-001",
        "file_pattern": r".*\.spec\.ts$",
        "reason": "Test files may import from any layer",
    },
    {
        "rule": "ARCH-003",
        "file_pattern": r"src/shared/.*",
        "reason": "Shared module is exempt from strict layering",
    },
]

def apply_false_positive_filter(findings: list[dict]) -> list[dict]:
    filtered = []
    for finding in findings:
        suppressed = False
        for pattern in SUPPRESSED_PATTERNS:
            if (finding["rule"] == pattern["rule"] and
                re.match(pattern["file_pattern"], finding["file"])):
                suppressed = True
                break
        if not suppressed:
            filtered.append(finding)
    return filtered

Cost Management

AI code review runs on every PR. At 15-20 PRs/day with an average of 5 changed files per PR, costs can add up.

Diff-Only Context

Only send the diff, not entire files. This typically reduces input tokens by 70-80%.

Caching Repeated Patterns

If a developer pushes multiple commits to the same PR, only review the new changes. Cache the review results for files that have not changed since the last review.

Token Budget per PR

Set a maximum token budget per review. If a PR changes 50 files, review the most important ones (based on file path — controllers, use cases, domain files first) and skip configuration files, test utilities, and auto-generated code.

MAX_TOKENS_PER_REVIEW = 8000  # input tokens

def prioritize_files(files: list[dict]) -> list[dict]:
    priority_order = [
        r".*/controllers/.*",
        r".*/use-cases/.*",
        r".*/domain/.*",
        r".*/repositories/.*",
        r".*/services/.*",
    ]

    def file_priority(f):
        for i, pattern in enumerate(priority_order):
            if re.match(pattern, f["filename"]):
                return i
        return len(priority_order)

    sorted_files = sorted(files, key=file_priority)

    # Include files until we hit the token budget
    selected = []
    token_count = 0
    for f in sorted_files:
        file_tokens = estimate_tokens(f["patch"])
        if token_count + file_tokens > MAX_TOKENS_PER_REVIEW:
            break
        selected.append(f)
        token_count += file_tokens

    return selected

Measuring Effectiveness

Without metrics, you cannot know if the review agent is helping or just adding noise. Track these:

Defects Caught Pre-Merge

The primary metric. Count the number of review comments that resulted in code changes (the developer agreed and fixed the issue). Exclude false positives and ignored comments.

Review Cycle Time

Time from PR creation to first review comment. The AI reviewer should respond within minutes, not hours. Compare with the team's average human review time.

Architecture Violations Per Sprint

Track the number of architecture violations that reach the main branch (caught by manual audits or integration tests). This should decrease over time as developers learn from the AI reviews.

Developer Satisfaction

Survey developers quarterly. Ask:

  • Does the AI reviewer catch useful issues? (1-5)
  • Is the false positive rate acceptable? (1-5)
  • Do you trust the AI reviewer's suggestions? (1-5)

A tool that developers hate will be disabled regardless of its objective effectiveness.

Case Study: NestJS Clean Architecture Team

A team of 12 developers working on a NestJS monorepo implementing a B2B order management system. The codebase used Clean Architecture with strict layering, but architecture violations had been creeping in over 8 months of rapid development.

The Problem

A manual architecture audit found:

  • 23 instances of services importing repositories directly (bypassing use cases)
  • 11 controllers containing business logic (conditional pricing rules, discount calculations)
  • 8 API endpoints returning raw entities instead of DTOs
  • 5 domain entities decorated with TypeORM decorators

Average PR review cycle time was 4.2 hours. Architecture violations caught in review: approximately 3 per sprint (out of an estimated 12+ that were introduced).

The Solution

Stripe Systems built the 3-pass review agent described in this post, configured with the team's specific architecture rules, ADRs, and coding standards.

Pass 1 (Static): 6 regex rules covering the most common violations — direct repository imports in controllers, ORM decorators in domain entities, console.log in production code, missing DTO patterns.

Pass 2 (Architecture LLM): GPT-4o with the team's full architecture context, including examples of correct and incorrect patterns from their codebase.

Pass 3 (Security): GPT-4o focused scan on authentication decorators, input validation, and SQL/NoSQL injection patterns.

Sample Review Output

On a PR that added a new pricing endpoint:

File: src/orders/controllers/pricing.controller.ts

+ @Get('calculate')
+ async calculatePrice(@Query() query: CalculatePriceQuery) {
+   const product = await this.productRepository.findById(query.productId);
+   let price = product.basePrice;
+   if (query.quantity > 100) {
+     price = price * 0.9; // 10% bulk discount
+   }
+   if (product.category === 'electronics') {
+     price = price * 1.18; // GST
+   }
+   return { price, product };
+ }

Review comments posted:

🔴 ARCH-001 (architecture)

Controller imports and uses productRepository directly. In Clean Architecture, controllers should delegate to use cases. Create a CalculatePriceUseCase that encapsulates this logic.

Suggestion: Move the pricing logic to src/orders/use-cases/calculate-price.use-case.ts and inject it into the controller.

🔴 ARCH-004 (architecture)

Controller contains business logic: bulk discount calculation and GST application. This logic belongs in the domain layer or a use case, not in a controller.

Suggestion: Create a PricingService in the domain layer that handles discount rules and tax calculations.

🟡 ARCH-005 (architecture)

Endpoint returns the raw product entity. Create a CalculatePriceResponseDto to control the API surface and prevent leaking internal entity structure.

Suggestion: Define a response DTO in src/orders/dtos/calculate-price-response.dto.ts.

🟡 CODE-003 (design)

Discount rules (10% for quantity > 100) and tax rates (18% GST for electronics) are hardcoded. These should be configurable or pulled from a rules table.

Suggestion: Consider a pricing rules configuration or database table.

Metrics After 3 Months

MetricBeforeAfterChange
PR review cycle time4.2 hours1.8 hours-57%
Architecture violations per sprint~12 introduced, ~3 caught~3 introduced, ~2 caught-75% introduced
Architecture violations reaching main branch~9 per sprint~1 per sprint-89%
False positive rateN/A14% (month 1) → 6% (month 3)Improving
Developer satisfaction (1-5)N/A3.8 (month 1) → 4.3 (month 3)Improving
Monthly review cost$0 (human time only)$180 (LLM API costs)
Human reviewer time per PR35 minutes18 minutes-49%

The most significant result was not the cycle time reduction — it was the drop in violations introduced per sprint from ~12 to ~3. Developers started self-correcting before submitting PRs because they knew the AI reviewer would catch violations. The review agent became a teaching tool, not just a gate.

The false positive rate started at 14% in the first month and dropped to 6% by month 3 through the feedback loop. The remaining false positives were edge cases in the shared module and test files, most of which were handled by suppression rules.

What Did Not Work

  • Reviewing auto-generated files: Prisma migrations and auto-generated type files produced nothing but noise. We added an exclusion list.
  • Reviewing large refactoring PRs: PRs with 40+ files exceeded the token budget and the model struggled with the volume of changes. We added a recommendation to split large PRs.
  • Security pass on non-web code: The security pass flagged false positives on internal utility functions that never handle user input. We scoped it to controller and middleware files only.

The overall conclusion: a custom AI code review system works when it encodes specific, well-defined architecture rules and when there is a functioning feedback loop to manage false positives. The LLM is not doing the hard part — defining and maintaining the architecture rules is. The LLM is a flexible pattern matcher that applies those rules more consistently than human reviewers who are tired, busy, or unfamiliar with the codebase.

Ready to discuss your project?

Get in Touch →
← Back to Blog

More Articles