Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months and make codebases progressively harder to work with. A service that directly imports a repository (bypassing the use-case layer), a controller that contains business logic, a missing DTO that leaks database schema to the API layer — these are the issues that matter most in mature codebases and that generic tools miss almost entirely.
The reason is straightforward: generic tools have no knowledge of your project's architecture. They do not know that your team uses Clean Architecture with specific layering rules. They do not know that repositories should only be accessed through use cases. They do not know your naming conventions, your dependency injection patterns, or the decisions documented in your ADRs.
This post describes how to build a custom AI code review pipeline that encodes your project's architecture rules and enforces them automatically on every pull request. We cover the technical architecture, prompt engineering, multi-pass review strategy, GitHub integration, false positive management, and the metrics that demonstrate whether it is working.
Why Generic AI Review Falls Short
We evaluated three commercial AI code review tools on a NestJS monorepo with Clean Architecture. We created 20 test PRs containing known architecture violations and measured detection rates:
| Violation Type | Tool A | Tool B | Tool C |
|---|---|---|---|
| Service imports repository directly | 1/5 | 0/5 | 0/5 |
| Controller contains business logic | 0/5 | 1/5 | 0/5 |
| Missing DTO (entity returned from controller) | 0/5 | 0/5 | 0/5 |
| Wrong dependency direction (domain → infrastructure) | 0/5 | 0/5 | 0/5 |
| Missing interface for external service | 2/5 | 1/5 | 0/5 |
| Total Detection Rate | 3/25 (12%) | 2/25 (8%) | 0/25 (0%) |
All three tools caught linting issues, potential null pointer errors, and suggested minor style improvements. None consistently caught architecture violations because they lack the project context to identify them.
Architecture of the Review Pipeline
The pipeline has four stages:
PR Webhook → Diff Extraction → Multi-Pass Analysis → Structured Output → GitHub Comment
PR Webhook Handler
A webhook listener receives GitHub PR events and triggers the review pipeline:
from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib
app = FastAPI()
WEBHOOK_SECRET = os.environ["GITHUB_WEBHOOK_SECRET"]
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
@app.post("/webhook/pr")
async def handle_pr_webhook(request: Request):
# Verify webhook signature
signature = request.headers.get("X-Hub-Signature-256")
body = await request.body()
expected = "sha256=" + hmac.new(
WEBHOOK_SECRET.encode(), body, hashlib.sha256
).hexdigest()
if not hmac.compare_digest(signature, expected):
raise HTTPException(status_code=401, detail="Invalid signature")
payload = await request.json()
# Only review on PR open and synchronize (new commits)
if payload.get("action") not in ("opened", "synchronize"):
return {"status": "skipped"}
pr_number = payload["pull_request"]["number"]
repo_full_name = payload["repository"]["full_name"]
# Run review asynchronously
review_task = asyncio.create_task(
run_review_pipeline(repo_full_name, pr_number)
)
return {"status": "review_started", "pr": pr_number}
Diff Extraction
We only review changed files — reviewing the entire codebase on every PR is wasteful and introduces noise. The GitHub API provides the diff:
import httpx
async def get_pr_diff(repo: str, pr_number: int) -> list[dict]:
"""Fetch changed files with their diffs from GitHub."""
async with httpx.AsyncClient() as client:
response = await client.get(
f"https://api.github.com/repos/{repo}/pulls/{pr_number}/files",
headers={
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3+json",
},
)
response.raise_for_status()
files = []
for f in response.json():
if f["status"] in ("added", "modified") and f["filename"].endswith(
(".ts", ".tsx", ".js")
):
files.append({
"filename": f["filename"],
"status": f["status"],
"patch": f.get("patch", ""),
"additions": f["additions"],
"deletions": f["deletions"],
})
return files
Context Assembly
The diff alone is insufficient — the LLM needs project context to identify architecture violations. We assemble a context package that includes:
- ✓Architecture rules: Extracted from ADRs and coding standards documents
- ✓File structure: The directory layout indicating architectural layers
- ✓Related files: Imports and dependencies of the changed files
ARCHITECTURE_CONTEXT = """
## Project Architecture: Clean Architecture (NestJS)
### Layer Rules (STRICT — violations must be flagged):
1. Controllers (src/*/controllers/) → ONLY import from Use Cases and DTOs
2. Use Cases (src/*/use-cases/) → ONLY import from Domain entities and Repository interfaces
3. Domain (src/*/domain/) → MUST NOT import from any other layer
4. Repositories (src/*/repositories/) → Implement interfaces defined in Domain
5. Infrastructure (src/*/infrastructure/) → Can import from Domain interfaces only
### Dependency Direction:
Controllers → Use Cases → Domain ← Repositories ← Infrastructure
(arrows show allowed import direction)
### Naming Conventions:
- Controllers: *.controller.ts
- Use Cases: *.use-case.ts (one public method: execute())
- DTOs: *.dto.ts (request and response separate)
- Entities: *.entity.ts (no decorators from infrastructure)
- Repositories: *.repository.ts (implement interface from domain)
### Common Violations to Watch For:
- Service/Controller importing directly from *.repository.ts (must go through use-case)
- Controller returning an entity instead of a response DTO
- Business logic in controllers (if/else on business rules, calculations, validations beyond input format)
- Domain entity importing from TypeORM, Prisma, or other ORM decorators
- Use case importing from infrastructure (e.g., importing HttpService, specific DB client)
- Missing interface: a use case depending on a concrete repository instead of an interface
"""
This context is included in the system prompt for the architecture review pass. It is the project-specific knowledge that makes this system effective where generic tools fail.
Multi-Pass Review Strategy
A single LLM pass trying to catch everything produces noisy results. Instead, we use three specialized passes, each with a focused objective:
Pass 1: Lightweight Static Checks (No LLM)
Fast, deterministic checks using regex and AST parsing. These are cheap and produce zero false positives:
import re
from typing import NamedTuple
class StaticFinding(NamedTuple):
file: str
line: int
rule: str
message: str
severity: str
def run_static_checks(files: list[dict]) -> list[StaticFinding]:
findings = []
for f in files:
filename = f["filename"]
lines = f["patch"].split("\n")
for i, line in enumerate(lines):
if not line.startswith("+"):
continue
code = line[1:] # strip the + prefix
line_number = extract_line_number(f["patch"], i)
# Rule: Controllers must not import from repositories
if "/controllers/" in filename:
if re.search(r"from\s+['\"].*\.repository['\"]", code):
findings.append(StaticFinding(
file=filename,
line=line_number,
rule="ARCH-001",
message="Controller imports directly from a repository. Use a use-case instead.",
severity="error",
))
# Rule: Domain entities must not import ORM decorators
if "/domain/" in filename:
if re.search(r"from\s+['\"](@nestjs\/typeorm|typeorm|@prisma|prisma)['\"]", code):
findings.append(StaticFinding(
file=filename,
line=line_number,
rule="ARCH-002",
message="Domain entity imports from ORM library. Domain must be infrastructure-agnostic.",
severity="error",
))
# Rule: console.log in production code
if "/src/" in filename and ".spec." not in filename:
if re.search(r"console\.(log|debug|info)\(", code):
findings.append(StaticFinding(
file=filename,
line=line_number,
rule="CODE-001",
message="console.log in production code. Use the Logger service instead.",
severity="warning",
))
return findings
Pass 1 runs in milliseconds and catches the most obvious violations. It serves as a fast filter — if Pass 1 catches an issue, there is no need to spend LLM tokens on it.
Pass 2: Architecture and Logic Review (LLM)
The main review pass. The LLM analyzes the diff with full architecture context:
REVIEW_PROMPT = """You are a senior software architect reviewing a pull request
for a NestJS application that follows Clean Architecture.
{architecture_context}
## Files Changed in This PR:
{file_list}
## Diffs:
{diffs}
## Your Task:
Review the code changes for architecture violations, logic errors, and design issues.
Focus ONLY on issues that matter — do not comment on formatting, naming style preferences,
or minor issues that a linter would catch.
For each issue found, respond with a JSON array of objects:
```json
[
{{
"file": "src/orders/controllers/order.controller.ts",
"line": 42,
"severity": "error|warning|info",
"category": "architecture|logic|security|design",
"rule": "ARCH-XXX or descriptive label",
"message": "Clear description of the issue",
"suggestion": "How to fix it (be specific, reference the correct layer/pattern)"
}}
]
Rules:
- ✓Only flag issues you are confident about. If you are unsure, do not include it.
- ✓"error" severity: architecture violations, bugs, security issues
- ✓"warning" severity: design concerns, potential issues, missing abstractions
- ✓"info" severity: suggestions for improvement (use sparingly)
- ✓If there are no issues, return an empty array: [] """
async def run_architecture_review(files: list[dict]) -> list[dict]:
file_list = "\n".join(f"- {f['filename']} ({f['status']})" for f in files)
diffs = "\n\n".join(
f"### {f['filename']}\ndiff\n{f['patch']}\n" for f in files
)
response = await openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": REVIEW_PROMPT.format(
architecture_context=ARCHITECTURE_CONTEXT,
file_list=file_list,
diffs=diffs,
),
},
],
response_format={"type": "json_object"},
temperature=0.1, # low temperature for consistent reviews
max_tokens=2000,
)
return json.loads(response.choices[0].message.content).get("issues", [])
### Pass 3: Security-Focused Scan (LLM)
A separate pass focused exclusively on security concerns. Using a different prompt for security review avoids the "dilution effect" where the model tries to catch everything and catches nothing well:
```python
SECURITY_PROMPT = """You are a security engineer reviewing code changes.
Focus exclusively on security vulnerabilities:
- SQL injection, NoSQL injection
- XSS (reflected, stored, DOM-based)
- Authentication/authorization bypasses
- Insecure direct object references
- Missing input validation on user-facing endpoints
- Hardcoded secrets, API keys, credentials
- Path traversal
- Insecure deserialization
Do NOT flag:
- Code style, architecture, or design issues
- Performance concerns
- Missing error handling (unless it leads to information disclosure)
Respond with a JSON array. Use severity "error" for confirmed vulnerabilities,
"warning" for potential issues that need investigation.
{diffs}
"""
Separating security into its own pass improves detection rates. In our evaluation, a combined prompt caught 60% of injected security issues, while the dedicated security pass caught 85%.
Structured Output and GitHub Integration
Merging Results from All Passes
async def run_review_pipeline(repo: str, pr_number: int):
# Fetch changed files
files = await get_pr_diff(repo, pr_number)
if not files:
return # no reviewable files changed
# Run all three passes
static_findings = run_static_checks(files)
# Only run LLM passes on files not fully covered by static checks
arch_findings, security_findings = await asyncio.gather(
run_architecture_review(files),
run_security_review(files),
)
# Merge and deduplicate
all_findings = merge_findings(static_findings, arch_findings, security_findings)
# Filter out known false positives
filtered = apply_false_positive_filter(all_findings)
# Post to GitHub
if filtered:
await post_review_comments(repo, pr_number, filtered)
else:
await post_approval(repo, pr_number)
Posting Review Comments
async def post_review_comments(repo: str, pr_number: int, findings: list[dict]):
# Get the latest commit SHA for the PR
pr_info = await get_pr_info(repo, pr_number)
commit_sha = pr_info["head"]["sha"]
comments = []
for finding in findings:
severity_emoji = {
"error": "🔴",
"warning": "🟡",
"info": "💡"
}.get(finding["severity"], "")
body = (
f"{severity_emoji} **{finding['rule']}** ({finding['category']})\n\n"
f"{finding['message']}\n\n"
)
if finding.get("suggestion"):
body += f"**Suggestion:** {finding['suggestion']}\n\n"
body += (
f"<sub>Severity: {finding['severity']} | "
f"[Not a real issue? Click here to report false positive]"
f"({FEEDBACK_URL}?finding={finding['rule']}&file={finding['file']})"
f"</sub>"
)
comments.append({
"path": finding["file"],
"line": finding["line"],
"body": body,
})
# Post as a PR review
async with httpx.AsyncClient() as client:
await client.post(
f"https://api.github.com/repos/{repo}/pulls/{pr_number}/reviews",
headers={
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3+json",
},
json={
"commit_id": commit_sha,
"body": f"AI Architecture Review: {len(findings)} issues found",
"event": "COMMENT",
"comments": comments,
},
)
False Positive Management
False positives erode developer trust faster than anything else. A review tool that flags non-issues gets ignored, then disabled. Managing false positives is not optional — it is a core system requirement.
Feedback Loop
Every review comment includes a "not a real issue" link. When a developer clicks it, the finding is logged:
from dataclasses import dataclass
from datetime import datetime
@dataclass
class FalsePositiveReport:
rule: str
file: str
finding_message: str
reporter: str
timestamp: datetime
context: str # the code that was flagged
# Store in a database
def record_false_positive(report: FalsePositiveReport):
db.insert("false_positives", {
"rule": report.rule,
"file_pattern": extract_layer(report.file), # e.g., "controllers"
"message_hash": hash(report.finding_message),
"reporter": report.reporter,
"timestamp": report.timestamp.isoformat(),
"context": report.context,
})
Prompt Improvement Cycle
Weekly, review the false positive reports:
- ✓Pattern analysis: Are most false positives from one rule? That rule's prompt needs refinement.
- ✓Context gaps: Did the LLM flag something because it lacked context? Add that context to the architecture description.
- ✓Ambiguous rules: Is the rule legitimately ambiguous? Add examples to the prompt showing what is and is not a violation.
# Example: Refining the prompt based on false positives
# Original rule: "Controllers must not contain business logic"
# False positives: Input validation in controllers flagged as "business logic"
# Refined rule:
REFINED_RULES = """
Controllers must not contain business logic.
Business logic includes: conditional branching on domain rules, calculations,
state transitions, applying business policies.
NOT business logic (do not flag): input format validation (checking required fields,
type coercion, format validation), authentication/authorization checks
(guard decorators), response mapping (entity → DTO), pagination parameter handling.
"""
Suppression Mechanism
For known false positive patterns that are hard to fix in the prompt, add explicit suppressions:
SUPPRESSED_PATTERNS = [
{
"rule": "ARCH-001",
"file_pattern": r".*\.spec\.ts$",
"reason": "Test files may import from any layer",
},
{
"rule": "ARCH-003",
"file_pattern": r"src/shared/.*",
"reason": "Shared module is exempt from strict layering",
},
]
def apply_false_positive_filter(findings: list[dict]) -> list[dict]:
filtered = []
for finding in findings:
suppressed = False
for pattern in SUPPRESSED_PATTERNS:
if (finding["rule"] == pattern["rule"] and
re.match(pattern["file_pattern"], finding["file"])):
suppressed = True
break
if not suppressed:
filtered.append(finding)
return filtered
Cost Management
AI code review runs on every PR. At 15-20 PRs/day with an average of 5 changed files per PR, costs can add up.
Diff-Only Context
Only send the diff, not entire files. This typically reduces input tokens by 70-80%.
Caching Repeated Patterns
If a developer pushes multiple commits to the same PR, only review the new changes. Cache the review results for files that have not changed since the last review.
Token Budget per PR
Set a maximum token budget per review. If a PR changes 50 files, review the most important ones (based on file path — controllers, use cases, domain files first) and skip configuration files, test utilities, and auto-generated code.
MAX_TOKENS_PER_REVIEW = 8000 # input tokens
def prioritize_files(files: list[dict]) -> list[dict]:
priority_order = [
r".*/controllers/.*",
r".*/use-cases/.*",
r".*/domain/.*",
r".*/repositories/.*",
r".*/services/.*",
]
def file_priority(f):
for i, pattern in enumerate(priority_order):
if re.match(pattern, f["filename"]):
return i
return len(priority_order)
sorted_files = sorted(files, key=file_priority)
# Include files until we hit the token budget
selected = []
token_count = 0
for f in sorted_files:
file_tokens = estimate_tokens(f["patch"])
if token_count + file_tokens > MAX_TOKENS_PER_REVIEW:
break
selected.append(f)
token_count += file_tokens
return selected
Measuring Effectiveness
Without metrics, you cannot know if the review agent is helping or just adding noise. Track these:
Defects Caught Pre-Merge
The primary metric. Count the number of review comments that resulted in code changes (the developer agreed and fixed the issue). Exclude false positives and ignored comments.
Review Cycle Time
Time from PR creation to first review comment. The AI reviewer should respond within minutes, not hours. Compare with the team's average human review time.
Architecture Violations Per Sprint
Track the number of architecture violations that reach the main branch (caught by manual audits or integration tests). This should decrease over time as developers learn from the AI reviews.
Developer Satisfaction
Survey developers quarterly. Ask:
- ✓Does the AI reviewer catch useful issues? (1-5)
- ✓Is the false positive rate acceptable? (1-5)
- ✓Do you trust the AI reviewer's suggestions? (1-5)
A tool that developers hate will be disabled regardless of its objective effectiveness.
Case Study: NestJS Clean Architecture Team
A team of 12 developers working on a NestJS monorepo implementing a B2B order management system. The codebase used Clean Architecture with strict layering, but architecture violations had been creeping in over 8 months of rapid development.
The Problem
A manual architecture audit found:
- ✓23 instances of services importing repositories directly (bypassing use cases)
- ✓11 controllers containing business logic (conditional pricing rules, discount calculations)
- ✓8 API endpoints returning raw entities instead of DTOs
- ✓5 domain entities decorated with TypeORM decorators
Average PR review cycle time was 4.2 hours. Architecture violations caught in review: approximately 3 per sprint (out of an estimated 12+ that were introduced).
The Solution
Stripe Systems built the 3-pass review agent described in this post, configured with the team's specific architecture rules, ADRs, and coding standards.
Pass 1 (Static): 6 regex rules covering the most common violations — direct repository imports in controllers, ORM decorators in domain entities, console.log in production code, missing DTO patterns.
Pass 2 (Architecture LLM): GPT-4o with the team's full architecture context, including examples of correct and incorrect patterns from their codebase.
Pass 3 (Security): GPT-4o focused scan on authentication decorators, input validation, and SQL/NoSQL injection patterns.
Sample Review Output
On a PR that added a new pricing endpoint:
File: src/orders/controllers/pricing.controller.ts
+ @Get('calculate')
+ async calculatePrice(@Query() query: CalculatePriceQuery) {
+ const product = await this.productRepository.findById(query.productId);
+ let price = product.basePrice;
+ if (query.quantity > 100) {
+ price = price * 0.9; // 10% bulk discount
+ }
+ if (product.category === 'electronics') {
+ price = price * 1.18; // GST
+ }
+ return { price, product };
+ }
Review comments posted:
🔴 ARCH-001 (architecture)
Controller imports and uses
productRepositorydirectly. In Clean Architecture, controllers should delegate to use cases. Create aCalculatePriceUseCasethat encapsulates this logic.Suggestion: Move the pricing logic to
src/orders/use-cases/calculate-price.use-case.tsand inject it into the controller.
🔴 ARCH-004 (architecture)
Controller contains business logic: bulk discount calculation and GST application. This logic belongs in the domain layer or a use case, not in a controller.
Suggestion: Create a
PricingServicein the domain layer that handles discount rules and tax calculations.
🟡 ARCH-005 (architecture)
Endpoint returns the raw
productentity. Create aCalculatePriceResponseDtoto control the API surface and prevent leaking internal entity structure.Suggestion: Define a response DTO in
src/orders/dtos/calculate-price-response.dto.ts.
🟡 CODE-003 (design)
Discount rules (10% for quantity > 100) and tax rates (18% GST for electronics) are hardcoded. These should be configurable or pulled from a rules table.
Suggestion: Consider a pricing rules configuration or database table.
Metrics After 3 Months
| Metric | Before | After | Change |
|---|---|---|---|
| PR review cycle time | 4.2 hours | 1.8 hours | -57% |
| Architecture violations per sprint | ~12 introduced, ~3 caught | ~3 introduced, ~2 caught | -75% introduced |
| Architecture violations reaching main branch | ~9 per sprint | ~1 per sprint | -89% |
| False positive rate | N/A | 14% (month 1) → 6% (month 3) | Improving |
| Developer satisfaction (1-5) | N/A | 3.8 (month 1) → 4.3 (month 3) | Improving |
| Monthly review cost | $0 (human time only) | $180 (LLM API costs) | — |
| Human reviewer time per PR | 35 minutes | 18 minutes | -49% |
The most significant result was not the cycle time reduction — it was the drop in violations introduced per sprint from ~12 to ~3. Developers started self-correcting before submitting PRs because they knew the AI reviewer would catch violations. The review agent became a teaching tool, not just a gate.
The false positive rate started at 14% in the first month and dropped to 6% by month 3 through the feedback loop. The remaining false positives were edge cases in the shared module and test files, most of which were handled by suppression rules.
What Did Not Work
- ✓Reviewing auto-generated files: Prisma migrations and auto-generated type files produced nothing but noise. We added an exclusion list.
- ✓Reviewing large refactoring PRs: PRs with 40+ files exceeded the token budget and the model struggled with the volume of changes. We added a recommendation to split large PRs.
- ✓Security pass on non-web code: The security pass flagged false positives on internal utility functions that never handle user input. We scoped it to controller and middleware files only.
The overall conclusion: a custom AI code review system works when it encodes specific, well-defined architecture rules and when there is a functioning feedback loop to manage false positives. The LLM is not doing the hard part — defining and maintaining the architecture rules is. The LLM is a flexible pattern matcher that applies those rules more consistently than human reviewers who are tired, busy, or unfamiliar with the codebase.
Ready to discuss your project?
Get in Touch →