Engineering Culture📅 March 25, 2026· 20 min read

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

✍️

Stripe Systems Engineering

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentication requirements, and test expectations — and gets usable code in one shot. The difference is not the model. It is the prompt.

At Stripe Systems, we spent three months formalizing our prompt engineering practices into an internal playbook. Not because prompt engineering is rocket science, but because consistency matters. When every engineer uses their own prompting style, you get inconsistent output quality, wasted time on prompt iteration, and no shared learning about what works and what does not.

This post shares the core of that playbook: six prompt patterns we use daily, common anti-patterns to avoid, guidance on model and temperature selection, and our approach to prompt versioning. We also include five real prompt templates from our internal library — anonymized but functionally identical to what our team uses.

Why Software Teams Need a Prompt Playbook

Three reasons:

Consistency. A shared set of prompt templates means every engineer gets similar-quality output for common tasks. The senior engineer's test generation prompt should produce the same quality results when a junior engineer uses it.

Knowledge sharing. When one engineer discovers that adding "rank hypotheses by probability" to a debugging prompt dramatically improves output quality, that discovery should propagate to the team. Without a shared playbook, it stays in one person's head.

Onboarding. New team members are effective with AI tools from day one if they have a library of proven prompts. Without it, they spend weeks discovering through trial and error what the team already knows.

Prompt Anatomy

Before the patterns, let us establish vocabulary. Every effective prompt has these components:

┌─────────────────────────────────────────────────┐
│ SYSTEM PROMPT (role + behavioral constraints)    │
├─────────────────────────────────────────────────┤
│ CONTEXT INJECTION (code, docs, architecture)     │
├─────────────────────────────────────────────────┤
│ TASK SPECIFICATION (what to do, clearly)          │
├─────────────────────────────────────────────────┤
│ OUTPUT FORMAT (structure, length, format)         │
├─────────────────────────────────────────────────┤
│ CONSTRAINTS (what to avoid, boundaries)           │
└─────────────────────────────────────────────────┘

System prompt sets the persona and behavioral rules. "You are a senior TypeScript engineer who follows NestJS conventions. Do not suggest deprecated APIs."

Context injection provides the specific code, documentation, or architecture information the model needs. This is where most prompts fail — insufficient context produces generic output.

Task specification is the actual instruction. "Review this code for potential null pointer exceptions" is a task specification. It should be specific, actionable, and unambiguous.

Output format tells the model how to structure its response. "Return a JSON array of objects with fields: line_number, issue, severity, suggestion" eliminates the need to parse prose.

Constraints define what the model should not do. "Do not suggest rewriting the function. Only identify issues in the existing code." Constraints prevent the model from going off on tangents.

Not every prompt needs all five components. Quick questions need only a task specification. Complex code generation needs all five.

Pattern 1: Code Generation Prompts

Code generation is the most common use of LLMs for developers, and it is where prompt quality matters most. A vague prompt produces code that compiles but does not fit your architecture. A structured prompt produces code that slots into your codebase with minimal modification.

Template Structure

[System] You are a senior {language} developer working in a {framework}
codebase. Follow these project conventions:
- {convention 1}
- {convention 2}
- {convention 3}

[Context]
Existing code that this new code must integrate with:

{paste relevant existing code — interfaces, types, related services}


Database schema (if relevant):

{paste relevant table definitions}


[Task] Generate a {component type} that {does what}.

Requirements:
1. {specific requirement}
2. {specific requirement}
3. {specific requirement}

[Output Format]
- Single file, complete and runnable
- Include all imports
- Include JSDoc comments for public methods
- Do not include test code in this file

[Constraints]
- Use {specific library} for {specific purpose}, not {alternative}
- Error handling must use {project's error handling pattern}
- Do not use any deprecated APIs

Why This Works

The model receives enough context to generate code that fits your specific codebase, not generic code for a generic project. The conventions section prevents style violations. The constraints prevent the model from making common substitutions (e.g., using Axios when your project uses the built-in fetch wrapper).

Common Failure Mode

Omitting the existing code context. If you ask for "a user service in NestJS" without showing the model your existing service patterns, base classes, and error handling approach, you get a service that works in isolation but does not match your codebase.

Pattern 2: Debugging Prompts

Debugging prompts are the second most common use case, and they benefit enormously from structure. An unstructured "why doesn't this work?" produces generic suggestions. A structured debugging prompt produces ranked hypotheses with specific verification steps.

Template Structure

[System] You are debugging a {language}/{framework} application.
Provide hypotheses ranked by probability. For each hypothesis,
include a specific test to confirm or rule it out.

[Context]
Error/Stack trace:

{paste full error output}


Relevant code:

{paste the function where the error occurs and its dependencies}


Environment:
- Runtime: {Node 20 / Python 3.11 / etc.}
- Framework: {NestJS 10 / Django 4.2 / etc.}
- Database: {PostgreSQL 15 / MongoDB 7 / etc.}
- OS: {Linux / macOS} in {Docker / native}

Expected behavior: {what should happen}
Actual behavior: {what actually happens}

What I've already checked:
- {hypothesis 1 — eliminated because...}
- {hypothesis 2 — eliminated because...}

[Task] Provide your top 5 hypotheses ranked by probability.
For each, explain why it could cause this specific error
and give me a concrete command or code change to verify it.

[Constraints]
- Do not suggest solutions I've already eliminated.
- Focus on this specific error, not general best practices.

Why This Works

The "what I've already checked" section is critical. Without it, the model's first three suggestions are usually things you already tried. Including eliminated hypotheses forces the model to go deeper.

The "specific test to confirm" requirement prevents vague suggestions like "check your database connection." Instead, you get: "Run SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction' to check for connection leaks. If the count exceeds your pool size (default 10), you have a connection leak."

Pattern 3: Architecture Review Prompts

Architecture review prompts are used during design phases and when evaluating significant changes to existing systems. The key is providing enough context about constraints and quality attributes.

Template Structure

[System] You are a senior software architect reviewing a design
document. Provide specific, actionable feedback. Do not restate
what the document already says. Focus on risks, gaps, and
unconsidered scenarios.

[Context]
Design document:

{paste full design doc or relevant sections}


System constraints:
- Team size: {N engineers}
- Deployment: {Kubernetes / VM / serverless}
- Scale: {requests/sec, data volume, user count}
- Compliance: {GDPR / PCI-DSS / HIPAA / none}
- Existing infrastructure: {list key existing systems}

Quality attributes prioritized (in order):
1. {e.g., Reliability — 99.9% uptime required}
2. {e.g., Maintainability — team rotates every 6 months}
3. {e.g., Performance — p99 < 200ms for API calls}

[Task] Review this design against the stated quality attributes.
For each concern you identify:
1. State the concern clearly
2. Explain the specific failure scenario
3. Rate severity (critical/major/minor)
4. Suggest a mitigation with tradeoffs

[Constraints]
- Do not suggest alternative architectures unless the current
  design has a critical flaw.
- Focus on the gaps in this design, not general architecture advice.
- Limit to 10 most important concerns.

Why This Works

The quality attributes provide evaluation criteria. Without them, the model produces generic architecture advice ("consider using a message queue"). With them, the model evaluates the design against specific requirements ("your design has a synchronous dependency between the order service and inventory service, which means an inventory service outage will cause order placement failures — this conflicts with your 99.9% uptime requirement").

Pattern 4: Test Generation Prompts

Test generation prompts produce the highest ROI of any prompt pattern because test code is highly structured and repetitive, making it an ideal target for AI generation.

Template Structure

[System] You are a senior QA engineer writing tests in {test framework}
for a {language}/{framework} application. Write thorough, readable tests.

[Context]
Function/class under test:

{paste the full function/class with type definitions}


Dependencies (mocked in tests):

{paste interfaces/types of dependencies}


Existing test patterns in this project:

{paste one example test from the project to establish style}


[Task] Generate test cases covering:
1. Happy path (standard input → expected output)
2. Edge cases: {list specific edge cases to cover}
3. Error cases: {list specific error scenarios}
4. Boundary values: {list boundaries}

[Output Format]
- Use {Jest / Mocha / pytest / etc.} syntax
- Use {project's mocking library} for mocks
- Group tests with describe/it blocks matching project convention
- Include setup and teardown where needed
- One assertion per test where practical

[Constraints]
- Do not test private methods directly
- Do not test implementation details (e.g., "it should call
  repository.save once") — test behavior
- Use realistic test data, not "test123" or "foo/bar"

Why This Works

The existing test pattern example is critical. It teaches the model your project's testing conventions: how you structure describe blocks, how you name tests, how you set up mocks, and what assertion style you use. Without this, the model generates tests in its own style, and you spend time reformatting.

The constraint about not testing implementation details prevents a common AI failure mode: generating tests that assert on internal method calls rather than observable behavior. These tests are brittle and break on refactoring.

Pattern 5: Documentation Prompts

Documentation prompts convert code into readable documentation. They are straightforward but benefit from specifying the audience and the documentation standard.

Template Structure

[System] You are a technical writer creating documentation for
a {language} API. Write for an audience of: {junior developers /
senior developers / external API consumers}.

[Context]
Code to document:

{paste controller/service/module code}


Existing documentation style (follow this):

{paste one example from existing docs}


[Task] Generate documentation covering:
1. Overview: what this module/API does, in 2-3 sentences
2. For each endpoint/method:
   - Purpose
   - Parameters with types and validation rules
   - Return value with type
   - Error cases with HTTP status codes
   - Example request and response
3. Authentication requirements
4. Rate limiting (if applicable)

[Output Format]
- Markdown
- Use code blocks for examples
- Include curl examples for API endpoints

[Constraints]
- Do not describe how the code works internally —
  describe what it does from the consumer's perspective.
- Do not invent features that are not in the code.
- If behavior is ambiguous from the code, flag it as
  "[NEEDS CLARIFICATION]" rather than guessing.

Why This Works

The "[NEEDS CLARIFICATION]" constraint is important. Without it, the model confidently documents behavior it is guessing about. With this constraint, ambiguous areas are flagged for human attention rather than documented incorrectly.

Pattern 6: Code Review Prompts

Code review prompts are used in automated review pipelines and by individual engineers seeking pre-review feedback.

Template Structure

[System] You are a senior code reviewer. Your review should be:
- Specific (reference line numbers and variable names)
- Actionable (say what to change, not just what's wrong)
- Prioritized (critical issues first)
- Respectful (critique the code, not the author)

[Context]
Diff to review:

{paste git diff}


Project coding standards:
- {standard 1: e.g., "All database access goes through repository classes"}
- {standard 2: e.g., "Services must not import from controllers"}
- {standard 3: e.g., "All public methods must have JSDoc comments"}
- {standard 4: e.g., "Error responses must use the ApiError class"}

Architecture rules:
- {rule 1: e.g., "No direct HTTP calls from service layer — use adapters"}
- {rule 2: e.g., "Feature modules must not depend on other feature modules"}

[Task] Review this diff for:
1. Bugs (null checks, error handling, logic errors)
2. Coding standard violations
3. Architecture rule violations
4. Test coverage gaps (missing tests for new/changed code)
5. Security concerns

[Output Format]
For each issue:
- File and line reference
- Category (bug/standard/architecture/test/security)
- Severity (critical/major/minor)
- Description
- Suggested fix

[Constraints]
- Do not comment on style/formatting (linter handles this)
- Do not suggest refactoring unrelated to the diff
- Limit to 15 most important issues
- If no issues found, say so — do not manufacture concerns

Why This Works

The coding standards and architecture rules section is what makes this prompt production-useful rather than generic. By encoding your project's specific rules, the AI review catches violations that a generic review would miss.

The "do not manufacture concerns" constraint prevents the model from generating a review comment on every PR. Some PRs are clean, and the review should say so.

Anti-Patterns

Common prompt mistakes we have observed and corrected:

Vague prompts. "Write me a service" → generates generic code. "Write a NestJS service for order processing that validates stock availability before creating an order, uses the existing InventoryRepository interface, and throws an InsufficientStockError if any line item is unavailable" → generates usable code.

Missing context. Asking the model to review code without providing the interfaces it implements, the types it uses, or the patterns the project follows. The model fills in the gaps with assumptions, and those assumptions are usually wrong.

Asking for too much at once. "Write a complete user management module with registration, login, password reset, profile management, role management, and an admin dashboard" → produces shallow implementations of everything. Break it into focused requests: one for registration, one for login, etc.

Not specifying output format. Without format constraints, the model wraps code in explanatory prose, uses different indentation, or generates code with inline comments explaining every line. Specify: "Output only the code, no explanation. Use 2-space indentation. Minimal comments."

Ignoring the system prompt. Many developers skip the system prompt entirely. The system prompt establishes the persona, conventions, and constraints that apply to all subsequent interactions in the conversation. It is the highest-leverage part of the prompt.

Temperature and Model Selection

Temperature controls randomness in the model's output. For software engineering tasks, the right temperature depends on the task type:

Temperature 0.0 - 0.3 (deterministic): Use for code generation, test writing, bug fixing, and any task where there is one correct answer. Low temperature produces consistent, predictable output. You want the same prompt to produce the same code every time.

Temperature 0.4 - 0.6 (balanced): Use for documentation writing, code review, and refactoring suggestions. These tasks benefit from some variation — the model might phrase a documentation section differently or suggest an alternative refactoring approach.

Temperature 0.7 - 1.0 (creative): Use for brainstorming, architecture exploration, and naming. "Suggest 10 names for this microservice" benefits from high temperature. "Write a database migration" does not.

Model selection: For code generation and review, use the most capable model available (GPT-4-class or Claude Sonnet/Opus). For quick lookups, syntax questions, and simple transformations, smaller/faster models (GPT-4o-mini, Claude Haiku) are sufficient and cheaper. The cost difference is significant at scale — a team of 8 engineers making 50+ API calls per day per person will notice the bill.

Prompt Versioning

Effective prompts are intellectual property. They encode team knowledge about what works for your specific tech stack, coding standards, and project patterns. Losing them is expensive.

Our approach to prompt versioning:

Shared repository. We maintain a directory in our internal knowledge base with prompt templates organized by use case:

prompts/
  code-generation/
    nestjs-service.md
    flutter-widget.md
    react-component.md
  testing/
    unit-test-nestjs.md
    widget-test-flutter.md
    e2e-test-playwright.md
  review/
    pr-review-backend.md
    pr-review-frontend.md
    pr-review-mobile.md
  debugging/
    backend-debug.md
    frontend-debug.md
  documentation/
    api-docs.md
    readme-section.md
    changelog.md

Versioning with Git. Prompts are version-controlled like code. When someone improves a prompt, they commit the change with a description of what improved and why.

Tagging by use case and quality. Each prompt has a header with metadata:

# NestJS Service Generation Prompt
- **Version:** 3.2
- **Last Updated:** 2026-12-15
- **Author:** [engineer name]
- **Quality Rating:** 4.2/5 (based on 47 uses)
- **Best Model:** Claude Sonnet or GPT-4
- **Temperature:** 0.2
- **Notes:** v3.2 added the "existing patterns" context section,
  which improved output consistency significantly

Quality tracking. After using a prompt, engineers rate the output quality (1-5 scale) with optional notes. We review ratings monthly and improve prompts that consistently score below 3.5.

Training New Team Members

Prompt engineering is part of our onboarding at Stripe Systems. New engineers complete a structured sequence during their first two weeks:

Week 1: Use the prompt library for daily tasks. No modifications, just use the templates as-is. This builds familiarity with what is available and what quality to expect.

Week 2: Modify prompts for specific tasks. The engineer encounters situations where templates do not quite fit, and they learn to adjust context, constraints, and output format.

Ongoing: Contribute improvements back to the prompt library. When an engineer discovers a prompt modification that consistently improves output, they submit it as a PR to the prompt repository.

The goal is not to make every engineer a prompt engineering expert. It is to make every engineer effective at the 80% of prompt use cases that are covered by templates, and competent at adapting prompts for the remaining 20%.

Case Study: The Stripe Systems Internal Prompt Library

Here are five real prompt templates from our internal library, anonymized but functionally identical. For each, we include the template, a filled example, and the team's quality rating.

Template 1: NestJS Service Scaffolding

Quality Rating: 4.3/5 (62 uses, team average)

You are a senior TypeScript developer working in a NestJS 10 application
with TypeORM 0.3.x and PostgreSQL. Follow these project conventions:
- Services are injectable classes with constructor-injected dependencies
- All database queries go through TypeORM repositories
- Use class-validator for DTO validation
- Use custom exceptions extending HttpException for error responses
- All amounts are stored in paise (integer), never rupees (float)
- Date fields use ISO 8601 format and are stored as timestamptz

Existing repository interface this service will use:
```typescript
@Injectable()
export class InvoiceRepository {
  constructor(
    @InjectRepository(Invoice)
    private readonly repo: Repository<Invoice>,
  ) {}

  async findById(id: string): Promise<Invoice | null>;
  async findByClientId(clientId: string, pagination: PaginationDto): Promise<[Invoice[], number]>;
  async save(invoice: Invoice): Promise<Invoice>;
  async updateStatus(id: string, status: InvoiceStatus): Promise<void>;
}

Existing error classes:

export class EntityNotFoundError extends HttpException { /* ... */ }
export class BusinessRuleViolationError extends HttpException { /* ... */ }
export class ValidationError extends HttpException { /* ... */ }

Generate an InvoiceService with these methods:

✓createInvoice(dto: CreateInvoiceDto) — validates line items, calculates total with GST (18%), creates invoice in DRAFT status
✓submitInvoice(id: string) — transitions from DRAFT to SUBMITTED, validates all required fields are present
✓getInvoice(id: string) — fetches by ID, throws EntityNotFoundError if not found
✓listClientInvoices(clientId: string, pagination: PaginationDto) — paginated list for a client

Output the complete service file with all imports. Include JSDoc for each public method. Do not include tests.


**Team notes:** "Works well for standard CRUD services. For services with complex business logic, use this to generate the structure and then rewrite the logic methods manually. The GST calculation pattern is consistently correct because we specify the rate and storage format (paise)."

### Template 2: Flutter Widget Test Generation

**Quality Rating: 3.8/5** (38 uses, team average)

You are a senior Flutter developer writing widget tests using flutter_test and mocktail. Follow these conventions:

✓Use pumpWidget with MaterialApp wrapper for all widget tests
✓Use mocktail for mocking (not mockito)
✓Test user-visible behavior, not implementation details
✓Use find.text(), find.byType(), and find.byKey() for finding widgets
✓Test loading states, error states, and success states

Widget under test:

{paste widget code}

Dependencies to mock:

{paste bloc/cubit/provider interfaces}

Example test from this project (follow this style):

{paste one existing test}

Generate widget tests covering:

✓Initial render — loading state shows CircularProgressIndicator
✓Success state — data displayed correctly
✓Error state — error message shown with retry button
✓User interaction — tap on item triggers expected navigation/action
✓Empty state — appropriate message shown when list is empty

Use realistic test data. Group tests with group() blocks. Include setUp and tearDown for mock registration.


**Team notes:** "Rating is 3.8 because the AI sometimes struggles with BLoC state management testing — it generates tests that emit states incorrectly. Works well for simple widgets but needs more human adjustment for complex stateful widgets. Adding the example test from the project improved output consistency from 3.2 to 3.8."

### Template 3: PR Review with Architecture Rules

**Quality Rating: 4.1/5** (94 uses, team average)

Review this pull request diff against our project standards.

Architecture Rules:

✓Feature modules must not import from other feature modules directly. Cross-feature communication goes through shared services or events.
✓Controllers handle HTTP concerns only (request parsing, response formatting, status codes). Business logic belongs in services.
✓Services must not use Request or Response objects from the HTTP layer.
✓All database access goes through repository classes. No raw queries in services.
✓External API calls go through adapter classes in the /adapters directory. Services depend on adapter interfaces, not implementations.
✓DTOs for request/response are separate from database entities. No entity objects in API responses.

Coding Standards:

✓All public methods have JSDoc comments with @param and @returns
✓Error handling uses custom exception classes, not generic Error
✓Async functions that call repositories must handle the null case
✓All new endpoints have corresponding test files
✓Environment-specific values come from ConfigService, not process.env

Diff:

{paste diff}

For each issue found:

✓Quote the specific code (file:line)
✓Category: architecture | standard | bug | security | test-gap
✓Severity: critical | major | minor
✓What is wrong
✓How to fix it

If the diff is clean, say "No issues found" — do not invent problems.


**Team notes:** "This is our most-used prompt. The architecture rules section is the key differentiator — without it, the AI gives generic review feedback. With it, it catches real architecture violations. We update the rules section whenever we add or modify an architecture decision. False positive rate dropped from 20% to 7% after we added the 'do not invent problems' constraint."

### Template 4: SQL Query Optimization

**Quality Rating: 4.0/5** (27 uses, team average)

You are a PostgreSQL performance expert. Analyze this query and suggest optimizations.

Query:

{paste the slow query}

EXPLAIN ANALYZE output:

{paste EXPLAIN ANALYZE results}

Table definitions:

{paste CREATE TABLE statements for involved tables}

Current indexes:

{paste relevant index definitions}

Data characteristics:

✓Table sizes: {e.g., orders: 2.3M rows, line_items: 8.7M rows}
✓Query frequency: {e.g., runs 500 times/hour}
✓Acceptable latency: {e.g., p99 < 50ms}

Provide:

✓Analysis of the current execution plan — where is time spent?
✓
Specific optimizations ranked by expected impact:
- ✓Query rewrites (with the rewritten query)
- ✓Index suggestions (with CREATE INDEX statements)
- ✓Schema changes (if justified by the performance requirement)
✓For each suggestion, estimate the improvement and any tradeoffs (e.g., index speeds up reads but slows writes)

Do not suggest: upgrading hardware, increasing memory, or generic advice like "add more indexes." Be specific.


**Team notes:** "Works well when you provide the EXPLAIN ANALYZE output — without it, the suggestions are generic. The model is good at identifying missing indexes and suggesting query rewrites for common patterns (correlated subqueries → JOINs, OR conditions → UNION). For complex queries involving CTEs or window functions, quality drops — model sometimes suggests 'optimizations' that change semantics."

### Template 5: Incident RCA Analysis

**Quality Rating: 3.9/5** (15 uses, team average)

Assist with root cause analysis for a production incident.

Incident Summary:

✓Service: {service name}
✓Start time: {timestamp}
✓End time: {timestamp}
✓Impact: {what users experienced}
✓Severity: {P1/P2/P3}

Timeline of events (from monitoring and logs): {paste chronological timeline}

Recent changes (deployments, config changes, infrastructure): {list changes in the 24 hours before the incident}

Metrics during the incident:

✓Error rate: {before} → {during}
✓Latency p99: {before} → {during}
✓CPU/Memory: {before} → {during}
✓Database: {connection pool, query latency, etc.}

Relevant log entries:

{paste key log entries with timestamps}

Provide:

✓Most likely root cause with supporting evidence from the timeline
✓Contributing factors (conditions that made the incident worse)
✓Why existing monitoring/alerting did not catch it sooner
✓
Specific remediation actions:
- ✓Immediate (prevent recurrence)
- ✓Short-term (improve detection)
- ✓Long-term (systemic improvement)
✓Timeline reconstruction: correlate events across the timeline to show the cascade of failures

Be specific. Reference timestamps and log entries. Do not suggest generic improvements unrelated to this incident.


**Team notes:** "Most useful for correlating events across a complex timeline. The model is good at spotting that event A at 14:02 caused event B at 14:03 when you have twenty events to sort through. Less useful for novel failure modes — it tends to attribute root causes to the most recent deployment even when the deployment is unrelated. We use this to draft the RCA document, then the on-call engineer reviews and corrects the analysis."

## Measuring Prompt Library Effectiveness

We track three metrics for the prompt library:

**Usage rate.** How often each template is used per week. Templates with low usage are either too niche (acceptable) or not useful (need improvement). We review usage monthly.

**Quality rating trend.** Each template's average rating over time. A declining rating signals that the template has not kept up with changes in our tech stack or conventions. A rising rating means recent improvements are working.

**Time-to-useful-output.** How many prompt iterations it takes to get usable output. A good template produces usable output in one iteration for 70%+ of uses. If engineers consistently need to re-prompt, the template needs improvement.

Current numbers for our top-5 templates (as of the publication date):

| Template | Uses/Week | Avg Rating | One-Shot Success Rate |
|---|---|---|---|
| NestJS Service | 8.2 | 4.3 | 78% |
| Flutter Widget Test | 5.1 | 3.8 | 62% |
| PR Review | 12.4 | 4.1 | 85% |
| SQL Optimization | 3.6 | 4.0 | 71% |
| Incident RCA | 1.8 | 3.9 | 65% |

The PR review template has the highest one-shot success rate because its output is advisory (a list of issues to consider), not generative (code that must compile and work). Generative templates inherently have lower one-shot rates because the output has more constraints to satisfy.

## Conclusion

Prompt engineering for software teams is not about crafting the perfect prompt. It is about building a shared library of good-enough prompts that consistently produce useful output, and improving them over time based on measured feedback.

The playbook we have built at Stripe Systems is not complex. It is six patterns, a shared repository, a rating system, and a monthly review process. The investment is small — maybe 2-3 hours per month for maintenance — and the return is measured in consistent output quality, faster onboarding, and shared learning across the team.

If your team is using LLMs without a shared prompt library, you are leaving value on the table. Not because individual engineers cannot write good prompts, but because their discoveries die with their chat history instead of becoming team knowledge.

Start with the six patterns in this post, adapt them to your tech stack, and iterate from there. The templates are meant to be modified, not followed rigidly. The structure matters more than the specific wording — as long as you provide system context, inject relevant code, specify the task clearly, define the output format, and set constraints, the output quality will be consistently higher than unstructured prompting.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

Custom Software Development

Purpose-built software designed around your business logic, data workflows, and operational requirements.

Learn more →

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussi...

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months ...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, an...

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, paymen...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million r...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a lang...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

Engineering Culture📅 March 25, 2026· 20 min read

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

✍️

Stripe Systems Engineering

Why Software Teams Need a Prompt Playbook

Three reasons:

Prompt Anatomy

Before the patterns, let us establish vocabulary. Every effective prompt has these components:

┌─────────────────────────────────────────────────┐
│ SYSTEM PROMPT (role + behavioral constraints)    │
├─────────────────────────────────────────────────┤
│ CONTEXT INJECTION (code, docs, architecture)     │
├─────────────────────────────────────────────────┤
│ TASK SPECIFICATION (what to do, clearly)          │
├─────────────────────────────────────────────────┤
│ OUTPUT FORMAT (structure, length, format)         │
├─────────────────────────────────────────────────┤
│ CONSTRAINTS (what to avoid, boundaries)           │
└─────────────────────────────────────────────────┘

System prompt sets the persona and behavioral rules. "You are a senior TypeScript engineer who follows NestJS conventions. Do not suggest deprecated APIs."

Context injection provides the specific code, documentation, or architecture information the model needs. This is where most prompts fail — insufficient context produces generic output.

Task specification is the actual instruction. "Review this code for potential null pointer exceptions" is a task specification. It should be specific, actionable, and unambiguous.

Output format tells the model how to structure its response. "Return a JSON array of objects with fields: line_number, issue, severity, suggestion" eliminates the need to parse prose.

Constraints define what the model should not do. "Do not suggest rewriting the function. Only identify issues in the existing code." Constraints prevent the model from going off on tangents.

Not every prompt needs all five components. Quick questions need only a task specification. Complex code generation needs all five.

Pattern 1: Code Generation Prompts

Template Structure

[System] You are a senior {language} developer working in a {framework}
codebase. Follow these project conventions:
- {convention 1}
- {convention 2}
- {convention 3}

[Context]
Existing code that this new code must integrate with:

{paste relevant existing code — interfaces, types, related services}


Database schema (if relevant):

{paste relevant table definitions}


[Task] Generate a {component type} that {does what}.

Requirements:
1. {specific requirement}
2. {specific requirement}
3. {specific requirement}

[Output Format]
- Single file, complete and runnable
- Include all imports
- Include JSDoc comments for public methods
- Do not include test code in this file

[Constraints]
- Use {specific library} for {specific purpose}, not {alternative}
- Error handling must use {project's error handling pattern}
- Do not use any deprecated APIs

Why This Works

Common Failure Mode

Pattern 2: Debugging Prompts

Template Structure

[System] You are debugging a {language}/{framework} application.
Provide hypotheses ranked by probability. For each hypothesis,
include a specific test to confirm or rule it out.

[Context]
Error/Stack trace:

{paste full error output}


Relevant code:

{paste the function where the error occurs and its dependencies}


Environment:
- Runtime: {Node 20 / Python 3.11 / etc.}
- Framework: {NestJS 10 / Django 4.2 / etc.}
- Database: {PostgreSQL 15 / MongoDB 7 / etc.}
- OS: {Linux / macOS} in {Docker / native}

Expected behavior: {what should happen}
Actual behavior: {what actually happens}

What I've already checked:
- {hypothesis 1 — eliminated because...}
- {hypothesis 2 — eliminated because...}

[Task] Provide your top 5 hypotheses ranked by probability.
For each, explain why it could cause this specific error
and give me a concrete command or code change to verify it.

[Constraints]
- Do not suggest solutions I've already eliminated.
- Focus on this specific error, not general best practices.

Why This Works

Pattern 3: Architecture Review Prompts

Architecture review prompts are used during design phases and when evaluating significant changes to existing systems. The key is providing enough context about constraints and quality attributes.

Template Structure

[System] You are a senior software architect reviewing a design
document. Provide specific, actionable feedback. Do not restate
what the document already says. Focus on risks, gaps, and
unconsidered scenarios.

[Context]
Design document:

{paste full design doc or relevant sections}


System constraints:
- Team size: {N engineers}
- Deployment: {Kubernetes / VM / serverless}
- Scale: {requests/sec, data volume, user count}
- Compliance: {GDPR / PCI-DSS / HIPAA / none}
- Existing infrastructure: {list key existing systems}

Quality attributes prioritized (in order):
1. {e.g., Reliability — 99.9% uptime required}
2. {e.g., Maintainability — team rotates every 6 months}
3. {e.g., Performance — p99 < 200ms for API calls}

[Task] Review this design against the stated quality attributes.
For each concern you identify:
1. State the concern clearly
2. Explain the specific failure scenario
3. Rate severity (critical/major/minor)
4. Suggest a mitigation with tradeoffs

[Constraints]
- Do not suggest alternative architectures unless the current
  design has a critical flaw.
- Focus on the gaps in this design, not general architecture advice.
- Limit to 10 most important concerns.

Why This Works

Pattern 4: Test Generation Prompts

Test generation prompts produce the highest ROI of any prompt pattern because test code is highly structured and repetitive, making it an ideal target for AI generation.

Template Structure

[System] You are a senior QA engineer writing tests in {test framework}
for a {language}/{framework} application. Write thorough, readable tests.

[Context]
Function/class under test:

{paste the full function/class with type definitions}


Dependencies (mocked in tests):

{paste interfaces/types of dependencies}


Existing test patterns in this project:

{paste one example test from the project to establish style}


[Task] Generate test cases covering:
1. Happy path (standard input → expected output)
2. Edge cases: {list specific edge cases to cover}
3. Error cases: {list specific error scenarios}
4. Boundary values: {list boundaries}

[Output Format]
- Use {Jest / Mocha / pytest / etc.} syntax
- Use {project's mocking library} for mocks
- Group tests with describe/it blocks matching project convention
- Include setup and teardown where needed
- One assertion per test where practical

[Constraints]
- Do not test private methods directly
- Do not test implementation details (e.g., "it should call
  repository.save once") — test behavior
- Use realistic test data, not "test123" or "foo/bar"

Why This Works

Pattern 5: Documentation Prompts

Documentation prompts convert code into readable documentation. They are straightforward but benefit from specifying the audience and the documentation standard.

Template Structure

[System] You are a technical writer creating documentation for
a {language} API. Write for an audience of: {junior developers /
senior developers / external API consumers}.

[Context]
Code to document:

{paste controller/service/module code}


Existing documentation style (follow this):

{paste one example from existing docs}


[Task] Generate documentation covering:
1. Overview: what this module/API does, in 2-3 sentences
2. For each endpoint/method:
   - Purpose
   - Parameters with types and validation rules
   - Return value with type
   - Error cases with HTTP status codes
   - Example request and response
3. Authentication requirements
4. Rate limiting (if applicable)

[Output Format]
- Markdown
- Use code blocks for examples
- Include curl examples for API endpoints

[Constraints]
- Do not describe how the code works internally —
  describe what it does from the consumer's perspective.
- Do not invent features that are not in the code.
- If behavior is ambiguous from the code, flag it as
  "[NEEDS CLARIFICATION]" rather than guessing.

Why This Works

Pattern 6: Code Review Prompts

Code review prompts are used in automated review pipelines and by individual engineers seeking pre-review feedback.

Template Structure

[System] You are a senior code reviewer. Your review should be:
- Specific (reference line numbers and variable names)
- Actionable (say what to change, not just what's wrong)
- Prioritized (critical issues first)
- Respectful (critique the code, not the author)

[Context]
Diff to review:

{paste git diff}


Project coding standards:
- {standard 1: e.g., "All database access goes through repository classes"}
- {standard 2: e.g., "Services must not import from controllers"}
- {standard 3: e.g., "All public methods must have JSDoc comments"}
- {standard 4: e.g., "Error responses must use the ApiError class"}

Architecture rules:
- {rule 1: e.g., "No direct HTTP calls from service layer — use adapters"}
- {rule 2: e.g., "Feature modules must not depend on other feature modules"}

[Task] Review this diff for:
1. Bugs (null checks, error handling, logic errors)
2. Coding standard violations
3. Architecture rule violations
4. Test coverage gaps (missing tests for new/changed code)
5. Security concerns

[Output Format]
For each issue:
- File and line reference
- Category (bug/standard/architecture/test/security)
- Severity (critical/major/minor)
- Description
- Suggested fix

[Constraints]
- Do not comment on style/formatting (linter handles this)
- Do not suggest refactoring unrelated to the diff
- Limit to 15 most important issues
- If no issues found, say so — do not manufacture concerns

Why This Works

The "do not manufacture concerns" constraint prevents the model from generating a review comment on every PR. Some PRs are clean, and the review should say so.

Anti-Patterns

Common prompt mistakes we have observed and corrected:

Temperature and Model Selection

Temperature controls randomness in the model's output. For software engineering tasks, the right temperature depends on the task type:

Prompt Versioning

Effective prompts are intellectual property. They encode team knowledge about what works for your specific tech stack, coding standards, and project patterns. Losing them is expensive.

Our approach to prompt versioning:

Shared repository. We maintain a directory in our internal knowledge base with prompt templates organized by use case:

prompts/
  code-generation/
    nestjs-service.md
    flutter-widget.md
    react-component.md
  testing/
    unit-test-nestjs.md
    widget-test-flutter.md
    e2e-test-playwright.md
  review/
    pr-review-backend.md
    pr-review-frontend.md
    pr-review-mobile.md
  debugging/
    backend-debug.md
    frontend-debug.md
  documentation/
    api-docs.md
    readme-section.md
    changelog.md

Versioning with Git. Prompts are version-controlled like code. When someone improves a prompt, they commit the change with a description of what improved and why.

Tagging by use case and quality. Each prompt has a header with metadata:

# NestJS Service Generation Prompt
- **Version:** 3.2
- **Last Updated:** 2026-12-15
- **Author:** [engineer name]
- **Quality Rating:** 4.2/5 (based on 47 uses)
- **Best Model:** Claude Sonnet or GPT-4
- **Temperature:** 0.2
- **Notes:** v3.2 added the "existing patterns" context section,
  which improved output consistency significantly

Quality tracking. After using a prompt, engineers rate the output quality (1-5 scale) with optional notes. We review ratings monthly and improve prompts that consistently score below 3.5.

Training New Team Members

Prompt engineering is part of our onboarding at Stripe Systems. New engineers complete a structured sequence during their first two weeks:

Week 1: Use the prompt library for daily tasks. No modifications, just use the templates as-is. This builds familiarity with what is available and what quality to expect.

Week 2: Modify prompts for specific tasks. The engineer encounters situations where templates do not quite fit, and they learn to adjust context, constraints, and output format.

Ongoing: Contribute improvements back to the prompt library. When an engineer discovers a prompt modification that consistently improves output, they submit it as a PR to the prompt repository.

Case Study: The Stripe Systems Internal Prompt Library

Here are five real prompt templates from our internal library, anonymized but functionally identical. For each, we include the template, a filled example, and the team's quality rating.

Template 1: NestJS Service Scaffolding

Quality Rating: 4.3/5 (62 uses, team average)

You are a senior TypeScript developer working in a NestJS 10 application
with TypeORM 0.3.x and PostgreSQL. Follow these project conventions:
- Services are injectable classes with constructor-injected dependencies
- All database queries go through TypeORM repositories
- Use class-validator for DTO validation
- Use custom exceptions extending HttpException for error responses
- All amounts are stored in paise (integer), never rupees (float)
- Date fields use ISO 8601 format and are stored as timestamptz

Existing repository interface this service will use:
```typescript
@Injectable()
export class InvoiceRepository {
  constructor(
    @InjectRepository(Invoice)
    private readonly repo: Repository<Invoice>,
  ) {}

  async findById(id: string): Promise<Invoice | null>;
  async findByClientId(clientId: string, pagination: PaginationDto): Promise<[Invoice[], number]>;
  async save(invoice: Invoice): Promise<Invoice>;
  async updateStatus(id: string, status: InvoiceStatus): Promise<void>;
}

Existing error classes:

export class EntityNotFoundError extends HttpException { /* ... */ }
export class BusinessRuleViolationError extends HttpException { /* ... */ }
export class ValidationError extends HttpException { /* ... */ }

Generate an InvoiceService with these methods:

✓createInvoice(dto: CreateInvoiceDto) — validates line items, calculates total with GST (18%), creates invoice in DRAFT status
✓submitInvoice(id: string) — transitions from DRAFT to SUBMITTED, validates all required fields are present
✓getInvoice(id: string) — fetches by ID, throws EntityNotFoundError if not found
✓listClientInvoices(clientId: string, pagination: PaginationDto) — paginated list for a client

Output the complete service file with all imports. Include JSDoc for each public method. Do not include tests.


**Team notes:** "Works well for standard CRUD services. For services with complex business logic, use this to generate the structure and then rewrite the logic methods manually. The GST calculation pattern is consistently correct because we specify the rate and storage format (paise)."

### Template 2: Flutter Widget Test Generation

**Quality Rating: 3.8/5** (38 uses, team average)

You are a senior Flutter developer writing widget tests using flutter_test and mocktail. Follow these conventions:

✓Use pumpWidget with MaterialApp wrapper for all widget tests
✓Use mocktail for mocking (not mockito)
✓Test user-visible behavior, not implementation details
✓Use find.text(), find.byType(), and find.byKey() for finding widgets
✓Test loading states, error states, and success states

Widget under test:

{paste widget code}

Dependencies to mock:

{paste bloc/cubit/provider interfaces}

Example test from this project (follow this style):

{paste one existing test}

Generate widget tests covering:

✓Initial render — loading state shows CircularProgressIndicator
✓Success state — data displayed correctly
✓Error state — error message shown with retry button
✓User interaction — tap on item triggers expected navigation/action
✓Empty state — appropriate message shown when list is empty

Use realistic test data. Group tests with group() blocks. Include setUp and tearDown for mock registration.


**Team notes:** "Rating is 3.8 because the AI sometimes struggles with BLoC state management testing — it generates tests that emit states incorrectly. Works well for simple widgets but needs more human adjustment for complex stateful widgets. Adding the example test from the project improved output consistency from 3.2 to 3.8."

### Template 3: PR Review with Architecture Rules

**Quality Rating: 4.1/5** (94 uses, team average)

Review this pull request diff against our project standards.

Architecture Rules:

✓Feature modules must not import from other feature modules directly. Cross-feature communication goes through shared services or events.
✓Controllers handle HTTP concerns only (request parsing, response formatting, status codes). Business logic belongs in services.
✓Services must not use Request or Response objects from the HTTP layer.
✓All database access goes through repository classes. No raw queries in services.
✓External API calls go through adapter classes in the /adapters directory. Services depend on adapter interfaces, not implementations.
✓DTOs for request/response are separate from database entities. No entity objects in API responses.

Coding Standards:

✓All public methods have JSDoc comments with @param and @returns
✓Error handling uses custom exception classes, not generic Error
✓Async functions that call repositories must handle the null case
✓All new endpoints have corresponding test files
✓Environment-specific values come from ConfigService, not process.env

Diff:

{paste diff}

For each issue found:

✓Quote the specific code (file:line)
✓Category: architecture | standard | bug | security | test-gap
✓Severity: critical | major | minor
✓What is wrong
✓How to fix it

If the diff is clean, say "No issues found" — do not invent problems.


**Team notes:** "This is our most-used prompt. The architecture rules section is the key differentiator — without it, the AI gives generic review feedback. With it, it catches real architecture violations. We update the rules section whenever we add or modify an architecture decision. False positive rate dropped from 20% to 7% after we added the 'do not invent problems' constraint."

### Template 4: SQL Query Optimization

**Quality Rating: 4.0/5** (27 uses, team average)

You are a PostgreSQL performance expert. Analyze this query and suggest optimizations.

Query:

{paste the slow query}

EXPLAIN ANALYZE output:

{paste EXPLAIN ANALYZE results}

Table definitions:

{paste CREATE TABLE statements for involved tables}

Current indexes:

{paste relevant index definitions}

Data characteristics:

✓Table sizes: {e.g., orders: 2.3M rows, line_items: 8.7M rows}
✓Query frequency: {e.g., runs 500 times/hour}
✓Acceptable latency: {e.g., p99 < 50ms}

Provide:

✓Analysis of the current execution plan — where is time spent?
✓
Specific optimizations ranked by expected impact:
- ✓Query rewrites (with the rewritten query)
- ✓Index suggestions (with CREATE INDEX statements)
- ✓Schema changes (if justified by the performance requirement)
✓For each suggestion, estimate the improvement and any tradeoffs (e.g., index speeds up reads but slows writes)

Do not suggest: upgrading hardware, increasing memory, or generic advice like "add more indexes." Be specific.


**Team notes:** "Works well when you provide the EXPLAIN ANALYZE output — without it, the suggestions are generic. The model is good at identifying missing indexes and suggesting query rewrites for common patterns (correlated subqueries → JOINs, OR conditions → UNION). For complex queries involving CTEs or window functions, quality drops — model sometimes suggests 'optimizations' that change semantics."

### Template 5: Incident RCA Analysis

**Quality Rating: 3.9/5** (15 uses, team average)

Assist with root cause analysis for a production incident.

Incident Summary:

✓Service: {service name}
✓Start time: {timestamp}
✓End time: {timestamp}
✓Impact: {what users experienced}
✓Severity: {P1/P2/P3}

Timeline of events (from monitoring and logs): {paste chronological timeline}

Recent changes (deployments, config changes, infrastructure): {list changes in the 24 hours before the incident}

Metrics during the incident:

✓Error rate: {before} → {during}
✓Latency p99: {before} → {during}
✓CPU/Memory: {before} → {during}
✓Database: {connection pool, query latency, etc.}

Relevant log entries:

{paste key log entries with timestamps}

Provide:

✓Most likely root cause with supporting evidence from the timeline
✓Contributing factors (conditions that made the incident worse)
✓Why existing monitoring/alerting did not catch it sooner
✓
Specific remediation actions:
- ✓Immediate (prevent recurrence)
- ✓Short-term (improve detection)
- ✓Long-term (systemic improvement)
✓Timeline reconstruction: correlate events across the timeline to show the cascade of failures

Be specific. Reference timestamps and log entries. Do not suggest generic improvements unrelated to this incident.


**Team notes:** "Most useful for correlating events across a complex timeline. The model is good at spotting that event A at 14:02 caused event B at 14:03 when you have twenty events to sort through. Less useful for novel failure modes — it tends to attribute root causes to the most recent deployment even when the deployment is unrelated. We use this to draft the RCA document, then the on-call engineer reviews and corrects the analysis."

## Measuring Prompt Library Effectiveness

We track three metrics for the prompt library:

**Usage rate.** How often each template is used per week. Templates with low usage are either too niche (acceptable) or not useful (need improvement). We review usage monthly.

**Quality rating trend.** Each template's average rating over time. A declining rating signals that the template has not kept up with changes in our tech stack or conventions. A rising rating means recent improvements are working.

**Time-to-useful-output.** How many prompt iterations it takes to get usable output. A good template produces usable output in one iteration for 70%+ of uses. If engineers consistently need to re-prompt, the template needs improvement.

Current numbers for our top-5 templates (as of the publication date):

| Template | Uses/Week | Avg Rating | One-Shot Success Rate |
|---|---|---|---|
| NestJS Service | 8.2 | 4.3 | 78% |
| Flutter Widget Test | 5.1 | 3.8 | 62% |
| PR Review | 12.4 | 4.1 | 85% |
| SQL Optimization | 3.6 | 4.0 | 71% |
| Incident RCA | 1.8 | 3.9 | 65% |

The PR review template has the highest one-shot success rate because its output is advisory (a list of issues to consider), not generative (code that must compile and work). Generative templates inherently have lower one-shot rates because the output has more constraints to satisfy.

## Conclusion

Prompt engineering for software teams is not about crafting the perfect prompt. It is about building a shared library of good-enough prompts that consistently produce useful output, and improving them over time based on measured feedback.

The playbook we have built at Stripe Systems is not complex. It is six patterns, a shared repository, a rating system, and a monthly review process. The investment is small — maybe 2-3 hours per month for maintenance — and the return is measured in consistent output quality, faster onboarding, and shared learning across the team.

If your team is using LLMs without a shared prompt library, you are leaving value on the table. Not because individual engineers cannot write good prompts, but because their discoveries die with their chat history instead of becoming team knowledge.

Start with the six patterns in this post, adapt them to your tech stack, and iterate from there. The templates are meant to be modified, not followed rigidly. The structure matters more than the specific wording — as long as you provide system context, inject relevant code, specify the task clearly, define the output format, and set constraints, the output quality will be consistently higher than unstructured prompting.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

Custom Software Development

Purpose-built software designed around your business logic, data workflows, and operational requirements.

Learn more →

AI/ML Solutions

Machine learning models and AI integrations grounded in measurable business outcomes.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Why Software Teams Need a Prompt Playbook

Prompt Anatomy

Pattern 1: Code Generation Prompts

Template Structure

Why This Works

Common Failure Mode

Pattern 2: Debugging Prompts

Template Structure

Why This Works

Pattern 3: Architecture Review Prompts

Template Structure

Why This Works

Pattern 4: Test Generation Prompts

Template Structure

Why This Works

Pattern 5: Documentation Prompts

Template Structure

Why This Works

Pattern 6: Code Review Prompts

Template Structure

Why This Works

Anti-Patterns

Temperature and Model Selection

Prompt Versioning

Training New Team Members

Case Study: The Stripe Systems Internal Prompt Library

Template 1: NestJS Service Scaffolding

Related Services from Stripe Systems

Custom Software Development

AI/ML Solutions

More Articles

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

Staff Augmentation — A Practical Guide for Engineering Leaders

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Why Custom Software Development Matters for Growing Businesses

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026