Skip to main content
Stripe SystemsStripe Systems
Engineering Culture๐Ÿ“… March 25, 2026ยท 20 min read

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

โœ๏ธ
Stripe Systems Engineering

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentication requirements, and test expectations โ€” and gets usable code in one shot. The difference is not the model. It is the prompt.

At Stripe Systems, we spent three months formalizing our prompt engineering practices into an internal playbook. Not because prompt engineering is rocket science, but because consistency matters. When every engineer uses their own prompting style, you get inconsistent output quality, wasted time on prompt iteration, and no shared learning about what works and what does not.

This post shares the core of that playbook: six prompt patterns we use daily, common anti-patterns to avoid, guidance on model and temperature selection, and our approach to prompt versioning. We also include five real prompt templates from our internal library โ€” anonymized but functionally identical to what our team uses.

Why Software Teams Need a Prompt Playbook

Three reasons:

Consistency. A shared set of prompt templates means every engineer gets similar-quality output for common tasks. The senior engineer's test generation prompt should produce the same quality results when a junior engineer uses it.

Knowledge sharing. When one engineer discovers that adding "rank hypotheses by probability" to a debugging prompt dramatically improves output quality, that discovery should propagate to the team. Without a shared playbook, it stays in one person's head.

Onboarding. New team members are effective with AI tools from day one if they have a library of proven prompts. Without it, they spend weeks discovering through trial and error what the team already knows.

Prompt Anatomy

Before the patterns, let us establish vocabulary. Every effective prompt has these components:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ SYSTEM PROMPT (role + behavioral constraints)    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ CONTEXT INJECTION (code, docs, architecture)     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ TASK SPECIFICATION (what to do, clearly)          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ OUTPUT FORMAT (structure, length, format)         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ CONSTRAINTS (what to avoid, boundaries)           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

System prompt sets the persona and behavioral rules. "You are a senior TypeScript engineer who follows NestJS conventions. Do not suggest deprecated APIs."

Context injection provides the specific code, documentation, or architecture information the model needs. This is where most prompts fail โ€” insufficient context produces generic output.

Task specification is the actual instruction. "Review this code for potential null pointer exceptions" is a task specification. It should be specific, actionable, and unambiguous.

Output format tells the model how to structure its response. "Return a JSON array of objects with fields: line_number, issue, severity, suggestion" eliminates the need to parse prose.

Constraints define what the model should not do. "Do not suggest rewriting the function. Only identify issues in the existing code." Constraints prevent the model from going off on tangents.

Not every prompt needs all five components. Quick questions need only a task specification. Complex code generation needs all five.

Pattern 1: Code Generation Prompts

Code generation is the most common use of LLMs for developers, and it is where prompt quality matters most. A vague prompt produces code that compiles but does not fit your architecture. A structured prompt produces code that slots into your codebase with minimal modification.

Template Structure

[System] You are a senior {language} developer working in a {framework}
codebase. Follow these project conventions:
- {convention 1}
- {convention 2}
- {convention 3}

[Context]
Existing code that this new code must integrate with:

{paste relevant existing code โ€” interfaces, types, related services}


Database schema (if relevant):

{paste relevant table definitions}


[Task] Generate a {component type} that {does what}.

Requirements:
1. {specific requirement}
2. {specific requirement}
3. {specific requirement}

[Output Format]
- Single file, complete and runnable
- Include all imports
- Include JSDoc comments for public methods
- Do not include test code in this file

[Constraints]
- Use {specific library} for {specific purpose}, not {alternative}
- Error handling must use {project's error handling pattern}
- Do not use any deprecated APIs

Why This Works

The model receives enough context to generate code that fits your specific codebase, not generic code for a generic project. The conventions section prevents style violations. The constraints prevent the model from making common substitutions (e.g., using Axios when your project uses the built-in fetch wrapper).

Common Failure Mode

Omitting the existing code context. If you ask for "a user service in NestJS" without showing the model your existing service patterns, base classes, and error handling approach, you get a service that works in isolation but does not match your codebase.

Pattern 2: Debugging Prompts

Debugging prompts are the second most common use case, and they benefit enormously from structure. An unstructured "why doesn't this work?" produces generic suggestions. A structured debugging prompt produces ranked hypotheses with specific verification steps.

Template Structure

[System] You are debugging a {language}/{framework} application.
Provide hypotheses ranked by probability. For each hypothesis,
include a specific test to confirm or rule it out.

[Context]
Error/Stack trace:

{paste full error output}


Relevant code:

{paste the function where the error occurs and its dependencies}


Environment:
- Runtime: {Node 20 / Python 3.11 / etc.}
- Framework: {NestJS 10 / Django 4.2 / etc.}
- Database: {PostgreSQL 15 / MongoDB 7 / etc.}
- OS: {Linux / macOS} in {Docker / native}

Expected behavior: {what should happen}
Actual behavior: {what actually happens}

What I've already checked:
- {hypothesis 1 โ€” eliminated because...}
- {hypothesis 2 โ€” eliminated because...}

[Task] Provide your top 5 hypotheses ranked by probability.
For each, explain why it could cause this specific error
and give me a concrete command or code change to verify it.

[Constraints]
- Do not suggest solutions I've already eliminated.
- Focus on this specific error, not general best practices.

Why This Works

The "what I've already checked" section is critical. Without it, the model's first three suggestions are usually things you already tried. Including eliminated hypotheses forces the model to go deeper.

The "specific test to confirm" requirement prevents vague suggestions like "check your database connection." Instead, you get: "Run SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction' to check for connection leaks. If the count exceeds your pool size (default 10), you have a connection leak."

Pattern 3: Architecture Review Prompts

Architecture review prompts are used during design phases and when evaluating significant changes to existing systems. The key is providing enough context about constraints and quality attributes.

Template Structure

[System] You are a senior software architect reviewing a design
document. Provide specific, actionable feedback. Do not restate
what the document already says. Focus on risks, gaps, and
unconsidered scenarios.

[Context]
Design document:

{paste full design doc or relevant sections}


System constraints:
- Team size: {N engineers}
- Deployment: {Kubernetes / VM / serverless}
- Scale: {requests/sec, data volume, user count}
- Compliance: {GDPR / PCI-DSS / HIPAA / none}
- Existing infrastructure: {list key existing systems}

Quality attributes prioritized (in order):
1. {e.g., Reliability โ€” 99.9% uptime required}
2. {e.g., Maintainability โ€” team rotates every 6 months}
3. {e.g., Performance โ€” p99 < 200ms for API calls}

[Task] Review this design against the stated quality attributes.
For each concern you identify:
1. State the concern clearly
2. Explain the specific failure scenario
3. Rate severity (critical/major/minor)
4. Suggest a mitigation with tradeoffs

[Constraints]
- Do not suggest alternative architectures unless the current
  design has a critical flaw.
- Focus on the gaps in this design, not general architecture advice.
- Limit to 10 most important concerns.

Why This Works

The quality attributes provide evaluation criteria. Without them, the model produces generic architecture advice ("consider using a message queue"). With them, the model evaluates the design against specific requirements ("your design has a synchronous dependency between the order service and inventory service, which means an inventory service outage will cause order placement failures โ€” this conflicts with your 99.9% uptime requirement").

Pattern 4: Test Generation Prompts

Test generation prompts produce the highest ROI of any prompt pattern because test code is highly structured and repetitive, making it an ideal target for AI generation.

Template Structure

[System] You are a senior QA engineer writing tests in {test framework}
for a {language}/{framework} application. Write thorough, readable tests.

[Context]
Function/class under test:

{paste the full function/class with type definitions}


Dependencies (mocked in tests):

{paste interfaces/types of dependencies}


Existing test patterns in this project:

{paste one example test from the project to establish style}


[Task] Generate test cases covering:
1. Happy path (standard input โ†’ expected output)
2. Edge cases: {list specific edge cases to cover}
3. Error cases: {list specific error scenarios}
4. Boundary values: {list boundaries}

[Output Format]
- Use {Jest / Mocha / pytest / etc.} syntax
- Use {project's mocking library} for mocks
- Group tests with describe/it blocks matching project convention
- Include setup and teardown where needed
- One assertion per test where practical

[Constraints]
- Do not test private methods directly
- Do not test implementation details (e.g., "it should call
  repository.save once") โ€” test behavior
- Use realistic test data, not "test123" or "foo/bar"

Why This Works

The existing test pattern example is critical. It teaches the model your project's testing conventions: how you structure describe blocks, how you name tests, how you set up mocks, and what assertion style you use. Without this, the model generates tests in its own style, and you spend time reformatting.

The constraint about not testing implementation details prevents a common AI failure mode: generating tests that assert on internal method calls rather than observable behavior. These tests are brittle and break on refactoring.

Pattern 5: Documentation Prompts

Documentation prompts convert code into readable documentation. They are straightforward but benefit from specifying the audience and the documentation standard.

Template Structure

[System] You are a technical writer creating documentation for
a {language} API. Write for an audience of: {junior developers /
senior developers / external API consumers}.

[Context]
Code to document:

{paste controller/service/module code}


Existing documentation style (follow this):

{paste one example from existing docs}


[Task] Generate documentation covering:
1. Overview: what this module/API does, in 2-3 sentences
2. For each endpoint/method:
   - Purpose
   - Parameters with types and validation rules
   - Return value with type
   - Error cases with HTTP status codes
   - Example request and response
3. Authentication requirements
4. Rate limiting (if applicable)

[Output Format]
- Markdown
- Use code blocks for examples
- Include curl examples for API endpoints

[Constraints]
- Do not describe how the code works internally โ€”
  describe what it does from the consumer's perspective.
- Do not invent features that are not in the code.
- If behavior is ambiguous from the code, flag it as
  "[NEEDS CLARIFICATION]" rather than guessing.

Why This Works

The "[NEEDS CLARIFICATION]" constraint is important. Without it, the model confidently documents behavior it is guessing about. With this constraint, ambiguous areas are flagged for human attention rather than documented incorrectly.

Pattern 6: Code Review Prompts

Code review prompts are used in automated review pipelines and by individual engineers seeking pre-review feedback.

Template Structure

[System] You are a senior code reviewer. Your review should be:
- Specific (reference line numbers and variable names)
- Actionable (say what to change, not just what's wrong)
- Prioritized (critical issues first)
- Respectful (critique the code, not the author)

[Context]
Diff to review:

{paste git diff}


Project coding standards:
- {standard 1: e.g., "All database access goes through repository classes"}
- {standard 2: e.g., "Services must not import from controllers"}
- {standard 3: e.g., "All public methods must have JSDoc comments"}
- {standard 4: e.g., "Error responses must use the ApiError class"}

Architecture rules:
- {rule 1: e.g., "No direct HTTP calls from service layer โ€” use adapters"}
- {rule 2: e.g., "Feature modules must not depend on other feature modules"}

[Task] Review this diff for:
1. Bugs (null checks, error handling, logic errors)
2. Coding standard violations
3. Architecture rule violations
4. Test coverage gaps (missing tests for new/changed code)
5. Security concerns

[Output Format]
For each issue:
- File and line reference
- Category (bug/standard/architecture/test/security)
- Severity (critical/major/minor)
- Description
- Suggested fix

[Constraints]
- Do not comment on style/formatting (linter handles this)
- Do not suggest refactoring unrelated to the diff
- Limit to 15 most important issues
- If no issues found, say so โ€” do not manufacture concerns

Why This Works

The coding standards and architecture rules section is what makes this prompt production-useful rather than generic. By encoding your project's specific rules, the AI review catches violations that a generic review would miss.

The "do not manufacture concerns" constraint prevents the model from generating a review comment on every PR. Some PRs are clean, and the review should say so.

Anti-Patterns

Common prompt mistakes we have observed and corrected:

Vague prompts. "Write me a service" โ†’ generates generic code. "Write a NestJS service for order processing that validates stock availability before creating an order, uses the existing InventoryRepository interface, and throws an InsufficientStockError if any line item is unavailable" โ†’ generates usable code.

Missing context. Asking the model to review code without providing the interfaces it implements, the types it uses, or the patterns the project follows. The model fills in the gaps with assumptions, and those assumptions are usually wrong.

Asking for too much at once. "Write a complete user management module with registration, login, password reset, profile management, role management, and an admin dashboard" โ†’ produces shallow implementations of everything. Break it into focused requests: one for registration, one for login, etc.

Not specifying output format. Without format constraints, the model wraps code in explanatory prose, uses different indentation, or generates code with inline comments explaining every line. Specify: "Output only the code, no explanation. Use 2-space indentation. Minimal comments."

Ignoring the system prompt. Many developers skip the system prompt entirely. The system prompt establishes the persona, conventions, and constraints that apply to all subsequent interactions in the conversation. It is the highest-leverage part of the prompt.

Temperature and Model Selection

Temperature controls randomness in the model's output. For software engineering tasks, the right temperature depends on the task type:

Temperature 0.0 - 0.3 (deterministic): Use for code generation, test writing, bug fixing, and any task where there is one correct answer. Low temperature produces consistent, predictable output. You want the same prompt to produce the same code every time.

Temperature 0.4 - 0.6 (balanced): Use for documentation writing, code review, and refactoring suggestions. These tasks benefit from some variation โ€” the model might phrase a documentation section differently or suggest an alternative refactoring approach.

Temperature 0.7 - 1.0 (creative): Use for brainstorming, architecture exploration, and naming. "Suggest 10 names for this microservice" benefits from high temperature. "Write a database migration" does not.

Model selection: For code generation and review, use the most capable model available (GPT-4-class or Claude Sonnet/Opus). For quick lookups, syntax questions, and simple transformations, smaller/faster models (GPT-4o-mini, Claude Haiku) are sufficient and cheaper. The cost difference is significant at scale โ€” a team of 8 engineers making 50+ API calls per day per person will notice the bill.

Prompt Versioning

Effective prompts are intellectual property. They encode team knowledge about what works for your specific tech stack, coding standards, and project patterns. Losing them is expensive.

Our approach to prompt versioning:

Shared repository. We maintain a directory in our internal knowledge base with prompt templates organized by use case:

prompts/
  code-generation/
    nestjs-service.md
    flutter-widget.md
    react-component.md
  testing/
    unit-test-nestjs.md
    widget-test-flutter.md
    e2e-test-playwright.md
  review/
    pr-review-backend.md
    pr-review-frontend.md
    pr-review-mobile.md
  debugging/
    backend-debug.md
    frontend-debug.md
  documentation/
    api-docs.md
    readme-section.md
    changelog.md

Versioning with Git. Prompts are version-controlled like code. When someone improves a prompt, they commit the change with a description of what improved and why.

Tagging by use case and quality. Each prompt has a header with metadata:

# NestJS Service Generation Prompt
- **Version:** 3.2
- **Last Updated:** 2026-12-15
- **Author:** [engineer name]
- **Quality Rating:** 4.2/5 (based on 47 uses)
- **Best Model:** Claude Sonnet or GPT-4
- **Temperature:** 0.2
- **Notes:** v3.2 added the "existing patterns" context section,
  which improved output consistency significantly

Quality tracking. After using a prompt, engineers rate the output quality (1-5 scale) with optional notes. We review ratings monthly and improve prompts that consistently score below 3.5.

Training New Team Members

Prompt engineering is part of our onboarding at Stripe Systems. New engineers complete a structured sequence during their first two weeks:

Week 1: Use the prompt library for daily tasks. No modifications, just use the templates as-is. This builds familiarity with what is available and what quality to expect.

Week 2: Modify prompts for specific tasks. The engineer encounters situations where templates do not quite fit, and they learn to adjust context, constraints, and output format.

Ongoing: Contribute improvements back to the prompt library. When an engineer discovers a prompt modification that consistently improves output, they submit it as a PR to the prompt repository.

The goal is not to make every engineer a prompt engineering expert. It is to make every engineer effective at the 80% of prompt use cases that are covered by templates, and competent at adapting prompts for the remaining 20%.

Case Study: The Stripe Systems Internal Prompt Library

Here are five real prompt templates from our internal library, anonymized but functionally identical. For each, we include the template, a filled example, and the team's quality rating.

Template 1: NestJS Service Scaffolding

Quality Rating: 4.3/5 (62 uses, team average)

You are a senior TypeScript developer working in a NestJS 10 application
with TypeORM 0.3.x and PostgreSQL. Follow these project conventions:
- Services are injectable classes with constructor-injected dependencies
- All database queries go through TypeORM repositories
- Use class-validator for DTO validation
- Use custom exceptions extending HttpException for error responses
- All amounts are stored in paise (integer), never rupees (float)
- Date fields use ISO 8601 format and are stored as timestamptz

Existing repository interface this service will use:
```typescript
@Injectable()
export class InvoiceRepository {
  constructor(
    @InjectRepository(Invoice)
    private readonly repo: Repository<Invoice>,
  ) {}

  async findById(id: string): Promise<Invoice | null>;
  async findByClientId(clientId: string, pagination: PaginationDto): Promise<[Invoice[], number]>;
  async save(invoice: Invoice): Promise<Invoice>;
  async updateStatus(id: string, status: InvoiceStatus): Promise<void>;
}

Existing error classes:

export class EntityNotFoundError extends HttpException { /* ... */ }
export class BusinessRuleViolationError extends HttpException { /* ... */ }
export class ValidationError extends HttpException { /* ... */ }

Generate an InvoiceService with these methods:

  1. โœ“createInvoice(dto: CreateInvoiceDto) โ€” validates line items, calculates total with GST (18%), creates invoice in DRAFT status
  2. โœ“submitInvoice(id: string) โ€” transitions from DRAFT to SUBMITTED, validates all required fields are present
  3. โœ“getInvoice(id: string) โ€” fetches by ID, throws EntityNotFoundError if not found
  4. โœ“listClientInvoices(clientId: string, pagination: PaginationDto) โ€” paginated list for a client

Output the complete service file with all imports. Include JSDoc for each public method. Do not include tests.


**Team notes:** "Works well for standard CRUD services. For services with complex business logic, use this to generate the structure and then rewrite the logic methods manually. The GST calculation pattern is consistently correct because we specify the rate and storage format (paise)."

### Template 2: Flutter Widget Test Generation

**Quality Rating: 3.8/5** (38 uses, team average)

You are a senior Flutter developer writing widget tests using flutter_test and mocktail. Follow these conventions:

  • โœ“Use pumpWidget with MaterialApp wrapper for all widget tests
  • โœ“Use mocktail for mocking (not mockito)
  • โœ“Test user-visible behavior, not implementation details
  • โœ“Use find.text(), find.byType(), and find.byKey() for finding widgets
  • โœ“Test loading states, error states, and success states

Widget under test:

{paste widget code}

Dependencies to mock:

{paste bloc/cubit/provider interfaces}

Example test from this project (follow this style):

{paste one existing test}

Generate widget tests covering:

  1. โœ“Initial render โ€” loading state shows CircularProgressIndicator
  2. โœ“Success state โ€” data displayed correctly
  3. โœ“Error state โ€” error message shown with retry button
  4. โœ“User interaction โ€” tap on item triggers expected navigation/action
  5. โœ“Empty state โ€” appropriate message shown when list is empty

Use realistic test data. Group tests with group() blocks. Include setUp and tearDown for mock registration.


**Team notes:** "Rating is 3.8 because the AI sometimes struggles with BLoC state management testing โ€” it generates tests that emit states incorrectly. Works well for simple widgets but needs more human adjustment for complex stateful widgets. Adding the example test from the project improved output consistency from 3.2 to 3.8."

### Template 3: PR Review with Architecture Rules

**Quality Rating: 4.1/5** (94 uses, team average)

Review this pull request diff against our project standards.

Architecture Rules:

  1. โœ“Feature modules must not import from other feature modules directly. Cross-feature communication goes through shared services or events.
  2. โœ“Controllers handle HTTP concerns only (request parsing, response formatting, status codes). Business logic belongs in services.
  3. โœ“Services must not use Request or Response objects from the HTTP layer.
  4. โœ“All database access goes through repository classes. No raw queries in services.
  5. โœ“External API calls go through adapter classes in the /adapters directory. Services depend on adapter interfaces, not implementations.
  6. โœ“DTOs for request/response are separate from database entities. No entity objects in API responses.

Coding Standards:

  1. โœ“All public methods have JSDoc comments with @param and @returns
  2. โœ“Error handling uses custom exception classes, not generic Error
  3. โœ“Async functions that call repositories must handle the null case
  4. โœ“All new endpoints have corresponding test files
  5. โœ“Environment-specific values come from ConfigService, not process.env

Diff:

{paste diff}

For each issue found:

  • โœ“Quote the specific code (file:line)
  • โœ“Category: architecture | standard | bug | security | test-gap
  • โœ“Severity: critical | major | minor
  • โœ“What is wrong
  • โœ“How to fix it

If the diff is clean, say "No issues found" โ€” do not invent problems.


**Team notes:** "This is our most-used prompt. The architecture rules section is the key differentiator โ€” without it, the AI gives generic review feedback. With it, it catches real architecture violations. We update the rules section whenever we add or modify an architecture decision. False positive rate dropped from 20% to 7% after we added the 'do not invent problems' constraint."

### Template 4: SQL Query Optimization

**Quality Rating: 4.0/5** (27 uses, team average)

You are a PostgreSQL performance expert. Analyze this query and suggest optimizations.

Query:

{paste the slow query}

EXPLAIN ANALYZE output:

{paste EXPLAIN ANALYZE results}

Table definitions:

{paste CREATE TABLE statements for involved tables}

Current indexes:

{paste relevant index definitions}

Data characteristics:

  • โœ“Table sizes: {e.g., orders: 2.3M rows, line_items: 8.7M rows}
  • โœ“Query frequency: {e.g., runs 500 times/hour}
  • โœ“Acceptable latency: {e.g., p99 < 50ms}

Provide:

  1. โœ“Analysis of the current execution plan โ€” where is time spent?
  2. โœ“Specific optimizations ranked by expected impact:
    • โœ“Query rewrites (with the rewritten query)
    • โœ“Index suggestions (with CREATE INDEX statements)
    • โœ“Schema changes (if justified by the performance requirement)
  3. โœ“For each suggestion, estimate the improvement and any tradeoffs (e.g., index speeds up reads but slows writes)

Do not suggest: upgrading hardware, increasing memory, or generic advice like "add more indexes." Be specific.


**Team notes:** "Works well when you provide the EXPLAIN ANALYZE output โ€” without it, the suggestions are generic. The model is good at identifying missing indexes and suggesting query rewrites for common patterns (correlated subqueries โ†’ JOINs, OR conditions โ†’ UNION). For complex queries involving CTEs or window functions, quality drops โ€” model sometimes suggests 'optimizations' that change semantics."

### Template 5: Incident RCA Analysis

**Quality Rating: 3.9/5** (15 uses, team average)

Assist with root cause analysis for a production incident.

Incident Summary:

  • โœ“Service: {service name}
  • โœ“Start time: {timestamp}
  • โœ“End time: {timestamp}
  • โœ“Impact: {what users experienced}
  • โœ“Severity: {P1/P2/P3}

Timeline of events (from monitoring and logs): {paste chronological timeline}

Recent changes (deployments, config changes, infrastructure): {list changes in the 24 hours before the incident}

Metrics during the incident:

  • โœ“Error rate: {before} โ†’ {during}
  • โœ“Latency p99: {before} โ†’ {during}
  • โœ“CPU/Memory: {before} โ†’ {during}
  • โœ“Database: {connection pool, query latency, etc.}

Relevant log entries:

{paste key log entries with timestamps}

Provide:

  1. โœ“Most likely root cause with supporting evidence from the timeline
  2. โœ“Contributing factors (conditions that made the incident worse)
  3. โœ“Why existing monitoring/alerting did not catch it sooner
  4. โœ“Specific remediation actions:
    • โœ“Immediate (prevent recurrence)
    • โœ“Short-term (improve detection)
    • โœ“Long-term (systemic improvement)
  5. โœ“Timeline reconstruction: correlate events across the timeline to show the cascade of failures

Be specific. Reference timestamps and log entries. Do not suggest generic improvements unrelated to this incident.


**Team notes:** "Most useful for correlating events across a complex timeline. The model is good at spotting that event A at 14:02 caused event B at 14:03 when you have twenty events to sort through. Less useful for novel failure modes โ€” it tends to attribute root causes to the most recent deployment even when the deployment is unrelated. We use this to draft the RCA document, then the on-call engineer reviews and corrects the analysis."

## Measuring Prompt Library Effectiveness

We track three metrics for the prompt library:

**Usage rate.** How often each template is used per week. Templates with low usage are either too niche (acceptable) or not useful (need improvement). We review usage monthly.

**Quality rating trend.** Each template's average rating over time. A declining rating signals that the template has not kept up with changes in our tech stack or conventions. A rising rating means recent improvements are working.

**Time-to-useful-output.** How many prompt iterations it takes to get usable output. A good template produces usable output in one iteration for 70%+ of uses. If engineers consistently need to re-prompt, the template needs improvement.

Current numbers for our top-5 templates (as of the publication date):

| Template | Uses/Week | Avg Rating | One-Shot Success Rate |
|---|---|---|---|
| NestJS Service | 8.2 | 4.3 | 78% |
| Flutter Widget Test | 5.1 | 3.8 | 62% |
| PR Review | 12.4 | 4.1 | 85% |
| SQL Optimization | 3.6 | 4.0 | 71% |
| Incident RCA | 1.8 | 3.9 | 65% |

The PR review template has the highest one-shot success rate because its output is advisory (a list of issues to consider), not generative (code that must compile and work). Generative templates inherently have lower one-shot rates because the output has more constraints to satisfy.

## Conclusion

Prompt engineering for software teams is not about crafting the perfect prompt. It is about building a shared library of good-enough prompts that consistently produce useful output, and improving them over time based on measured feedback.

The playbook we have built at Stripe Systems is not complex. It is six patterns, a shared repository, a rating system, and a monthly review process. The investment is small โ€” maybe 2-3 hours per month for maintenance โ€” and the return is measured in consistent output quality, faster onboarding, and shared learning across the team.

If your team is using LLMs without a shared prompt library, you are leaving value on the table. Not because individual engineers cannot write good prompts, but because their discoveries die with their chat history instead of becoming team knowledge.

Start with the six patterns in this post, adapt them to your tech stack, and iterate from there. The templates are meant to be modified, not followed rigidly. The structure matters more than the specific wording โ€” as long as you provide system context, inject relevant code, specify the task clearly, define the output format, and set constraints, the output quality will be consistently higher than unstructured prompting.

Ready to discuss your project?

Get in Touch โ†’
โ† Back to Blog

More Articles