Skip to main content
Stripe SystemsStripe Systems
Engineering Culture📅 February 5, 2026· 19 min read

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

✍️
Stripe Systems Engineering

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the development lifecycle — requirements through deployment — there are specific, bounded tasks where an LLM can reduce time, catch errors, or generate useful first drafts that a human then refines.

This post walks through each SDLC phase, describes exactly how we use AI tools at Stripe Systems, what works, what does not, and how we measure the impact. We also include a detailed case study from a real project where AI was deliberately used at every phase.

The operating principle behind everything that follows: AI suggests, human decides. Never the other way around.

Phase 1: Requirements

Requirements analysis is where projects succeed or fail, and it is where AI has a surprisingly useful role — not in writing requirements, but in stress-testing them.

What We Do

When a product manager or client delivers a requirements document, the assigned engineer runs it through a structured AI review before estimation:

Review these software requirements for completeness.
Identify:
1. Ambiguous statements that could be interpreted multiple ways
2. Missing error handling requirements
3. Edge cases not addressed
4. Implicit assumptions about user behavior
5. Missing non-functional requirements (performance, security, availability)
6. Contradictions between requirements

Requirements:
[paste full requirements document]

The AI typically identifies 5-15 items worth discussing, of which 3-5 are genuinely useful catches that would have surfaced later during development or testing.

Acceptance Criteria Generation

After requirements are clarified, we use AI to generate initial acceptance criteria:

Given this user story, generate acceptance criteria in Given/When/Then format:

User Story: As a warehouse manager, I can generate a stock
discrepancy report comparing physical count against system
inventory, so that I can identify and investigate mismatches.

Consider: happy path, error cases, boundary conditions,
data validation, and permissions.

The AI generates 8-12 acceptance criteria. The PM and engineer review them, keep about 70%, modify 20%, and add 10% that require domain knowledge the AI does not have.

Measured Impact

  • Requirements review time: reduced from 2 hours to 45 minutes (engineer still reviews, but starts from AI-identified issues rather than a blank slate)
  • Requirement gaps found before development: increased by approximately 30% (measured by tracking change requests during development — fewer mid-sprint scope clarifications)
  • Acceptance criteria coverage: AI-generated criteria catch error handling and boundary conditions that PMs frequently omit

Phase 2: Design

Architecture and design is where AI's role is most nuanced. It is useful as a critique partner, not as a decision maker.

Architecture Review

When an engineer drafts a design document, we use AI for a structured critique:

Review this system design document. Evaluate against these
quality attributes:
- Scalability: can it handle 10x current load?
- Reliability: what are the single points of failure?
- Maintainability: are the boundaries between components clean?
- Operability: can this be deployed, monitored, and debugged?

Provide specific concerns, not general advice. For each concern,
explain the failure scenario and suggest a mitigation.

Design Document:
[paste design doc]

This does not replace a design review meeting. What it does is surface obvious issues before the meeting, so the review discussion focuses on the genuinely hard tradeoffs rather than "you forgot to mention what happens when the database is unavailable."

ADR Drafting

Architecture Decision Records document why a particular approach was chosen. AI generates the initial draft:

Draft an ADR for the following decision:

Context: We need to implement real-time notifications for
our order tracking system. Current options evaluated:
WebSockets, Server-Sent Events, and polling.

Decision: Server-Sent Events (SSE)

Write an ADR covering: context, decision, consequences
(positive and negative), and alternatives considered with
reasons for rejection.

The engineer edits for accuracy — AI gets the generic tradeoffs right but misses project-specific constraints (e.g., "our load balancer does not support sticky sessions, which affects WebSocket scaling").

Sequence Diagram Generation

We describe interactions in natural language and ask AI to generate PlantUML or Mermaid diagrams:

Generate a Mermaid sequence diagram for this flow:

1. Mobile app sends order placement request to API Gateway
2. API Gateway validates JWT and forwards to Order Service
3. Order Service creates order in database
4. Order Service publishes OrderCreated event to message queue
5. Inventory Service consumes event, reserves stock
6. Payment Service consumes event, initiates payment capture
7. If payment succeeds, Notification Service sends confirmation
8. If payment fails, Inventory Service releases stock reservation

The output requires minor formatting adjustments but is structurally correct about 85% of the time. This saves 20-30 minutes per diagram versus manual construction.

Measured Impact

  • Design review meeting efficiency: improved (fewer obvious issues raised in meetings)
  • ADR creation time: reduced from 1.5 hours to 30 minutes
  • Diagram creation time: reduced from 30 minutes to 10 minutes
  • Design quality: no measurable change (AI does not improve design quality, it improves the documentation of design decisions)

Phase 3: Development

This is the phase most people associate with "AI in software development," and it is where the tools are most mature.

Code Completion and Generation

GitHub Copilot provides inline suggestions as engineers write code. The patterns where it is most effective:

  • CRUD operations: Given an entity definition, Copilot generates the repository, service, and controller layers with high accuracy.
  • API integration: When importing a well-known library (Stripe API, AWS SDK, Razorpay), Copilot suggests correct initialization, authentication, and common operations.
  • Data transformation: Mapping between DTOs, entities, and API responses — Copilot handles these with minimal correction.
  • Standard algorithms: Sorting, filtering, pagination, search — Copilot generates correct implementations for common patterns.

Where it consistently fails: any code that requires understanding your specific business rules, domain model, or architectural constraints.

Refactoring Suggestions

When an engineer identifies code that needs refactoring, they use AI to explore approaches:

Refactor this function to separate concerns. Currently it:
1. Validates input
2. Queries the database
3. Applies business rules
4. Formats the response
5. Sends a notification

Break it into single-responsibility functions following
the NestJS service pattern. Preserve all behavior.

[paste function]

The AI generates a restructured version that the engineer evaluates, tests, and adjusts. This is faster than manually extracting methods because the AI handles the mechanical work of splitting the function, adjusting parameters, and maintaining the call chain.

Measured Impact

  • Code writing speed: approximately 15-20% faster (measured by time from empty file to passing tests for standard modules)
  • Boilerplate reduction: ~60% of boilerplate code is AI-generated with minor edits
  • Complex logic: no measurable speed improvement (AI suggestions are mostly rejected in complex business logic sections)

Phase 4: Testing

Testing is where AI provides the highest ROI relative to effort. The reason: test code follows predictable patterns, and the "creative" part (identifying what to test) is where human judgment still dominates.

Test Case Generation from Requirements

We maintain a lightweight requirements traceability matrix. AI generates test cases from acceptance criteria:

Given these acceptance criteria for the order placement feature,
generate test cases in this format:

Test ID | Description | Preconditions | Steps | Expected Result | Priority

Acceptance Criteria:
1. Given a valid cart with items in stock, when the user places
   an order, then an order is created with status PENDING
2. Given a cart with an out-of-stock item, when the user places
   an order, then the order is rejected with a clear error message
3. [additional criteria...]

Include positive, negative, boundary, and error handling test cases.

This generates the test plan structure. The QA engineer adds domain-specific scenarios that the AI misses — typically concurrency scenarios, integration edge cases, and business rule combinations.

Test Data Generation

AI generates realistic test data that respects constraints:

Generate 20 test records for an Indian e-commerce user table with:
- name: realistic Indian names
- email: valid format, different providers
- phone: valid Indian mobile numbers (10 digits, starting with 6-9)
- address: realistic Indian addresses with PIN codes
- created_at: dates within the last 2 years
- Include edge cases: very long names, special characters,
  addresses with unicode characters

Output as a TypeScript array of objects.

This replaces manually crafting test data, which is tedious and often lacks the variety needed to catch formatting and validation issues.

Flaky Test Analysis

When a test fails intermittently, we provide the test code, recent failure logs, and timing information to AI:

This test fails approximately 15% of the time in CI but
passes consistently locally. Analyze for potential flakiness causes:

Test code: [paste]
Recent failure logs (3 runs): [paste]
CI environment: GitHub Actions, Ubuntu 22.04, Node 20
Local environment: macOS, Node 20

Common flakiness causes to check: timing dependencies, shared state,
network calls, file system operations, date/time sensitivity,
random ordering, resource contention.

AI correctly identifies the flakiness cause about 50% of the time — usually timing issues, shared state between tests, or date-dependent assertions. When it misses, the structured analysis still helps the engineer narrow down the investigation.

Measured Impact

  • Test case coverage: AI-generated test cases find approximately 15% more edge cases than manual test planning alone
  • Test writing time: reduced by 35-40% (AI generates the scaffold, engineer writes assertions)
  • Flaky test resolution: average resolution time reduced from 2 hours to 1.2 hours
  • Test data preparation: reduced from 30 minutes to 5 minutes per test suite

Phase 5: Code Review

Code review is a known bottleneck: senior engineers review PRs for multiple team members, and their calendar is a scarce resource. AI helps by handling the mechanical portion of review.

AI Pre-Review

Before a human reviewer sees a PR, an automated pipeline runs an AI review:

  1. Extract the diff and identify changed files
  2. Gather context: related files, architecture documentation, coding standards
  3. Submit to LLM with a structured review prompt
  4. Post findings as GitHub review comments

The AI checks for: style consistency, common bug patterns (missing null checks, unhandled promise rejections, missing error cases in switch statements), test coverage for changed code, and documentation completeness.

PR Summarization

AI generates a structured summary of the PR for the reviewer:

## Summary
This PR adds a stock discrepancy report generator to the inventory module.

## Key Changes
- New `DiscrepancyReportService` with methods for comparing
  physical count against system inventory
- New `GET /api/inventory/discrepancy-report` endpoint
- Database migration adding `physical_count_records` table
- 18 new test cases covering matching, mismatches, and edge cases

## Risk Areas
- The discrepancy calculation uses floating-point arithmetic for
  quantities — consider using integer units instead
- No rate limiting on the report generation endpoint, which runs
  a heavy database query

Human reviewers consistently report that this summary helps them understand the intent of the PR before diving into the diff, reducing review time.

Measured Impact

  • PR review cycle time: reduced by 40% (AI handles mechanical checks, human focuses on logic and architecture)
  • Defects found in review: increased by 25% (AI catches patterns human reviewers miss when fatigued)
  • Senior engineer review time per PR: reduced from 45 minutes to 20 minutes
  • False positive rate: 8% after tuning (started at 22%, reduced through prompt refinement and feedback loops)

Phase 6: Deployment

AI's role in deployment is narrower but still valuable: summarizing changes, assisting with rollback decisions, and generating release notes.

Deployment Summary Generation

Before each deployment, AI generates a summary from the git log:

Given these commits since the last release, generate a deployment
summary covering:
1. New features
2. Bug fixes
3. Database migrations (flag these prominently)
4. Configuration changes
5. Dependency updates
6. Risk assessment (high/medium/low) with reasoning

Commits:
[paste git log --oneline since last tag]

This summary goes into the deployment ticket and is referenced during the deployment call. It ensures everyone involved understands what is being deployed.

Rollback Decision Support

When a deployment shows problems, the on-call engineer feeds error logs into AI for analysis:

We deployed version 2.14.0 thirty minutes ago. Error rates have
increased from 0.1% to 2.3%. Here are the error logs from the
last 15 minutes:

[paste logs]

Here are the changes in this deployment:
[paste deployment summary]

Analyze: which change is most likely causing the errors?
Should we roll back the entire deployment or can we
feature-flag a specific change?

This is advisory only — the on-call engineer makes the rollback decision. But having a structured analysis of likely causes reduces the panic-driven "roll back everything" reaction and sometimes enables a targeted fix instead.

Measured Impact

  • Deployment summary preparation: reduced from 30 minutes to 5 minutes
  • Rollback decision time: anecdotally faster (insufficient data for statistical measurement)
  • Release note accuracy: improved (AI catches changes that engineers forget to mention)

Phase 7: Monitoring and Incident Response

Post-deployment, AI assists with log analysis and root cause analysis (RCA).

Pattern Matching Across Services

When an issue spans multiple services, AI helps correlate logs:

These are error logs from three services in the last 30 minutes,
all related to order processing failures. Correlate by
request ID and identify the root cause:

Order Service logs: [paste]
Payment Service logs: [paste]
Inventory Service logs: [paste]

Timeline the events and identify where the chain breaks.

AI generates a timeline that correlates events across services by request ID, identifying the first failure point. This saves 15-20 minutes of manual log correlation during incidents.

Anomaly Narrative Generation

When monitoring dashboards show anomalies, AI generates human-readable narratives for the team:

Our monitoring shows these anomalies in the last hour:
- API response time p99: increased from 200ms to 850ms
- Database connection pool: utilization went from 40% to 92%
- Memory usage on service pods: increased from 60% to 85%
- Error rate: increased from 0.1% to 0.8%

No deployments in the last 6 hours. Generate a narrative
explaining the likely relationship between these metrics
and possible root causes.

AI produces: "The database connection pool saturation (40% → 92%) is likely the primary issue. High pool utilization increases query wait times, which cascades into API response time increases. The memory increase may be caused by requests queuing in memory while waiting for database connections. Possible causes: a slow query holding connections, a connection leak, or an external load increase. Check: recent query performance, connection pool metrics, and traffic volume."

This is exactly the kind of structured reasoning that is useful during an incident, and it takes the AI seconds versus the minutes it takes a human to write it out.

Phase 8: Documentation

Documentation is the perennial neglected task in software development. AI reduces the friction enough that it actually gets done.

API Documentation from Code

We generate OpenAPI documentation from NestJS controllers and then use AI to add descriptions, examples, and error documentation:

Given this NestJS controller with Swagger decorators, generate
comprehensive API documentation in markdown including:
- Endpoint description and purpose
- Request parameters with types and validation rules
- Request body examples
- Response body examples for success and each error case
- Authentication requirements
- Rate limiting information
- curl examples

Controller: [paste]

The AI generates documentation that is structurally complete. The engineer adds business context ("this endpoint is typically called after the user confirms their cart") and corrects any response format inaccuracies.

Changelog Generation

At the end of each sprint, AI generates the changelog:

Generate a changelog from these PR descriptions.
Group by: Features, Bug Fixes, Performance, Infrastructure.
Use clear, non-technical language suitable for stakeholders.

PR descriptions:
[paste consolidated PR descriptions]

This saves the tech lead 30-45 minutes per sprint and produces more consistent formatting.

Measured Impact

  • API documentation creation: reduced from 3 hours to 45 minutes per module
  • Changelog generation: reduced from 45 minutes to 10 minutes per sprint
  • Documentation coverage: increased (lower friction means more modules get documented)

Risks and Mitigations

Honest adoption requires acknowledging the risks:

Over-reliance. Engineers who rely heavily on AI completion may lose proficiency in writing code from scratch. We mitigate this with periodic "no-AI" coding sessions and by ensuring junior engineers complete their first three months without AI tools, building foundational skills first.

Skill atrophy. If AI writes all the tests, engineers may lose the ability to identify what needs testing. We mitigate this by requiring engineers to specify test scenarios before AI generates the test code.

Hallucinated requirements. AI-generated acceptance criteria sometimes include requirements that sound reasonable but are not actually needed. Every AI-generated requirement goes through PM review before entering the backlog.

False confidence in AI-generated tests. Tests that pass are not necessarily correct. An AI might generate a test that checks expect(result).toBeDefined() instead of checking the actual business-critical property. We review AI-generated tests with the same rigor as AI-generated code.

Context leakage. AI tools that send code to external APIs create data privacy risks. We enforce strict governance: enterprise-tier tools only, no proprietary code in public interfaces, quarterly audits.

The Human-in-the-Loop Principle

Everything described in this post follows a single rule: AI suggests, human decides.

AI generates acceptance criteria — the PM approves them. AI generates code — the engineer reviews it. AI flags a potential bug in review — the reviewer evaluates whether it is real. AI suggests a rollback — the on-call engineer makes the call.

This is not a philosophical position. It is a practical one: AI tools are not reliable enough to make decisions autonomously. Their error rate is too high and their understanding of context is too shallow. They are excellent research assistants and draft generators. They are not decision makers.

Case Study: E-Commerce Checkout Redesign

To ground this in practice, here is a detailed breakdown of a real project at Stripe Systems where AI was deliberately used at every SDLC phase.

Project Context

Project: Complete redesign of the checkout flow for an e-commerce client. The existing checkout was a single-page form with high cart abandonment (68%). The redesign introduced a multi-step checkout with address validation, real-time shipping calculation, coupon application, and multiple payment methods (UPI, card, net banking, wallet).

Team: 4 engineers (2 backend, 1 frontend, 1 full-stack), 1 PM, 1 QA engineer.

Timeline: 12-week estimate, 10-week actual delivery.

Phase-by-Phase Breakdown

Requirements (Week 1)

The PM delivered a requirements document covering the new checkout flow. AI review identified three edge cases the PM had missed:

  1. What happens when a coupon expires between cart creation and checkout completion (user adds coupon, takes 20 minutes to enter payment details, coupon has a 15-minute expiry)?
  2. How does the system handle address validation failures for addresses in newly created PIN codes not yet in the validation database?
  3. What is the behavior when the user's selected payment method (e.g., a specific bank's net banking) is temporarily unavailable?

These were added to the requirements before development started. Without AI review, edge case #1 would likely have been discovered during QA testing (week 8-9), requiring a design change late in the project.

AI also generated 34 acceptance criteria, of which 24 were kept as-is, 7 were modified, and 3 were discarded as unnecessary.

Time spent on requirements: 1.5 weeks. Estimated without AI: 2 weeks.

Design (Week 2)

The full-stack engineer drafted the system design document. AI critique identified two architectural concerns:

  1. The proposed design had the frontend calling the payment gateway directly. AI noted this creates a CORS dependency and makes it harder to add payment orchestration logic later. Recommendation: route through backend. The team agreed and adjusted the design.
  2. The shipping calculation was designed as a synchronous call during checkout. AI suggested making it asynchronous with a cached result, since shipping rates change infrequently. The team adopted this, which simplified the checkout flow.

AI generated Mermaid sequence diagrams for the checkout flow (3 diagrams covering happy path, payment failure, and address validation failure). The engineer adjusted the error handling flows, which AI had oversimplified.

Design time: 1 week. Estimated without AI: 1.5 weeks (the AI-identified architecture issues would have caused rework later).

Development (Weeks 3-7)

AI contribution during development varied by task type:

Task TypeAI ContributionHuman Contribution
API scaffolding (NestJS)~70% of code generatedEngineer adjusted validation, error handling
Checkout UI components (React)~50% of component structureEngineer wrote all state management, UX logic
Payment integration (Razorpay)~60% of integration codeEngineer wrote error handling, retry logic
Address validation service~30% (uncommon API)Engineer wrote most of the validation logic
Database migrations~80% (straightforward schema)Engineer adjusted indexes and constraints
Business logic (pricing, coupons)~10%Engineer wrote almost everything

Total development time was comparable to the non-AI estimate for this phase (5 weeks vs. estimated 5 weeks). The savings came from faster scaffolding and CRUD operations, but were offset by the time spent on complex business logic, where AI was not helpful.

Testing (Weeks 7-8.5)

This is where AI had the largest impact on the project timeline.

The QA engineer used AI to generate test cases from the acceptance criteria. AI generated 156 test cases total. After review:

  • 104 (67%) were usable as-is or with minor modification
  • 19 (12%) needed significant human correction (wrong expected behavior, missing preconditions)
  • 14 (9%) were duplicates or trivially similar to other tests
  • 19 (12%) were discarded as irrelevant or testing implementation details rather than behavior

The QA engineer added 38 additional test cases covering scenarios AI missed: concurrency issues (two users applying the same limited-use coupon simultaneously), integration edge cases (payment gateway timeout during 3D Secure authentication), and business rule combinations (coupon + loyalty points + partial payment with wallet).

Engineers used AI for unit test generation:

// Example: AI-generated test for coupon validation service
describe('CouponValidationService', () => {
  it('should reject expired coupon gracefully', async () => {
    const expiredCoupon = createMockCoupon({
      code: 'SAVE20',
      expiresAt: subHours(new Date(), 1),
      discountPercent: 20,
    });
    mockCouponRepo.findByCode.mockResolvedValue(expiredCoupon);

    const result = await service.validateCoupon('SAVE20', mockCart);

    expect(result.valid).toBe(false);
    expect(result.reason).toBe('COUPON_EXPIRED');
    expect(result.expiredAt).toEqual(expiredCoupon.expiresAt);
  });

  // Engineer added this test — AI missed the race condition scenario
  it('should handle coupon expiring between validation and application', async () => {
    const coupon = createMockCoupon({
      code: 'FLASH10',
      expiresAt: addMinutes(new Date(), 1),
      discountPercent: 10,
    });
    mockCouponRepo.findByCode.mockResolvedValue(coupon);

    // Simulate time passing during checkout
    jest.advanceTimersByTime(120_000); // 2 minutes

    await expect(service.applyCoupon('FLASH10', mockCart))
      .rejects.toThrow(CouponExpiredDuringCheckoutError);
  });
});

Testing time: 1.5 weeks. Estimated without AI: 2.5 weeks. The time saving came primarily from test case generation and test data preparation.

Code Review (Throughout)

AI pre-review ran on every PR throughout the project. Key statistics:

  • Total PRs: 47
  • AI review comments generated: 183
  • Useful comments (rated 👍 by developers): 134 (73%)
  • False positives (rated 👎): 15 (8%)
  • Neutral/obvious (no reaction): 34 (19%)

The most valuable AI review finding: in a PR implementing the payment retry logic, the AI flagged that the retry delay was using setTimeout without clearing the timeout on component unmount (React). This would have caused a state update on an unmounted component if the user navigated away during a retry. The bug was subtle and would have been difficult to reproduce in testing.

Senior engineer review time per PR averaged 18 minutes, down from 30 minutes in previous projects of similar complexity. The senior engineers consistently reported that AI pre-review eliminated the need to comment on style, formatting, and common patterns, allowing them to focus on business logic and architecture.

Deployment (Week 9)

AI generated the deployment checklist and rollback plan from the project documentation:

## Deployment Checklist
1. Run database migrations (3 new tables, 2 altered tables)
   - Rollback: reverse migrations available
2. Deploy backend services (Order, Payment, Checkout)
   - Rollback: revert to previous container image
3. Deploy frontend build
   - Rollback: revert to previous CDN build
4. Enable feature flag for new checkout flow (10% traffic)
5. Monitor error rates, conversion rate, and checkout completion
   rate for 30 minutes
6. If metrics are stable, increase to 50%, then 100%

## Risk Areas
- Database migration adds a NOT NULL column with a default
  value — verify backfill completes before deployment
- Payment gateway integration uses a new API version —
  verify sandbox testing passed

The deployment was executed with a gradual rollout. AI-generated monitoring queries helped the team track checkout conversion rate in real-time during the rollout.

Documentation (Week 10)

AI generated API documentation for all 12 new endpoints, the checkout flow architecture document, and the operations runbook. The team spent one week on review and refinement — estimated two weeks without AI.

Project Summary

MetricWith AIWithout AI (est.)Delta
Total duration10 weeks12 weeks-2 weeks
Test cases generated194 total~120 (manual)+62%
Defects found in review134 AI + human~90 (human only)+49%
Documentation coverageCompletePartial (time pressure)Significant
Requirements gaps found early3 major0-1 (typically)Notable

The two-week time saving came primarily from testing (1 week saved) and documentation (0.5 weeks saved), with smaller contributions from requirements (0.5 weeks) and scaffolding (scattered throughout development). The core development work — business logic, integration, and complex UI — took roughly the same time with or without AI.

Practical Recommendations

For teams considering a similar approach:

  1. Start with testing and documentation. These are the lowest-risk, highest-ROI areas for AI adoption. The output is easy to verify and the time savings are immediate.

  2. Establish governance before adoption. Decide which tools are approved, what data can be sent to which services, and document it. Retroactively adding governance is painful.

  3. Measure before and after. Without baseline metrics, you cannot tell whether AI is helping. Track sprint velocity, cycle time, defect rates, and review times for at least two months before introducing AI tools.

  4. Train the team on prompt engineering. Bad prompts produce bad output. A 2-hour workshop on structured prompting pays for itself within a week.

  5. Maintain the human-in-the-loop. AI-generated output must be reviewed by a human before it enters the codebase, the requirements, or the production environment. This is not optional.

The AI-augmented SDLC is not about replacing human judgment. It is about reducing the time humans spend on tasks that do not require judgment, so they can spend more time on the tasks that do.

Ready to discuss your project?

Get in Touch →
← Back to Blog

More Articles