Backend Development📅 January 18, 2026· 18 min read

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

✍️

Stripe Systems Engineering

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, and well-supported by every framework.

But it breaks down when services need to coordinate work without blocking each other, when you need to replay historical state changes, or when a downstream service being unavailable shouldn't prevent an upstream service from completing its job. That's when event-driven architecture earns its place.

This post is a production-focused walkthrough. We'll cover the theory briefly, then spend most of our time on implementation details — the outbox pattern, idempotent consumers, dead letter queues, and the operational concerns that matter at scale.

Event-Driven vs. Request-Driven — When Events Are the Right Abstraction

In a request-driven system, Service A calls Service B and waits for a response. The coupling is temporal (A blocks until B responds), behavioral (A knows B's API contract), and availability-dependent (if B is down, A fails).

In an event-driven system, Service A publishes an event describing what happened. Service B, C, and D consume it on their own schedule. The coupling shifts: A doesn't know who consumes its events, doesn't wait for processing, and doesn't fail if consumers are temporarily unavailable.

Temporal decoupling is the primary benefit. When Service A publishes an OrderPlaced event, it doesn't care whether the inventory service processes it in 50ms or 5 minutes. The event sits in the broker until consumers are ready.

Eventual consistency is the primary tradeoff. After publishing OrderPlaced, there's a window where the order exists but inventory hasn't been decremented. Your system must tolerate this. If your business logic requires immediate consistency — a bank transfer where both accounts must reflect the change atomically — events are the wrong abstraction for that specific operation.

Not everything should be an event. Queries that need a synchronous response, operations requiring strong consistency, and simple CRUD that doesn't trigger downstream work — these are better served by direct API calls. The decision should be driven by whether temporal decoupling provides real value, not by architectural fashion.

Apache Kafka Fundamentals

Kafka is a distributed commit log. Messages are appended to topics, which are split into partitions. Each partition is an ordered, immutable sequence of records. Ordering is guaranteed only within a single partition, not across partitions.

Consumer groups are Kafka's scaling mechanism. Each partition in a topic is assigned to exactly one consumer within a group. If you have 12 partitions and 4 consumers in a group, each consumer handles 3 partitions. If a consumer dies, Kafka rebalances — reassigning its partitions to the remaining consumers. This is why partition count sets the upper bound on parallelism: 12 partitions means at most 12 concurrent consumers in a single group.

Offsets track each consumer's position in a partition. When a consumer processes a message, it commits the offset to Kafka (or to an external store). On restart, it resumes from the last committed offset. The choice between auto.commit and manual offset management has significant implications for at-least-once vs. at-most-once delivery semantics.

Retention is time-based or size-based. A 7-day retention policy means consumers have a week to process messages before they're deleted. For event sourcing workloads, you can set retention to infinite (retention.ms=-1), turning Kafka into a permanent event store.

Key configuration decisions you'll make early:

✓Partition count: Overprovisioning is safer than underprovisioning. You can increase partitions but never decrease them, and increasing them breaks key-based ordering guarantees for existing data.
✓Replication factor: 3 is standard for production. This tolerates 1 broker failure without data loss.
✓min.insync.replicas: Set to 2 with a replication factor of 3. Combined with acks=all on the producer, this ensures a write is acknowledged only after 2 replicas have it.

NestJS Kafka Integration

NestJS provides Kafka support through @nestjs/microservices backed by KafkaJS. Here's a production-ready module setup:

// kafka.module.ts
import { Module } from '@nestjs/common';
import { ClientsModule, Transport } from '@nestjs/microservices';

@Module({
  imports: [
    ClientsModule.register([
      {
        name: 'KAFKA_SERVICE',
        transport: Transport.KAFKA,
        options: {
          client: {
            clientId: 'order-service',
            brokers: process.env.KAFKA_BROKERS?.split(',') ?? ['localhost:9092'],
            ssl: process.env.KAFKA_SSL === 'true',
            sasl: process.env.KAFKA_SASL_USERNAME
              ? {
                  mechanism: 'scram-sha-256',
                  username: process.env.KAFKA_SASL_USERNAME,
                  password: process.env.KAFKA_SASL_PASSWORD ?? '',
                }
              : undefined,
          },
          producer: {
            allowAutoTopicCreation: false,
            idempotent: true,
          },
          consumer: {
            groupId: 'order-service-group',
            sessionTimeout: 30000,
            heartbeatInterval: 10000,
            maxWaitTimeInMs: 100,
          },
        },
      },
    ]),
  ],
  exports: [ClientsModule],
})
export class KafkaModule {}

Note idempotent: true on the producer — this enables Kafka's idempotent producer, which deduplicates messages caused by retries at the broker level. It's free and you should always enable it.

Consumer setup uses NestJS decorators:

// order-events.controller.ts
import { Controller } from '@nestjs/common';
import { EventPattern, Payload, Ctx, KafkaContext } from '@nestjs/microservices';

@Controller()
export class OrderEventsController {
  @EventPattern('orders.events')
  async handleOrderEvent(
    @Payload() event: OrderEvent,
    @Ctx() context: KafkaContext,
  ) {
    const { offset } = context.getMessage();
    const partition = context.getPartition();
    const topic = context.getTopic();

    try {
      await this.processEvent(event);
      // Manual commit after successful processing
      await context.getConsumer().commitOffsets([
        { topic, partition, offset: (Number(offset) + 1).toString() },
      ]);
    } catch (error) {
      // Don't commit — message will be redelivered
      throw error;
    }
  }
}

Event Schema Design

Poorly designed event schemas create coupling that's worse than direct API calls — because the coupling is implicit and discovered at runtime.

The CloudEvents specification provides a standard envelope:

{
  "specversion": "1.0",
  "id": "a]b2c3d4-e5f6-7890-abcd-ef1234567890",
  "source": "/services/order-service",
  "type": "com.example.order.placed",
  "datacontenttype": "application/json",
  "time": "2025-08-30T14:30:00Z",
  "subject": "order-12345",
  "data": {
    "orderId": "order-12345",
    "customerId": "cust-789",
    "items": [
      { "sku": "WIDGET-001", "quantity": 2, "unitPrice": 29.99 }
    ],
    "totalAmount": 59.98,
    "currency": "USD"
  }
}

The metadata fields (id, source, type, time) are essential for routing, deduplication, and debugging. The data field contains your domain-specific payload.

Schema evolution is where things get painful without a plan. A schema registry (Confluent Schema Registry or Apicurio) with Avro or Protobuf enforces compatibility rules:

✓Backward compatible: New schema can read data written by the old schema. You can add optional fields, but not required ones.
✓Forward compatible: Old schema can read data written by the new schema. You can remove optional fields.
✓Full compatibility: Both directions. The safest choice — only add or remove optional fields with defaults.

With Avro in a schema registry:

{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.example.orders",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "totalAmount", "type": "double" },
    { "name": "currency", "type": "string", "default": "USD" },
    { "name": "placedAt", "type": { "type": "long", "logicalType": "timestamp-millis" } },
    { "name": "metadata", "type": ["null", "string"], "default": null }
  ]
}

The metadata field was added later with a null default — backward compatible. Existing consumers ignore it; new consumers can read it.

The Transactional Outbox Pattern

Here's the problem: your service needs to update a database row AND publish a Kafka event. These are two different systems. If the database write succeeds but the Kafka publish fails (network blip, broker down), your database says "order created" but no event was emitted. Downstream services never learn about the order.

You can't wrap them in a single transaction because Kafka isn't a relational database. You could try to publish first and write second, but that inverts the problem. This is the dual-write problem, and it has no solution without changing the approach.

The outbox pattern solves it by writing both the business data and the event to the same database in a single transaction:

CREATE TABLE outbox_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id VARCHAR(255) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ,
    retry_count INT NOT NULL DEFAULT 0,
    INDEX idx_outbox_unpublished (published_at) WHERE published_at IS NULL
);

The partial index WHERE published_at IS NULL is critical — it ensures the polling query only scans unpublished rows, not the entire table.

Your service code becomes:

// order.service.ts
async createOrder(dto: CreateOrderDto): Promise<Order> {
  return this.dataSource.transaction(async (manager) => {
    // 1. Write business data
    const order = manager.create(Order, {
      customerId: dto.customerId,
      items: dto.items,
      totalAmount: dto.totalAmount,
      status: OrderStatus.PLACED,
    });
    await manager.save(order);

    // 2. Write outbox event in the SAME transaction
    const outboxEvent = manager.create(OutboxEvent, {
      aggregateType: 'Order',
      aggregateId: order.id,
      eventType: 'order.placed',
      payload: {
        specversion: '1.0',
        id: randomUUID(),
        source: '/services/order-service',
        type: 'com.example.order.placed',
        time: new Date().toISOString(),
        data: {
          orderId: order.id,
          customerId: order.customerId,
          items: order.items,
          totalAmount: order.totalAmount,
        },
      },
    });
    await manager.save(outboxEvent);

    return order;
  });
}

Both writes succeed or both fail. Atomicity is guaranteed by the database transaction. The event in the outbox table is now a promise that "this event will eventually be published to Kafka."

Polling Publisher — The Simple Relay

The polling publisher is a background process that periodically reads unpublished events from the outbox table and publishes them to Kafka:

// outbox-relay.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { Cron, CronExpression } from '@nestjs/schedule';
import { InjectRepository } from '@nestjs/typeorm';
import { Repository, IsNull } from 'typeorm';
import { Inject } from '@nestjs/common';
import { ClientKafka } from '@nestjs/microservices';

@Injectable()
export class OutboxRelayService {
  private readonly logger = new Logger(OutboxRelayService.name);
  private isProcessing = false;

  constructor(
    @InjectRepository(OutboxEvent)
    private readonly outboxRepo: Repository<OutboxEvent>,
    @Inject('KAFKA_SERVICE')
    private readonly kafkaClient: ClientKafka,
  ) {}

  @Cron(CronExpression.EVERY_5_SECONDS)
  async publishPendingEvents() {
    if (this.isProcessing) return; // Prevent overlapping runs
    this.isProcessing = true;

    try {
      const events = await this.outboxRepo.find({
        where: { publishedAt: IsNull() },
        order: { createdAt: 'ASC' },
        take: 100, // Batch size
      });

      for (const event of events) {
        try {
          const topic = this.resolveTopicName(event.aggregateType);
          await this.kafkaClient.emit(topic, {
            key: event.aggregateId,
            value: JSON.stringify(event.payload),
            headers: {
              'event-type': event.eventType,
              'event-id': event.id,
            },
          });

          await this.outboxRepo.update(event.id, {
            publishedAt: new Date(),
          });
        } catch (error) {
          this.logger.error(
            `Failed to publish event ${event.id}: ${error.message}`,
          );
          await this.outboxRepo.increment(
            { id: event.id },
            'retryCount',
            1,
          );
        }
      }
    } finally {
      this.isProcessing = false;
    }
  }

  private resolveTopicName(aggregateType: string): string {
    const topicMap: Record<string, string> = {
      Order: 'orders.events',
      Payment: 'payments.events',
      Inventory: 'inventory.events',
    };
    return topicMap[aggregateType] ?? `${aggregateType.toLowerCase()}.events`;
  }
}

Polling interval tradeoffs: A 1-second interval gives low latency but generates constant database load. A 30-second interval reduces load but adds delivery latency. For most systems, 5 seconds is a reasonable default. If you need sub-second delivery, use CDC instead.

The polling publisher has a known limitation: it introduces at-least-once delivery. If the publisher crashes after sending to Kafka but before marking the event as published, the event will be re-sent on the next poll. Your consumers must handle duplicates — which brings us to idempotency.

CDC with Debezium — The WAL-Based Alternative

Instead of polling the outbox table, you can use Change Data Capture (CDC) to stream changes from the database's write-ahead log (WAL) directly to Kafka. Debezium is the standard tool for this.

Debezium runs as a Kafka Connect connector. It reads PostgreSQL's logical replication stream, captures every INSERT to the outbox table, and publishes it to a Kafka topic. No polling, no added database load from repeated queries.

A Debezium connector configuration for the outbox pattern:

{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/secrets/db-password.txt}",
    "database.dbname": "orders_db",
    "topic.prefix": "cdc",
    "table.include.list": "public.outbox_events",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.field.event.id": "id",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.event.type": "event_type",
    "transforms.outbox.table.field.event.payload": "payload",
    "transforms.outbox.route.topic.replacement": "${routedByValue}.events",
    "transforms.outbox.table.fields.additional.placement": "aggregate_type:header:aggregateType",
    "plugin.name": "pgoutput",
    "slot.name": "outbox_slot",
    "publication.name": "outbox_publication"
  }
}

The EventRouter SMT (Single Message Transform) is key — it extracts the event payload from the outbox row and routes it to the correct topic based on the aggregate_type column.

When to use CDC vs. polling:

Factor	Polling	CDC (Debezium)
Latency	Seconds (depends on interval)	Sub-second
Database load	Repeated queries	Minimal (reads WAL)
Operational complexity	Low	High (Kafka Connect cluster, slot management)
Debugging	Simple (query the table)	Harder (WAL offsets, slot monitoring)
Infrastructure	Just your app + DB	Kafka Connect + Debezium

Start with polling. Move to CDC when polling latency or database load becomes a bottleneck. Many systems never need CDC.

Idempotent Consumers

Kafka guarantees at-least-once delivery by default. Messages can be delivered more than once — broker retries, consumer rebalances, and the outbox relay's at-least-once semantics all contribute to duplicates.

Your consumers must be idempotent: processing the same event twice produces the same result.

// idempotent-consumer.service.ts
@Injectable()
export class IdempotentConsumerService {
  constructor(
    @InjectRepository(ProcessedEvent)
    private readonly processedEventRepo: Repository<ProcessedEvent>,
    private readonly dataSource: DataSource,
  ) {}

  async processEvent<T>(
    eventId: string,
    handler: (queryRunner: QueryRunner) => Promise<T>,
  ): Promise<T | null> {
    const queryRunner = this.dataSource.createQueryRunner();
    await queryRunner.connect();
    await queryRunner.startTransaction();

    try {
      // Check if already processed (with row-level lock to prevent races)
      const existing = await queryRunner.manager.findOne(ProcessedEvent, {
        where: { eventId },
        lock: { mode: 'pessimistic_write' },
      });

      if (existing) {
        await queryRunner.rollbackTransaction();
        return null; // Already processed
      }

      // Execute business logic
      const result = await handler(queryRunner);

      // Record that we processed this event
      await queryRunner.manager.save(ProcessedEvent, {
        eventId,
        processedAt: new Date(),
      });

      await queryRunner.commitTransaction();
      return result;
    } catch (error) {
      await queryRunner.rollbackTransaction();
      throw error;
    } finally {
      await queryRunner.release();
    }
  }
}

The processed event check and the business logic run in the same transaction. This ensures that either both the deduplication record and the business state change are committed, or neither is.

Storage for idempotency keys:

✓Database (same as business data): Strongest guarantee — the dedup check and business write are in the same transaction. Use this for critical operations.
✓Redis: Lower latency for the dedup lookup, but the check and business write are no longer atomic. Acceptable for operations where occasional duplicate processing is tolerable (e.g., sending a notification twice is annoying but not catastrophic).

Set a TTL on your idempotency records. Events older than your Kafka retention period can't be redelivered, so there's no need to keep their dedup keys forever. If your retention is 7 days, a 14-day TTL on idempotency records gives you a comfortable buffer.

Dead Letter Queues

Some messages can't be processed regardless of how many times you retry. Malformed payloads, schema mismatches, or bugs in consumer logic produce "poison messages" that block the partition. A dead letter queue (DLQ) moves these aside so the consumer can continue.

// dlq.service.ts
@Injectable()
export class DeadLetterQueueService {
  private readonly MAX_RETRIES = 3;
  private readonly BACKOFF_BASE_MS = 1000;

  constructor(
    @Inject('KAFKA_SERVICE')
    private readonly kafkaClient: ClientKafka,
  ) {}

  async handleWithRetry(
    event: any,
    context: KafkaContext,
    handler: () => Promise<void>,
  ): Promise<void> {
    const retryCount = this.getRetryCount(context);

    if (retryCount >= this.MAX_RETRIES) {
      await this.sendToDlq(event, context, 'Max retries exceeded');
      return;
    }

    try {
      await handler();
    } catch (error) {
      if (this.isRetryable(error)) {
        const backoffMs = this.BACKOFF_BASE_MS * Math.pow(2, retryCount);
        await this.sleep(backoffMs);

        // Publish to retry topic with incremented count
        await this.kafkaClient.emit(
          `${context.getTopic()}.retry`,
          {
            key: context.getMessage().key,
            value: context.getMessage().value,
            headers: {
              ...context.getMessage().headers,
              'x-retry-count': (retryCount + 1).toString(),
              'x-original-topic': context.getTopic(),
              'x-error-message': error.message,
            },
          },
        );
      } else {
        await this.sendToDlq(event, context, error.message);
      }
    }
  }

  private async sendToDlq(
    event: any,
    context: KafkaContext,
    reason: string,
  ): Promise<void> {
    const dlqTopic = `${context.getTopic()}.dlq`;
    await this.kafkaClient.emit(dlqTopic, {
      key: context.getMessage().key,
      value: context.getMessage().value,
      headers: {
        ...context.getMessage().headers,
        'x-dlq-reason': reason,
        'x-dlq-timestamp': new Date().toISOString(),
        'x-original-topic': context.getTopic(),
        'x-original-partition': context.getPartition().toString(),
        'x-original-offset': context.getMessage().offset,
      },
    });
  }

  private getRetryCount(context: KafkaContext): number {
    const header = context.getMessage().headers?.['x-retry-count'];
    return header ? parseInt(header.toString(), 10) : 0;
  }

  private isRetryable(error: any): boolean {
    // Network errors, timeouts, and transient DB errors are retryable
    // Validation errors, deserialization errors are not
    const nonRetryable = ['ValidationError', 'SyntaxError', 'SchemaError'];
    return !nonRetryable.includes(error.constructor.name);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

Always alert on DLQ messages. A message in the DLQ represents data that was not processed — in a financial system, that could be a payment that was charged but not recorded. Set up alerts when DLQ topic offsets advance.

Event Ordering Guarantees

Kafka guarantees ordering within a partition. To ensure all events for an entity are ordered, use the entity's ID as the partition key. All events with the same key hash to the same partition and are therefore ordered.

// Partition key selection examples
await this.kafkaClient.emit('orders.events', {
  key: order.id,           // All events for this order are ordered
  value: JSON.stringify(event),
});

await this.kafkaClient.emit('payments.events', {
  key: payment.orderId,    // All payments for an order are ordered
  value: JSON.stringify(event),
});

// Anti-pattern: using customer ID for high-volume customers
// creates a "hot partition" — one partition gets disproportionate traffic
await this.kafkaClient.emit('orders.events', {
  key: order.customerId,   // Avoid: enterprise customers create hot partitions
  value: JSON.stringify(event),
});

Exactly-once semantics (EOS) with Kafka transactions ensure that a consume-transform-produce cycle is atomic. The consumer reads a message, processes it, produces output messages, and commits the consumer offset — all in a single transaction:

const producer = kafka.producer({
  idempotent: true,
  transactionalId: 'inventory-service-txn',
  maxInFlightRequests: 1,
});

const transaction = await producer.transaction();

try {
  await transaction.send({
    topic: 'inventory.events',
    messages: [{ key: itemId, value: JSON.stringify(inventoryUpdated) }],
  });
  await transaction.sendOffsets({
    consumerGroupId: 'inventory-service-group',
    topics: [{ topic: 'orders.events', partitions: [{ partition, offset }] }],
  });
  await transaction.commit();
} catch (error) {
  await transaction.abort();
  throw error;
}

EOS adds latency (roughly 50-100ms per transaction) and reduces throughput. Use it when duplicate downstream effects are unacceptable.

Monitoring

A Kafka-based system that isn't monitored will eventually lose messages silently.

Consumer lag is the single most important metric. It measures the difference between the latest offset in a partition and the consumer's committed offset. Rising lag means consumers can't keep up.

Key Prometheus metrics to expose:

# Consumer lag per partition
kafka_consumer_lag{topic="orders.events", partition="0", group="inventory-service"} 142

# Messages consumed per second
kafka_consumer_messages_total{topic="orders.events", group="inventory-service"} 

# Processing duration (p99)
kafka_consumer_processing_duration_seconds{quantile="0.99", topic="orders.events"} 0.045

# DLQ messages (should be near zero)
kafka_dlq_messages_total{original_topic="orders.events"} 3

# Outbox table unpublished count (polling relay)
outbox_unpublished_events_count{service="order-service"} 7

# Producer send latency
kafka_producer_send_duration_seconds{quantile="0.99", topic="orders.events"} 0.012

Alerting thresholds:

✓Consumer lag > 10,000 for 5 minutes → Warning
✓Consumer lag > 100,000 for 5 minutes → Critical
✓DLQ messages > 0 → Investigate immediately
✓Outbox unpublished count > 1,000 → Relay is stuck or slow
✓Consumer processing p99 > 1s → Consumer is bottlenecked

Use Burrow or Kafka's built-in kafka-consumer-groups.sh for lag monitoring. For Prometheus-based stacks, the kafka_exporter project exposes broker and consumer group metrics.

Case Study: Order Management System at Scale

This section describes a system that Stripe Systems built for an e-commerce client processing 100K+ orders per day across five microservices: Order, Payment, Inventory, Notification, and Fulfillment.

The Problem

The original architecture used synchronous HTTP calls. When a customer placed an order, the Order Service called the Payment Service, then the Inventory Service, then the Notification Service — sequentially, in the same request cycle. If the Inventory Service timed out after Payment had already charged the customer, the system entered an inconsistent state.

The most damaging bug: orders were placed and payments collected, but inventory was not decremented. This caused 2-3% overselling — customers received "item shipped" emails followed by "sorry, out of stock" emails days later. The root cause was a dual-write problem: the Order Service committed the order to its database and then made an HTTP call to Inventory. When the HTTP call failed, the order existed but inventory was unchanged.

The Solution

We migrated to an event-driven architecture with the outbox pattern. The Order Service writes the order and an outbox event in a single PostgreSQL transaction. A polling relay publishes the event to Kafka. Downstream services consume events independently.

Outbox table DDL (PostgreSQL):

CREATE TABLE outbox_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(100) NOT NULL,
    aggregate_id VARCHAR(100) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    partition_key VARCHAR(255) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ,
    retry_count INT NOT NULL DEFAULT 0
);

-- Partial index: only unpublished events are scanned by the relay
CREATE INDEX idx_outbox_unpublished
    ON outbox_events (created_at ASC)
    WHERE published_at IS NULL;

-- Cleanup: published events older than 7 days
-- Runs via pg_cron daily
-- DELETE FROM outbox_events WHERE published_at < NOW() - INTERVAL '7 days';

Kafka topic design:

Topic	Partitions	Retention	Partition Key
`orders.events`	12	7 days	`orderId`
`payments.events`	6	14 days	`orderId`
`inventory.events`	12	7 days	`skuId`
`notifications.events`	6	3 days	`customerId`
`fulfillment.events`	6	14 days	`orderId`
`*.retry` (per topic)	3	3 days	same as source
`*.dlq` (per topic)	1	30 days	same as source

The orders.events topic uses 12 partitions — enough to support 12 concurrent consumers in each downstream service's consumer group. DLQ topics use 1 partition (low throughput by design) with 30-day retention for forensic analysis.

Inventory service consumer (NestJS):

// inventory-consumer.controller.ts
@Controller()
export class InventoryConsumerController {
  private readonly logger = new Logger(InventoryConsumerController.name);

  constructor(
    private readonly idempotentConsumer: IdempotentConsumerService,
    private readonly inventoryService: InventoryService,
    private readonly dlqService: DeadLetterQueueService,
  ) {}

  @EventPattern('orders.events')
  async handleOrderEvent(
    @Payload() event: CloudEvent<OrderEventData>,
    @Ctx() context: KafkaContext,
  ) {
    if (event.type !== 'com.example.order.placed') {
      return; // Ignore events we don't care about
    }

    await this.dlqService.handleWithRetry(event, context, async () => {
      const result = await this.idempotentConsumer.processEvent(
        event.id,
        async (queryRunner) => {
          // Decrement inventory for each item in the order
          for (const item of event.data.items) {
            const inventory = await queryRunner.manager.findOne(
              InventoryItem,
              {
                where: { sku: item.sku },
                lock: { mode: 'pessimistic_write' },
              },
            );

            if (!inventory) {
              throw new ValidationError(`Unknown SKU: ${item.sku}`);
            }

            if (inventory.availableQuantity < item.quantity) {
              // Publish compensation event instead of throwing
              await this.publishInsufficientStockEvent(
                event.data.orderId,
                item.sku,
                inventory.availableQuantity,
                item.quantity,
              );
              return { success: false, reason: 'insufficient_stock' };
            }

            inventory.availableQuantity -= item.quantity;
            inventory.reservedQuantity += item.quantity;
            await queryRunner.manager.save(inventory);
          }

          // Write outbox event for downstream services
          await queryRunner.manager.save(OutboxEvent, {
            aggregateType: 'Inventory',
            aggregateId: event.data.orderId,
            eventType: 'inventory.reserved',
            partitionKey: event.data.orderId,
            payload: {
              specversion: '1.0',
              id: randomUUID(),
              source: '/services/inventory-service',
              type: 'com.example.inventory.reserved',
              time: new Date().toISOString(),
              data: {
                orderId: event.data.orderId,
                items: event.data.items.map((i) => ({
                  sku: i.sku,
                  quantity: i.quantity,
                })),
              },
            },
          });

          return { success: true };
        },
      );

      if (result === null) {
        this.logger.debug(`Event ${event.id} already processed, skipping`);
      }
    });
  }
}

Key implementation details in this consumer:

✓Idempotent processing: The processEvent wrapper checks for duplicate event IDs before executing business logic.
✓Pessimistic locking: SELECT ... FOR UPDATE prevents concurrent inventory decrements from overselling.
✓Compensation over rejection: When stock is insufficient, we publish an inventory.insufficient_stock event instead of silently dropping the message. The Order Service consumes this to cancel the order and trigger a refund.
✓Outbox for downstream events: The inventory service itself uses the outbox pattern to publish inventory.reserved events consumed by the Fulfillment service.

Event flow:

Customer places order
    → Order Service: INSERT order + outbox event (single txn)
    → Outbox Relay: polls outbox → publishes to orders.events
    → Kafka: orders.events (partition key: orderId)
        ├── Payment Service: charges payment → publishes payment.completed
        ├── Inventory Service: reserves stock → publishes inventory.reserved
        └── Notification Service: sends order confirmation email
    → Kafka: inventory.events
        └── Fulfillment Service: initiates shipping when both
            payment.completed AND inventory.reserved received

Results

After deploying this architecture to production with Stripe Systems' implementation, the results were measured over a 90-day period:

Metric	Before (HTTP)	After (Outbox + Kafka)
Overselling rate	2-3%	0.001% (1 in 100K)
Order placement p99 latency	1,200ms	180ms
System availability during downstream outages	Cascading failures	Order Service unaffected
Failed order recovery	Manual intervention	Automatic via retries + DLQ
Daily DLQ messages	N/A	~5 (schema issues, investigated same day)

The p99 latency dropped from 1,200ms to 180ms because the Order Service no longer waits for downstream HTTP calls. It writes to PostgreSQL and returns. The remaining 0.001% overselling comes from race conditions in concurrent inventory updates under extreme load — addressed by tuning the pessimistic lock wait timeout.

The DLQ averages about 5 messages per day out of 100K+ orders. These are typically caused by malformed payloads from partner API integrations and are triaged within the same business day.

Takeaways

Event-driven architecture isn't a universal improvement over request-response. It trades immediate consistency for temporal decoupling and resilience. The transactional outbox pattern eliminates the dual-write problem that causes data inconsistencies between services. Idempotent consumers and dead letter queues are not optional — they're structural requirements for correctness in an at-least-once delivery system.

Start simple: PostgreSQL outbox with a polling relay. Add CDC with Debezium when polling latency becomes a problem. Monitor consumer lag before anything else. And don't make something an event unless temporal decoupling provides concrete value.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

Backend Development

Server-side systems designed for correctness, observability, and horizontal scalability.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussi...

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months ...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, paymen...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million r...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a lang...

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentic...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

Backend Development📅 January 18, 2026· 18 min read

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

✍️

Stripe Systems Engineering

Event-Driven vs. Request-Driven — When Events Are the Right Abstraction

Apache Kafka Fundamentals

Key configuration decisions you'll make early:

✓Partition count: Overprovisioning is safer than underprovisioning. You can increase partitions but never decrease them, and increasing them breaks key-based ordering guarantees for existing data.
✓Replication factor: 3 is standard for production. This tolerates 1 broker failure without data loss.
✓min.insync.replicas: Set to 2 with a replication factor of 3. Combined with acks=all on the producer, this ensures a write is acknowledged only after 2 replicas have it.

NestJS Kafka Integration

NestJS provides Kafka support through @nestjs/microservices backed by KafkaJS. Here's a production-ready module setup:

// kafka.module.ts
import { Module } from '@nestjs/common';
import { ClientsModule, Transport } from '@nestjs/microservices';

@Module({
  imports: [
    ClientsModule.register([
      {
        name: 'KAFKA_SERVICE',
        transport: Transport.KAFKA,
        options: {
          client: {
            clientId: 'order-service',
            brokers: process.env.KAFKA_BROKERS?.split(',') ?? ['localhost:9092'],
            ssl: process.env.KAFKA_SSL === 'true',
            sasl: process.env.KAFKA_SASL_USERNAME
              ? {
                  mechanism: 'scram-sha-256',
                  username: process.env.KAFKA_SASL_USERNAME,
                  password: process.env.KAFKA_SASL_PASSWORD ?? '',
                }
              : undefined,
          },
          producer: {
            allowAutoTopicCreation: false,
            idempotent: true,
          },
          consumer: {
            groupId: 'order-service-group',
            sessionTimeout: 30000,
            heartbeatInterval: 10000,
            maxWaitTimeInMs: 100,
          },
        },
      },
    ]),
  ],
  exports: [ClientsModule],
})
export class KafkaModule {}

Note idempotent: true on the producer — this enables Kafka's idempotent producer, which deduplicates messages caused by retries at the broker level. It's free and you should always enable it.

Consumer setup uses NestJS decorators:

// order-events.controller.ts
import { Controller } from '@nestjs/common';
import { EventPattern, Payload, Ctx, KafkaContext } from '@nestjs/microservices';

@Controller()
export class OrderEventsController {
  @EventPattern('orders.events')
  async handleOrderEvent(
    @Payload() event: OrderEvent,
    @Ctx() context: KafkaContext,
  ) {
    const { offset } = context.getMessage();
    const partition = context.getPartition();
    const topic = context.getTopic();

    try {
      await this.processEvent(event);
      // Manual commit after successful processing
      await context.getConsumer().commitOffsets([
        { topic, partition, offset: (Number(offset) + 1).toString() },
      ]);
    } catch (error) {
      // Don't commit — message will be redelivered
      throw error;
    }
  }
}

Event Schema Design

Poorly designed event schemas create coupling that's worse than direct API calls — because the coupling is implicit and discovered at runtime.

The CloudEvents specification provides a standard envelope:

{
  "specversion": "1.0",
  "id": "a]b2c3d4-e5f6-7890-abcd-ef1234567890",
  "source": "/services/order-service",
  "type": "com.example.order.placed",
  "datacontenttype": "application/json",
  "time": "2025-08-30T14:30:00Z",
  "subject": "order-12345",
  "data": {
    "orderId": "order-12345",
    "customerId": "cust-789",
    "items": [
      { "sku": "WIDGET-001", "quantity": 2, "unitPrice": 29.99 }
    ],
    "totalAmount": 59.98,
    "currency": "USD"
  }
}

The metadata fields (id, source, type, time) are essential for routing, deduplication, and debugging. The data field contains your domain-specific payload.

Schema evolution is where things get painful without a plan. A schema registry (Confluent Schema Registry or Apicurio) with Avro or Protobuf enforces compatibility rules:

✓Backward compatible: New schema can read data written by the old schema. You can add optional fields, but not required ones.
✓Forward compatible: Old schema can read data written by the new schema. You can remove optional fields.
✓Full compatibility: Both directions. The safest choice — only add or remove optional fields with defaults.

With Avro in a schema registry:

{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.example.orders",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "totalAmount", "type": "double" },
    { "name": "currency", "type": "string", "default": "USD" },
    { "name": "placedAt", "type": { "type": "long", "logicalType": "timestamp-millis" } },
    { "name": "metadata", "type": ["null", "string"], "default": null }
  ]
}

The metadata field was added later with a null default — backward compatible. Existing consumers ignore it; new consumers can read it.

The Transactional Outbox Pattern

The outbox pattern solves it by writing both the business data and the event to the same database in a single transaction:

CREATE TABLE outbox_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id VARCHAR(255) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ,
    retry_count INT NOT NULL DEFAULT 0,
    INDEX idx_outbox_unpublished (published_at) WHERE published_at IS NULL
);

The partial index WHERE published_at IS NULL is critical — it ensures the polling query only scans unpublished rows, not the entire table.

Your service code becomes:

// order.service.ts
async createOrder(dto: CreateOrderDto): Promise<Order> {
  return this.dataSource.transaction(async (manager) => {
    // 1. Write business data
    const order = manager.create(Order, {
      customerId: dto.customerId,
      items: dto.items,
      totalAmount: dto.totalAmount,
      status: OrderStatus.PLACED,
    });
    await manager.save(order);

    // 2. Write outbox event in the SAME transaction
    const outboxEvent = manager.create(OutboxEvent, {
      aggregateType: 'Order',
      aggregateId: order.id,
      eventType: 'order.placed',
      payload: {
        specversion: '1.0',
        id: randomUUID(),
        source: '/services/order-service',
        type: 'com.example.order.placed',
        time: new Date().toISOString(),
        data: {
          orderId: order.id,
          customerId: order.customerId,
          items: order.items,
          totalAmount: order.totalAmount,
        },
      },
    });
    await manager.save(outboxEvent);

    return order;
  });
}

Both writes succeed or both fail. Atomicity is guaranteed by the database transaction. The event in the outbox table is now a promise that "this event will eventually be published to Kafka."

Polling Publisher — The Simple Relay

The polling publisher is a background process that periodically reads unpublished events from the outbox table and publishes them to Kafka:

// outbox-relay.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { Cron, CronExpression } from '@nestjs/schedule';
import { InjectRepository } from '@nestjs/typeorm';
import { Repository, IsNull } from 'typeorm';
import { Inject } from '@nestjs/common';
import { ClientKafka } from '@nestjs/microservices';

@Injectable()
export class OutboxRelayService {
  private readonly logger = new Logger(OutboxRelayService.name);
  private isProcessing = false;

  constructor(
    @InjectRepository(OutboxEvent)
    private readonly outboxRepo: Repository<OutboxEvent>,
    @Inject('KAFKA_SERVICE')
    private readonly kafkaClient: ClientKafka,
  ) {}

  @Cron(CronExpression.EVERY_5_SECONDS)
  async publishPendingEvents() {
    if (this.isProcessing) return; // Prevent overlapping runs
    this.isProcessing = true;

    try {
      const events = await this.outboxRepo.find({
        where: { publishedAt: IsNull() },
        order: { createdAt: 'ASC' },
        take: 100, // Batch size
      });

      for (const event of events) {
        try {
          const topic = this.resolveTopicName(event.aggregateType);
          await this.kafkaClient.emit(topic, {
            key: event.aggregateId,
            value: JSON.stringify(event.payload),
            headers: {
              'event-type': event.eventType,
              'event-id': event.id,
            },
          });

          await this.outboxRepo.update(event.id, {
            publishedAt: new Date(),
          });
        } catch (error) {
          this.logger.error(
            `Failed to publish event ${event.id}: ${error.message}`,
          );
          await this.outboxRepo.increment(
            { id: event.id },
            'retryCount',
            1,
          );
        }
      }
    } finally {
      this.isProcessing = false;
    }
  }

  private resolveTopicName(aggregateType: string): string {
    const topicMap: Record<string, string> = {
      Order: 'orders.events',
      Payment: 'payments.events',
      Inventory: 'inventory.events',
    };
    return topicMap[aggregateType] ?? `${aggregateType.toLowerCase()}.events`;
  }
}

CDC with Debezium — The WAL-Based Alternative

Instead of polling the outbox table, you can use Change Data Capture (CDC) to stream changes from the database's write-ahead log (WAL) directly to Kafka. Debezium is the standard tool for this.

A Debezium connector configuration for the outbox pattern:

{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/secrets/db-password.txt}",
    "database.dbname": "orders_db",
    "topic.prefix": "cdc",
    "table.include.list": "public.outbox_events",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.field.event.id": "id",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.event.type": "event_type",
    "transforms.outbox.table.field.event.payload": "payload",
    "transforms.outbox.route.topic.replacement": "${routedByValue}.events",
    "transforms.outbox.table.fields.additional.placement": "aggregate_type:header:aggregateType",
    "plugin.name": "pgoutput",
    "slot.name": "outbox_slot",
    "publication.name": "outbox_publication"
  }
}

The EventRouter SMT (Single Message Transform) is key — it extracts the event payload from the outbox row and routes it to the correct topic based on the aggregate_type column.

When to use CDC vs. polling:

Factor	Polling	CDC (Debezium)
Latency	Seconds (depends on interval)	Sub-second
Database load	Repeated queries	Minimal (reads WAL)
Operational complexity	Low	High (Kafka Connect cluster, slot management)
Debugging	Simple (query the table)	Harder (WAL offsets, slot monitoring)
Infrastructure	Just your app + DB	Kafka Connect + Debezium

Start with polling. Move to CDC when polling latency or database load becomes a bottleneck. Many systems never need CDC.

Idempotent Consumers

Your consumers must be idempotent: processing the same event twice produces the same result.

// idempotent-consumer.service.ts
@Injectable()
export class IdempotentConsumerService {
  constructor(
    @InjectRepository(ProcessedEvent)
    private readonly processedEventRepo: Repository<ProcessedEvent>,
    private readonly dataSource: DataSource,
  ) {}

  async processEvent<T>(
    eventId: string,
    handler: (queryRunner: QueryRunner) => Promise<T>,
  ): Promise<T | null> {
    const queryRunner = this.dataSource.createQueryRunner();
    await queryRunner.connect();
    await queryRunner.startTransaction();

    try {
      // Check if already processed (with row-level lock to prevent races)
      const existing = await queryRunner.manager.findOne(ProcessedEvent, {
        where: { eventId },
        lock: { mode: 'pessimistic_write' },
      });

      if (existing) {
        await queryRunner.rollbackTransaction();
        return null; // Already processed
      }

      // Execute business logic
      const result = await handler(queryRunner);

      // Record that we processed this event
      await queryRunner.manager.save(ProcessedEvent, {
        eventId,
        processedAt: new Date(),
      });

      await queryRunner.commitTransaction();
      return result;
    } catch (error) {
      await queryRunner.rollbackTransaction();
      throw error;
    } finally {
      await queryRunner.release();
    }
  }
}

The processed event check and the business logic run in the same transaction. This ensures that either both the deduplication record and the business state change are committed, or neither is.

Storage for idempotency keys:

✓Database (same as business data): Strongest guarantee — the dedup check and business write are in the same transaction. Use this for critical operations.
✓Redis: Lower latency for the dedup lookup, but the check and business write are no longer atomic. Acceptable for operations where occasional duplicate processing is tolerable (e.g., sending a notification twice is annoying but not catastrophic).

Dead Letter Queues

// dlq.service.ts
@Injectable()
export class DeadLetterQueueService {
  private readonly MAX_RETRIES = 3;
  private readonly BACKOFF_BASE_MS = 1000;

  constructor(
    @Inject('KAFKA_SERVICE')
    private readonly kafkaClient: ClientKafka,
  ) {}

  async handleWithRetry(
    event: any,
    context: KafkaContext,
    handler: () => Promise<void>,
  ): Promise<void> {
    const retryCount = this.getRetryCount(context);

    if (retryCount >= this.MAX_RETRIES) {
      await this.sendToDlq(event, context, 'Max retries exceeded');
      return;
    }

    try {
      await handler();
    } catch (error) {
      if (this.isRetryable(error)) {
        const backoffMs = this.BACKOFF_BASE_MS * Math.pow(2, retryCount);
        await this.sleep(backoffMs);

        // Publish to retry topic with incremented count
        await this.kafkaClient.emit(
          `${context.getTopic()}.retry`,
          {
            key: context.getMessage().key,
            value: context.getMessage().value,
            headers: {
              ...context.getMessage().headers,
              'x-retry-count': (retryCount + 1).toString(),
              'x-original-topic': context.getTopic(),
              'x-error-message': error.message,
            },
          },
        );
      } else {
        await this.sendToDlq(event, context, error.message);
      }
    }
  }

  private async sendToDlq(
    event: any,
    context: KafkaContext,
    reason: string,
  ): Promise<void> {
    const dlqTopic = `${context.getTopic()}.dlq`;
    await this.kafkaClient.emit(dlqTopic, {
      key: context.getMessage().key,
      value: context.getMessage().value,
      headers: {
        ...context.getMessage().headers,
        'x-dlq-reason': reason,
        'x-dlq-timestamp': new Date().toISOString(),
        'x-original-topic': context.getTopic(),
        'x-original-partition': context.getPartition().toString(),
        'x-original-offset': context.getMessage().offset,
      },
    });
  }

  private getRetryCount(context: KafkaContext): number {
    const header = context.getMessage().headers?.['x-retry-count'];
    return header ? parseInt(header.toString(), 10) : 0;
  }

  private isRetryable(error: any): boolean {
    // Network errors, timeouts, and transient DB errors are retryable
    // Validation errors, deserialization errors are not
    const nonRetryable = ['ValidationError', 'SyntaxError', 'SchemaError'];
    return !nonRetryable.includes(error.constructor.name);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

Event Ordering Guarantees

// Partition key selection examples
await this.kafkaClient.emit('orders.events', {
  key: order.id,           // All events for this order are ordered
  value: JSON.stringify(event),
});

await this.kafkaClient.emit('payments.events', {
  key: payment.orderId,    // All payments for an order are ordered
  value: JSON.stringify(event),
});

// Anti-pattern: using customer ID for high-volume customers
// creates a "hot partition" — one partition gets disproportionate traffic
await this.kafkaClient.emit('orders.events', {
  key: order.customerId,   // Avoid: enterprise customers create hot partitions
  value: JSON.stringify(event),
});

const producer = kafka.producer({
  idempotent: true,
  transactionalId: 'inventory-service-txn',
  maxInFlightRequests: 1,
});

const transaction = await producer.transaction();

try {
  await transaction.send({
    topic: 'inventory.events',
    messages: [{ key: itemId, value: JSON.stringify(inventoryUpdated) }],
  });
  await transaction.sendOffsets({
    consumerGroupId: 'inventory-service-group',
    topics: [{ topic: 'orders.events', partitions: [{ partition, offset }] }],
  });
  await transaction.commit();
} catch (error) {
  await transaction.abort();
  throw error;
}

EOS adds latency (roughly 50-100ms per transaction) and reduces throughput. Use it when duplicate downstream effects are unacceptable.

Monitoring

A Kafka-based system that isn't monitored will eventually lose messages silently.

Key Prometheus metrics to expose:

# Consumer lag per partition
kafka_consumer_lag{topic="orders.events", partition="0", group="inventory-service"} 142

# Messages consumed per second
kafka_consumer_messages_total{topic="orders.events", group="inventory-service"} 

# Processing duration (p99)
kafka_consumer_processing_duration_seconds{quantile="0.99", topic="orders.events"} 0.045

# DLQ messages (should be near zero)
kafka_dlq_messages_total{original_topic="orders.events"} 3

# Outbox table unpublished count (polling relay)
outbox_unpublished_events_count{service="order-service"} 7

# Producer send latency
kafka_producer_send_duration_seconds{quantile="0.99", topic="orders.events"} 0.012

Alerting thresholds:

✓Consumer lag > 10,000 for 5 minutes → Warning
✓Consumer lag > 100,000 for 5 minutes → Critical
✓DLQ messages > 0 → Investigate immediately
✓Outbox unpublished count > 1,000 → Relay is stuck or slow
✓Consumer processing p99 > 1s → Consumer is bottlenecked

Use Burrow or Kafka's built-in kafka-consumer-groups.sh for lag monitoring. For Prometheus-based stacks, the kafka_exporter project exposes broker and consumer group metrics.

Case Study: Order Management System at Scale

The Problem

The Solution

Outbox table DDL (PostgreSQL):

CREATE TABLE outbox_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(100) NOT NULL,
    aggregate_id VARCHAR(100) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    partition_key VARCHAR(255) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ,
    retry_count INT NOT NULL DEFAULT 0
);

-- Partial index: only unpublished events are scanned by the relay
CREATE INDEX idx_outbox_unpublished
    ON outbox_events (created_at ASC)
    WHERE published_at IS NULL;

-- Cleanup: published events older than 7 days
-- Runs via pg_cron daily
-- DELETE FROM outbox_events WHERE published_at < NOW() - INTERVAL '7 days';

Kafka topic design:

Topic	Partitions	Retention	Partition Key
`orders.events`	12	7 days	`orderId`
`payments.events`	6	14 days	`orderId`
`inventory.events`	12	7 days	`skuId`
`notifications.events`	6	3 days	`customerId`
`fulfillment.events`	6	14 days	`orderId`
`*.retry` (per topic)	3	3 days	same as source
`*.dlq` (per topic)	1	30 days	same as source

Inventory service consumer (NestJS):

// inventory-consumer.controller.ts
@Controller()
export class InventoryConsumerController {
  private readonly logger = new Logger(InventoryConsumerController.name);

  constructor(
    private readonly idempotentConsumer: IdempotentConsumerService,
    private readonly inventoryService: InventoryService,
    private readonly dlqService: DeadLetterQueueService,
  ) {}

  @EventPattern('orders.events')
  async handleOrderEvent(
    @Payload() event: CloudEvent<OrderEventData>,
    @Ctx() context: KafkaContext,
  ) {
    if (event.type !== 'com.example.order.placed') {
      return; // Ignore events we don't care about
    }

    await this.dlqService.handleWithRetry(event, context, async () => {
      const result = await this.idempotentConsumer.processEvent(
        event.id,
        async (queryRunner) => {
          // Decrement inventory for each item in the order
          for (const item of event.data.items) {
            const inventory = await queryRunner.manager.findOne(
              InventoryItem,
              {
                where: { sku: item.sku },
                lock: { mode: 'pessimistic_write' },
              },
            );

            if (!inventory) {
              throw new ValidationError(`Unknown SKU: ${item.sku}`);
            }

            if (inventory.availableQuantity < item.quantity) {
              // Publish compensation event instead of throwing
              await this.publishInsufficientStockEvent(
                event.data.orderId,
                item.sku,
                inventory.availableQuantity,
                item.quantity,
              );
              return { success: false, reason: 'insufficient_stock' };
            }

            inventory.availableQuantity -= item.quantity;
            inventory.reservedQuantity += item.quantity;
            await queryRunner.manager.save(inventory);
          }

          // Write outbox event for downstream services
          await queryRunner.manager.save(OutboxEvent, {
            aggregateType: 'Inventory',
            aggregateId: event.data.orderId,
            eventType: 'inventory.reserved',
            partitionKey: event.data.orderId,
            payload: {
              specversion: '1.0',
              id: randomUUID(),
              source: '/services/inventory-service',
              type: 'com.example.inventory.reserved',
              time: new Date().toISOString(),
              data: {
                orderId: event.data.orderId,
                items: event.data.items.map((i) => ({
                  sku: i.sku,
                  quantity: i.quantity,
                })),
              },
            },
          });

          return { success: true };
        },
      );

      if (result === null) {
        this.logger.debug(`Event ${event.id} already processed, skipping`);
      }
    });
  }
}

Key implementation details in this consumer:

✓Idempotent processing: The processEvent wrapper checks for duplicate event IDs before executing business logic.
✓Pessimistic locking: SELECT ... FOR UPDATE prevents concurrent inventory decrements from overselling.
✓Compensation over rejection: When stock is insufficient, we publish an inventory.insufficient_stock event instead of silently dropping the message. The Order Service consumes this to cancel the order and trigger a refund.
✓Outbox for downstream events: The inventory service itself uses the outbox pattern to publish inventory.reserved events consumed by the Fulfillment service.

Event flow:

Customer places order
    → Order Service: INSERT order + outbox event (single txn)
    → Outbox Relay: polls outbox → publishes to orders.events
    → Kafka: orders.events (partition key: orderId)
        ├── Payment Service: charges payment → publishes payment.completed
        ├── Inventory Service: reserves stock → publishes inventory.reserved
        └── Notification Service: sends order confirmation email
    → Kafka: inventory.events
        └── Fulfillment Service: initiates shipping when both
            payment.completed AND inventory.reserved received

Results

After deploying this architecture to production with Stripe Systems' implementation, the results were measured over a 90-day period:

Metric	Before (HTTP)	After (Outbox + Kafka)
Overselling rate	2-3%	0.001% (1 in 100K)
Order placement p99 latency	1,200ms	180ms
System availability during downstream outages	Cascading failures	Order Service unaffected
Failed order recovery	Manual intervention	Automatic via retries + DLQ
Daily DLQ messages	N/A	~5 (schema issues, investigated same day)

The DLQ averages about 5 messages per day out of 100K+ orders. These are typically caused by malformed payloads from partner API integrations and are triaged within the same business day.

Takeaways

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

Backend Development

Server-side systems designed for correctness, observability, and horizontal scalability.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Event-Driven vs. Request-Driven — When Events Are the Right Abstraction

Apache Kafka Fundamentals

NestJS Kafka Integration

Event Schema Design

The Transactional Outbox Pattern

Polling Publisher — The Simple Relay

CDC with Debezium — The WAL-Based Alternative

Idempotent Consumers

Dead Letter Queues

Event Ordering Guarantees

Monitoring

Case Study: Order Management System at Scale

The Problem

The Solution

Results

Takeaways

Related Services from Stripe Systems

Backend Development

More Articles

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

Staff Augmentation — A Practical Guide for Engineering Leaders

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Why Custom Software Development Matters for Growing Businesses

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Event-Driven vs. Request-Driven — When Events Are the Right Abstraction

Apache Kafka Fundamentals

NestJS Kafka Integration

Event Schema Design

The Transactional Outbox Pattern

Polling Publisher — The Simple Relay

CDC with Debezium — The WAL-Based Alternative

Idempotent Consumers

Dead Letter Queues

Event Ordering Guarantees

Monitoring