Skip to main content
Stripe SystemsStripe Systems
Backend Development📅 January 18, 2026· 18 min read

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

✍️
Stripe Systems Engineering

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, and well-supported by every framework.

But it breaks down when services need to coordinate work without blocking each other, when you need to replay historical state changes, or when a downstream service being unavailable shouldn't prevent an upstream service from completing its job. That's when event-driven architecture earns its place.

This post is a production-focused walkthrough. We'll cover the theory briefly, then spend most of our time on implementation details — the outbox pattern, idempotent consumers, dead letter queues, and the operational concerns that matter at scale.

Event-Driven vs. Request-Driven — When Events Are the Right Abstraction

In a request-driven system, Service A calls Service B and waits for a response. The coupling is temporal (A blocks until B responds), behavioral (A knows B's API contract), and availability-dependent (if B is down, A fails).

In an event-driven system, Service A publishes an event describing what happened. Service B, C, and D consume it on their own schedule. The coupling shifts: A doesn't know who consumes its events, doesn't wait for processing, and doesn't fail if consumers are temporarily unavailable.

Temporal decoupling is the primary benefit. When Service A publishes an OrderPlaced event, it doesn't care whether the inventory service processes it in 50ms or 5 minutes. The event sits in the broker until consumers are ready.

Eventual consistency is the primary tradeoff. After publishing OrderPlaced, there's a window where the order exists but inventory hasn't been decremented. Your system must tolerate this. If your business logic requires immediate consistency — a bank transfer where both accounts must reflect the change atomically — events are the wrong abstraction for that specific operation.

Not everything should be an event. Queries that need a synchronous response, operations requiring strong consistency, and simple CRUD that doesn't trigger downstream work — these are better served by direct API calls. The decision should be driven by whether temporal decoupling provides real value, not by architectural fashion.

Apache Kafka Fundamentals

Kafka is a distributed commit log. Messages are appended to topics, which are split into partitions. Each partition is an ordered, immutable sequence of records. Ordering is guaranteed only within a single partition, not across partitions.

Consumer groups are Kafka's scaling mechanism. Each partition in a topic is assigned to exactly one consumer within a group. If you have 12 partitions and 4 consumers in a group, each consumer handles 3 partitions. If a consumer dies, Kafka rebalances — reassigning its partitions to the remaining consumers. This is why partition count sets the upper bound on parallelism: 12 partitions means at most 12 concurrent consumers in a single group.

Offsets track each consumer's position in a partition. When a consumer processes a message, it commits the offset to Kafka (or to an external store). On restart, it resumes from the last committed offset. The choice between auto.commit and manual offset management has significant implications for at-least-once vs. at-most-once delivery semantics.

Retention is time-based or size-based. A 7-day retention policy means consumers have a week to process messages before they're deleted. For event sourcing workloads, you can set retention to infinite (retention.ms=-1), turning Kafka into a permanent event store.

Key configuration decisions you'll make early:

  • Partition count: Overprovisioning is safer than underprovisioning. You can increase partitions but never decrease them, and increasing them breaks key-based ordering guarantees for existing data.
  • Replication factor: 3 is standard for production. This tolerates 1 broker failure without data loss.
  • min.insync.replicas: Set to 2 with a replication factor of 3. Combined with acks=all on the producer, this ensures a write is acknowledged only after 2 replicas have it.

NestJS Kafka Integration

NestJS provides Kafka support through @nestjs/microservices backed by KafkaJS. Here's a production-ready module setup:

// kafka.module.ts
import { Module } from '@nestjs/common';
import { ClientsModule, Transport } from '@nestjs/microservices';

@Module({
  imports: [
    ClientsModule.register([
      {
        name: 'KAFKA_SERVICE',
        transport: Transport.KAFKA,
        options: {
          client: {
            clientId: 'order-service',
            brokers: process.env.KAFKA_BROKERS?.split(',') ?? ['localhost:9092'],
            ssl: process.env.KAFKA_SSL === 'true',
            sasl: process.env.KAFKA_SASL_USERNAME
              ? {
                  mechanism: 'scram-sha-256',
                  username: process.env.KAFKA_SASL_USERNAME,
                  password: process.env.KAFKA_SASL_PASSWORD ?? '',
                }
              : undefined,
          },
          producer: {
            allowAutoTopicCreation: false,
            idempotent: true,
          },
          consumer: {
            groupId: 'order-service-group',
            sessionTimeout: 30000,
            heartbeatInterval: 10000,
            maxWaitTimeInMs: 100,
          },
        },
      },
    ]),
  ],
  exports: [ClientsModule],
})
export class KafkaModule {}

Note idempotent: true on the producer — this enables Kafka's idempotent producer, which deduplicates messages caused by retries at the broker level. It's free and you should always enable it.

Consumer setup uses NestJS decorators:

// order-events.controller.ts
import { Controller } from '@nestjs/common';
import { EventPattern, Payload, Ctx, KafkaContext } from '@nestjs/microservices';

@Controller()
export class OrderEventsController {
  @EventPattern('orders.events')
  async handleOrderEvent(
    @Payload() event: OrderEvent,
    @Ctx() context: KafkaContext,
  ) {
    const { offset } = context.getMessage();
    const partition = context.getPartition();
    const topic = context.getTopic();

    try {
      await this.processEvent(event);
      // Manual commit after successful processing
      await context.getConsumer().commitOffsets([
        { topic, partition, offset: (Number(offset) + 1).toString() },
      ]);
    } catch (error) {
      // Don't commit — message will be redelivered
      throw error;
    }
  }
}

Event Schema Design

Poorly designed event schemas create coupling that's worse than direct API calls — because the coupling is implicit and discovered at runtime.

The CloudEvents specification provides a standard envelope:

{
  "specversion": "1.0",
  "id": "a]b2c3d4-e5f6-7890-abcd-ef1234567890",
  "source": "/services/order-service",
  "type": "com.example.order.placed",
  "datacontenttype": "application/json",
  "time": "2025-08-30T14:30:00Z",
  "subject": "order-12345",
  "data": {
    "orderId": "order-12345",
    "customerId": "cust-789",
    "items": [
      { "sku": "WIDGET-001", "quantity": 2, "unitPrice": 29.99 }
    ],
    "totalAmount": 59.98,
    "currency": "USD"
  }
}

The metadata fields (id, source, type, time) are essential for routing, deduplication, and debugging. The data field contains your domain-specific payload.

Schema evolution is where things get painful without a plan. A schema registry (Confluent Schema Registry or Apicurio) with Avro or Protobuf enforces compatibility rules:

  • Backward compatible: New schema can read data written by the old schema. You can add optional fields, but not required ones.
  • Forward compatible: Old schema can read data written by the new schema. You can remove optional fields.
  • Full compatibility: Both directions. The safest choice — only add or remove optional fields with defaults.

With Avro in a schema registry:

{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.example.orders",
  "fields": [
    { "name": "orderId", "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "totalAmount", "type": "double" },
    { "name": "currency", "type": "string", "default": "USD" },
    { "name": "placedAt", "type": { "type": "long", "logicalType": "timestamp-millis" } },
    { "name": "metadata", "type": ["null", "string"], "default": null }
  ]
}

The metadata field was added later with a null default — backward compatible. Existing consumers ignore it; new consumers can read it.

The Transactional Outbox Pattern

Here's the problem: your service needs to update a database row AND publish a Kafka event. These are two different systems. If the database write succeeds but the Kafka publish fails (network blip, broker down), your database says "order created" but no event was emitted. Downstream services never learn about the order.

You can't wrap them in a single transaction because Kafka isn't a relational database. You could try to publish first and write second, but that inverts the problem. This is the dual-write problem, and it has no solution without changing the approach.

The outbox pattern solves it by writing both the business data and the event to the same database in a single transaction:

CREATE TABLE outbox_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id VARCHAR(255) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ,
    retry_count INT NOT NULL DEFAULT 0,
    INDEX idx_outbox_unpublished (published_at) WHERE published_at IS NULL
);

The partial index WHERE published_at IS NULL is critical — it ensures the polling query only scans unpublished rows, not the entire table.

Your service code becomes:

// order.service.ts
async createOrder(dto: CreateOrderDto): Promise<Order> {
  return this.dataSource.transaction(async (manager) => {
    // 1. Write business data
    const order = manager.create(Order, {
      customerId: dto.customerId,
      items: dto.items,
      totalAmount: dto.totalAmount,
      status: OrderStatus.PLACED,
    });
    await manager.save(order);

    // 2. Write outbox event in the SAME transaction
    const outboxEvent = manager.create(OutboxEvent, {
      aggregateType: 'Order',
      aggregateId: order.id,
      eventType: 'order.placed',
      payload: {
        specversion: '1.0',
        id: randomUUID(),
        source: '/services/order-service',
        type: 'com.example.order.placed',
        time: new Date().toISOString(),
        data: {
          orderId: order.id,
          customerId: order.customerId,
          items: order.items,
          totalAmount: order.totalAmount,
        },
      },
    });
    await manager.save(outboxEvent);

    return order;
  });
}

Both writes succeed or both fail. Atomicity is guaranteed by the database transaction. The event in the outbox table is now a promise that "this event will eventually be published to Kafka."

Polling Publisher — The Simple Relay

The polling publisher is a background process that periodically reads unpublished events from the outbox table and publishes them to Kafka:

// outbox-relay.service.ts
import { Injectable, Logger } from '@nestjs/common';
import { Cron, CronExpression } from '@nestjs/schedule';
import { InjectRepository } from '@nestjs/typeorm';
import { Repository, IsNull } from 'typeorm';
import { Inject } from '@nestjs/common';
import { ClientKafka } from '@nestjs/microservices';

@Injectable()
export class OutboxRelayService {
  private readonly logger = new Logger(OutboxRelayService.name);
  private isProcessing = false;

  constructor(
    @InjectRepository(OutboxEvent)
    private readonly outboxRepo: Repository<OutboxEvent>,
    @Inject('KAFKA_SERVICE')
    private readonly kafkaClient: ClientKafka,
  ) {}

  @Cron(CronExpression.EVERY_5_SECONDS)
  async publishPendingEvents() {
    if (this.isProcessing) return; // Prevent overlapping runs
    this.isProcessing = true;

    try {
      const events = await this.outboxRepo.find({
        where: { publishedAt: IsNull() },
        order: { createdAt: 'ASC' },
        take: 100, // Batch size
      });

      for (const event of events) {
        try {
          const topic = this.resolveTopicName(event.aggregateType);
          await this.kafkaClient.emit(topic, {
            key: event.aggregateId,
            value: JSON.stringify(event.payload),
            headers: {
              'event-type': event.eventType,
              'event-id': event.id,
            },
          });

          await this.outboxRepo.update(event.id, {
            publishedAt: new Date(),
          });
        } catch (error) {
          this.logger.error(
            `Failed to publish event ${event.id}: ${error.message}`,
          );
          await this.outboxRepo.increment(
            { id: event.id },
            'retryCount',
            1,
          );
        }
      }
    } finally {
      this.isProcessing = false;
    }
  }

  private resolveTopicName(aggregateType: string): string {
    const topicMap: Record<string, string> = {
      Order: 'orders.events',
      Payment: 'payments.events',
      Inventory: 'inventory.events',
    };
    return topicMap[aggregateType] ?? `${aggregateType.toLowerCase()}.events`;
  }
}

Polling interval tradeoffs: A 1-second interval gives low latency but generates constant database load. A 30-second interval reduces load but adds delivery latency. For most systems, 5 seconds is a reasonable default. If you need sub-second delivery, use CDC instead.

The polling publisher has a known limitation: it introduces at-least-once delivery. If the publisher crashes after sending to Kafka but before marking the event as published, the event will be re-sent on the next poll. Your consumers must handle duplicates — which brings us to idempotency.

CDC with Debezium — The WAL-Based Alternative

Instead of polling the outbox table, you can use Change Data Capture (CDC) to stream changes from the database's write-ahead log (WAL) directly to Kafka. Debezium is the standard tool for this.

Debezium runs as a Kafka Connect connector. It reads PostgreSQL's logical replication stream, captures every INSERT to the outbox table, and publishes it to a Kafka topic. No polling, no added database load from repeated queries.

A Debezium connector configuration for the outbox pattern:

{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "${file:/secrets/db-password.txt}",
    "database.dbname": "orders_db",
    "topic.prefix": "cdc",
    "table.include.list": "public.outbox_events",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.field.event.id": "id",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.event.type": "event_type",
    "transforms.outbox.table.field.event.payload": "payload",
    "transforms.outbox.route.topic.replacement": "${routedByValue}.events",
    "transforms.outbox.table.fields.additional.placement": "aggregate_type:header:aggregateType",
    "plugin.name": "pgoutput",
    "slot.name": "outbox_slot",
    "publication.name": "outbox_publication"
  }
}

The EventRouter SMT (Single Message Transform) is key — it extracts the event payload from the outbox row and routes it to the correct topic based on the aggregate_type column.

When to use CDC vs. polling:

FactorPollingCDC (Debezium)
LatencySeconds (depends on interval)Sub-second
Database loadRepeated queriesMinimal (reads WAL)
Operational complexityLowHigh (Kafka Connect cluster, slot management)
DebuggingSimple (query the table)Harder (WAL offsets, slot monitoring)
InfrastructureJust your app + DBKafka Connect + Debezium

Start with polling. Move to CDC when polling latency or database load becomes a bottleneck. Many systems never need CDC.

Idempotent Consumers

Kafka guarantees at-least-once delivery by default. Messages can be delivered more than once — broker retries, consumer rebalances, and the outbox relay's at-least-once semantics all contribute to duplicates.

Your consumers must be idempotent: processing the same event twice produces the same result.

// idempotent-consumer.service.ts
@Injectable()
export class IdempotentConsumerService {
  constructor(
    @InjectRepository(ProcessedEvent)
    private readonly processedEventRepo: Repository<ProcessedEvent>,
    private readonly dataSource: DataSource,
  ) {}

  async processEvent<T>(
    eventId: string,
    handler: (queryRunner: QueryRunner) => Promise<T>,
  ): Promise<T | null> {
    const queryRunner = this.dataSource.createQueryRunner();
    await queryRunner.connect();
    await queryRunner.startTransaction();

    try {
      // Check if already processed (with row-level lock to prevent races)
      const existing = await queryRunner.manager.findOne(ProcessedEvent, {
        where: { eventId },
        lock: { mode: 'pessimistic_write' },
      });

      if (existing) {
        await queryRunner.rollbackTransaction();
        return null; // Already processed
      }

      // Execute business logic
      const result = await handler(queryRunner);

      // Record that we processed this event
      await queryRunner.manager.save(ProcessedEvent, {
        eventId,
        processedAt: new Date(),
      });

      await queryRunner.commitTransaction();
      return result;
    } catch (error) {
      await queryRunner.rollbackTransaction();
      throw error;
    } finally {
      await queryRunner.release();
    }
  }
}

The processed event check and the business logic run in the same transaction. This ensures that either both the deduplication record and the business state change are committed, or neither is.

Storage for idempotency keys:

  • Database (same as business data): Strongest guarantee — the dedup check and business write are in the same transaction. Use this for critical operations.
  • Redis: Lower latency for the dedup lookup, but the check and business write are no longer atomic. Acceptable for operations where occasional duplicate processing is tolerable (e.g., sending a notification twice is annoying but not catastrophic).

Set a TTL on your idempotency records. Events older than your Kafka retention period can't be redelivered, so there's no need to keep their dedup keys forever. If your retention is 7 days, a 14-day TTL on idempotency records gives you a comfortable buffer.

Dead Letter Queues

Some messages can't be processed regardless of how many times you retry. Malformed payloads, schema mismatches, or bugs in consumer logic produce "poison messages" that block the partition. A dead letter queue (DLQ) moves these aside so the consumer can continue.

// dlq.service.ts
@Injectable()
export class DeadLetterQueueService {
  private readonly MAX_RETRIES = 3;
  private readonly BACKOFF_BASE_MS = 1000;

  constructor(
    @Inject('KAFKA_SERVICE')
    private readonly kafkaClient: ClientKafka,
  ) {}

  async handleWithRetry(
    event: any,
    context: KafkaContext,
    handler: () => Promise<void>,
  ): Promise<void> {
    const retryCount = this.getRetryCount(context);

    if (retryCount >= this.MAX_RETRIES) {
      await this.sendToDlq(event, context, 'Max retries exceeded');
      return;
    }

    try {
      await handler();
    } catch (error) {
      if (this.isRetryable(error)) {
        const backoffMs = this.BACKOFF_BASE_MS * Math.pow(2, retryCount);
        await this.sleep(backoffMs);

        // Publish to retry topic with incremented count
        await this.kafkaClient.emit(
          `${context.getTopic()}.retry`,
          {
            key: context.getMessage().key,
            value: context.getMessage().value,
            headers: {
              ...context.getMessage().headers,
              'x-retry-count': (retryCount + 1).toString(),
              'x-original-topic': context.getTopic(),
              'x-error-message': error.message,
            },
          },
        );
      } else {
        await this.sendToDlq(event, context, error.message);
      }
    }
  }

  private async sendToDlq(
    event: any,
    context: KafkaContext,
    reason: string,
  ): Promise<void> {
    const dlqTopic = `${context.getTopic()}.dlq`;
    await this.kafkaClient.emit(dlqTopic, {
      key: context.getMessage().key,
      value: context.getMessage().value,
      headers: {
        ...context.getMessage().headers,
        'x-dlq-reason': reason,
        'x-dlq-timestamp': new Date().toISOString(),
        'x-original-topic': context.getTopic(),
        'x-original-partition': context.getPartition().toString(),
        'x-original-offset': context.getMessage().offset,
      },
    });
  }

  private getRetryCount(context: KafkaContext): number {
    const header = context.getMessage().headers?.['x-retry-count'];
    return header ? parseInt(header.toString(), 10) : 0;
  }

  private isRetryable(error: any): boolean {
    // Network errors, timeouts, and transient DB errors are retryable
    // Validation errors, deserialization errors are not
    const nonRetryable = ['ValidationError', 'SyntaxError', 'SchemaError'];
    return !nonRetryable.includes(error.constructor.name);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

Always alert on DLQ messages. A message in the DLQ represents data that was not processed — in a financial system, that could be a payment that was charged but not recorded. Set up alerts when DLQ topic offsets advance.

Event Ordering Guarantees

Kafka guarantees ordering within a partition. To ensure all events for an entity are ordered, use the entity's ID as the partition key. All events with the same key hash to the same partition and are therefore ordered.

// Partition key selection examples
await this.kafkaClient.emit('orders.events', {
  key: order.id,           // All events for this order are ordered
  value: JSON.stringify(event),
});

await this.kafkaClient.emit('payments.events', {
  key: payment.orderId,    // All payments for an order are ordered
  value: JSON.stringify(event),
});

// Anti-pattern: using customer ID for high-volume customers
// creates a "hot partition" — one partition gets disproportionate traffic
await this.kafkaClient.emit('orders.events', {
  key: order.customerId,   // Avoid: enterprise customers create hot partitions
  value: JSON.stringify(event),
});

Exactly-once semantics (EOS) with Kafka transactions ensure that a consume-transform-produce cycle is atomic. The consumer reads a message, processes it, produces output messages, and commits the consumer offset — all in a single transaction:

const producer = kafka.producer({
  idempotent: true,
  transactionalId: 'inventory-service-txn',
  maxInFlightRequests: 1,
});

const transaction = await producer.transaction();

try {
  await transaction.send({
    topic: 'inventory.events',
    messages: [{ key: itemId, value: JSON.stringify(inventoryUpdated) }],
  });
  await transaction.sendOffsets({
    consumerGroupId: 'inventory-service-group',
    topics: [{ topic: 'orders.events', partitions: [{ partition, offset }] }],
  });
  await transaction.commit();
} catch (error) {
  await transaction.abort();
  throw error;
}

EOS adds latency (roughly 50-100ms per transaction) and reduces throughput. Use it when duplicate downstream effects are unacceptable.

Monitoring

A Kafka-based system that isn't monitored will eventually lose messages silently.

Consumer lag is the single most important metric. It measures the difference between the latest offset in a partition and the consumer's committed offset. Rising lag means consumers can't keep up.

Key Prometheus metrics to expose:

# Consumer lag per partition
kafka_consumer_lag{topic="orders.events", partition="0", group="inventory-service"} 142

# Messages consumed per second
kafka_consumer_messages_total{topic="orders.events", group="inventory-service"} 

# Processing duration (p99)
kafka_consumer_processing_duration_seconds{quantile="0.99", topic="orders.events"} 0.045

# DLQ messages (should be near zero)
kafka_dlq_messages_total{original_topic="orders.events"} 3

# Outbox table unpublished count (polling relay)
outbox_unpublished_events_count{service="order-service"} 7

# Producer send latency
kafka_producer_send_duration_seconds{quantile="0.99", topic="orders.events"} 0.012

Alerting thresholds:

  • Consumer lag > 10,000 for 5 minutes → Warning
  • Consumer lag > 100,000 for 5 minutes → Critical
  • DLQ messages > 0 → Investigate immediately
  • Outbox unpublished count > 1,000 → Relay is stuck or slow
  • Consumer processing p99 > 1s → Consumer is bottlenecked

Use Burrow or Kafka's built-in kafka-consumer-groups.sh for lag monitoring. For Prometheus-based stacks, the kafka_exporter project exposes broker and consumer group metrics.

Case Study: Order Management System at Scale

This section describes a system that Stripe Systems built for an e-commerce client processing 100K+ orders per day across five microservices: Order, Payment, Inventory, Notification, and Fulfillment.

The Problem

The original architecture used synchronous HTTP calls. When a customer placed an order, the Order Service called the Payment Service, then the Inventory Service, then the Notification Service — sequentially, in the same request cycle. If the Inventory Service timed out after Payment had already charged the customer, the system entered an inconsistent state.

The most damaging bug: orders were placed and payments collected, but inventory was not decremented. This caused 2-3% overselling — customers received "item shipped" emails followed by "sorry, out of stock" emails days later. The root cause was a dual-write problem: the Order Service committed the order to its database and then made an HTTP call to Inventory. When the HTTP call failed, the order existed but inventory was unchanged.

The Solution

We migrated to an event-driven architecture with the outbox pattern. The Order Service writes the order and an outbox event in a single PostgreSQL transaction. A polling relay publishes the event to Kafka. Downstream services consume events independently.

Outbox table DDL (PostgreSQL):

CREATE TABLE outbox_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(100) NOT NULL,
    aggregate_id VARCHAR(100) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    partition_key VARCHAR(255) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ,
    retry_count INT NOT NULL DEFAULT 0
);

-- Partial index: only unpublished events are scanned by the relay
CREATE INDEX idx_outbox_unpublished
    ON outbox_events (created_at ASC)
    WHERE published_at IS NULL;

-- Cleanup: published events older than 7 days
-- Runs via pg_cron daily
-- DELETE FROM outbox_events WHERE published_at < NOW() - INTERVAL '7 days';

Kafka topic design:

TopicPartitionsRetentionPartition Key
orders.events127 daysorderId
payments.events614 daysorderId
inventory.events127 daysskuId
notifications.events63 dayscustomerId
fulfillment.events614 daysorderId
*.retry (per topic)33 dayssame as source
*.dlq (per topic)130 dayssame as source

The orders.events topic uses 12 partitions — enough to support 12 concurrent consumers in each downstream service's consumer group. DLQ topics use 1 partition (low throughput by design) with 30-day retention for forensic analysis.

Inventory service consumer (NestJS):

// inventory-consumer.controller.ts
@Controller()
export class InventoryConsumerController {
  private readonly logger = new Logger(InventoryConsumerController.name);

  constructor(
    private readonly idempotentConsumer: IdempotentConsumerService,
    private readonly inventoryService: InventoryService,
    private readonly dlqService: DeadLetterQueueService,
  ) {}

  @EventPattern('orders.events')
  async handleOrderEvent(
    @Payload() event: CloudEvent<OrderEventData>,
    @Ctx() context: KafkaContext,
  ) {
    if (event.type !== 'com.example.order.placed') {
      return; // Ignore events we don't care about
    }

    await this.dlqService.handleWithRetry(event, context, async () => {
      const result = await this.idempotentConsumer.processEvent(
        event.id,
        async (queryRunner) => {
          // Decrement inventory for each item in the order
          for (const item of event.data.items) {
            const inventory = await queryRunner.manager.findOne(
              InventoryItem,
              {
                where: { sku: item.sku },
                lock: { mode: 'pessimistic_write' },
              },
            );

            if (!inventory) {
              throw new ValidationError(`Unknown SKU: ${item.sku}`);
            }

            if (inventory.availableQuantity < item.quantity) {
              // Publish compensation event instead of throwing
              await this.publishInsufficientStockEvent(
                event.data.orderId,
                item.sku,
                inventory.availableQuantity,
                item.quantity,
              );
              return { success: false, reason: 'insufficient_stock' };
            }

            inventory.availableQuantity -= item.quantity;
            inventory.reservedQuantity += item.quantity;
            await queryRunner.manager.save(inventory);
          }

          // Write outbox event for downstream services
          await queryRunner.manager.save(OutboxEvent, {
            aggregateType: 'Inventory',
            aggregateId: event.data.orderId,
            eventType: 'inventory.reserved',
            partitionKey: event.data.orderId,
            payload: {
              specversion: '1.0',
              id: randomUUID(),
              source: '/services/inventory-service',
              type: 'com.example.inventory.reserved',
              time: new Date().toISOString(),
              data: {
                orderId: event.data.orderId,
                items: event.data.items.map((i) => ({
                  sku: i.sku,
                  quantity: i.quantity,
                })),
              },
            },
          });

          return { success: true };
        },
      );

      if (result === null) {
        this.logger.debug(`Event ${event.id} already processed, skipping`);
      }
    });
  }
}

Key implementation details in this consumer:

  1. Idempotent processing: The processEvent wrapper checks for duplicate event IDs before executing business logic.
  2. Pessimistic locking: SELECT ... FOR UPDATE prevents concurrent inventory decrements from overselling.
  3. Compensation over rejection: When stock is insufficient, we publish an inventory.insufficient_stock event instead of silently dropping the message. The Order Service consumes this to cancel the order and trigger a refund.
  4. Outbox for downstream events: The inventory service itself uses the outbox pattern to publish inventory.reserved events consumed by the Fulfillment service.

Event flow:

Customer places order
    → Order Service: INSERT order + outbox event (single txn)
    → Outbox Relay: polls outbox → publishes to orders.events
    → Kafka: orders.events (partition key: orderId)
        ├── Payment Service: charges payment → publishes payment.completed
        ├── Inventory Service: reserves stock → publishes inventory.reserved
        └── Notification Service: sends order confirmation email
    → Kafka: inventory.events
        └── Fulfillment Service: initiates shipping when both
            payment.completed AND inventory.reserved received

Results

After deploying this architecture to production with Stripe Systems' implementation, the results were measured over a 90-day period:

MetricBefore (HTTP)After (Outbox + Kafka)
Overselling rate2-3%0.001% (1 in 100K)
Order placement p99 latency1,200ms180ms
System availability during downstream outagesCascading failuresOrder Service unaffected
Failed order recoveryManual interventionAutomatic via retries + DLQ
Daily DLQ messagesN/A~5 (schema issues, investigated same day)

The p99 latency dropped from 1,200ms to 180ms because the Order Service no longer waits for downstream HTTP calls. It writes to PostgreSQL and returns. The remaining 0.001% overselling comes from race conditions in concurrent inventory updates under extreme load — addressed by tuning the pessimistic lock wait timeout.

The DLQ averages about 5 messages per day out of 100K+ orders. These are typically caused by malformed payloads from partner API integrations and are triaged within the same business day.

Takeaways

Event-driven architecture isn't a universal improvement over request-response. It trades immediate consistency for temporal decoupling and resilience. The transactional outbox pattern eliminates the dual-write problem that causes data inconsistencies between services. Idempotent consumers and dead letter queues are not optional — they're structural requirements for correctness in an at-least-once delivery system.

Start simple: PostgreSQL outbox with a polling relay. Add CDC with Debezium when polling latency becomes a problem. Monitor consumer lag before anything else. And don't make something an event unless temporal decoupling provides concrete value.

Ready to discuss your project?

Get in Touch →
← Back to Blog

More Articles