Skip to main content
Stripe SystemsStripe Systems
DevOps๐Ÿ“… February 15, 2026ยท 12 min read

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

โœ๏ธ
Stripe Systems Engineering

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific problems that emerge at scale and the patterns that address them.

What Is in Terraform State and Why It Matters

Terraform state (terraform.tfstate) is a JSON file that maps declared resources in .tf files to real infrastructure. It contains:

  • โœ“Resource IDs: The AWS ARN, GCP resource name, or Azure resource ID for every managed resource.
  • โœ“Attribute values: Every attribute of every resource, including sensitive outputs (database passwords, API keys) unless explicitly marked as sensitive.
  • โœ“Dependency graph: The implicit and explicit dependencies between resources.
  • โœ“Metadata: Provider versions, Terraform version, serial number (for conflict detection).

When state breaks, Terraform cannot determine what exists in the cloud. Common failure modes:

  • โœ“State corruption: Manual editing, interrupted applies, or backend failures can corrupt the JSON structure.
  • โœ“State lock stuck: A crashed terraform apply leaves the DynamoDB lock in place. No one can apply until it is manually released.
  • โœ“State bloat: 200+ resources means terraform plan must refresh every resource, taking minutes per run.
  • โœ“Secret exposure: State contains plaintext secrets unless the backend encrypts at rest.

Remote State Backends

S3 + DynamoDB (AWS)

The most common backend for AWS-centric teams:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "networking/terraform.tfstate"
    region         = "ap-south-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
    kms_key_id     = "arn:aws:kms:ap-south-1:123456789:key/abc-123"
  }
}

Provision the backend infrastructure itself (a bootstrapping problem โ€” this one resource is created manually or with a separate root module):

# bootstrap/main.tf
resource "aws_s3_bucket" "terraform_state" {
  bucket = "company-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.terraform.arn
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket                  = aws_s3_bucket.terraform_state.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

resource "aws_kms_key" "terraform" {
  description             = "KMS key for Terraform state encryption"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

Backend Tradeoffs

BackendState LockingEncryptionCostManaged
S3 + DynamoDBYes (DynamoDB)Yes (KMS)~$1/moNo
Terraform CloudYes (built-in)Yes (built-in)Free (up to 5 users)Yes
Azure StorageYes (blob lease)Yes (SSE)~$1/moNo
GCSYes (built-in)Yes (CMEK)~$1/moNo
PostgreSQLYes (advisory locks)Depends on setupVariableNo

Terraform Cloud adds workspace management, Sentinel policies, and a UI, but it introduces a dependency on HashiCorp's SaaS platform. For teams that want to own their backend, S3 + DynamoDB is the standard choice.

Handling Stuck Locks

When terraform apply crashes mid-execution, the DynamoDB lock persists:

# Check who holds the lock
aws dynamodb get-item \
  --table-name terraform-locks \
  --key '{"LockID": {"S": "company-terraform-state/networking/terraform.tfstate"}}' \
  --query 'Item.Info.S' --output text | jq .

# Force unlock (use the Lock ID from terraform error message)
terraform force-unlock LOCK_ID

# Or manually delete from DynamoDB (last resort)
aws dynamodb delete-item \
  --table-name terraform-locks \
  --key '{"LockID": {"S": "company-terraform-state/networking/terraform.tfstate"}}'

Always investigate why the lock was stuck before force-unlocking. If an apply is still running (e.g., on a CI runner), force-unlocking can cause concurrent applies and state corruption.

State Segmentation

The single most impactful decision for Terraform at scale is how you split state files.

One State File for Everything (Anti-Pattern)

When all resources are in one state:

  • โœ“terraform plan refreshes every resource. With 200+ resources, this takes 5-15 minutes.
  • โœ“A typo in a networking change blocks a developer trying to deploy an application change.
  • โœ“State lock contention: only one person can plan/apply at a time.
  • โœ“Blast radius: a bad apply can affect networking, databases, and compute simultaneously.

Segmentation Strategies

By layer (recommended starting point):

infrastructure/
โ”œโ”€โ”€ networking/          # VPC, subnets, route tables, NAT gateways
โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”œโ”€โ”€ outputs.tf
โ”‚   โ””โ”€โ”€ backend.tf      # key = "networking/terraform.tfstate"
โ”œโ”€โ”€ data/                # RDS, ElastiCache, S3 buckets
โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ””โ”€โ”€ backend.tf      # key = "data/terraform.tfstate"
โ”œโ”€โ”€ compute/             # EKS, ECS, EC2, ASGs
โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ””โ”€โ”€ backend.tf      # key = "compute/terraform.tfstate"
โ”œโ”€โ”€ dns/                 # Route53 zones and records
โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ””โ”€โ”€ backend.tf      # key = "dns/terraform.tfstate"
โ””โ”€โ”€ iam/                 # IAM roles, policies, users
    โ”œโ”€โ”€ main.tf
    โ””โ”€โ”€ backend.tf       # key = "iam/terraform.tfstate"

By service (for larger organizations):

infrastructure/
โ”œโ”€โ”€ shared/
โ”‚   โ”œโ”€โ”€ networking/
โ”‚   โ””โ”€โ”€ iam/
โ”œโ”€โ”€ payment-service/
โ”‚   โ”œโ”€โ”€ rds.tf
โ”‚   โ”œโ”€โ”€ ecs.tf
โ”‚   โ”œโ”€โ”€ s3.tf
โ”‚   โ””โ”€โ”€ backend.tf      # key = "services/payment/terraform.tfstate"
โ”œโ”€โ”€ user-service/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ order-service/
    โ””โ”€โ”€ ...

By layer AND service (for large-scale, many-team environments):

infrastructure/
โ”œโ”€โ”€ platform/            # Shared infrastructure
โ”‚   โ”œโ”€โ”€ networking/
โ”‚   โ”œโ”€โ”€ eks/
โ”‚   โ””โ”€โ”€ monitoring/
โ”œโ”€โ”€ services/
โ”‚   โ”œโ”€โ”€ payment/
โ”‚   โ”‚   โ”œโ”€โ”€ data/       # Payment service's RDS, ElastiCache
โ”‚   โ”‚   โ””โ”€โ”€ compute/    # Payment service's ECS tasks
โ”‚   โ””โ”€โ”€ order/
โ”‚       โ”œโ”€โ”€ data/
โ”‚       โ””โ”€โ”€ compute/

Cross-State References

When resources in one state need values from another (e.g., compute needs VPC ID from networking), use terraform_remote_state or data sources:

# compute/main.tf
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "company-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "ap-south-1"
  }
}

module "ecs_cluster" {
  source = "../modules/ecs"

  vpc_id     = data.terraform_remote_state.networking.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
}

The dependency direction matters: compute reads from networking, never the reverse. This creates a clear layering: networking โ†’ data โ†’ compute โ†’ application.

Workspaces: When They Help and When They Hurt

Terraform workspaces create named state files within the same backend configuration:

terraform workspace new staging
terraform workspace new production
terraform workspace select staging
terraform plan  # Plans against staging state

When Workspaces Work

  • โœ“Same configuration, different parameters: A module that provisions an identical environment with different variable values (instance sizes, counts).
  • โœ“Short-lived environments: Feature branch environments that are created and destroyed frequently.
# Using workspace name to vary configuration
locals {
  env_config = {
    staging = {
      instance_type = "t3.medium"
      rds_class     = "db.t3.medium"
      min_capacity  = 1
    }
    production = {
      instance_type = "m6i.xlarge"
      rds_class     = "db.r6g.xlarge"
      min_capacity  = 3
    }
  }
  config = local.env_config[terraform.workspace]
}

When Workspaces Hurt

  • โœ“Different configurations per environment: Production has a WAF, staging does not. Production has multi-AZ RDS, staging does not. These differences accumulate into a tangle of count = terraform.workspace == "production" ? 1 : 0 conditionals.
  • โœ“Different providers per environment: Production is in ap-south-1, DR is in us-east-1. Provider configuration is fixed per root module.
  • โœ“Different access controls: If the staging team should not be able to terraform plan against production, workspaces provide no isolation โ€” the same backend config accesses all workspaces.

For environments with structural differences, use separate root modules (one per environment) that call shared modules. Workspaces are for identical structures with different parameters.

Module Design

Input/Output Contracts

A well-designed module has explicit inputs, explicit outputs, and no internal assumptions about naming:

# modules/rds/variables.tf
variable "identifier" {
  description = "RDS instance identifier"
  type        = string
  validation {
    condition     = can(regex("^[a-z][a-z0-9-]*$", var.identifier))
    error_message = "Identifier must start with a letter and contain only lowercase letters, numbers, and hyphens."
  }
}

variable "engine_version" {
  description = "PostgreSQL engine version"
  type        = string
  default     = "15.4"
}

variable "instance_class" {
  description = "RDS instance class"
  type        = string
}

variable "allocated_storage" {
  description = "Storage in GB"
  type        = number
  default     = 20
}

variable "vpc_id" {
  description = "VPC ID for security group"
  type        = string
}

variable "subnet_ids" {
  description = "Subnet IDs for DB subnet group"
  type        = list(string)
}

variable "allowed_cidr_blocks" {
  description = "CIDR blocks allowed to connect"
  type        = list(string)
}

variable "tags" {
  description = "Tags to apply to all resources"
  type        = map(string)
  default     = {}
}
# modules/rds/outputs.tf
output "endpoint" {
  description = "RDS instance endpoint (host:port)"
  value       = aws_db_instance.this.endpoint
}

output "address" {
  description = "RDS instance hostname"
  value       = aws_db_instance.this.address
}

output "port" {
  description = "RDS instance port"
  value       = aws_db_instance.this.port
}

output "security_group_id" {
  description = "Security group ID attached to the instance"
  value       = aws_security_group.rds.id
}

Semantic Versioning for Modules

Tag modules with semantic versions and pin consumers to specific versions:

# Consuming a versioned module from a private registry
module "payment_db" {
  source  = "app.terraform.io/company/rds/aws"
  version = "~> 2.1"  # Accept 2.1.x patches, not 2.2.0

  identifier          = "payment-db"
  engine_version      = "15.4"
  instance_class      = "db.r6g.xlarge"
  allocated_storage   = 100
  vpc_id              = data.terraform_remote_state.networking.outputs.vpc_id
  subnet_ids          = data.terraform_remote_state.networking.outputs.database_subnet_ids
  allowed_cidr_blocks = ["10.0.0.0/8"]
  tags                = local.common_tags
}

Or from a Git repository with tags:

module "payment_db" {
  source = "git::https://github.com/company/terraform-modules.git//rds?ref=rds-v2.1.3"

  identifier    = "payment-db"
  # ...
}

Private Module Registry

For teams not using Terraform Cloud, a Git-based module registry with tagged releases works well:

terraform-modules/
โ”œโ”€โ”€ rds/
โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”œโ”€โ”€ outputs.tf
โ”‚   โ”œโ”€โ”€ versions.tf
โ”‚   โ””โ”€โ”€ README.md
โ”œโ”€โ”€ ecs/
โ”œโ”€โ”€ vpc/
โ””โ”€โ”€ s3/

Release workflow:

# Tag a module release
git tag rds-v2.1.3
git push origin rds-v2.1.3

CI/CD Integration with Atlantis

Plan on PR, Apply on Merge

# atlantis.yaml
version: 3
projects:
  - name: networking
    dir: infrastructure/networking
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
        - "../modules/vpc/**"
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: payment-data
    dir: infrastructure/services/payment/data
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
        - "../../../modules/rds/**"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: with-policy-check

workflows:
  with-policy-check:
    plan:
      steps:
        - init
        - plan
        - run: |
            conftest test tfplan.json \
              --policy ../../../policies/ \
              --no-color
    apply:
      steps:
        - apply

Sentinel / Conftest Policies

Prevent unsafe changes before they reach production:

# policies/rds.rego
package main

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_db_instance"
  resource.change.after.publicly_accessible == true
  msg := sprintf("RDS instance %s must not be publicly accessible", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_db_instance"
  resource.change.after.storage_encrypted == false
  msg := sprintf("RDS instance %s must have encryption enabled", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_db_instance"
  not resource.change.after.deletion_protection
  msg := sprintf("RDS instance %s must have deletion protection enabled", [resource.address])
}

Refactoring Terraform

Moved Blocks (Terraform 1.1+)

Rename a resource without destroying and recreating it:

# Before: resource was named "main"
# After: renamed to "primary"

moved {
  from = aws_db_instance.main
  to   = aws_db_instance.primary
}

resource "aws_db_instance" "primary" {
  # ... same configuration
}

State Move for Cross-Module Refactoring

When extracting resources into a module:

# Move a resource from root module into a child module
terraform state mv \
  'aws_db_instance.payment_db' \
  'module.payment_db.aws_db_instance.this'

# Move a resource between state files
terraform state mv \
  -state=monolith/terraform.tfstate \
  -state-out=services/payment/data/terraform.tfstate \
  'aws_db_instance.payment_db' \
  'aws_db_instance.payment_db'

Import Existing Resources

Bring unmanaged resources under Terraform control:

# Terraform 1.5+ import blocks
import {
  to = aws_s3_bucket.legacy_uploads
  id = "legacy-uploads-bucket"
}

resource "aws_s3_bucket" "legacy_uploads" {
  bucket = "legacy-uploads-bucket"
  # Run terraform plan to see what attributes need to be declared
}
# Generate configuration from existing resource
terraform plan -generate-config-out=generated.tf
# Review generated.tf, clean up, and move to appropriate file

Common Anti-Patterns

Everything in one state file: Already discussed. Split by layer or service.

Hardcoded values: Use variables and data sources. Hardcoded AMI IDs, account numbers, and region names create silent failures when moving between environments.

# Anti-pattern
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"  # What is this? Is it current?
  instance_type = "t3.medium"
  subnet_id     = "subnet-0bb1c79de3EXAMPLE"  # Hardcoded subnet
}

# Better
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = var.instance_type
  subnet_id     = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}

No module versioning: Modules consumed via relative path (source = "../modules/rds") are always "latest." A change to the module immediately affects all consumers. Use Git tags or a registry for stable versioning.

No state backup: Enable S3 versioning. Without it, a corrupted state file is unrecoverable.

Case Study: Splitting a Monolithic State File

A 5-team organization had accumulated 200+ AWS resources in a single Terraform state file over 2 years. The symptoms:

  • โœ“terraform plan took 12-15 minutes (refreshing every resource)
  • โœ“State lock conflicts happened 2-3 times per week (multiple teams trying to plan simultaneously)
  • โœ“One team's networking change blocked another team's application deployment for hours
  • โœ“The state file was 8MB of JSON โ€” merge conflicts in the state file occurred during manual recovery attempts
  • โœ“No one on the team fully understood all 200+ resources

Approach

The Stripe Systems infrastructure team performed the state split over 4 weeks:

Week 1: Inventory and planning. Ran terraform state list to catalog all 214 resources. Categorized each resource into one of 15 state files:

State FileResourcesTeams
networking/vpc23Platform
networking/dns12Platform
data/rds18Platform + App teams
data/elasticache8Platform
data/s331All teams
compute/eks15Platform
compute/ecs22App teams
iam/roles28Platform
services/payment14Payment team
services/order11Order team
services/user9User team
services/search8Search team
services/analytics7Analytics team
monitoring5Platform
cicd3Platform

Week 2: Module extraction. Created reusable modules for VPC, RDS, ECS, and S3 patterns. Tagged each module with an initial v1.0.0.

Week 3: State moves. This is the critical and dangerous phase. For each state file:

# 1. Create the new backend configuration
cd infrastructure/networking/vpc
cat > backend.tf <<EOF
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "networking/vpc/terraform.tfstate"
    region         = "ap-south-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}
EOF

# 2. Write the Terraform configuration using the new module
cat > main.tf <<EOF
module "vpc" {
  source = "../../modules/vpc"
  # ... parameters
}
EOF

# 3. Move resources from monolithic state to new state
terraform state mv \
  -state=../../monolith/terraform.tfstate \
  -state-out=terraform.tfstate \
  'aws_vpc.main' \
  'module.vpc.aws_vpc.this'

terraform state mv \
  -state=../../monolith/terraform.tfstate \
  -state-out=terraform.tfstate \
  'aws_subnet.private[0]' \
  'module.vpc.aws_subnet.private[0]'

# ... repeat for each resource

# 4. Initialize with new backend and push state
terraform init
terraform state push terraform.tfstate

# 5. Verify with a plan (should show no changes)
terraform plan
# "No changes. Your infrastructure matches the configuration."

Each state move was performed during a maintenance window with the DynamoDB lock held to prevent concurrent modifications.

Week 4: CI/CD setup. Configured Atlantis with per-directory project configuration. Each state file became an independent Atlantis project with its own plan/apply lifecycle.

Directory Structure (Final)

infrastructure/
โ”œโ”€โ”€ modules/
โ”‚   โ”œโ”€โ”€ vpc/            # v1.0.0
โ”‚   โ”œโ”€โ”€ rds/            # v1.0.0
โ”‚   โ”œโ”€โ”€ ecs/            # v1.0.0
โ”‚   โ””โ”€โ”€ s3/             # v1.0.0
โ”œโ”€โ”€ networking/
โ”‚   โ”œโ”€โ”€ vpc/
โ”‚   โ””โ”€โ”€ dns/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ rds/
โ”‚   โ”œโ”€โ”€ elasticache/
โ”‚   โ””โ”€โ”€ s3/
โ”œโ”€โ”€ compute/
โ”‚   โ”œโ”€โ”€ eks/
โ”‚   โ””โ”€โ”€ ecs/
โ”œโ”€โ”€ iam/
โ”œโ”€โ”€ services/
โ”‚   โ”œโ”€โ”€ payment/
โ”‚   โ”œโ”€โ”€ order/
โ”‚   โ”œโ”€โ”€ user/
โ”‚   โ”œโ”€โ”€ search/
โ”‚   โ””โ”€โ”€ analytics/
โ”œโ”€โ”€ monitoring/
โ”œโ”€โ”€ cicd/
โ”œโ”€โ”€ atlantis.yaml
โ””โ”€โ”€ policies/
    โ”œโ”€โ”€ rds.rego
    โ””โ”€โ”€ security.rego

Results

MetricBeforeAfterImprovement
Plan time (networking)12 min45 sec94%
Plan time (single service)12 min30 sec96%
Plan time (worst case, all states)12 min1 min 50 sec85%
State lock conflicts/week2-30100%
Teams blocked by other teamsDailyNever100%
Blast radius of bad applyAll 214 resources8-31 resources85-96%

The apply time improvement followed the same pattern โ€” smaller state files mean fewer resources to reconcile. The 5 teams could now plan and apply independently, and the Conftest policies prevented unsafe changes before they reached terraform apply.

The critical lesson: state segmentation is not just a performance optimization. It is an organizational boundary that enables team autonomy. When the payment team can plan and apply their infrastructure without waiting for the networking team's lock to release, deployment frequency increases and coordination overhead drops.

Ready to discuss your project?

Get in Touch โ†’
โ† Back to Blog

More Articles