Cloud Computing📅 March 5, 2026· 18 min read

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

✍️

Stripe Systems Engineering

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, payment model, and data path — and none of that complexity is visible from application code.

This post documents a real FinOps engagement where we reduced a client's monthly AWS bill from $47,000 to $28,000. No application code was modified. Every change was at the infrastructure, configuration, and governance layer. We will walk through each optimization with the exact CLI commands, JSON policies, and cost math we used.

The FinOps Framework: Inform → Optimize → Operate

FinOps is not a one-time audit. It is a continuous practice structured around three phases:

Inform — Establish visibility into who is spending what, and where. This means tagging, cost allocation, and reporting. You cannot optimize what you cannot attribute.

Optimize — Act on the data. Right-size instances, purchase commitments, restructure storage tiers, eliminate waste. Each action has a specific cost-benefit tradeoff that must be calculated.

Operate — Build governance and automation so costs do not drift back. Budgets, alerts, service control policies, and a weekly review cadence. The goal is to make cost efficiency a sustained organizational behavior, not a quarterly panic.

Most teams skip the Inform phase and jump straight to "let's buy Reserved Instances." That is a mistake. Without accurate attribution, you are guessing.

Phase 1 — Visibility: Tagging, CUR, and Cost Explorer

Tagging Strategy

Every resource must carry four mandatory tags. Without them, cost data is noise.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceRequiredTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "s3:CreateBucket",
        "elasticloadbalancing:CreateLoadBalancer"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/environment": "true",
          "aws:RequestTag/team": "true",
          "aws:RequestTag/service": "true",
          "aws:RequestTag/cost-center": "true"
        }
      }
    }
  ]
}

The four tags:

✓environment: production, staging, development, sandbox
✓team: The engineering team that owns this resource (e.g., payments, platform, data)
✓service: The application or microservice this resource belongs to (e.g., api-gateway, order-processor)
✓cost-center: Maps to the business unit responsible for the budget

Deploy this as a Service Control Policy (SCP) at the organizational level so it applies to all accounts.

Cost and Usage Reports (CUR) with Athena

CUR provides line-item billing data. Setting up Athena queries over CUR is the foundation for any serious cost analysis.

Enable CUR delivery to S3:

aws cur put-report-definition \
  --report-definition '{
    "ReportName": "daily-cur",
    "TimeUnit": "DAILY",
    "Format": "Parquet",
    "Compression": "Parquet",
    "S3Bucket": "company-billing-cur",
    "S3Prefix": "cur-data",
    "S3Region": "us-east-1",
    "AdditionalSchemaElements": ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"],
    "RefreshClosedReports": true,
    "ReportVersioning": "OVERWRITE_REPORT"
  }'

Once the CUR data lands in S3, create an Athena table (AWS provides a CloudFormation template with CUR, or you can use the auto-generated crawler). Then query it:

-- Top 10 most expensive services by team
SELECT
  line_item_product_code AS service,
  resource_tags_user_team AS team,
  SUM(line_item_unblended_cost) AS total_cost
FROM cur_database.cur_table
WHERE month = '3' AND year = '2026'
GROUP BY 1, 2
ORDER BY total_cost DESC
LIMIT 10;

-- Daily spend trend for EC2
SELECT
  line_item_usage_start_date AS usage_date,
  SUM(line_item_unblended_cost) AS daily_cost
FROM cur_database.cur_table
WHERE line_item_product_code = 'AmazonEC2'
  AND month = '3' AND year = '2026'
GROUP BY 1
ORDER BY 1;

Cost Explorer Filters

Cost Explorer is useful for quick analysis without writing SQL. The filters we used most:

✓Group by: Tag → service — reveals which microservice costs the most
✓Filter: Usage Type = NatGateway-Bytes — isolates NAT Gateway data processing charges (often a hidden cost driver)
✓Filter: Purchase Option = On-Demand — shows only resources without any discount commitment, the primary optimization target

Right-Sizing EC2 Instances

Right-sizing is the highest-ROI optimization in nearly every engagement. Most instances are provisioned for peak load that rarely occurs.

Identifying Candidates with CloudWatch

Pull 14 days of average CPU utilization for all instances:

# Get CPU utilization for a specific instance over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456789 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

For memory utilization (requires CloudWatch Agent):

aws cloudwatch get-metric-statistics \
  --namespace CWAgent \
  --metric-name mem_used_percent \
  --dimensions Name=InstanceId,Value=i-0abc123def456789 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

Decision Thresholds

Our criteria:

✓Average CPU < 20% over 14 days AND peak CPU < 50%: right-size candidate — downsize by one instance class
✓Average CPU < 10% over 14 days AND peak CPU < 30%: aggressive right-size candidate — downsize by two instance classes or consider Graviton
✓Average memory < 30% over 14 days: memory over-provisioned — switch to a compute-optimized family or smaller instance

Example Downsizes

Original Instance	Avg CPU	Avg Memory	New Instance	Monthly Savings
m5.2xlarge (8 vCPU, 32 GB)	12%	22%	m5.xlarge (4 vCPU, 16 GB)	$140
r5.xlarge (4 vCPU, 32 GB)	8%	18%	m7g.large (2 vCPU, 8 GB)	$180
c5.4xlarge (16 vCPU, 32 GB)	15%	35%	c5.2xlarge (8 vCPU, 16 GB)	$200

The Graviton (m7g, c7g, r7g) instances deserve special attention: they are approximately 20% cheaper than equivalent x86 instances and deliver equal or better performance for most workloads. If your application runs on Linux and does not depend on x86-specific binaries, Graviton is free savings.

To list all running instances with their types for a bulk audit:

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Type:InstanceType,
    Name:Tags[?Key==`Name`]|[0].Value,
    LaunchTime:LaunchTime
  }' \
  --output table

Reserved Instances vs. Savings Plans

The Math

Suppose an m5.xlarge in us-east-1 costs:

✓On-demand: $0.192/hr → $140.16/month
✓1-year No Upfront Savings Plan: $0.121/hr → $88.33/month (37% discount)
✓1-year All Upfront Savings Plan: $0.115/hr → $83.95/month (40% discount)
✓1-year No Upfront Reserved Instance (standard): $0.120/hr → $87.60/month (37% discount)

Break-Even Analysis

For a 1-year No Upfront Compute Savings Plan at $0.121/hr:

Savings per hour = $0.192 - $0.121 = $0.071
Commitment per hour = $0.121 (you pay this regardless of usage)

Break-even utilization = commitment / on-demand rate
                       = $0.121 / $0.192
                       = 63%

If you are running the workload at least 63% of the time, the Savings Plan saves money. For production workloads running 24/7, this is trivially met.

Why Savings Plans Beat Reserved Instances

Reserved Instances are locked to a specific instance family, region, and tenancy. Savings Plans (specifically Compute Savings Plans) apply to any instance family, any region, any OS, and even Fargate and Lambda. If you right-size an instance or migrate to Graviton after purchasing an RI, the RI may no longer apply. A Compute Savings Plan automatically follows you.

The only scenario where RIs still make sense is if you need capacity reservation (On-Demand Capacity Reservations paired with RIs). For pure cost optimization, Savings Plans are the correct instrument.

Coverage recommendation: commit to Savings Plans for 60–70% of your steady-state compute. Leave the remaining 30–40% as on-demand or Spot to absorb variability.

Spot Instances for Non-Critical Workloads

Spot instances offer 60–90% discounts but can be reclaimed with 2 minutes notice. They are appropriate for batch processing, CI/CD runners, stateless web workers behind a load balancer, and data pipelines.

Spot Fleet Configuration

Diversify across instance types and Availability Zones to reduce interruption probability:

{
  "SpotFleetRequestConfig": {
    "IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
    "TargetCapacity": 10,
    "SpotPrice": "0.10",
    "TerminateInstancesWithExpiration": true,
    "Type": "maintain",
    "AllocationStrategy": "capacityOptimized",
    "LaunchSpecifications": [
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m5.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m5a.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m6i.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m7g.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      }
    ]
  }
}

Key settings:

✓AllocationStrategy: capacityOptimized — selects pools with the most available capacity, reducing interruption rates. This is preferred over lowestPrice for production workloads.
✓Four instance types across three AZs — gives the fleet 12 capacity pools to draw from. The more pools, the lower the probability of simultaneous interruption.

Graceful Shutdown Handling

Every Spot instance should poll the instance metadata service for the interruption notice:

#!/bin/bash
# /usr/local/bin/spot-interrupt-handler.sh
# Run as a systemd service or cron job every 5 seconds

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

INTERRUPTION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)

if [ "$INTERRUPTION" != "" ] && [ "$INTERRUPTION" != "<?xml"* ]; then
  echo "Spot interruption notice received. Draining..."
  # Deregister from load balancer target group
  INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/instance-id)

  aws elbv2 deregister-targets \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \
    --targets Id=$INSTANCE_ID

  # Allow in-flight requests to complete
  sleep 30

  # Stop application gracefully
  systemctl stop my-application
fi

Storage Optimization

S3 Lifecycle Policies

Most S3 buckets accumulate data indefinitely. A lifecycle policy moves aging objects to cheaper tiers automatically:

{
  "Rules": [
    {
      "ID": "ArchiveLogs",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 730
      }
    },
    {
      "ID": "CleanupIncompleteUploads",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}

Apply it:

aws s3api put-bucket-lifecycle-configuration \
  --bucket my-application-logs \
  --lifecycle-configuration file://lifecycle-policy.json

Pricing comparison (us-east-1, per GB/month):

Tier	Cost	Retrieval Cost	Use Case
S3 Standard	$0.023	None	Active data
S3 Standard-IA	$0.0125	$0.01/GB	Accessed < 1x/month
S3 Glacier Instant	$0.004	$0.03/GB	Quarterly access
S3 Glacier Flexible	$0.0036	$0.01/GB (5-12 hrs)	Annual audits
S3 Deep Archive	$0.00099	$0.02/GB (12-48 hrs)	Compliance retention

Moving 4 TB of logs from Standard to Glacier saves approximately $77/TB/month — $308/month for 4 TB.

EBS Volume Migration: gp2 → gp3

gp3 volumes are 20% cheaper than gp2 at baseline and provide 3,000 IOPS and 125 MB/s throughput included (gp2 provides 3 IOPS/GB with burst credits, which means a 100 GB gp2 volume only gets 300 baseline IOPS).

Migrate in place with zero downtime:

# Find all gp2 volumes
aws ec2 describe-volumes \
  --filters "Name=volume-type,Values=gp2" \
  --query 'Volumes[].{ID:VolumeId,Size:Size,State:State,AZ:AvailabilityZone}' \
  --output table

# Modify a volume from gp2 to gp3
aws ec2 modify-volume \
  --volume-id vol-0abc123def456789a \
  --volume-type gp3 \
  --iops 3000 \
  --throughput 125

For a 500 GB volume: gp2 costs $0.10/GB/month = $50. gp3 costs $0.08/GB/month = $40. That is $10/month per volume, and it adds up fast across 40 volumes.

Snapshot Cleanup

Orphaned snapshots accumulate when instances are terminated but their snapshots remain:

# List all snapshots owned by this account
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-12-01`].{
    ID:SnapshotId,
    Size:VolumeSize,
    Created:StartTime,
    VolumeId:VolumeId,
    Description:Description
  }' \
  --output table

# Find snapshots whose source volume no longer exists
for snap_id in $(aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[].SnapshotId' --output text); do

  vol_id=$(aws ec2 describe-snapshots --snapshot-ids "$snap_id" \
    --query 'Snapshots[0].VolumeId' --output text)

  if [ "$vol_id" != "None" ] && [ "$vol_id" != "vol-ffffffff" ]; then
    state=$(aws ec2 describe-volumes --volume-ids "$vol_id" \
      --query 'Volumes[0].State' --output text 2>/dev/null)
    if [ $? -ne 0 ]; then
      echo "ORPHANED: $snap_id (volume $vol_id no longer exists)"
    fi
  fi
done

# Delete a confirmed orphaned snapshot
aws ec2 delete-snapshot --snapshot-id snap-0abc123def456789a

Database Cost Optimization

Aurora I/O-Optimized vs. Standard

Aurora offers two pricing models:

Standard: Lower instance cost, pay per I/O operation ($0.20 per million I/O requests)

I/O-Optimized: 30% higher instance cost, zero I/O charges

The crossover math:

Standard monthly cost = instance_cost + (io_requests × $0.20 / 1,000,000)
IO-Optimized monthly cost = instance_cost × 1.30

Break-even: instance_cost × 0.30 = io_requests × $0.20 / 1,000,000

For an r6g.xlarge ($0.52/hr = $379.60/month):
Break-even I/O = ($379.60 × 0.30) / $0.20 × 1,000,000
               = $113.88 / $0.20 × 1,000,000
               = 569,400,000 I/O requests/month
               ≈ 570 million I/O requests/month

If your Aurora cluster processes more than 570 million I/O operations per month, I/O-Optimized is cheaper. Check your current I/O in CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name VolumeReadIOPs \
  --dimensions Name=DBClusterIdentifier,Value=my-aurora-cluster \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-03-01T00:00:00Z \
  --period 2592000 \
  --statistics Sum \
  --output json

Add VolumeWriteIOPs to the total. If the sum exceeds the break-even threshold, switch to I/O-Optimized. The switch is a single API call with no downtime.

RDS Right-Sizing and Read Replica Consolidation

Apply the same CPU/memory analysis as EC2. Additionally, check for read replicas that exist but receive minimal traffic:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=my-read-replica-01 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

If a read replica averages fewer than 5 connections over 14 days, it is likely serving a query that could be routed to the primary or to another replica. Consolidating from 3 read replicas to 1 saves two full instance costs.

NAT Gateway: The Silent Budget Killer

A single NAT Gateway costs $0.045/hr ($32.40/month) plus $0.045/GB of processed data. For a workload pushing 20 TB/month through NAT (common when services pull from S3, ECR, or DynamoDB), the data processing charge alone is $900/month per NAT Gateway.

If you have two NAT Gateways (one per AZ for high availability), that is $1,800/month in data processing charges — for traffic that could be free.

The Fix: VPC Endpoints

Gateway endpoints (S3 and DynamoDB) are free. No hourly charge, no data processing charge.

# Create a gateway endpoint for S3
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-0abc123 rtb-0def456

# Create a gateway endpoint for DynamoDB
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids rtb-0abc123 rtb-0def456

Interface endpoints (ECR, STS, CloudWatch, Secrets Manager, etc.) cost $0.01/hr per AZ plus $0.01/GB processed. Still far cheaper than NAT Gateway for AWS service traffic.

# Create an interface endpoint for ECR (Docker pulls)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

# Also create endpoints for ECR API and S3 (ECR uses S3 for layers)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

# STS endpoint (used by IAM role assumption, very chatty)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.sts \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

To determine how much traffic is going through your NAT Gateway, check Cost Explorer with usage type filter NatGateway-Bytes or query CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456789a \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-03-01T00:00:00Z \
  --period 2592000 \
  --statistics Sum \
  --output json

Data Transfer Optimization

CloudFront for Origin Offload

Data transfer from EC2/ALB to the internet costs $0.09/GB. Data transfer from CloudFront is $0.085/GB at the first tier and drops with volume. But the real savings come from cache hits — a cached response at CloudFront does not generate an origin fetch, so you pay only the CloudFront edge cost with zero origin data transfer.

CloudFront configuration for an ALB origin with aggressive caching:

{
  "Origins": {
    "Items": [
      {
        "Id": "alb-origin",
        "DomainName": "internal-api-alb-123456.us-east-1.elb.amazonaws.com",
        "CustomOriginConfig": {
          "HTTPPort": 80,
          "HTTPSPort": 443,
          "OriginProtocolPolicy": "https-only",
          "OriginReadTimeout": 30,
          "OriginKeepaliveTimeout": 5
        }
      }
    ],
    "Quantity": 1
  },
  "DefaultCacheBehavior": {
    "TargetOriginId": "alb-origin",
    "ViewerProtocolPolicy": "redirect-to-https",
    "CachePolicyId": "658327ea-f89d-4fab-a63d-7e88639e58f6",
    "Compress": true,
    "AllowedMethods": ["GET", "HEAD", "OPTIONS"],
    "CachedMethods": ["GET", "HEAD"]
  }
}

The CachePolicyId above is the AWS managed CachingOptimized policy. For API responses, create a custom cache policy with appropriate TTLs based on your data freshness requirements.

For static assets (JS, CSS, images), set Cache-Control: public, max-age=31536000, immutable at the origin. A well-configured CloudFront distribution achieves 85–95% cache hit ratios for static content, meaning only 5–15% of requests reach your origin.

VPC Endpoints for AWS Service Traffic

As covered in the NAT Gateway section, VPC endpoints eliminate data transfer charges for traffic to AWS services. This is data transfer optimization — the traffic still occurs, but the cost drops to zero (for gateway endpoints) or near-zero (for interface endpoints).

Finding and Eliminating Unused Resources

Unused resources are pure waste. These commands identify them:

Unattached EBS Volumes

aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[].{
    ID:VolumeId,
    Size:Size,
    Type:VolumeType,
    Created:CreateTime,
    AZ:AvailabilityZone
  }' \
  --output table

An available status means the volume is not attached to any instance. If it has been available for more than 7 days, it is almost certainly waste.

Idle Load Balancers

# Find ALBs with zero requests in the last 14 days
for alb_arn in $(aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[].LoadBalancerArn' --output text); do

  alb_name=$(echo "$alb_arn" | awk -F'/' '{print $(NF-1)"/"$NF}')

  request_count=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name RequestCount \
    --dimensions Name=LoadBalancer,Value="app/$alb_name" \
    --start-time 2026-02-19T00:00:00Z \
    --end-time 2026-03-05T00:00:00Z \
    --period 1209600 \
    --statistics Sum \
    --query 'Datapoints[0].Sum' \
    --output text 2>/dev/null)

  if [ "$request_count" = "None" ] || [ "$request_count" = "0.0" ]; then
    echo "IDLE ALB: $alb_arn"
  fi
done

Each idle ALB costs approximately $16.20/month (hourly charge) plus LCU charges. With zero traffic, it is still $16.20/month of waste.

Unused Elastic IPs

aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].{
    IP:PublicIp,
    AllocationId:AllocationId,
    Tags:Tags
  }' \
  --output table

Unassociated Elastic IPs cost $3.60/month each (as of the 2024 pricing change where AWS began charging for all public IPv4 addresses). Eight unused EIPs = $28.80/month.

Governance: Budgets, SCPs, and Automation

AWS Budgets with Alerts

aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "monthly-total",
    "BudgetLimit": {
      "Amount": "30000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {},
    "CostTypes": {
      "IncludeTax": true,
      "IncludeSubscription": true,
      "UseBlended": false
    }
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        },
        {
          "SubscriptionType": "SNS",
          "Address": "arn:aws:sns:us-east-1:123456789012:budget-alerts"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        }
      ]
    }
  ]'

Two alert thresholds: actual spend exceeding 80% of budget, and forecasted spend exceeding 100%. The forecasted alert is critical — it triggers before you overspend, giving you time to react.

Service Control Policies to Prevent Expensive Mistakes

This SCP prevents anyone from launching instances larger than 4xlarge or using expensive instance families in non-production accounts:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyLargeInstances",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "ec2:InstanceType": [
            "*.8xlarge",
            "*.12xlarge",
            "*.16xlarge",
            "*.24xlarge",
            "*.metal",
            "p3.*",
            "p4d.*",
            "p5.*",
            "g5.*",
            "inf1.*",
            "inf2.*"
          ]
        }
      }
    },
    {
      "Sid": "DenyExpensiveRDS",
      "Effect": "Deny",
      "Action": "rds:CreateDBInstance",
      "Resource": "*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "rds:DatabaseClass": [
            "db.r5.8xlarge",
            "db.r5.12xlarge",
            "db.r5.16xlarge",
            "db.r5.24xlarge",
            "db.r6g.8xlarge",
            "db.r6g.12xlarge",
            "db.r6g.16xlarge"
          ]
        }
      }
    }
  ]
}

Attach this SCP to the organizational unit containing development and staging accounts. Production accounts may need larger instances, so scope accordingly.

Automated Dev Environment Shutdown

Development and staging environments do not need to run 24/7. Shutting them down outside business hours (e.g., 7 PM to 7 AM IST, weekends) saves 65% of their compute cost.

Use AWS Instance Scheduler or a simple EventBridge rule with a Lambda function. The tag-based approach works well:

# Tag instances that should be auto-stopped
aws ec2 create-tags \
  --resources i-0abc123def456789 \
  --tags Key=auto-shutdown,Value=true

# EventBridge rule (cron: 7 PM IST = 1:30 PM UTC)
aws events put-rule \
  --name "stop-dev-instances" \
  --schedule-expression "cron(30 13 ? * MON-FRI *)" \
  --state ENABLED

# EventBridge rule for start (7 AM IST = 1:30 AM UTC)
aws events put-rule \
  --name "start-dev-instances" \
  --schedule-expression "cron(30 1 ? * MON-FRI *)" \
  --state ENABLED

Continuous Optimization: The Operating Model

Cost optimization is not a project with an end date. It is an ongoing practice.

Weekly Review Cadence

Every Monday, the platform team reviews:

✓Previous week spend vs. budget — using Cost Explorer's weekly view
✓Top 5 cost changes — which services or tags had the largest absolute increase
✓Anomaly alerts — AWS Cost Anomaly Detection flags unexpected spend patterns
✓New resources — any resources created in the past week without required tags (use AWS Config rules to detect this)
✓Savings Plan utilization — if utilization drops below 90%, something changed in the workload

AWS Cost Anomaly Detection

aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "service-level-monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "cost-anomaly-alerts",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/monitor-id"],
    "Subscribers": [
      {
        "Address": "[email protected]",
        "Type": "EMAIL"
      }
    ],
    "Frequency": "DAILY",
    "ThresholdExpression": {
      "Dimensions": {
        "Key": "ANOMALY_TOTAL_IMPACT_ABSOLUTE",
        "Values": ["100"],
        "MatchOptions": ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }'

This triggers an alert whenever an anomaly with an impact of $100 or more is detected. Adjust the threshold based on your total spend — for a $28K/month account, $100 is roughly 0.35%, which is a reasonable sensitivity.

Team Accountability

Each cost-center tag maps to a team lead who receives a weekly cost report for their tag. This creates visibility and ownership without requiring every engineer to understand AWS billing. The platform team owns the overall budget and the governance framework. Individual teams own their cost trajectories.

Case Study: $47K → $28K in 8 Weeks

A mid-stage SaaS company approached Stripe Systems with a straightforward problem: their AWS bill had grown from $20K to $47K over 18 months without a proportional increase in traffic or customers. The infrastructure had accumulated organic waste — instances sized for launch-day traffic projections that never materialized, default storage configurations never revisited, and NAT Gateways processing terabytes of internal AWS service traffic.

Discovery (Weeks 1–2)

We deployed the tagging strategy described above and analyzed 3 months of CUR data. The findings:

✓14 of 38 EC2 instances had average CPU below 15%
✓40 EBS volumes were still gp2 (the account was created before gp3 became the default)
✓12 EBS volumes were unattached, totaling 1.2 TB
✓2 TB of EBS snapshots belonged to volumes that had been terminated months earlier
✓4 TB of application logs sat in S3 Standard with no lifecycle policy
✓2 NAT Gateways processed 18 TB/month, of which 14 TB was traffic to S3 and DynamoDB
✓3 ALBs had received zero requests in 30+ days
✓8 Elastic IPs were allocated but unassociated
✓Zero Savings Plans or Reserved Instances were in place — 100% on-demand pricing

Implementation (Weeks 3–8)

Optimization	Action	Monthly Savings
EC2 right-sizing	Downsized 14 instances (11 reduced one class, 3 migrated to Graviton m7g)	$6,000
Compute Savings Plans	Purchased 1-year No Upfront Compute Savings Plans at 60% coverage of steady-state compute	$5,000
S3 lifecycle policies	Moved 4 TB of logs >30 days old to Glacier, enabled IA transition at 30 days	$3,000
NAT Gateway elimination	Replaced 2 NAT Gateways with gateway endpoints for S3/DynamoDB and interface endpoints for ECR, STS, CloudWatch Logs	$2,000
EBS optimization	Migrated 40 volumes from gp2 to gp3, deleted 2 TB of orphaned snapshots and 12 unattached volumes	$2,000
Unused resource cleanup	Terminated 3 idle ALBs, released 8 unassociated Elastic IPs, removed associated security groups and target groups	$1,000
Total		$19,000/month

New monthly spend: $28,000 — a 40% reduction.

Governance Framework Deployed

To prevent cost regression, we implemented:

✓Mandatory tagging SCP — resources without environment, team, service, and cost-center tags cannot be created
✓AWS Budgets — $30,000 monthly budget with alerts at 80% actual and 100% forecasted
✓Instance size SCP — dev/staging accounts cannot launch instances larger than 4xlarge or GPU instances
✓Weekly review — Monday 30-minute meeting reviewing Cost Explorer dashboard, anomaly alerts, and tag compliance
✓Dev environment scheduling — all non-production instances tagged auto-shutdown=true stop at 7 PM IST and start at 7 AM IST on weekdays
✓Cost Anomaly Detection — daily monitoring with $100 impact threshold

The Stripe Systems engineering team conducted this engagement over 8 weeks with a two-person team. The infrastructure changes required no application downtime and no code modifications. Six months later, the client's bill has remained between $27K and $30K, validating that the governance framework is holding.

Key Takeaways

✓
Tag first, optimize second. Without accurate cost attribution, every optimization decision is based on incomplete data.
✓
NAT Gateway charges are the most commonly overlooked cost. Any workload communicating with S3, DynamoDB, ECR, or other AWS services should use VPC endpoints.
✓
gp3 is strictly better than gp2. There is no reason to run gp2 volumes. The migration is zero-downtime and takes one API call per volume.
✓
Savings Plans over Reserved Instances. Compute Savings Plans provide comparable discounts with far more flexibility. Commit to 60–70% of your baseline, not 100%.
✓
Right-sizing is not a one-time event. Workloads change. The instance that was correctly sized 6 months ago may be 3x over-provisioned today. Automate the detection.
✓
Governance prevents drift. SCPs, budgets, automated schedules, and a weekly review cadence are what separate a one-time cost cut from sustained efficiency.

Cloud cost optimization is an engineering discipline, not a procurement exercise. The tools are mature, the data is available, and the process is repeatable. The 40% reduction documented here is not unusual — it is typical of what a structured FinOps practice finds in accounts that have never been systematically reviewed.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

DevOps

Infrastructure automation, CI/CD pipelines, and security practices integrated from project inception.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussi...

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months ...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, an...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million r...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a lang...

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentic...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

Cloud Computing📅 March 5, 2026· 18 min read

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

✍️

Stripe Systems Engineering

The FinOps Framework: Inform → Optimize → Operate

FinOps is not a one-time audit. It is a continuous practice structured around three phases:

Inform — Establish visibility into who is spending what, and where. This means tagging, cost allocation, and reporting. You cannot optimize what you cannot attribute.

Optimize — Act on the data. Right-size instances, purchase commitments, restructure storage tiers, eliminate waste. Each action has a specific cost-benefit tradeoff that must be calculated.

Most teams skip the Inform phase and jump straight to "let's buy Reserved Instances." That is a mistake. Without accurate attribution, you are guessing.

Phase 1 — Visibility: Tagging, CUR, and Cost Explorer

Tagging Strategy

Every resource must carry four mandatory tags. Without them, cost data is noise.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceRequiredTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "s3:CreateBucket",
        "elasticloadbalancing:CreateLoadBalancer"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/environment": "true",
          "aws:RequestTag/team": "true",
          "aws:RequestTag/service": "true",
          "aws:RequestTag/cost-center": "true"
        }
      }
    }
  ]
}

The four tags:

✓environment: production, staging, development, sandbox
✓team: The engineering team that owns this resource (e.g., payments, platform, data)
✓service: The application or microservice this resource belongs to (e.g., api-gateway, order-processor)
✓cost-center: Maps to the business unit responsible for the budget

Deploy this as a Service Control Policy (SCP) at the organizational level so it applies to all accounts.

Cost and Usage Reports (CUR) with Athena

CUR provides line-item billing data. Setting up Athena queries over CUR is the foundation for any serious cost analysis.

Enable CUR delivery to S3:

aws cur put-report-definition \
  --report-definition '{
    "ReportName": "daily-cur",
    "TimeUnit": "DAILY",
    "Format": "Parquet",
    "Compression": "Parquet",
    "S3Bucket": "company-billing-cur",
    "S3Prefix": "cur-data",
    "S3Region": "us-east-1",
    "AdditionalSchemaElements": ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"],
    "RefreshClosedReports": true,
    "ReportVersioning": "OVERWRITE_REPORT"
  }'

Once the CUR data lands in S3, create an Athena table (AWS provides a CloudFormation template with CUR, or you can use the auto-generated crawler). Then query it:

-- Top 10 most expensive services by team
SELECT
  line_item_product_code AS service,
  resource_tags_user_team AS team,
  SUM(line_item_unblended_cost) AS total_cost
FROM cur_database.cur_table
WHERE month = '3' AND year = '2026'
GROUP BY 1, 2
ORDER BY total_cost DESC
LIMIT 10;

-- Daily spend trend for EC2
SELECT
  line_item_usage_start_date AS usage_date,
  SUM(line_item_unblended_cost) AS daily_cost
FROM cur_database.cur_table
WHERE line_item_product_code = 'AmazonEC2'
  AND month = '3' AND year = '2026'
GROUP BY 1
ORDER BY 1;

Cost Explorer Filters

Cost Explorer is useful for quick analysis without writing SQL. The filters we used most:

✓Group by: Tag → service — reveals which microservice costs the most
✓Filter: Usage Type = NatGateway-Bytes — isolates NAT Gateway data processing charges (often a hidden cost driver)
✓Filter: Purchase Option = On-Demand — shows only resources without any discount commitment, the primary optimization target

Right-Sizing EC2 Instances

Right-sizing is the highest-ROI optimization in nearly every engagement. Most instances are provisioned for peak load that rarely occurs.

Identifying Candidates with CloudWatch

Pull 14 days of average CPU utilization for all instances:

# Get CPU utilization for a specific instance over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456789 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

For memory utilization (requires CloudWatch Agent):

aws cloudwatch get-metric-statistics \
  --namespace CWAgent \
  --metric-name mem_used_percent \
  --dimensions Name=InstanceId,Value=i-0abc123def456789 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

Decision Thresholds

Our criteria:

✓Average CPU < 20% over 14 days AND peak CPU < 50%: right-size candidate — downsize by one instance class
✓Average CPU < 10% over 14 days AND peak CPU < 30%: aggressive right-size candidate — downsize by two instance classes or consider Graviton
✓Average memory < 30% over 14 days: memory over-provisioned — switch to a compute-optimized family or smaller instance

Example Downsizes

Original Instance	Avg CPU	Avg Memory	New Instance	Monthly Savings
m5.2xlarge (8 vCPU, 32 GB)	12%	22%	m5.xlarge (4 vCPU, 16 GB)	$140
r5.xlarge (4 vCPU, 32 GB)	8%	18%	m7g.large (2 vCPU, 8 GB)	$180
c5.4xlarge (16 vCPU, 32 GB)	15%	35%	c5.2xlarge (8 vCPU, 16 GB)	$200

To list all running instances with their types for a bulk audit:

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Type:InstanceType,
    Name:Tags[?Key==`Name`]|[0].Value,
    LaunchTime:LaunchTime
  }' \
  --output table

Reserved Instances vs. Savings Plans

The Math

Suppose an m5.xlarge in us-east-1 costs:

✓On-demand: $0.192/hr → $140.16/month
✓1-year No Upfront Savings Plan: $0.121/hr → $88.33/month (37% discount)
✓1-year All Upfront Savings Plan: $0.115/hr → $83.95/month (40% discount)
✓1-year No Upfront Reserved Instance (standard): $0.120/hr → $87.60/month (37% discount)

Break-Even Analysis

For a 1-year No Upfront Compute Savings Plan at $0.121/hr:

Savings per hour = $0.192 - $0.121 = $0.071
Commitment per hour = $0.121 (you pay this regardless of usage)

Break-even utilization = commitment / on-demand rate
                       = $0.121 / $0.192
                       = 63%

If you are running the workload at least 63% of the time, the Savings Plan saves money. For production workloads running 24/7, this is trivially met.

Why Savings Plans Beat Reserved Instances

Coverage recommendation: commit to Savings Plans for 60–70% of your steady-state compute. Leave the remaining 30–40% as on-demand or Spot to absorb variability.

Spot Instances for Non-Critical Workloads

Spot Fleet Configuration

Diversify across instance types and Availability Zones to reduce interruption probability:

{
  "SpotFleetRequestConfig": {
    "IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
    "TargetCapacity": 10,
    "SpotPrice": "0.10",
    "TerminateInstancesWithExpiration": true,
    "Type": "maintain",
    "AllocationStrategy": "capacityOptimized",
    "LaunchSpecifications": [
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m5.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m5a.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m6i.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m7g.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      }
    ]
  }
}

Key settings:

✓AllocationStrategy: capacityOptimized — selects pools with the most available capacity, reducing interruption rates. This is preferred over lowestPrice for production workloads.
✓Four instance types across three AZs — gives the fleet 12 capacity pools to draw from. The more pools, the lower the probability of simultaneous interruption.

Graceful Shutdown Handling

Every Spot instance should poll the instance metadata service for the interruption notice:

#!/bin/bash
# /usr/local/bin/spot-interrupt-handler.sh
# Run as a systemd service or cron job every 5 seconds

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

INTERRUPTION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)

if [ "$INTERRUPTION" != "" ] && [ "$INTERRUPTION" != "<?xml"* ]; then
  echo "Spot interruption notice received. Draining..."
  # Deregister from load balancer target group
  INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/instance-id)

  aws elbv2 deregister-targets \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \
    --targets Id=$INSTANCE_ID

  # Allow in-flight requests to complete
  sleep 30

  # Stop application gracefully
  systemctl stop my-application
fi

Storage Optimization

S3 Lifecycle Policies

Most S3 buckets accumulate data indefinitely. A lifecycle policy moves aging objects to cheaper tiers automatically:

{
  "Rules": [
    {
      "ID": "ArchiveLogs",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 730
      }
    },
    {
      "ID": "CleanupIncompleteUploads",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}

Apply it:

aws s3api put-bucket-lifecycle-configuration \
  --bucket my-application-logs \
  --lifecycle-configuration file://lifecycle-policy.json

Pricing comparison (us-east-1, per GB/month):

Tier	Cost	Retrieval Cost	Use Case
S3 Standard	$0.023	None	Active data
S3 Standard-IA	$0.0125	$0.01/GB	Accessed < 1x/month
S3 Glacier Instant	$0.004	$0.03/GB	Quarterly access
S3 Glacier Flexible	$0.0036	$0.01/GB (5-12 hrs)	Annual audits
S3 Deep Archive	$0.00099	$0.02/GB (12-48 hrs)	Compliance retention

Moving 4 TB of logs from Standard to Glacier saves approximately $77/TB/month — $308/month for 4 TB.

EBS Volume Migration: gp2 → gp3

Migrate in place with zero downtime:

# Find all gp2 volumes
aws ec2 describe-volumes \
  --filters "Name=volume-type,Values=gp2" \
  --query 'Volumes[].{ID:VolumeId,Size:Size,State:State,AZ:AvailabilityZone}' \
  --output table

# Modify a volume from gp2 to gp3
aws ec2 modify-volume \
  --volume-id vol-0abc123def456789a \
  --volume-type gp3 \
  --iops 3000 \
  --throughput 125

For a 500 GB volume: gp2 costs $0.10/GB/month = $50. gp3 costs $0.08/GB/month = $40. That is $10/month per volume, and it adds up fast across 40 volumes.

Snapshot Cleanup

Orphaned snapshots accumulate when instances are terminated but their snapshots remain:

# List all snapshots owned by this account
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-12-01`].{
    ID:SnapshotId,
    Size:VolumeSize,
    Created:StartTime,
    VolumeId:VolumeId,
    Description:Description
  }' \
  --output table

# Find snapshots whose source volume no longer exists
for snap_id in $(aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[].SnapshotId' --output text); do

  vol_id=$(aws ec2 describe-snapshots --snapshot-ids "$snap_id" \
    --query 'Snapshots[0].VolumeId' --output text)

  if [ "$vol_id" != "None" ] && [ "$vol_id" != "vol-ffffffff" ]; then
    state=$(aws ec2 describe-volumes --volume-ids "$vol_id" \
      --query 'Volumes[0].State' --output text 2>/dev/null)
    if [ $? -ne 0 ]; then
      echo "ORPHANED: $snap_id (volume $vol_id no longer exists)"
    fi
  fi
done

# Delete a confirmed orphaned snapshot
aws ec2 delete-snapshot --snapshot-id snap-0abc123def456789a

Database Cost Optimization

Aurora I/O-Optimized vs. Standard

Aurora offers two pricing models:

Standard: Lower instance cost, pay per I/O operation ($0.20 per million I/O requests)

I/O-Optimized: 30% higher instance cost, zero I/O charges

The crossover math:

Standard monthly cost = instance_cost + (io_requests × $0.20 / 1,000,000)
IO-Optimized monthly cost = instance_cost × 1.30

Break-even: instance_cost × 0.30 = io_requests × $0.20 / 1,000,000

For an r6g.xlarge ($0.52/hr = $379.60/month):
Break-even I/O = ($379.60 × 0.30) / $0.20 × 1,000,000
               = $113.88 / $0.20 × 1,000,000
               = 569,400,000 I/O requests/month
               ≈ 570 million I/O requests/month

If your Aurora cluster processes more than 570 million I/O operations per month, I/O-Optimized is cheaper. Check your current I/O in CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name VolumeReadIOPs \
  --dimensions Name=DBClusterIdentifier,Value=my-aurora-cluster \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-03-01T00:00:00Z \
  --period 2592000 \
  --statistics Sum \
  --output json

Add VolumeWriteIOPs to the total. If the sum exceeds the break-even threshold, switch to I/O-Optimized. The switch is a single API call with no downtime.

RDS Right-Sizing and Read Replica Consolidation

Apply the same CPU/memory analysis as EC2. Additionally, check for read replicas that exist but receive minimal traffic:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=my-read-replica-01 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

NAT Gateway: The Silent Budget Killer

If you have two NAT Gateways (one per AZ for high availability), that is $1,800/month in data processing charges — for traffic that could be free.

The Fix: VPC Endpoints

Gateway endpoints (S3 and DynamoDB) are free. No hourly charge, no data processing charge.

# Create a gateway endpoint for S3
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-0abc123 rtb-0def456

# Create a gateway endpoint for DynamoDB
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids rtb-0abc123 rtb-0def456

Interface endpoints (ECR, STS, CloudWatch, Secrets Manager, etc.) cost $0.01/hr per AZ plus $0.01/GB processed. Still far cheaper than NAT Gateway for AWS service traffic.

# Create an interface endpoint for ECR (Docker pulls)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

# Also create endpoints for ECR API and S3 (ECR uses S3 for layers)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

# STS endpoint (used by IAM role assumption, very chatty)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.sts \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

To determine how much traffic is going through your NAT Gateway, check Cost Explorer with usage type filter NatGateway-Bytes or query CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456789a \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-03-01T00:00:00Z \
  --period 2592000 \
  --statistics Sum \
  --output json

Data Transfer Optimization

CloudFront for Origin Offload

CloudFront configuration for an ALB origin with aggressive caching:

{
  "Origins": {
    "Items": [
      {
        "Id": "alb-origin",
        "DomainName": "internal-api-alb-123456.us-east-1.elb.amazonaws.com",
        "CustomOriginConfig": {
          "HTTPPort": 80,
          "HTTPSPort": 443,
          "OriginProtocolPolicy": "https-only",
          "OriginReadTimeout": 30,
          "OriginKeepaliveTimeout": 5
        }
      }
    ],
    "Quantity": 1
  },
  "DefaultCacheBehavior": {
    "TargetOriginId": "alb-origin",
    "ViewerProtocolPolicy": "redirect-to-https",
    "CachePolicyId": "658327ea-f89d-4fab-a63d-7e88639e58f6",
    "Compress": true,
    "AllowedMethods": ["GET", "HEAD", "OPTIONS"],
    "CachedMethods": ["GET", "HEAD"]
  }
}

The CachePolicyId above is the AWS managed CachingOptimized policy. For API responses, create a custom cache policy with appropriate TTLs based on your data freshness requirements.

VPC Endpoints for AWS Service Traffic

Finding and Eliminating Unused Resources

Unused resources are pure waste. These commands identify them:

Unattached EBS Volumes

aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[].{
    ID:VolumeId,
    Size:Size,
    Type:VolumeType,
    Created:CreateTime,
    AZ:AvailabilityZone
  }' \
  --output table

An available status means the volume is not attached to any instance. If it has been available for more than 7 days, it is almost certainly waste.

Idle Load Balancers

# Find ALBs with zero requests in the last 14 days
for alb_arn in $(aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[].LoadBalancerArn' --output text); do

  alb_name=$(echo "$alb_arn" | awk -F'/' '{print $(NF-1)"/"$NF}')

  request_count=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name RequestCount \
    --dimensions Name=LoadBalancer,Value="app/$alb_name" \
    --start-time 2026-02-19T00:00:00Z \
    --end-time 2026-03-05T00:00:00Z \
    --period 1209600 \
    --statistics Sum \
    --query 'Datapoints[0].Sum' \
    --output text 2>/dev/null)

  if [ "$request_count" = "None" ] || [ "$request_count" = "0.0" ]; then
    echo "IDLE ALB: $alb_arn"
  fi
done

Each idle ALB costs approximately $16.20/month (hourly charge) plus LCU charges. With zero traffic, it is still $16.20/month of waste.

Unused Elastic IPs

aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].{
    IP:PublicIp,
    AllocationId:AllocationId,
    Tags:Tags
  }' \
  --output table

Unassociated Elastic IPs cost $3.60/month each (as of the 2024 pricing change where AWS began charging for all public IPv4 addresses). Eight unused EIPs = $28.80/month.

Governance: Budgets, SCPs, and Automation

AWS Budgets with Alerts

aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "monthly-total",
    "BudgetLimit": {
      "Amount": "30000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {},
    "CostTypes": {
      "IncludeTax": true,
      "IncludeSubscription": true,
      "UseBlended": false
    }
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        },
        {
          "SubscriptionType": "SNS",
          "Address": "arn:aws:sns:us-east-1:123456789012:budget-alerts"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        }
      ]
    }
  ]'

Two alert thresholds: actual spend exceeding 80% of budget, and forecasted spend exceeding 100%. The forecasted alert is critical — it triggers before you overspend, giving you time to react.

Service Control Policies to Prevent Expensive Mistakes

This SCP prevents anyone from launching instances larger than 4xlarge or using expensive instance families in non-production accounts:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyLargeInstances",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "ec2:InstanceType": [
            "*.8xlarge",
            "*.12xlarge",
            "*.16xlarge",
            "*.24xlarge",
            "*.metal",
            "p3.*",
            "p4d.*",
            "p5.*",
            "g5.*",
            "inf1.*",
            "inf2.*"
          ]
        }
      }
    },
    {
      "Sid": "DenyExpensiveRDS",
      "Effect": "Deny",
      "Action": "rds:CreateDBInstance",
      "Resource": "*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "rds:DatabaseClass": [
            "db.r5.8xlarge",
            "db.r5.12xlarge",
            "db.r5.16xlarge",
            "db.r5.24xlarge",
            "db.r6g.8xlarge",
            "db.r6g.12xlarge",
            "db.r6g.16xlarge"
          ]
        }
      }
    }
  ]
}

Attach this SCP to the organizational unit containing development and staging accounts. Production accounts may need larger instances, so scope accordingly.

Automated Dev Environment Shutdown

Development and staging environments do not need to run 24/7. Shutting them down outside business hours (e.g., 7 PM to 7 AM IST, weekends) saves 65% of their compute cost.

Use AWS Instance Scheduler or a simple EventBridge rule with a Lambda function. The tag-based approach works well:

# Tag instances that should be auto-stopped
aws ec2 create-tags \
  --resources i-0abc123def456789 \
  --tags Key=auto-shutdown,Value=true

# EventBridge rule (cron: 7 PM IST = 1:30 PM UTC)
aws events put-rule \
  --name "stop-dev-instances" \
  --schedule-expression "cron(30 13 ? * MON-FRI *)" \
  --state ENABLED

# EventBridge rule for start (7 AM IST = 1:30 AM UTC)
aws events put-rule \
  --name "start-dev-instances" \
  --schedule-expression "cron(30 1 ? * MON-FRI *)" \
  --state ENABLED

Continuous Optimization: The Operating Model

Cost optimization is not a project with an end date. It is an ongoing practice.

Weekly Review Cadence

Every Monday, the platform team reviews:

✓Previous week spend vs. budget — using Cost Explorer's weekly view
✓Top 5 cost changes — which services or tags had the largest absolute increase
✓Anomaly alerts — AWS Cost Anomaly Detection flags unexpected spend patterns
✓New resources — any resources created in the past week without required tags (use AWS Config rules to detect this)
✓Savings Plan utilization — if utilization drops below 90%, something changed in the workload

AWS Cost Anomaly Detection

aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "service-level-monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "cost-anomaly-alerts",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/monitor-id"],
    "Subscribers": [
      {
        "Address": "[email protected]",
        "Type": "EMAIL"
      }
    ],
    "Frequency": "DAILY",
    "ThresholdExpression": {
      "Dimensions": {
        "Key": "ANOMALY_TOTAL_IMPACT_ABSOLUTE",
        "Values": ["100"],
        "MatchOptions": ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }'

Team Accountability

Case Study: $47K → $28K in 8 Weeks

Discovery (Weeks 1–2)

We deployed the tagging strategy described above and analyzed 3 months of CUR data. The findings:

✓14 of 38 EC2 instances had average CPU below 15%
✓40 EBS volumes were still gp2 (the account was created before gp3 became the default)
✓12 EBS volumes were unattached, totaling 1.2 TB
✓2 TB of EBS snapshots belonged to volumes that had been terminated months earlier
✓4 TB of application logs sat in S3 Standard with no lifecycle policy
✓2 NAT Gateways processed 18 TB/month, of which 14 TB was traffic to S3 and DynamoDB
✓3 ALBs had received zero requests in 30+ days
✓8 Elastic IPs were allocated but unassociated
✓Zero Savings Plans or Reserved Instances were in place — 100% on-demand pricing

Implementation (Weeks 3–8)

Optimization	Action	Monthly Savings
EC2 right-sizing	Downsized 14 instances (11 reduced one class, 3 migrated to Graviton m7g)	$6,000
Compute Savings Plans	Purchased 1-year No Upfront Compute Savings Plans at 60% coverage of steady-state compute	$5,000
S3 lifecycle policies	Moved 4 TB of logs >30 days old to Glacier, enabled IA transition at 30 days	$3,000
NAT Gateway elimination	Replaced 2 NAT Gateways with gateway endpoints for S3/DynamoDB and interface endpoints for ECR, STS, CloudWatch Logs	$2,000
EBS optimization	Migrated 40 volumes from gp2 to gp3, deleted 2 TB of orphaned snapshots and 12 unattached volumes	$2,000
Unused resource cleanup	Terminated 3 idle ALBs, released 8 unassociated Elastic IPs, removed associated security groups and target groups	$1,000
Total		$19,000/month

New monthly spend: $28,000 — a 40% reduction.

Governance Framework Deployed

To prevent cost regression, we implemented:

✓Mandatory tagging SCP — resources without environment, team, service, and cost-center tags cannot be created
✓AWS Budgets — $30,000 monthly budget with alerts at 80% actual and 100% forecasted
✓Instance size SCP — dev/staging accounts cannot launch instances larger than 4xlarge or GPU instances
✓Weekly review — Monday 30-minute meeting reviewing Cost Explorer dashboard, anomaly alerts, and tag compliance
✓Dev environment scheduling — all non-production instances tagged auto-shutdown=true stop at 7 PM IST and start at 7 AM IST on weekdays
✓Cost Anomaly Detection — daily monitoring with $100 impact threshold

Key Takeaways

✓
Tag first, optimize second. Without accurate cost attribution, every optimization decision is based on incomplete data.
✓
NAT Gateway charges are the most commonly overlooked cost. Any workload communicating with S3, DynamoDB, ECR, or other AWS services should use VPC endpoints.
✓
gp3 is strictly better than gp2. There is no reason to run gp2 volumes. The migration is zero-downtime and takes one API call per volume.
✓
Savings Plans over Reserved Instances. Compute Savings Plans provide comparable discounts with far more flexibility. Commit to 60–70% of your baseline, not 100%.
✓
Right-sizing is not a one-time event. Workloads change. The instance that was correctly sized 6 months ago may be 3x over-provisioned today. Automate the detection.
✓
Governance prevents drift. SCPs, budgets, automated schedules, and a weekly review cadence are what separate a one-time cost cut from sustained efficiency.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

DevOps

Infrastructure automation, CI/CD pipelines, and security practices integrated from project inception.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevOpsMarch 13, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

The FinOps Framework: Inform → Optimize → Operate

Phase 1 — Visibility: Tagging, CUR, and Cost Explorer

Tagging Strategy

Cost and Usage Reports (CUR) with Athena

Cost Explorer Filters

Right-Sizing EC2 Instances

Identifying Candidates with CloudWatch

Decision Thresholds

Example Downsizes

Reserved Instances vs. Savings Plans

The Math

Break-Even Analysis

Why Savings Plans Beat Reserved Instances

Spot Instances for Non-Critical Workloads

Spot Fleet Configuration

Graceful Shutdown Handling

Storage Optimization

S3 Lifecycle Policies

EBS Volume Migration: gp2 → gp3

Snapshot Cleanup

Database Cost Optimization

Aurora I/O-Optimized vs. Standard

RDS Right-Sizing and Read Replica Consolidation

NAT Gateway: The Silent Budget Killer

The Fix: VPC Endpoints

Data Transfer Optimization

CloudFront for Origin Offload

VPC Endpoints for AWS Service Traffic

Finding and Eliminating Unused Resources

Unattached EBS Volumes

Idle Load Balancers

Unused Elastic IPs

Governance: Budgets, SCPs, and Automation

AWS Budgets with Alerts

Service Control Policies to Prevent Expensive Mistakes

Automated Dev Environment Shutdown

Continuous Optimization: The Operating Model

Weekly Review Cadence

AWS Cost Anomaly Detection

Team Accountability

Case Study: $47K → $28K in 8 Weeks

Discovery (Weeks 1–2)

Implementation (Weeks 3–8)

Governance Framework Deployed

Key Takeaways

Related Services from Stripe Systems

DevOps

More Articles

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads