Skip to main content
Stripe SystemsStripe Systems
Cloud Computing📅 March 5, 2026· 18 min read

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

✍️
Stripe Systems Engineering

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, payment model, and data path — and none of that complexity is visible from application code.

This post documents a real FinOps engagement where we reduced a client's monthly AWS bill from $47,000 to $28,000. No application code was modified. Every change was at the infrastructure, configuration, and governance layer. We will walk through each optimization with the exact CLI commands, JSON policies, and cost math we used.

The FinOps Framework: Inform → Optimize → Operate

FinOps is not a one-time audit. It is a continuous practice structured around three phases:

Inform — Establish visibility into who is spending what, and where. This means tagging, cost allocation, and reporting. You cannot optimize what you cannot attribute.

Optimize — Act on the data. Right-size instances, purchase commitments, restructure storage tiers, eliminate waste. Each action has a specific cost-benefit tradeoff that must be calculated.

Operate — Build governance and automation so costs do not drift back. Budgets, alerts, service control policies, and a weekly review cadence. The goal is to make cost efficiency a sustained organizational behavior, not a quarterly panic.

Most teams skip the Inform phase and jump straight to "let's buy Reserved Instances." That is a mistake. Without accurate attribution, you are guessing.

Phase 1 — Visibility: Tagging, CUR, and Cost Explorer

Tagging Strategy

Every resource must carry four mandatory tags. Without them, cost data is noise.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceRequiredTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "s3:CreateBucket",
        "elasticloadbalancing:CreateLoadBalancer"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/environment": "true",
          "aws:RequestTag/team": "true",
          "aws:RequestTag/service": "true",
          "aws:RequestTag/cost-center": "true"
        }
      }
    }
  ]
}

The four tags:

  • environment: production, staging, development, sandbox
  • team: The engineering team that owns this resource (e.g., payments, platform, data)
  • service: The application or microservice this resource belongs to (e.g., api-gateway, order-processor)
  • cost-center: Maps to the business unit responsible for the budget

Deploy this as a Service Control Policy (SCP) at the organizational level so it applies to all accounts.

Cost and Usage Reports (CUR) with Athena

CUR provides line-item billing data. Setting up Athena queries over CUR is the foundation for any serious cost analysis.

Enable CUR delivery to S3:

aws cur put-report-definition \
  --report-definition '{
    "ReportName": "daily-cur",
    "TimeUnit": "DAILY",
    "Format": "Parquet",
    "Compression": "Parquet",
    "S3Bucket": "company-billing-cur",
    "S3Prefix": "cur-data",
    "S3Region": "us-east-1",
    "AdditionalSchemaElements": ["RESOURCES", "SPLIT_COST_ALLOCATION_DATA"],
    "RefreshClosedReports": true,
    "ReportVersioning": "OVERWRITE_REPORT"
  }'

Once the CUR data lands in S3, create an Athena table (AWS provides a CloudFormation template with CUR, or you can use the auto-generated crawler). Then query it:

-- Top 10 most expensive services by team
SELECT
  line_item_product_code AS service,
  resource_tags_user_team AS team,
  SUM(line_item_unblended_cost) AS total_cost
FROM cur_database.cur_table
WHERE month = '3' AND year = '2026'
GROUP BY 1, 2
ORDER BY total_cost DESC
LIMIT 10;

-- Daily spend trend for EC2
SELECT
  line_item_usage_start_date AS usage_date,
  SUM(line_item_unblended_cost) AS daily_cost
FROM cur_database.cur_table
WHERE line_item_product_code = 'AmazonEC2'
  AND month = '3' AND year = '2026'
GROUP BY 1
ORDER BY 1;

Cost Explorer Filters

Cost Explorer is useful for quick analysis without writing SQL. The filters we used most:

  • Group by: Tag → service — reveals which microservice costs the most
  • Filter: Usage Type = NatGateway-Bytes — isolates NAT Gateway data processing charges (often a hidden cost driver)
  • Filter: Purchase Option = On-Demand — shows only resources without any discount commitment, the primary optimization target

Right-Sizing EC2 Instances

Right-sizing is the highest-ROI optimization in nearly every engagement. Most instances are provisioned for peak load that rarely occurs.

Identifying Candidates with CloudWatch

Pull 14 days of average CPU utilization for all instances:

# Get CPU utilization for a specific instance over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456789 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

For memory utilization (requires CloudWatch Agent):

aws cloudwatch get-metric-statistics \
  --namespace CWAgent \
  --metric-name mem_used_percent \
  --dimensions Name=InstanceId,Value=i-0abc123def456789 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

Decision Thresholds

Our criteria:

  • Average CPU < 20% over 14 days AND peak CPU < 50%: right-size candidate — downsize by one instance class
  • Average CPU < 10% over 14 days AND peak CPU < 30%: aggressive right-size candidate — downsize by two instance classes or consider Graviton
  • Average memory < 30% over 14 days: memory over-provisioned — switch to a compute-optimized family or smaller instance

Example Downsizes

Original InstanceAvg CPUAvg MemoryNew InstanceMonthly Savings
m5.2xlarge (8 vCPU, 32 GB)12%22%m5.xlarge (4 vCPU, 16 GB)$140
r5.xlarge (4 vCPU, 32 GB)8%18%m7g.large (2 vCPU, 8 GB)$180
c5.4xlarge (16 vCPU, 32 GB)15%35%c5.2xlarge (8 vCPU, 16 GB)$200

The Graviton (m7g, c7g, r7g) instances deserve special attention: they are approximately 20% cheaper than equivalent x86 instances and deliver equal or better performance for most workloads. If your application runs on Linux and does not depend on x86-specific binaries, Graviton is free savings.

To list all running instances with their types for a bulk audit:

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Type:InstanceType,
    Name:Tags[?Key==`Name`]|[0].Value,
    LaunchTime:LaunchTime
  }' \
  --output table

Reserved Instances vs. Savings Plans

The Math

Suppose an m5.xlarge in us-east-1 costs:

  • On-demand: $0.192/hr → $140.16/month
  • 1-year No Upfront Savings Plan: $0.121/hr → $88.33/month (37% discount)
  • 1-year All Upfront Savings Plan: $0.115/hr → $83.95/month (40% discount)
  • 1-year No Upfront Reserved Instance (standard): $0.120/hr → $87.60/month (37% discount)

Break-Even Analysis

For a 1-year No Upfront Compute Savings Plan at $0.121/hr:

Savings per hour = $0.192 - $0.121 = $0.071
Commitment per hour = $0.121 (you pay this regardless of usage)

Break-even utilization = commitment / on-demand rate
                       = $0.121 / $0.192
                       = 63%

If you are running the workload at least 63% of the time, the Savings Plan saves money. For production workloads running 24/7, this is trivially met.

Why Savings Plans Beat Reserved Instances

Reserved Instances are locked to a specific instance family, region, and tenancy. Savings Plans (specifically Compute Savings Plans) apply to any instance family, any region, any OS, and even Fargate and Lambda. If you right-size an instance or migrate to Graviton after purchasing an RI, the RI may no longer apply. A Compute Savings Plan automatically follows you.

The only scenario where RIs still make sense is if you need capacity reservation (On-Demand Capacity Reservations paired with RIs). For pure cost optimization, Savings Plans are the correct instrument.

Coverage recommendation: commit to Savings Plans for 60–70% of your steady-state compute. Leave the remaining 30–40% as on-demand or Spot to absorb variability.

Spot Instances for Non-Critical Workloads

Spot instances offer 60–90% discounts but can be reclaimed with 2 minutes notice. They are appropriate for batch processing, CI/CD runners, stateless web workers behind a load balancer, and data pipelines.

Spot Fleet Configuration

Diversify across instance types and Availability Zones to reduce interruption probability:

{
  "SpotFleetRequestConfig": {
    "IamFleetRole": "arn:aws:iam::123456789012:role/spot-fleet-role",
    "TargetCapacity": 10,
    "SpotPrice": "0.10",
    "TerminateInstancesWithExpiration": true,
    "Type": "maintain",
    "AllocationStrategy": "capacityOptimized",
    "LaunchSpecifications": [
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m5.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m5a.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m6i.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      },
      {
        "ImageId": "ami-0abcdef1234567890",
        "InstanceType": "m7g.xlarge",
        "SubnetId": "subnet-aaa111,subnet-bbb222,subnet-ccc333",
        "WeightedCapacity": 1
      }
    ]
  }
}

Key settings:

  • AllocationStrategy: capacityOptimized — selects pools with the most available capacity, reducing interruption rates. This is preferred over lowestPrice for production workloads.
  • Four instance types across three AZs — gives the fleet 12 capacity pools to draw from. The more pools, the lower the probability of simultaneous interruption.

Graceful Shutdown Handling

Every Spot instance should poll the instance metadata service for the interruption notice:

#!/bin/bash
# /usr/local/bin/spot-interrupt-handler.sh
# Run as a systemd service or cron job every 5 seconds

TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

INTERRUPTION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action 2>/dev/null)

if [ "$INTERRUPTION" != "" ] && [ "$INTERRUPTION" != "<?xml"* ]; then
  echo "Spot interruption notice received. Draining..."
  # Deregister from load balancer target group
  INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
    http://169.254.169.254/latest/meta-data/instance-id)

  aws elbv2 deregister-targets \
    --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \
    --targets Id=$INSTANCE_ID

  # Allow in-flight requests to complete
  sleep 30

  # Stop application gracefully
  systemctl stop my-application
fi

Storage Optimization

S3 Lifecycle Policies

Most S3 buckets accumulate data indefinitely. A lifecycle policy moves aging objects to cheaper tiers automatically:

{
  "Rules": [
    {
      "ID": "ArchiveLogs",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 730
      }
    },
    {
      "ID": "CleanupIncompleteUploads",
      "Status": "Enabled",
      "Filter": {
        "Prefix": ""
      },
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}

Apply it:

aws s3api put-bucket-lifecycle-configuration \
  --bucket my-application-logs \
  --lifecycle-configuration file://lifecycle-policy.json

Pricing comparison (us-east-1, per GB/month):

TierCostRetrieval CostUse Case
S3 Standard$0.023NoneActive data
S3 Standard-IA$0.0125$0.01/GBAccessed < 1x/month
S3 Glacier Instant$0.004$0.03/GBQuarterly access
S3 Glacier Flexible$0.0036$0.01/GB (5-12 hrs)Annual audits
S3 Deep Archive$0.00099$0.02/GB (12-48 hrs)Compliance retention

Moving 4 TB of logs from Standard to Glacier saves approximately $77/TB/month — $308/month for 4 TB.

EBS Volume Migration: gp2 → gp3

gp3 volumes are 20% cheaper than gp2 at baseline and provide 3,000 IOPS and 125 MB/s throughput included (gp2 provides 3 IOPS/GB with burst credits, which means a 100 GB gp2 volume only gets 300 baseline IOPS).

Migrate in place with zero downtime:

# Find all gp2 volumes
aws ec2 describe-volumes \
  --filters "Name=volume-type,Values=gp2" \
  --query 'Volumes[].{ID:VolumeId,Size:Size,State:State,AZ:AvailabilityZone}' \
  --output table

# Modify a volume from gp2 to gp3
aws ec2 modify-volume \
  --volume-id vol-0abc123def456789a \
  --volume-type gp3 \
  --iops 3000 \
  --throughput 125

For a 500 GB volume: gp2 costs $0.10/GB/month = $50. gp3 costs $0.08/GB/month = $40. That is $10/month per volume, and it adds up fast across 40 volumes.

Snapshot Cleanup

Orphaned snapshots accumulate when instances are terminated but their snapshots remain:

# List all snapshots owned by this account
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-12-01`].{
    ID:SnapshotId,
    Size:VolumeSize,
    Created:StartTime,
    VolumeId:VolumeId,
    Description:Description
  }' \
  --output table

# Find snapshots whose source volume no longer exists
for snap_id in $(aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[].SnapshotId' --output text); do

  vol_id=$(aws ec2 describe-snapshots --snapshot-ids "$snap_id" \
    --query 'Snapshots[0].VolumeId' --output text)

  if [ "$vol_id" != "None" ] && [ "$vol_id" != "vol-ffffffff" ]; then
    state=$(aws ec2 describe-volumes --volume-ids "$vol_id" \
      --query 'Volumes[0].State' --output text 2>/dev/null)
    if [ $? -ne 0 ]; then
      echo "ORPHANED: $snap_id (volume $vol_id no longer exists)"
    fi
  fi
done

# Delete a confirmed orphaned snapshot
aws ec2 delete-snapshot --snapshot-id snap-0abc123def456789a

Database Cost Optimization

Aurora I/O-Optimized vs. Standard

Aurora offers two pricing models:

Standard: Lower instance cost, pay per I/O operation ($0.20 per million I/O requests)

I/O-Optimized: 30% higher instance cost, zero I/O charges

The crossover math:

Standard monthly cost = instance_cost + (io_requests × $0.20 / 1,000,000)
IO-Optimized monthly cost = instance_cost × 1.30

Break-even: instance_cost × 0.30 = io_requests × $0.20 / 1,000,000

For an r6g.xlarge ($0.52/hr = $379.60/month):
Break-even I/O = ($379.60 × 0.30) / $0.20 × 1,000,000
               = $113.88 / $0.20 × 1,000,000
               = 569,400,000 I/O requests/month
               ≈ 570 million I/O requests/month

If your Aurora cluster processes more than 570 million I/O operations per month, I/O-Optimized is cheaper. Check your current I/O in CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name VolumeReadIOPs \
  --dimensions Name=DBClusterIdentifier,Value=my-aurora-cluster \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-03-01T00:00:00Z \
  --period 2592000 \
  --statistics Sum \
  --output json

Add VolumeWriteIOPs to the total. If the sum exceeds the break-even threshold, switch to I/O-Optimized. The switch is a single API call with no downtime.

RDS Right-Sizing and Read Replica Consolidation

Apply the same CPU/memory analysis as EC2. Additionally, check for read replicas that exist but receive minimal traffic:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=my-read-replica-01 \
  --start-time 2026-02-19T00:00:00Z \
  --end-time 2026-03-05T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

If a read replica averages fewer than 5 connections over 14 days, it is likely serving a query that could be routed to the primary or to another replica. Consolidating from 3 read replicas to 1 saves two full instance costs.

NAT Gateway: The Silent Budget Killer

A single NAT Gateway costs $0.045/hr ($32.40/month) plus $0.045/GB of processed data. For a workload pushing 20 TB/month through NAT (common when services pull from S3, ECR, or DynamoDB), the data processing charge alone is $900/month per NAT Gateway.

If you have two NAT Gateways (one per AZ for high availability), that is $1,800/month in data processing charges — for traffic that could be free.

The Fix: VPC Endpoints

Gateway endpoints (S3 and DynamoDB) are free. No hourly charge, no data processing charge.

# Create a gateway endpoint for S3
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-0abc123 rtb-0def456

# Create a gateway endpoint for DynamoDB
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids rtb-0abc123 rtb-0def456

Interface endpoints (ECR, STS, CloudWatch, Secrets Manager, etc.) cost $0.01/hr per AZ plus $0.01/GB processed. Still far cheaper than NAT Gateway for AWS service traffic.

# Create an interface endpoint for ECR (Docker pulls)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

# Also create endpoints for ECR API and S3 (ECR uses S3 for layers)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

# STS endpoint (used by IAM role assumption, very chatty)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def456789a \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.sts \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --security-group-ids sg-0abc123def456789a \
  --private-dns-enabled

To determine how much traffic is going through your NAT Gateway, check Cost Explorer with usage type filter NatGateway-Bytes or query CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0abc123def456789a \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-03-01T00:00:00Z \
  --period 2592000 \
  --statistics Sum \
  --output json

Data Transfer Optimization

CloudFront for Origin Offload

Data transfer from EC2/ALB to the internet costs $0.09/GB. Data transfer from CloudFront is $0.085/GB at the first tier and drops with volume. But the real savings come from cache hits — a cached response at CloudFront does not generate an origin fetch, so you pay only the CloudFront edge cost with zero origin data transfer.

CloudFront configuration for an ALB origin with aggressive caching:

{
  "Origins": {
    "Items": [
      {
        "Id": "alb-origin",
        "DomainName": "internal-api-alb-123456.us-east-1.elb.amazonaws.com",
        "CustomOriginConfig": {
          "HTTPPort": 80,
          "HTTPSPort": 443,
          "OriginProtocolPolicy": "https-only",
          "OriginReadTimeout": 30,
          "OriginKeepaliveTimeout": 5
        }
      }
    ],
    "Quantity": 1
  },
  "DefaultCacheBehavior": {
    "TargetOriginId": "alb-origin",
    "ViewerProtocolPolicy": "redirect-to-https",
    "CachePolicyId": "658327ea-f89d-4fab-a63d-7e88639e58f6",
    "Compress": true,
    "AllowedMethods": ["GET", "HEAD", "OPTIONS"],
    "CachedMethods": ["GET", "HEAD"]
  }
}

The CachePolicyId above is the AWS managed CachingOptimized policy. For API responses, create a custom cache policy with appropriate TTLs based on your data freshness requirements.

For static assets (JS, CSS, images), set Cache-Control: public, max-age=31536000, immutable at the origin. A well-configured CloudFront distribution achieves 85–95% cache hit ratios for static content, meaning only 5–15% of requests reach your origin.

VPC Endpoints for AWS Service Traffic

As covered in the NAT Gateway section, VPC endpoints eliminate data transfer charges for traffic to AWS services. This is data transfer optimization — the traffic still occurs, but the cost drops to zero (for gateway endpoints) or near-zero (for interface endpoints).

Finding and Eliminating Unused Resources

Unused resources are pure waste. These commands identify them:

Unattached EBS Volumes

aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[].{
    ID:VolumeId,
    Size:Size,
    Type:VolumeType,
    Created:CreateTime,
    AZ:AvailabilityZone
  }' \
  --output table

An available status means the volume is not attached to any instance. If it has been available for more than 7 days, it is almost certainly waste.

Idle Load Balancers

# Find ALBs with zero requests in the last 14 days
for alb_arn in $(aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[].LoadBalancerArn' --output text); do

  alb_name=$(echo "$alb_arn" | awk -F'/' '{print $(NF-1)"/"$NF}')

  request_count=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name RequestCount \
    --dimensions Name=LoadBalancer,Value="app/$alb_name" \
    --start-time 2026-02-19T00:00:00Z \
    --end-time 2026-03-05T00:00:00Z \
    --period 1209600 \
    --statistics Sum \
    --query 'Datapoints[0].Sum' \
    --output text 2>/dev/null)

  if [ "$request_count" = "None" ] || [ "$request_count" = "0.0" ]; then
    echo "IDLE ALB: $alb_arn"
  fi
done

Each idle ALB costs approximately $16.20/month (hourly charge) plus LCU charges. With zero traffic, it is still $16.20/month of waste.

Unused Elastic IPs

aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].{
    IP:PublicIp,
    AllocationId:AllocationId,
    Tags:Tags
  }' \
  --output table

Unassociated Elastic IPs cost $3.60/month each (as of the 2024 pricing change where AWS began charging for all public IPv4 addresses). Eight unused EIPs = $28.80/month.

Governance: Budgets, SCPs, and Automation

AWS Budgets with Alerts

aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "monthly-total",
    "BudgetLimit": {
      "Amount": "30000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {},
    "CostTypes": {
      "IncludeTax": true,
      "IncludeSubscription": true,
      "UseBlended": false
    }
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        },
        {
          "SubscriptionType": "SNS",
          "Address": "arn:aws:sns:us-east-1:123456789012:budget-alerts"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "[email protected]"
        }
      ]
    }
  ]'

Two alert thresholds: actual spend exceeding 80% of budget, and forecasted spend exceeding 100%. The forecasted alert is critical — it triggers before you overspend, giving you time to react.

Service Control Policies to Prevent Expensive Mistakes

This SCP prevents anyone from launching instances larger than 4xlarge or using expensive instance families in non-production accounts:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyLargeInstances",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "ec2:InstanceType": [
            "*.8xlarge",
            "*.12xlarge",
            "*.16xlarge",
            "*.24xlarge",
            "*.metal",
            "p3.*",
            "p4d.*",
            "p5.*",
            "g5.*",
            "inf1.*",
            "inf2.*"
          ]
        }
      }
    },
    {
      "Sid": "DenyExpensiveRDS",
      "Effect": "Deny",
      "Action": "rds:CreateDBInstance",
      "Resource": "*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "rds:DatabaseClass": [
            "db.r5.8xlarge",
            "db.r5.12xlarge",
            "db.r5.16xlarge",
            "db.r5.24xlarge",
            "db.r6g.8xlarge",
            "db.r6g.12xlarge",
            "db.r6g.16xlarge"
          ]
        }
      }
    }
  ]
}

Attach this SCP to the organizational unit containing development and staging accounts. Production accounts may need larger instances, so scope accordingly.

Automated Dev Environment Shutdown

Development and staging environments do not need to run 24/7. Shutting them down outside business hours (e.g., 7 PM to 7 AM IST, weekends) saves 65% of their compute cost.

Use AWS Instance Scheduler or a simple EventBridge rule with a Lambda function. The tag-based approach works well:

# Tag instances that should be auto-stopped
aws ec2 create-tags \
  --resources i-0abc123def456789 \
  --tags Key=auto-shutdown,Value=true

# EventBridge rule (cron: 7 PM IST = 1:30 PM UTC)
aws events put-rule \
  --name "stop-dev-instances" \
  --schedule-expression "cron(30 13 ? * MON-FRI *)" \
  --state ENABLED

# EventBridge rule for start (7 AM IST = 1:30 AM UTC)
aws events put-rule \
  --name "start-dev-instances" \
  --schedule-expression "cron(30 1 ? * MON-FRI *)" \
  --state ENABLED

Continuous Optimization: The Operating Model

Cost optimization is not a project with an end date. It is an ongoing practice.

Weekly Review Cadence

Every Monday, the platform team reviews:

  1. Previous week spend vs. budget — using Cost Explorer's weekly view
  2. Top 5 cost changes — which services or tags had the largest absolute increase
  3. Anomaly alerts — AWS Cost Anomaly Detection flags unexpected spend patterns
  4. New resources — any resources created in the past week without required tags (use AWS Config rules to detect this)
  5. Savings Plan utilization — if utilization drops below 90%, something changed in the workload

AWS Cost Anomaly Detection

aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "service-level-monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "cost-anomaly-alerts",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/monitor-id"],
    "Subscribers": [
      {
        "Address": "[email protected]",
        "Type": "EMAIL"
      }
    ],
    "Frequency": "DAILY",
    "ThresholdExpression": {
      "Dimensions": {
        "Key": "ANOMALY_TOTAL_IMPACT_ABSOLUTE",
        "Values": ["100"],
        "MatchOptions": ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }'

This triggers an alert whenever an anomaly with an impact of $100 or more is detected. Adjust the threshold based on your total spend — for a $28K/month account, $100 is roughly 0.35%, which is a reasonable sensitivity.

Team Accountability

Each cost-center tag maps to a team lead who receives a weekly cost report for their tag. This creates visibility and ownership without requiring every engineer to understand AWS billing. The platform team owns the overall budget and the governance framework. Individual teams own their cost trajectories.

Case Study: $47K → $28K in 8 Weeks

A mid-stage SaaS company approached Stripe Systems with a straightforward problem: their AWS bill had grown from $20K to $47K over 18 months without a proportional increase in traffic or customers. The infrastructure had accumulated organic waste — instances sized for launch-day traffic projections that never materialized, default storage configurations never revisited, and NAT Gateways processing terabytes of internal AWS service traffic.

Discovery (Weeks 1–2)

We deployed the tagging strategy described above and analyzed 3 months of CUR data. The findings:

  • 14 of 38 EC2 instances had average CPU below 15%
  • 40 EBS volumes were still gp2 (the account was created before gp3 became the default)
  • 12 EBS volumes were unattached, totaling 1.2 TB
  • 2 TB of EBS snapshots belonged to volumes that had been terminated months earlier
  • 4 TB of application logs sat in S3 Standard with no lifecycle policy
  • 2 NAT Gateways processed 18 TB/month, of which 14 TB was traffic to S3 and DynamoDB
  • 3 ALBs had received zero requests in 30+ days
  • 8 Elastic IPs were allocated but unassociated
  • Zero Savings Plans or Reserved Instances were in place — 100% on-demand pricing

Implementation (Weeks 3–8)

OptimizationActionMonthly Savings
EC2 right-sizingDownsized 14 instances (11 reduced one class, 3 migrated to Graviton m7g)$6,000
Compute Savings PlansPurchased 1-year No Upfront Compute Savings Plans at 60% coverage of steady-state compute$5,000
S3 lifecycle policiesMoved 4 TB of logs >30 days old to Glacier, enabled IA transition at 30 days$3,000
NAT Gateway eliminationReplaced 2 NAT Gateways with gateway endpoints for S3/DynamoDB and interface endpoints for ECR, STS, CloudWatch Logs$2,000
EBS optimizationMigrated 40 volumes from gp2 to gp3, deleted 2 TB of orphaned snapshots and 12 unattached volumes$2,000
Unused resource cleanupTerminated 3 idle ALBs, released 8 unassociated Elastic IPs, removed associated security groups and target groups$1,000
Total$19,000/month

New monthly spend: $28,000 — a 40% reduction.

Governance Framework Deployed

To prevent cost regression, we implemented:

  1. Mandatory tagging SCP — resources without environment, team, service, and cost-center tags cannot be created
  2. AWS Budgets — $30,000 monthly budget with alerts at 80% actual and 100% forecasted
  3. Instance size SCP — dev/staging accounts cannot launch instances larger than 4xlarge or GPU instances
  4. Weekly review — Monday 30-minute meeting reviewing Cost Explorer dashboard, anomaly alerts, and tag compliance
  5. Dev environment scheduling — all non-production instances tagged auto-shutdown=true stop at 7 PM IST and start at 7 AM IST on weekdays
  6. Cost Anomaly Detection — daily monitoring with $100 impact threshold

The Stripe Systems engineering team conducted this engagement over 8 weeks with a two-person team. The infrastructure changes required no application downtime and no code modifications. Six months later, the client's bill has remained between $27K and $30K, validating that the governance framework is holding.

Key Takeaways

  1. Tag first, optimize second. Without accurate cost attribution, every optimization decision is based on incomplete data.

  2. NAT Gateway charges are the most commonly overlooked cost. Any workload communicating with S3, DynamoDB, ECR, or other AWS services should use VPC endpoints.

  3. gp3 is strictly better than gp2. There is no reason to run gp2 volumes. The migration is zero-downtime and takes one API call per volume.

  4. Savings Plans over Reserved Instances. Compute Savings Plans provide comparable discounts with far more flexibility. Commit to 60–70% of your baseline, not 100%.

  5. Right-sizing is not a one-time event. Workloads change. The instance that was correctly sized 6 months ago may be 3x over-provisioned today. Automate the detection.

  6. Governance prevents drift. SCPs, budgets, automated schedules, and a weekly review cadence are what separate a one-time cost cut from sustained efficiency.

Cloud cost optimization is an engineering discipline, not a procurement exercise. The tools are mature, the data is available, and the process is repeatable. The 40% reduction documented here is not unusual — it is typical of what a structured FinOps practice finds in accounts that have never been systematically reviewed.

Ready to discuss your project?

Get in Touch →
← Back to Blog

More Articles