DevOps📅 March 13, 2026· 11 min read

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

✍️

Stripe Systems Engineering

Infrastructure drift — the divergence between what is declared in code and what is actually running — is the root cause of a large class of production incidents. GitOps addresses this by making Git the single source of truth for both application deployments and infrastructure state. This post covers the practical implementation of GitOps using ArgoCD for Kubernetes workloads and Terraform with Atlantis for cloud infrastructure.

GitOps Principles

GitOps is not a tool — it is a set of operational principles:

✓Declarative: The entire system is described declaratively. For Kubernetes, this means YAML manifests or Helm charts. For infrastructure, Terraform HCL.
✓Versioned and immutable: The desired state is stored in Git. Every change is a commit with an author, timestamp, and review trail.
✓Pulled automatically: An agent (ArgoCD, Flux, Terraform controller) pulls the desired state from Git and applies it. No human runs kubectl apply or terraform apply against production.
✓Continuously reconciled: The agent detects drift between desired and actual state and corrects it. If someone manually scales a deployment, the agent scales it back to the declared replica count.

The value is operational: when something breaks, you git log to find what changed, git revert to undo it, and the agent applies the rollback. No SSH access to production, no manual intervention.

ArgoCD for Kubernetes Workloads

ArgoCD watches Git repositories and synchronizes Kubernetes resources to match the declared state.

Installation and Core Configuration

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Expose via ingress (production setup)
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: argocd-server
  namespace: argocd
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  rules:
    - host: argocd.internal.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: argocd-server
                port:
                  number: 443
EOF

Application CRD

An ArgoCD Application defines the mapping between a Git repository path and a Kubernetes cluster/namespace:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: production
  source:
    repoURL: https://github.com/org/platform-configs.git
    targetRevision: main
    path: services/payment-service/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: payment
  syncPolicy:
    automated:
      prune: true       # Delete resources removed from Git
      selfHeal: true     # Revert manual changes
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # Allow HPA to manage replicas

Key sync policy decisions:

✓prune: true — Resources deleted from Git are deleted from the cluster. Without this, orphaned resources accumulate.
✓selfHeal: true — Manual changes (e.g., someone runs kubectl scale) are reverted. This enforces Git as the sole source of truth.
✓ignoreDifferences — HPA changes replica counts, which conflicts with the declared count. Ignore this field to let HPA and ArgoCD coexist.

Application of Applications Pattern

With 30+ microservices, defining each Application manually is tedious. Use an ApplicationSet to auto-discover services from the repository structure:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/org/platform-configs.git
        revision: main
        directories:
          - path: services/*/overlays/production
  template:
    metadata:
      name: '{{path[1]}}'
      namespace: argocd
      labels:
        team: '{{path[1]}}'
    spec:
      project: production
      source:
        repoURL: https://github.com/org/platform-configs.git
        targetRevision: main
        path: '{{path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{path[1]}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

This generator scans the repository for directories matching services/*/overlays/production and creates an Application for each. Adding a new service means creating a directory — ArgoCD picks it up automatically.

Helm vs Kustomize with ArgoCD

Use Kustomize when: You have plain YAML manifests and need per-environment patches (replica count, resource limits, ingress hostnames). Kustomize overlays are transparent — you can read the base and the patch separately.

Use Helm when: You need complex templating logic (conditionals, loops), or you consume third-party charts (nginx-ingress, cert-manager, prometheus).

A common hybrid: Helm charts for third-party dependencies, Kustomize for in-house services.

services/
  payment-service/
    base/
      deployment.yaml
      service.yaml
      kustomization.yaml
    overlays/
      development/
        kustomization.yaml    # patches: replicas=1, resources=low
      staging/
        kustomization.yaml    # patches: replicas=2, resources=medium
      production/
        kustomization.yaml    # patches: replicas=3, resources=high
  infrastructure/
    cert-manager/
      Chart.yaml
      values-production.yaml

Terraform for Infrastructure: Bridging the GitOps Gap

Terraform is not natively GitOps. It requires a plan step (preview changes), an apply step (execute changes), and maintains state in a backend. The reconciliation loop is not continuous — it runs on demand.

Atlantis: GitOps Workflow for Terraform

Atlantis bridges this gap by integrating Terraform into the PR workflow:

✓Developer opens a PR that modifies .tf files
✓Atlantis automatically runs terraform plan and comments the plan output on the PR
✓Reviewer examines the plan, approves the PR
✓Developer comments atlantis apply to execute the changes
✓Atlantis applies, comments the result, and the PR is merged

# atlantis.yaml (repo-level configuration)
version: 3
projects:
  - name: networking
    dir: infrastructure/networking
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: default

  - name: compute
    dir: infrastructure/compute
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
        - "../modules/ecs/**/*.tf"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: default

  - name: data
    dir: infrastructure/data
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: custom-with-checkov

workflows:
  custom-with-checkov:
    plan:
      steps:
        - init
        - plan
        - run: checkov -d . --framework terraform --quiet --compact
    apply:
      steps:
        - apply

Drift Detection

ArgoCD continuously compares Git to cluster state. Terraform does not do this by default. Implement scheduled drift detection:

# GitHub Actions workflow for Terraform drift detection
name: Terraform Drift Detection
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Weekdays at 8 AM

jobs:
  drift-check:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        project: [networking, compute, data]
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.4"

      - name: Terraform Init
        working-directory: infrastructure/${{ matrix.project }}
        run: terraform init -backend-config=backend-prod.hcl

      - name: Terraform Plan (Drift Detection)
        working-directory: infrastructure/${{ matrix.project }}
        id: plan
        run: |
          terraform plan -detailed-exitcode -out=plan.tfplan 2>&1 | tee plan-output.txt
          echo "exitcode=$?" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Notify on Drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d "{\"text\": \"⚠️ Terraform drift detected in ${{ matrix.project }}. Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\"}"

Exit code 2 from terraform plan -detailed-exitcode means changes are needed — infrastructure has drifted from the declared state.

Repository Structure

The repository structure significantly impacts workflow clarity and team autonomy.

Recommended: Monorepo with Clear Boundaries

platform-configs/
├── infrastructure/           # Terraform — cloud resources
│   ├── modules/             # Reusable Terraform modules
│   │   ├── vpc/
│   │   ├── ecs/
│   │   ├── rds/
│   │   └── s3/
│   ├── networking/          # VPC, subnets, route tables
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── backend-prod.hcl
│   ├── compute/             # ECS clusters, EKS, EC2
│   │   ├── main.tf
│   │   └── ...
│   └── data/                # RDS, ElastiCache, S3
│       ├── main.tf
│       └── ...
├── services/                # Kubernetes manifests — app workloads
│   ├── payment-service/
│   │   ├── base/
│   │   └── overlays/
│   │       ├── development/
│   │       ├── staging/
│   │       └── production/
│   ├── user-service/
│   │   ├── base/
│   │   └── overlays/
│   └── ...
├── platform/                # Cluster-wide Kubernetes resources
│   ├── cert-manager/
│   ├── ingress-nginx/
│   ├── prometheus/
│   └── argocd/
├── atlantis.yaml
└── .github/
    └── workflows/
        └── drift-detection.yaml

Why monorepo: A change that modifies both Terraform (new RDS instance) and Kubernetes manifests (new service connecting to that RDS) can be reviewed in a single PR. Cross-cutting changes are atomic.

Why not monorepo: If infrastructure and application teams have different review cadences, a monorepo can create PR bottleneck. In that case, split into infra-configs and app-configs repos, with ArgoCD watching each independently.

Secrets in GitOps

Secrets cannot be stored in plaintext in Git. Three approaches:

Sealed Secrets

Encrypt secrets client-side, store the encrypted version in Git, decrypt in-cluster:

# Install Sealed Secrets controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system

# Encrypt a secret
kubectl create secret generic db-credentials \
  --from-literal=password=s3cret \
  --dry-run=client -o yaml | \
  kubeseal --format yaml > sealed-db-credentials.yaml

The sealed secret can safely be committed to Git. Only the Sealed Secrets controller (with its private key) can decrypt it.

SOPS with Age

Mozilla SOPS encrypts specific values within YAML files, keeping keys readable:

# Generate an age key
age-keygen -o key.txt

# Encrypt with SOPS
sops --encrypt --age age1ql3z7hjy54pw3hyww5ayyfg7zqgvc7w3j2elw8zmrj2kg5sfn9aqmcac8p \
  --encrypted-regex '^(data|stringData)$' \
  secret.yaml > secret.enc.yaml

# ArgoCD decrypts via kustomize-sops plugin or helm-secrets

External Secrets Operator

Fetch secrets from external stores (AWS Secrets Manager, Vault, GCP Secret Manager) at runtime:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: production/payment-service/db
        property: password

This is the approach we recommend for production. Secrets never touch Git — not even in encrypted form.

Promotion Workflows

Changes should flow through environments: development → staging → production.

PR-Based Promotion with Kustomize

# Structure
services/payment-service/
  base/
    deployment.yaml          # image: payment-service:IMAGE_TAG
    kustomization.yaml
  overlays/
    development/
      kustomization.yaml     # image tag: abc123 (latest commit)
    staging/
      kustomization.yaml     # image tag: abc123 (promoted from dev)
    production/
      kustomization.yaml     # image tag: def456 (promoted from staging)

Promotion is a PR that updates the image tag in the next environment's overlay:

# overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
images:
  - name: payment-service
    newName: ghcr.io/org/payment-service
    newTag: abc123f  # Updated by promotion PR
patches:
  - target:
      kind: Deployment
      name: payment-service
    patch: |
      - op: replace
        path: /spec/replicas
        value: 2

Automate the promotion PR with a GitHub Actions workflow:

name: Promote to Staging
on:
  workflow_dispatch:
    inputs:
      service:
        description: 'Service name'
        required: true
      tag:
        description: 'Image tag to promote'
        required: true

jobs:
  promote:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Update staging image tag
        run: |
          cd services/${{ inputs.service }}/overlays/staging
          kustomize edit set image \
            ${{ inputs.service }}=ghcr.io/org/${{ inputs.service }}:${{ inputs.tag }}

      - name: Create promotion PR
        uses: peter-evans/create-pull-request@v6
        with:
          title: "Promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
          branch: promote/${{ inputs.service }}-${{ inputs.tag }}-staging
          commit-message: "chore: promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
          body: |
            Automated promotion from development.
            - Service: `${{ inputs.service }}`
            - Image tag: `${{ inputs.tag }}`
          reviewers: platform-team

ArgoCD Health Checks and Rollback

Custom Health Checks

ArgoCD can monitor resource health beyond standard Kubernetes readiness:

# argocd-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.health.argoproj.io_Rollout: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for _, condition in ipairs(obj.status.conditions) do
          if condition.type == "Degraded" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
        end
      end
      if obj.status.currentPodHash ~= nil then
        if obj.status.currentPodHash == obj.status.stableRS then
          hs.status = "Healthy"
        else
          hs.status = "Progressing"
        end
      end
    end
    hs.status = "Progressing"
    return hs

Automated Rollback

ArgoCD does not natively perform automated rollback on health check failure, but you can combine it with Argo Rollouts for canary/blue-green deployments with automatic rollback:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 3
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: payment-service
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        - name: payment-service
          image: ghcr.io/org/payment-service:abc123

Disaster Recovery

ArgoCD Backup

# Export all ArgoCD applications
kubectl get applications -n argocd -o yaml > argocd-apps-backup.yaml

# Export ArgoCD configuration
kubectl get configmap argocd-cm argocd-rbac-cm -n argocd -o yaml > argocd-config-backup.yaml
kubectl get secret argocd-secret -n argocd -o yaml > argocd-secret-backup.yaml

# Automated backup with CronJob
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: argocd-backup
  namespace: argocd
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  kubectl get applications -n argocd -o yaml > /backup/apps.yaml
                  aws s3 cp /backup/apps.yaml s3://backups/argocd/apps-\$(date +%Y%m%d).yaml
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              emptyDir: {}
          restartPolicy: OnFailure
          serviceAccountName: argocd-backup-sa
EOF

Terraform State Recovery

# Always use versioned S3 bucket for state
resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Recover a previous state version
aws s3api list-object-versions \
  --bucket terraform-state-bucket \
  --prefix networking/terraform.tfstate

aws s3api get-object \
  --bucket terraform-state-bucket \
  --key networking/terraform.tfstate \
  --version-id "abc123" \
  recovered-state.tfstate

# Push recovered state (use with extreme caution)
terraform state push recovered-state.tfstate

Case Study: Platform Team Managing 30+ Microservices

A platform team responsible for 30+ microservices across development, staging, and production environments was deploying via a combination of Jenkins pipelines and manual kubectl apply commands. Deployments happened weekly, coordinated over Slack. Rollbacks required SSH access to a bastion host. Configuration drift was discovered only when something broke.

Before GitOps

✓Deployment frequency: 4-5 per week (batched releases)
✓Mean time to deploy: 45 minutes per service (manual process)
✓Rollback time: 20-60 minutes (SSH to bastion, find previous manifest, apply)
✓Drift incidents: 2-3 per month (someone manually patched production)
✓Config stored in: Jenkins job parameters, team wiki, individual laptops

Implementation

The Stripe Systems platform engineering team implemented GitOps over 6 weeks:

Week 1-2: Repository structure and ArgoCD setup. Migrated all Kubernetes manifests from Jenkins workspace directories to the monorepo structure shown above. Used Kustomize overlays for environment-specific configuration.

Week 3: ApplicationSet auto-discovery. The ApplicationSet generator (shown earlier) eliminated the need to manually register each service. New services were onboarded by creating a directory.

Week 4: Atlantis for Terraform. Migrated Terraform from a shared Jenkins agent to Atlantis with PR-based plan/apply. Added checkov scanning to the plan workflow.

Week 5: Secrets migration. Moved from Kubernetes secrets (stored in a shared 1Password vault and applied manually) to External Secrets Operator pulling from AWS Secrets Manager.

Week 6: Promotion workflows and monitoring. Implemented the PR-based promotion pipeline and set up ArgoCD notifications to Slack.

ArgoCD ApplicationSet Configuration

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: all-services
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - git:
              repoURL: https://github.com/org/platform-configs.git
              revision: main
              directories:
                - path: services/*
          - list:
              elements:
                - env: development
                  cluster: https://dev-cluster.internal:6443
                - env: staging
                  cluster: https://staging-cluster.internal:6443
                - env: production
                  cluster: https://prod-cluster.internal:6443
  template:
    metadata:
      name: '{{path.basename}}-{{env}}'
      namespace: argocd
    spec:
      project: '{{env}}'
      source:
        repoURL: https://github.com/org/platform-configs.git
        targetRevision: main
        path: 'services/{{path.basename}}/overlays/{{env}}'
      destination:
        server: '{{cluster}}'
        namespace: '{{path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true

This single ApplicationSet creates 90+ Applications (30 services × 3 environments) automatically.

After GitOps

✓Deployment frequency: 15-20 per day (individual service deploys)
✓Mean time to deploy: 3 minutes (commit to running in production)
✓Rollback time: Under 2 minutes (git revert + auto-sync)
✓Drift incidents: 0 per month (self-heal reverts manual changes)
✓Config stored in: Git (single source of truth, fully auditable)

The quantitative improvement was significant, but the qualitative change mattered more: on-call engineers could diagnose and roll back issues without production access. Every change was traceable to a PR with an author and reviewer. The weekly deployment coordination meeting was eliminated entirely.

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

DevOps

Infrastructure automation, CI/CD pipelines, and security practices integrated from project inception.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

The term "AI agent" has been diluted by marketing to the point where it describes everything from a chatbot with a system prompt to a fully autonomous multi-step reasoning system. For this discussi...

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

The methodology debate in software development is older than most of the frameworks we argue about on the internet. Waterfall has been declared dead roughly once per year since the Agile Manifesto ...

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Code review is the most important quality gate in a software team, and it is also the most common bottleneck. Every team has the same problem: senior engineers are the reviewers, they have their ow...

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

The phrase "AI-augmented SDLC" gets thrown around loosely. Vendors pitch it as "AI writes your code." That is not what it means in practice. What it actually means: at every phase of the developmen...

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI-assisted testing has moved from research papers into daily engineering workflows. Tools powered by large language models can generate test scaffolds, detect visual regressions, predict flaky tes...

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Generic AI code review tools are good at catching syntax errors, unused variables, and simple bugs. They are poor at catching architecture violations — the kind of issues that compound over months ...

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

AI tools are not magic. They do not replace engineers, they do not understand your codebase, and they will confidently generate code that compiles but violates your business rules. What they do — w...

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Every team building on microservices eventually hits the same question: how should clients talk to your backend? The answer is some form of API gateway — but which pattern you choose has lasting co...

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Every engineer who has operated a Lambda-based production service has encountered the cold start problem. The function responds in 12 milliseconds on the second invocation but takes 3.8 seconds on ...

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Most cloud comparison articles recycle the same vague advice: "AWS has the most services, Azure integrates with Microsoft, GCP is good for data." That is not useful when you are a startup founder s...

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

One of the first and most important decisions in any mobile app project is choosing between native and cross-platform development. Each approach has distinct advantages, and the right choice depend...

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Monorepos consolidate multiple services, shared libraries, and frontend applications into a single repository. This brings benefits — atomic cross-service changes, shared tooling, simplified depend...

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Software architecture is not about choosing the right framework. It is about deciding which parts of a system should be easy to change and which should be stable — then enforcing that decision stru...

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

Flutter gives you a rendering engine and a widget tree. It does not give you an architecture. That gap is where most projects accumulate the technical debt that slows them down six months after lau...

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Most enterprise teams treat DevOps as something to bolt on after the application takes shape. Security gets deferred even further — relegated to a penetration test two weeks before launch. This seq...

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

A default Docker image built from `node:18` or `python:3.11` ships with hundreds of packages you do not need in production — compilers, package managers, shells, debug utilities. Each unnecessary p...

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Most backend systems start as synchronous request-response services. A client sends a request, the server processes it, and returns a result. This model is simple to reason about, easy to debug, an...

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Most organizations overspend on AWS by 25–35%. Not because their engineers are careless, but because cloud billing is structurally opaque. Pricing varies by region, instance family, tenancy, paymen...

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Cross-platform mobile development has converged on two serious contenders: Flutter and React Native. Both are production-ready for enterprise applications, but they make fundamentally different arc...

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Cloud misconfigurations remain the most common cause of cloud security incidents. The 2024 Verizon Data Breach Investigations Report attributes 74% of cloud breaches to misconfiguration or misuse, ...

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Backend concurrency is not a solved problem. It is a set of trade-offs that shift with every workload profile. Java 21 introduced virtual threads — lightweight threads managed by the JVM rather tha...

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

Multi-tenancy in Kubernetes is not a single problem — it is a spectrum of isolation requirements that vary based on trust boundaries, compliance mandates, and operational capacity. This post examin...

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

LLM API costs follow a simple formula: tokens consumed × price per token. At low volume, this is negligible. At production scale, it becomes a significant line item. A system processing 1 million r...

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

The pitch for micro-frontends is compelling: split a monolithic frontend into independently deployable units owned by autonomous teams. The reality is more nuanced. Module Federation, introduced in...

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

The architecture decision between microservices and a monolith is not a technology choice — it is an organizational one. The right answer depends on your team size, your domain maturity, your opera...

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Multi-cloud is one of the most oversold ideas in infrastructure. The pitch is simple: run workloads across AWS, GCP, and Azure to avoid vendor lock-in, improve resilience, and negotiate better pric...

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

REST and GraphQL dominate client-facing APIs for good reason: browser support, tooling maturity, and developer familiarity. But for service-to-service communication inside a cluster, gRPC offers me...

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Engineering leaders who need to extend capacity beyond their core team face a fundamental choice between two models: hire individual freelancers through marketplace platforms, or establish a dedica...

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Most web applications treat offline support as an afterthought — a "no internet" screen with a sad dinosaur. Offline-first flips this: the app is designed to work without a network connection, and ...

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

The offshore development industry has a reputation problem, and it is largely self-inflicted. For two decades, the dominant sales pitch was cost arbitrage: "Get the same work done for 60% less." Th...

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

The single biggest risk in staff augmentation is not cost, quality, or attrition. It is the velocity dip during onboarding. A team that goes from signing a contract to productive output in 4 weeks ...

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Most engineering leaders approach the onshore-vs-offshore decision with a spreadsheet containing hourly rates and a vague sense of "risk." That is insufficient. The actual decision involves at leas...

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM-powered applications over proprietary data. The core idea is straightforward: instead of fine-tuning a lang...

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Every developer on your team uses LLMs differently. One engineer writes "make me a login page" and gets generic boilerplate. Another writes a structured prompt with framework constraints, authentic...

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Every year, engineering leaders evaluate staff augmentation options by comparing hourly rates on a spreadsheet. Offshore at $40–55/hr, nearshore at $65–85/hr, onshore at $130–180/hr. The math looks...

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Most teams adopt the Next.js App Router and immediately add `"use client"` to every component that does anything interactive. Within a week, they've recreated a fully client-rendered SPA with extra...

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

If you are a CTO or founder evaluating India for an Offshore Development Centre (ODC), you have probably encountered two types of advice: breathless marketing from outsourcing firms promising effor...

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

"Shift left" means running security checks earlier in the development lifecycle — during coding and code review rather than after deployment. The economic argument is straightforward: a vulnerabili...

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

SOC 2 Type II audits examine whether your security controls work consistently over a defined observation period — typically 6 to 12 months. Unlike Type I, which captures a point-in-time snapshot, T...

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Staff augmentation is a staffing model where external engineers join your team on a contract basis, working under your technical leadership and within your existing processes. Unlike project outsou...

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

React 19 shipped server components, and with them came a reasonable question: do we still need client-side state management libraries? The answer is yes, but the reasoning has shifted. Server compo...

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Terraform works well for a single team managing a handful of resources. It does not work well when five teams share a single state file containing 200+ resources. This post covers the specific prob...

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

In today's competitive landscape, growing businesses face a critical decision: should they rely on off-the-shelf software or invest in custom-built solutions? While pre-built tools offer quick depl...

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Zero-trust networking operates on a simple principle: no request is trusted based on its network origin. A request from inside your VPC receives the same scrutiny as a request from the public inter...

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Traditional network security operates on a simple assumption: traffic inside the firewall is trusted, traffic outside is not. This model fails in cloud environments for three reasons. First, there ...

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

Most "offshoring rate" guides float a single dollar number per country and call it analysis. That number is almost always wrong — because it conflates raw salary with the fully-loaded cost of empl...

DevOpsApril 28, 2026

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026

Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps *belief* problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amou...

DevOps📅 March 13, 2026· 11 min read

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

✍️

Stripe Systems Engineering

GitOps Principles

GitOps is not a tool — it is a set of operational principles:

✓Declarative: The entire system is described declaratively. For Kubernetes, this means YAML manifests or Helm charts. For infrastructure, Terraform HCL.
✓Versioned and immutable: The desired state is stored in Git. Every change is a commit with an author, timestamp, and review trail.
✓Pulled automatically: An agent (ArgoCD, Flux, Terraform controller) pulls the desired state from Git and applies it. No human runs kubectl apply or terraform apply against production.
✓Continuously reconciled: The agent detects drift between desired and actual state and corrects it. If someone manually scales a deployment, the agent scales it back to the declared replica count.

The value is operational: when something breaks, you git log to find what changed, git revert to undo it, and the agent applies the rollback. No SSH access to production, no manual intervention.

ArgoCD for Kubernetes Workloads

ArgoCD watches Git repositories and synchronizes Kubernetes resources to match the declared state.

Installation and Core Configuration

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Expose via ingress (production setup)
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: argocd-server
  namespace: argocd
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  rules:
    - host: argocd.internal.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: argocd-server
                port:
                  number: 443
EOF

Application CRD

An ArgoCD Application defines the mapping between a Git repository path and a Kubernetes cluster/namespace:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: production
  source:
    repoURL: https://github.com/org/platform-configs.git
    targetRevision: main
    path: services/payment-service/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: payment
  syncPolicy:
    automated:
      prune: true       # Delete resources removed from Git
      selfHeal: true     # Revert manual changes
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # Allow HPA to manage replicas

Key sync policy decisions:

✓prune: true — Resources deleted from Git are deleted from the cluster. Without this, orphaned resources accumulate.
✓selfHeal: true — Manual changes (e.g., someone runs kubectl scale) are reverted. This enforces Git as the sole source of truth.
✓ignoreDifferences — HPA changes replica counts, which conflicts with the declared count. Ignore this field to let HPA and ArgoCD coexist.

Application of Applications Pattern

With 30+ microservices, defining each Application manually is tedious. Use an ApplicationSet to auto-discover services from the repository structure:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/org/platform-configs.git
        revision: main
        directories:
          - path: services/*/overlays/production
  template:
    metadata:
      name: '{{path[1]}}'
      namespace: argocd
      labels:
        team: '{{path[1]}}'
    spec:
      project: production
      source:
        repoURL: https://github.com/org/platform-configs.git
        targetRevision: main
        path: '{{path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{path[1]}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

Helm vs Kustomize with ArgoCD

Use Helm when: You need complex templating logic (conditionals, loops), or you consume third-party charts (nginx-ingress, cert-manager, prometheus).

A common hybrid: Helm charts for third-party dependencies, Kustomize for in-house services.

services/
  payment-service/
    base/
      deployment.yaml
      service.yaml
      kustomization.yaml
    overlays/
      development/
        kustomization.yaml    # patches: replicas=1, resources=low
      staging/
        kustomization.yaml    # patches: replicas=2, resources=medium
      production/
        kustomization.yaml    # patches: replicas=3, resources=high
  infrastructure/
    cert-manager/
      Chart.yaml
      values-production.yaml

Terraform for Infrastructure: Bridging the GitOps Gap

Atlantis: GitOps Workflow for Terraform

Atlantis bridges this gap by integrating Terraform into the PR workflow:

✓Developer opens a PR that modifies .tf files
✓Atlantis automatically runs terraform plan and comments the plan output on the PR
✓Reviewer examines the plan, approves the PR
✓Developer comments atlantis apply to execute the changes
✓Atlantis applies, comments the result, and the PR is merged

# atlantis.yaml (repo-level configuration)
version: 3
projects:
  - name: networking
    dir: infrastructure/networking
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: default

  - name: compute
    dir: infrastructure/compute
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
        - "../modules/ecs/**/*.tf"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: default

  - name: data
    dir: infrastructure/data
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: custom-with-checkov

workflows:
  custom-with-checkov:
    plan:
      steps:
        - init
        - plan
        - run: checkov -d . --framework terraform --quiet --compact
    apply:
      steps:
        - apply

Drift Detection

ArgoCD continuously compares Git to cluster state. Terraform does not do this by default. Implement scheduled drift detection:

# GitHub Actions workflow for Terraform drift detection
name: Terraform Drift Detection
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Weekdays at 8 AM

jobs:
  drift-check:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        project: [networking, compute, data]
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.4"

      - name: Terraform Init
        working-directory: infrastructure/${{ matrix.project }}
        run: terraform init -backend-config=backend-prod.hcl

      - name: Terraform Plan (Drift Detection)
        working-directory: infrastructure/${{ matrix.project }}
        id: plan
        run: |
          terraform plan -detailed-exitcode -out=plan.tfplan 2>&1 | tee plan-output.txt
          echo "exitcode=$?" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Notify on Drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d "{\"text\": \"⚠️ Terraform drift detected in ${{ matrix.project }}. Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\"}"

Exit code 2 from terraform plan -detailed-exitcode means changes are needed — infrastructure has drifted from the declared state.

Repository Structure

The repository structure significantly impacts workflow clarity and team autonomy.

Recommended: Monorepo with Clear Boundaries

platform-configs/
├── infrastructure/           # Terraform — cloud resources
│   ├── modules/             # Reusable Terraform modules
│   │   ├── vpc/
│   │   ├── ecs/
│   │   ├── rds/
│   │   └── s3/
│   ├── networking/          # VPC, subnets, route tables
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── backend-prod.hcl
│   ├── compute/             # ECS clusters, EKS, EC2
│   │   ├── main.tf
│   │   └── ...
│   └── data/                # RDS, ElastiCache, S3
│       ├── main.tf
│       └── ...
├── services/                # Kubernetes manifests — app workloads
│   ├── payment-service/
│   │   ├── base/
│   │   └── overlays/
│   │       ├── development/
│   │       ├── staging/
│   │       └── production/
│   ├── user-service/
│   │   ├── base/
│   │   └── overlays/
│   └── ...
├── platform/                # Cluster-wide Kubernetes resources
│   ├── cert-manager/
│   ├── ingress-nginx/
│   ├── prometheus/
│   └── argocd/
├── atlantis.yaml
└── .github/
    └── workflows/
        └── drift-detection.yaml

Secrets in GitOps

Secrets cannot be stored in plaintext in Git. Three approaches:

Sealed Secrets

Encrypt secrets client-side, store the encrypted version in Git, decrypt in-cluster:

# Install Sealed Secrets controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system

# Encrypt a secret
kubectl create secret generic db-credentials \
  --from-literal=password=s3cret \
  --dry-run=client -o yaml | \
  kubeseal --format yaml > sealed-db-credentials.yaml

The sealed secret can safely be committed to Git. Only the Sealed Secrets controller (with its private key) can decrypt it.

SOPS with Age

Mozilla SOPS encrypts specific values within YAML files, keeping keys readable:

# Generate an age key
age-keygen -o key.txt

# Encrypt with SOPS
sops --encrypt --age age1ql3z7hjy54pw3hyww5ayyfg7zqgvc7w3j2elw8zmrj2kg5sfn9aqmcac8p \
  --encrypted-regex '^(data|stringData)$' \
  secret.yaml > secret.enc.yaml

# ArgoCD decrypts via kustomize-sops plugin or helm-secrets

External Secrets Operator

Fetch secrets from external stores (AWS Secrets Manager, Vault, GCP Secret Manager) at runtime:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: production/payment-service/db
        property: password

This is the approach we recommend for production. Secrets never touch Git — not even in encrypted form.

Promotion Workflows

Changes should flow through environments: development → staging → production.

PR-Based Promotion with Kustomize

# Structure
services/payment-service/
  base/
    deployment.yaml          # image: payment-service:IMAGE_TAG
    kustomization.yaml
  overlays/
    development/
      kustomization.yaml     # image tag: abc123 (latest commit)
    staging/
      kustomization.yaml     # image tag: abc123 (promoted from dev)
    production/
      kustomization.yaml     # image tag: def456 (promoted from staging)

Promotion is a PR that updates the image tag in the next environment's overlay:

# overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
images:
  - name: payment-service
    newName: ghcr.io/org/payment-service
    newTag: abc123f  # Updated by promotion PR
patches:
  - target:
      kind: Deployment
      name: payment-service
    patch: |
      - op: replace
        path: /spec/replicas
        value: 2

Automate the promotion PR with a GitHub Actions workflow:

name: Promote to Staging
on:
  workflow_dispatch:
    inputs:
      service:
        description: 'Service name'
        required: true
      tag:
        description: 'Image tag to promote'
        required: true

jobs:
  promote:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Update staging image tag
        run: |
          cd services/${{ inputs.service }}/overlays/staging
          kustomize edit set image \
            ${{ inputs.service }}=ghcr.io/org/${{ inputs.service }}:${{ inputs.tag }}

      - name: Create promotion PR
        uses: peter-evans/create-pull-request@v6
        with:
          title: "Promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
          branch: promote/${{ inputs.service }}-${{ inputs.tag }}-staging
          commit-message: "chore: promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
          body: |
            Automated promotion from development.
            - Service: `${{ inputs.service }}`
            - Image tag: `${{ inputs.tag }}`
          reviewers: platform-team

ArgoCD Health Checks and Rollback

Custom Health Checks

ArgoCD can monitor resource health beyond standard Kubernetes readiness:

# argocd-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.health.argoproj.io_Rollout: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for _, condition in ipairs(obj.status.conditions) do
          if condition.type == "Degraded" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
        end
      end
      if obj.status.currentPodHash ~= nil then
        if obj.status.currentPodHash == obj.status.stableRS then
          hs.status = "Healthy"
        else
          hs.status = "Progressing"
        end
      end
    end
    hs.status = "Progressing"
    return hs

Automated Rollback

ArgoCD does not natively perform automated rollback on health check failure, but you can combine it with Argo Rollouts for canary/blue-green deployments with automatic rollback:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 3
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: payment-service
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        - name: payment-service
          image: ghcr.io/org/payment-service:abc123

Disaster Recovery

ArgoCD Backup

# Export all ArgoCD applications
kubectl get applications -n argocd -o yaml > argocd-apps-backup.yaml

# Export ArgoCD configuration
kubectl get configmap argocd-cm argocd-rbac-cm -n argocd -o yaml > argocd-config-backup.yaml
kubectl get secret argocd-secret -n argocd -o yaml > argocd-secret-backup.yaml

# Automated backup with CronJob
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: argocd-backup
  namespace: argocd
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  kubectl get applications -n argocd -o yaml > /backup/apps.yaml
                  aws s3 cp /backup/apps.yaml s3://backups/argocd/apps-\$(date +%Y%m%d).yaml
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              emptyDir: {}
          restartPolicy: OnFailure
          serviceAccountName: argocd-backup-sa
EOF

Terraform State Recovery

# Always use versioned S3 bucket for state
resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Recover a previous state version
aws s3api list-object-versions \
  --bucket terraform-state-bucket \
  --prefix networking/terraform.tfstate

aws s3api get-object \
  --bucket terraform-state-bucket \
  --key networking/terraform.tfstate \
  --version-id "abc123" \
  recovered-state.tfstate

# Push recovered state (use with extreme caution)
terraform state push recovered-state.tfstate

Case Study: Platform Team Managing 30+ Microservices

Before GitOps

✓Deployment frequency: 4-5 per week (batched releases)
✓Mean time to deploy: 45 minutes per service (manual process)
✓Rollback time: 20-60 minutes (SSH to bastion, find previous manifest, apply)
✓Drift incidents: 2-3 per month (someone manually patched production)
✓Config stored in: Jenkins job parameters, team wiki, individual laptops

Implementation

The Stripe Systems platform engineering team implemented GitOps over 6 weeks:

Week 3: ApplicationSet auto-discovery. The ApplicationSet generator (shown earlier) eliminated the need to manually register each service. New services were onboarded by creating a directory.

Week 4: Atlantis for Terraform. Migrated Terraform from a shared Jenkins agent to Atlantis with PR-based plan/apply. Added checkov scanning to the plan workflow.

Week 5: Secrets migration. Moved from Kubernetes secrets (stored in a shared 1Password vault and applied manually) to External Secrets Operator pulling from AWS Secrets Manager.

Week 6: Promotion workflows and monitoring. Implemented the PR-based promotion pipeline and set up ArgoCD notifications to Slack.

ArgoCD ApplicationSet Configuration

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: all-services
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - git:
              repoURL: https://github.com/org/platform-configs.git
              revision: main
              directories:
                - path: services/*
          - list:
              elements:
                - env: development
                  cluster: https://dev-cluster.internal:6443
                - env: staging
                  cluster: https://staging-cluster.internal:6443
                - env: production
                  cluster: https://prod-cluster.internal:6443
  template:
    metadata:
      name: '{{path.basename}}-{{env}}'
      namespace: argocd
    spec:
      project: '{{env}}'
      source:
        repoURL: https://github.com/org/platform-configs.git
        targetRevision: main
        path: 'services/{{path.basename}}/overlays/{{env}}'
      destination:
        server: '{{cluster}}'
        namespace: '{{path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true

This single ApplicationSet creates 90+ Applications (30 services × 3 environments) automatically.

After GitOps

✓Deployment frequency: 15-20 per day (individual service deploys)
✓Mean time to deploy: 3 minutes (commit to running in production)
✓Rollback time: Under 2 minutes (git revert + auto-sync)
✓Drift incidents: 0 per month (self-heal reverts manual changes)
✓Config stored in: Git (single source of truth, fully auditable)

Ready to discuss your project?

Get in Touch →

Related Services from Stripe Systems

Stripe Systems helps teams put the patterns covered in this article into production.

DevOps

Infrastructure automation, CI/CD pipelines, and security practices integrated from project inception.

Learn more →

← Back to Blog

AI/MLFebruary 28, 2026

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Software DevelopmentFebruary 10, 2026

Agile vs Waterfall — Choosing the Right Methodology for Your Project

Engineering CultureMarch 5, 2026

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

Engineering CultureFebruary 5, 2026

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

Quality AssuranceMarch 15, 2026

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI/MLMarch 19, 2026

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

Engineering CultureMarch 20, 2026

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

Backend DevelopmentJanuary 15, 2026

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

Cloud ComputingFebruary 24, 2026

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

Cloud ComputingFebruary 15, 2026

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Mobile DevelopmentMarch 1, 2026

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

DevOpsMarch 7, 2026

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Backend DevelopmentJanuary 29, 2026

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

Mobile DevelopmentJanuary 6, 2026

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

DevOpsFebruary 28, 2026

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

DevOpsJanuary 23, 2026

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Backend DevelopmentJanuary 18, 2026

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

Cloud ComputingMarch 5, 2026

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Mobile DevelopmentJanuary 10, 2026

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

DevSecOpsFebruary 18, 2026

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Backend DevelopmentFebruary 10, 2026

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

DevOpsJanuary 25, 2026

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

AI/MLJanuary 18, 2026

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Frontend DevelopmentMarch 2, 2026

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Software DevelopmentJanuary 9, 2026

Microservices vs Monolith — Making the Right Architecture Decision

Cloud ComputingMarch 22, 2026

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

Backend DevelopmentFebruary 21, 2026

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Staff AugmentationFebruary 27, 2026

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Frontend DevelopmentFebruary 4, 2026

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Staff AugmentationFebruary 1, 2026

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

Staff AugmentationFebruary 10, 2026

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Staff AugmentationMarch 15, 2026

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

AI/MLMarch 10, 2026

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Engineering CultureMarch 25, 2026

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

Staff AugmentationJanuary 5, 2026

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Frontend DevelopmentMarch 16, 2026

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Staff AugmentationFebruary 13, 2026

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

DevSecOpsMarch 10, 2026

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

DevSecOpsFebruary 20, 2026

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

TechnologyJanuary 12, 2026

Staff Augmentation — A Practical Guide for Engineering Leaders

Frontend DevelopmentJanuary 26, 2026

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Software DevelopmentJanuary 3, 2026

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Most teams agree that automated tests are valuable. Far fewer teams write the tests *before* the implementation. The gap between those two positions is where the majority of preventable defects live.

DevOpsFebruary 15, 2026

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Software DevelopmentMarch 15, 2026

Why Custom Software Development Matters for Growing Businesses

DevSecOpsJanuary 21, 2026

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Cloud ComputingFebruary 7, 2026

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

Staff AugmentationApril 28, 2026

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOpsApril 28, 2026

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

GitOps Principles

ArgoCD for Kubernetes Workloads

Installation and Core Configuration

Application CRD

Application of Applications Pattern

Helm vs Kustomize with ArgoCD

Terraform for Infrastructure: Bridging the GitOps Gap

Atlantis: GitOps Workflow for Terraform

Drift Detection

Repository Structure

Recommended: Monorepo with Clear Boundaries

Secrets in GitOps

Sealed Secrets

SOPS with Age

External Secrets Operator

Promotion Workflows

PR-Based Promotion with Kustomize

ArgoCD Health Checks and Rollback

Custom Health Checks

Automated Rollback

Disaster Recovery

ArgoCD Backup

Terraform State Recovery

Case Study: Platform Team Managing 30+ Microservices

Before GitOps

Implementation

ArgoCD ApplicationSet Configuration

After GitOps

Related Services from Stripe Systems

DevOps

More Articles

Agentic AI in the Enterprise: Designing Multi-Agent Systems with LangGraph and Tool Orchestration

Agile vs Waterfall — Choosing the Right Methodology for Your Project

AI-Assisted Code Review at Scale: How We Cut Review Cycle Time by 60% Without Sacrificing Architecture Standards

The AI-Augmented SDLC: How We've Embedded AI at Every Phase — From Requirements to Deployment

How AI Is Transforming Automated Testing — Unit Tests, Code Coverage, and E2E Integration

AI Code Review Agents: How We Built a Custom Pipeline That Catches Architecture Violations, Not Just Bugs

How Our Engineering Team Uses AI Tools Daily to Ship Faster, Catch More Bugs, and Write Better Code — A Practitioner's Honest Breakdown

API Gateway Patterns: BFF vs Aggregator vs Direct — Choosing for Your Stack

AWS Lambda Cold Starts — Root Causes, Benchmarks, and 7 Proven Mitigation Strategies

AWS vs Azure vs GCP for Startups in 2026 — An Honest Cost and Capability Breakdown

Choosing the Right Mobile Development Approach: Native vs Cross-Platform

Building a Production-Grade CI/CD Pipeline for a Monorepo (GitHub Actions + Docker + Kubernetes)

Clean Architecture in .NET 8 — Structuring Enterprise Apps That Scale Without Rot

CLEAN Architecture in Flutter — BLoC vs Riverpod for State Management

How DevOps and DevSecOps Integrate Into Enterprise Product Development From Day One

Docker Image Hardening for Production — Distroless, Non-Root Users, and Layer Optimization

Event-Driven Architecture with Kafka, NestJS, and Outbox Pattern — A Production Walkthrough

FinOps in Practice: How We Cut a Client's AWS Bill by 40% Without Touching Their Codebase

Flutter vs React Native in 2026 — A Deep Technical Comparison for Enterprise Apps

Infrastructure as Code Security: Detecting Misconfigurations with Checkov and OPA Before Deployment

Java Virtual Threads (Project Loom) vs Node.js — Concurrency Models Compared for Backend Engineers

Kubernetes Multi-Tenancy Patterns — Namespace Isolation vs Virtual Clusters vs Separate Clusters

LLM Cost Optimization at Scale — Prompt Caching, Model Routing, and Batch Inference in Production

Micro-Frontend Architecture at Scale: Module Federation with React and Webpack 5

Microservices vs Monolith — Making the Right Architecture Decision

Multi-Cloud Architecture: Avoiding Vendor Lock-in Without Sacrificing Performance

NestJS Microservices with gRPC — Architecture Patterns for High-Throughput APIs

Why an Offshore Development Centre (ODC) Beats a Distributed Freelance Model — And How Stripe Systems Sets One Up

Building Offline-First PWAs with Next.js, Service Workers, and IndexedDB

Beyond Cost Arbitrage: How Stripe Systems' Offshore Teams Deliver Senior-Level Architecture, Not Just Execution

How to Onboard an Augmented Team Without Losing Velocity — A 90-Day Playbook for Engineering Leads

Onshore vs Offshore vs Nearshore Augmentation — A Decision Framework for CTOs Beyond Just Cost

Building Production-Ready RAG Pipelines — Chunking Strategies, Vector DBs, and Evaluation Frameworks

Prompt Engineering for Software Teams: The Internal Playbook We Built to Maximize Developer Output with LLMs

The Real ROI of Offshore vs Nearshore vs Onshore Augmentation — A Data-Driven Cost-Benefit Framework for Engineering Leaders

Server Components vs Client Components in Next.js 14 — When to Use Which (And Why Most Teams Get It Wrong)

Setting Up an ODC in India: Legal, Compliance, HR, and Infrastructure — What CTOs and Founders Actually Need to Know

Shifting Security Left: Integrating SAST, DAST, and Secret Scanning into Your CI/CD Pipeline

SOC 2 Type II for Engineering Teams — What Developers Actually Need to Build and Change

Staff Augmentation — A Practical Guide for Engineering Leaders

State Management Showdown: Zustand vs Redux Toolkit vs Jotai for Large React Codebases

Why Test-Driven Development Is Non-Negotiable in Our Engineering Process

Terraform at Scale: Remote State, Workspaces, and Module Versioning for Multi-Team Environments

Why Custom Software Development Matters for Growing Businesses

Zero-Trust API Security — mTLS, JWT Validation, and Rate Limiting in a Kubernetes-Native Stack

Building a Zero-Trust Network on GCP with VPC Service Controls and Identity-Aware Proxy

2026 Global Software Engineering Rate Benchmark — India vs US vs UK vs LATAM vs Eastern Europe

DevOps Maturity Benchmarks: What Top 1% Engineering Teams Do Differently in 2026