Skip to main content
Stripe SystemsStripe Systems
DevOps๐Ÿ“… March 13, 2026ยท 11 min read

GitOps with ArgoCD and Terraform: The Infrastructure Deployment Workflow That Eliminates Drift

โœ๏ธ
Stripe Systems Engineering

Infrastructure drift โ€” the divergence between what is declared in code and what is actually running โ€” is the root cause of a large class of production incidents. GitOps addresses this by making Git the single source of truth for both application deployments and infrastructure state. This post covers the practical implementation of GitOps using ArgoCD for Kubernetes workloads and Terraform with Atlantis for cloud infrastructure.

GitOps Principles

GitOps is not a tool โ€” it is a set of operational principles:

  1. โœ“Declarative: The entire system is described declaratively. For Kubernetes, this means YAML manifests or Helm charts. For infrastructure, Terraform HCL.
  2. โœ“Versioned and immutable: The desired state is stored in Git. Every change is a commit with an author, timestamp, and review trail.
  3. โœ“Pulled automatically: An agent (ArgoCD, Flux, Terraform controller) pulls the desired state from Git and applies it. No human runs kubectl apply or terraform apply against production.
  4. โœ“Continuously reconciled: The agent detects drift between desired and actual state and corrects it. If someone manually scales a deployment, the agent scales it back to the declared replica count.

The value is operational: when something breaks, you git log to find what changed, git revert to undo it, and the agent applies the rollback. No SSH access to production, no manual intervention.

ArgoCD for Kubernetes Workloads

ArgoCD watches Git repositories and synchronizes Kubernetes resources to match the declared state.

Installation and Core Configuration

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Expose via ingress (production setup)
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: argocd-server
  namespace: argocd
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  rules:
    - host: argocd.internal.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: argocd-server
                port:
                  number: 443
EOF

Application CRD

An ArgoCD Application defines the mapping between a Git repository path and a Kubernetes cluster/namespace:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: production
  source:
    repoURL: https://github.com/org/platform-configs.git
    targetRevision: main
    path: services/payment-service/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: payment
  syncPolicy:
    automated:
      prune: true       # Delete resources removed from Git
      selfHeal: true     # Revert manual changes
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # Allow HPA to manage replicas

Key sync policy decisions:

  • โœ“prune: true โ€” Resources deleted from Git are deleted from the cluster. Without this, orphaned resources accumulate.
  • โœ“selfHeal: true โ€” Manual changes (e.g., someone runs kubectl scale) are reverted. This enforces Git as the sole source of truth.
  • โœ“ignoreDifferences โ€” HPA changes replica counts, which conflicts with the declared count. Ignore this field to let HPA and ArgoCD coexist.

Application of Applications Pattern

With 30+ microservices, defining each Application manually is tedious. Use an ApplicationSet to auto-discover services from the repository structure:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: platform-services
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/org/platform-configs.git
        revision: main
        directories:
          - path: services/*/overlays/production
  template:
    metadata:
      name: '{{path[1]}}'
      namespace: argocd
      labels:
        team: '{{path[1]}}'
    spec:
      project: production
      source:
        repoURL: https://github.com/org/platform-configs.git
        targetRevision: main
        path: '{{path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{path[1]}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

This generator scans the repository for directories matching services/*/overlays/production and creates an Application for each. Adding a new service means creating a directory โ€” ArgoCD picks it up automatically.

Helm vs Kustomize with ArgoCD

Use Kustomize when: You have plain YAML manifests and need per-environment patches (replica count, resource limits, ingress hostnames). Kustomize overlays are transparent โ€” you can read the base and the patch separately.

Use Helm when: You need complex templating logic (conditionals, loops), or you consume third-party charts (nginx-ingress, cert-manager, prometheus).

A common hybrid: Helm charts for third-party dependencies, Kustomize for in-house services.

services/
  payment-service/
    base/
      deployment.yaml
      service.yaml
      kustomization.yaml
    overlays/
      development/
        kustomization.yaml    # patches: replicas=1, resources=low
      staging/
        kustomization.yaml    # patches: replicas=2, resources=medium
      production/
        kustomization.yaml    # patches: replicas=3, resources=high
  infrastructure/
    cert-manager/
      Chart.yaml
      values-production.yaml

Terraform for Infrastructure: Bridging the GitOps Gap

Terraform is not natively GitOps. It requires a plan step (preview changes), an apply step (execute changes), and maintains state in a backend. The reconciliation loop is not continuous โ€” it runs on demand.

Atlantis: GitOps Workflow for Terraform

Atlantis bridges this gap by integrating Terraform into the PR workflow:

  1. โœ“Developer opens a PR that modifies .tf files
  2. โœ“Atlantis automatically runs terraform plan and comments the plan output on the PR
  3. โœ“Reviewer examines the plan, approves the PR
  4. โœ“Developer comments atlantis apply to execute the changes
  5. โœ“Atlantis applies, comments the result, and the PR is merged
# atlantis.yaml (repo-level configuration)
version: 3
projects:
  - name: networking
    dir: infrastructure/networking
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: default

  - name: compute
    dir: infrastructure/compute
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
        - "../modules/ecs/**/*.tf"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: default

  - name: data
    dir: infrastructure/data
    workspace: production
    terraform_version: v1.6.4
    autoplan:
      when_modified:
        - "*.tf"
        - "*.tfvars"
      enabled: true
    apply_requirements:
      - approved
      - mergeable
    workflow: custom-with-checkov

workflows:
  custom-with-checkov:
    plan:
      steps:
        - init
        - plan
        - run: checkov -d . --framework terraform --quiet --compact
    apply:
      steps:
        - apply

Drift Detection

ArgoCD continuously compares Git to cluster state. Terraform does not do this by default. Implement scheduled drift detection:

# GitHub Actions workflow for Terraform drift detection
name: Terraform Drift Detection
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Weekdays at 8 AM

jobs:
  drift-check:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        project: [networking, compute, data]
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.4"

      - name: Terraform Init
        working-directory: infrastructure/${{ matrix.project }}
        run: terraform init -backend-config=backend-prod.hcl

      - name: Terraform Plan (Drift Detection)
        working-directory: infrastructure/${{ matrix.project }}
        id: plan
        run: |
          terraform plan -detailed-exitcode -out=plan.tfplan 2>&1 | tee plan-output.txt
          echo "exitcode=$?" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Notify on Drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d "{\"text\": \"โš ๏ธ Terraform drift detected in ${{ matrix.project }}. Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\"}"

Exit code 2 from terraform plan -detailed-exitcode means changes are needed โ€” infrastructure has drifted from the declared state.

Repository Structure

The repository structure significantly impacts workflow clarity and team autonomy.

Recommended: Monorepo with Clear Boundaries

platform-configs/
โ”œโ”€โ”€ infrastructure/           # Terraform โ€” cloud resources
โ”‚   โ”œโ”€โ”€ modules/             # Reusable Terraform modules
โ”‚   โ”‚   โ”œโ”€โ”€ vpc/
โ”‚   โ”‚   โ”œโ”€โ”€ ecs/
โ”‚   โ”‚   โ”œโ”€โ”€ rds/
โ”‚   โ”‚   โ””โ”€โ”€ s3/
โ”‚   โ”œโ”€โ”€ networking/          # VPC, subnets, route tables
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”‚   โ”œโ”€โ”€ outputs.tf
โ”‚   โ”‚   โ””โ”€โ”€ backend-prod.hcl
โ”‚   โ”œโ”€โ”€ compute/             # ECS clusters, EKS, EC2
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ data/                # RDS, ElastiCache, S3
โ”‚       โ”œโ”€โ”€ main.tf
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ services/                # Kubernetes manifests โ€” app workloads
โ”‚   โ”œโ”€โ”€ payment-service/
โ”‚   โ”‚   โ”œโ”€โ”€ base/
โ”‚   โ”‚   โ””โ”€โ”€ overlays/
โ”‚   โ”‚       โ”œโ”€โ”€ development/
โ”‚   โ”‚       โ”œโ”€โ”€ staging/
โ”‚   โ”‚       โ””โ”€โ”€ production/
โ”‚   โ”œโ”€โ”€ user-service/
โ”‚   โ”‚   โ”œโ”€โ”€ base/
โ”‚   โ”‚   โ””โ”€โ”€ overlays/
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ platform/                # Cluster-wide Kubernetes resources
โ”‚   โ”œโ”€โ”€ cert-manager/
โ”‚   โ”œโ”€โ”€ ingress-nginx/
โ”‚   โ”œโ”€โ”€ prometheus/
โ”‚   โ””โ”€โ”€ argocd/
โ”œโ”€โ”€ atlantis.yaml
โ””โ”€โ”€ .github/
    โ””โ”€โ”€ workflows/
        โ””โ”€โ”€ drift-detection.yaml

Why monorepo: A change that modifies both Terraform (new RDS instance) and Kubernetes manifests (new service connecting to that RDS) can be reviewed in a single PR. Cross-cutting changes are atomic.

Why not monorepo: If infrastructure and application teams have different review cadences, a monorepo can create PR bottleneck. In that case, split into infra-configs and app-configs repos, with ArgoCD watching each independently.

Secrets in GitOps

Secrets cannot be stored in plaintext in Git. Three approaches:

Sealed Secrets

Encrypt secrets client-side, store the encrypted version in Git, decrypt in-cluster:

# Install Sealed Secrets controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system

# Encrypt a secret
kubectl create secret generic db-credentials \
  --from-literal=password=s3cret \
  --dry-run=client -o yaml | \
  kubeseal --format yaml > sealed-db-credentials.yaml

The sealed secret can safely be committed to Git. Only the Sealed Secrets controller (with its private key) can decrypt it.

SOPS with Age

Mozilla SOPS encrypts specific values within YAML files, keeping keys readable:

# Generate an age key
age-keygen -o key.txt

# Encrypt with SOPS
sops --encrypt --age age1ql3z7hjy54pw3hyww5ayyfg7zqgvc7w3j2elw8zmrj2kg5sfn9aqmcac8p \
  --encrypted-regex '^(data|stringData)$' \
  secret.yaml > secret.enc.yaml

# ArgoCD decrypts via kustomize-sops plugin or helm-secrets

External Secrets Operator

Fetch secrets from external stores (AWS Secrets Manager, Vault, GCP Secret Manager) at runtime:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: production/payment-service/db
        property: password

This is the approach we recommend for production. Secrets never touch Git โ€” not even in encrypted form.

Promotion Workflows

Changes should flow through environments: development โ†’ staging โ†’ production.

PR-Based Promotion with Kustomize

# Structure
services/payment-service/
  base/
    deployment.yaml          # image: payment-service:IMAGE_TAG
    kustomization.yaml
  overlays/
    development/
      kustomization.yaml     # image tag: abc123 (latest commit)
    staging/
      kustomization.yaml     # image tag: abc123 (promoted from dev)
    production/
      kustomization.yaml     # image tag: def456 (promoted from staging)

Promotion is a PR that updates the image tag in the next environment's overlay:

# overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
images:
  - name: payment-service
    newName: ghcr.io/org/payment-service
    newTag: abc123f  # Updated by promotion PR
patches:
  - target:
      kind: Deployment
      name: payment-service
    patch: |
      - op: replace
        path: /spec/replicas
        value: 2

Automate the promotion PR with a GitHub Actions workflow:

name: Promote to Staging
on:
  workflow_dispatch:
    inputs:
      service:
        description: 'Service name'
        required: true
      tag:
        description: 'Image tag to promote'
        required: true

jobs:
  promote:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Update staging image tag
        run: |
          cd services/${{ inputs.service }}/overlays/staging
          kustomize edit set image \
            ${{ inputs.service }}=ghcr.io/org/${{ inputs.service }}:${{ inputs.tag }}

      - name: Create promotion PR
        uses: peter-evans/create-pull-request@v6
        with:
          title: "Promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
          branch: promote/${{ inputs.service }}-${{ inputs.tag }}-staging
          commit-message: "chore: promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
          body: |
            Automated promotion from development.
            - Service: `${{ inputs.service }}`
            - Image tag: `${{ inputs.tag }}`
          reviewers: platform-team

ArgoCD Health Checks and Rollback

Custom Health Checks

ArgoCD can monitor resource health beyond standard Kubernetes readiness:

# argocd-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.health.argoproj.io_Rollout: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for _, condition in ipairs(obj.status.conditions) do
          if condition.type == "Degraded" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message
            return hs
          end
        end
      end
      if obj.status.currentPodHash ~= nil then
        if obj.status.currentPodHash == obj.status.stableRS then
          hs.status = "Healthy"
        else
          hs.status = "Progressing"
        end
      end
    end
    hs.status = "Progressing"
    return hs

Automated Rollback

ArgoCD does not natively perform automated rollback on health check failure, but you can combine it with Argo Rollouts for canary/blue-green deployments with automatic rollback:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 3
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1
        args:
          - name: service-name
            value: payment-service
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        - name: payment-service
          image: ghcr.io/org/payment-service:abc123

Disaster Recovery

ArgoCD Backup

# Export all ArgoCD applications
kubectl get applications -n argocd -o yaml > argocd-apps-backup.yaml

# Export ArgoCD configuration
kubectl get configmap argocd-cm argocd-rbac-cm -n argocd -o yaml > argocd-config-backup.yaml
kubectl get secret argocd-secret -n argocd -o yaml > argocd-secret-backup.yaml

# Automated backup with CronJob
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: argocd-backup
  namespace: argocd
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  kubectl get applications -n argocd -o yaml > /backup/apps.yaml
                  aws s3 cp /backup/apps.yaml s3://backups/argocd/apps-\$(date +%Y%m%d).yaml
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              emptyDir: {}
          restartPolicy: OnFailure
          serviceAccountName: argocd-backup-sa
EOF

Terraform State Recovery

# Always use versioned S3 bucket for state
resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

# Recover a previous state version
aws s3api list-object-versions \
  --bucket terraform-state-bucket \
  --prefix networking/terraform.tfstate

aws s3api get-object \
  --bucket terraform-state-bucket \
  --key networking/terraform.tfstate \
  --version-id "abc123" \
  recovered-state.tfstate

# Push recovered state (use with extreme caution)
terraform state push recovered-state.tfstate

Case Study: Platform Team Managing 30+ Microservices

A platform team responsible for 30+ microservices across development, staging, and production environments was deploying via a combination of Jenkins pipelines and manual kubectl apply commands. Deployments happened weekly, coordinated over Slack. Rollbacks required SSH access to a bastion host. Configuration drift was discovered only when something broke.

Before GitOps

  • โœ“Deployment frequency: 4-5 per week (batched releases)
  • โœ“Mean time to deploy: 45 minutes per service (manual process)
  • โœ“Rollback time: 20-60 minutes (SSH to bastion, find previous manifest, apply)
  • โœ“Drift incidents: 2-3 per month (someone manually patched production)
  • โœ“Config stored in: Jenkins job parameters, team wiki, individual laptops

Implementation

The Stripe Systems platform engineering team implemented GitOps over 6 weeks:

Week 1-2: Repository structure and ArgoCD setup. Migrated all Kubernetes manifests from Jenkins workspace directories to the monorepo structure shown above. Used Kustomize overlays for environment-specific configuration.

Week 3: ApplicationSet auto-discovery. The ApplicationSet generator (shown earlier) eliminated the need to manually register each service. New services were onboarded by creating a directory.

Week 4: Atlantis for Terraform. Migrated Terraform from a shared Jenkins agent to Atlantis with PR-based plan/apply. Added checkov scanning to the plan workflow.

Week 5: Secrets migration. Moved from Kubernetes secrets (stored in a shared 1Password vault and applied manually) to External Secrets Operator pulling from AWS Secrets Manager.

Week 6: Promotion workflows and monitoring. Implemented the PR-based promotion pipeline and set up ArgoCD notifications to Slack.

ArgoCD ApplicationSet Configuration

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: all-services
  namespace: argocd
spec:
  generators:
    - matrix:
        generators:
          - git:
              repoURL: https://github.com/org/platform-configs.git
              revision: main
              directories:
                - path: services/*
          - list:
              elements:
                - env: development
                  cluster: https://dev-cluster.internal:6443
                - env: staging
                  cluster: https://staging-cluster.internal:6443
                - env: production
                  cluster: https://prod-cluster.internal:6443
  template:
    metadata:
      name: '{{path.basename}}-{{env}}'
      namespace: argocd
    spec:
      project: '{{env}}'
      source:
        repoURL: https://github.com/org/platform-configs.git
        targetRevision: main
        path: 'services/{{path.basename}}/overlays/{{env}}'
      destination:
        server: '{{cluster}}'
        namespace: '{{path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - ServerSideApply=true

This single ApplicationSet creates 90+ Applications (30 services ร— 3 environments) automatically.

After GitOps

  • โœ“Deployment frequency: 15-20 per day (individual service deploys)
  • โœ“Mean time to deploy: 3 minutes (commit to running in production)
  • โœ“Rollback time: Under 2 minutes (git revert + auto-sync)
  • โœ“Drift incidents: 0 per month (self-heal reverts manual changes)
  • โœ“Config stored in: Git (single source of truth, fully auditable)

The quantitative improvement was significant, but the qualitative change mattered more: on-call engineers could diagnose and roll back issues without production access. Every change was traceable to a PR with an author and reviewer. The weekly deployment coordination meeting was eliminated entirely.

Ready to discuss your project?

Get in Touch โ†’
โ† Back to Blog

More Articles