Infrastructure drift โ the divergence between what is declared in code and what is actually running โ is the root cause of a large class of production incidents. GitOps addresses this by making Git the single source of truth for both application deployments and infrastructure state. This post covers the practical implementation of GitOps using ArgoCD for Kubernetes workloads and Terraform with Atlantis for cloud infrastructure.
GitOps Principles
GitOps is not a tool โ it is a set of operational principles:
- โDeclarative: The entire system is described declaratively. For Kubernetes, this means YAML manifests or Helm charts. For infrastructure, Terraform HCL.
- โVersioned and immutable: The desired state is stored in Git. Every change is a commit with an author, timestamp, and review trail.
- โPulled automatically: An agent (ArgoCD, Flux, Terraform controller) pulls the desired state from Git and applies it. No human runs
kubectl applyorterraform applyagainst production. - โContinuously reconciled: The agent detects drift between desired and actual state and corrects it. If someone manually scales a deployment, the agent scales it back to the declared replica count.
The value is operational: when something breaks, you git log to find what changed, git revert to undo it, and the agent applies the rollback. No SSH access to production, no manual intervention.
ArgoCD for Kubernetes Workloads
ArgoCD watches Git repositories and synchronizes Kubernetes resources to match the declared state.
Installation and Core Configuration
# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Expose via ingress (production setup)
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: argocd-server
namespace: argocd
annotations:
nginx.ingress.kubernetes.io/ssl-passthrough: "true"
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
rules:
- host: argocd.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: argocd-server
port:
number: 443
EOF
Application CRD
An ArgoCD Application defines the mapping between a Git repository path and a Kubernetes cluster/namespace:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payment-service
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: production
source:
repoURL: https://github.com/org/platform-configs.git
targetRevision: main
path: services/payment-service/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: payment
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual changes
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # Allow HPA to manage replicas
Key sync policy decisions:
- โ
prune: trueโ Resources deleted from Git are deleted from the cluster. Without this, orphaned resources accumulate. - โ
selfHeal: trueโ Manual changes (e.g., someone runskubectl scale) are reverted. This enforces Git as the sole source of truth. - โ
ignoreDifferencesโ HPA changes replica counts, which conflicts with the declared count. Ignore this field to let HPA and ArgoCD coexist.
Application of Applications Pattern
With 30+ microservices, defining each Application manually is tedious. Use an ApplicationSet to auto-discover services from the repository structure:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: platform-services
namespace: argocd
spec:
generators:
- git:
repoURL: https://github.com/org/platform-configs.git
revision: main
directories:
- path: services/*/overlays/production
template:
metadata:
name: '{{path[1]}}'
namespace: argocd
labels:
team: '{{path[1]}}'
spec:
project: production
source:
repoURL: https://github.com/org/platform-configs.git
targetRevision: main
path: '{{path}}'
destination:
server: https://kubernetes.default.svc
namespace: '{{path[1]}}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
This generator scans the repository for directories matching services/*/overlays/production and creates an Application for each. Adding a new service means creating a directory โ ArgoCD picks it up automatically.
Helm vs Kustomize with ArgoCD
Use Kustomize when: You have plain YAML manifests and need per-environment patches (replica count, resource limits, ingress hostnames). Kustomize overlays are transparent โ you can read the base and the patch separately.
Use Helm when: You need complex templating logic (conditionals, loops), or you consume third-party charts (nginx-ingress, cert-manager, prometheus).
A common hybrid: Helm charts for third-party dependencies, Kustomize for in-house services.
services/
payment-service/
base/
deployment.yaml
service.yaml
kustomization.yaml
overlays/
development/
kustomization.yaml # patches: replicas=1, resources=low
staging/
kustomization.yaml # patches: replicas=2, resources=medium
production/
kustomization.yaml # patches: replicas=3, resources=high
infrastructure/
cert-manager/
Chart.yaml
values-production.yaml
Terraform for Infrastructure: Bridging the GitOps Gap
Terraform is not natively GitOps. It requires a plan step (preview changes), an apply step (execute changes), and maintains state in a backend. The reconciliation loop is not continuous โ it runs on demand.
Atlantis: GitOps Workflow for Terraform
Atlantis bridges this gap by integrating Terraform into the PR workflow:
- โDeveloper opens a PR that modifies
.tffiles - โAtlantis automatically runs
terraform planand comments the plan output on the PR - โReviewer examines the plan, approves the PR
- โDeveloper comments
atlantis applyto execute the changes - โAtlantis applies, comments the result, and the PR is merged
# atlantis.yaml (repo-level configuration)
version: 3
projects:
- name: networking
dir: infrastructure/networking
workspace: production
terraform_version: v1.6.4
autoplan:
when_modified:
- "*.tf"
- "*.tfvars"
enabled: true
apply_requirements:
- approved
- mergeable
workflow: default
- name: compute
dir: infrastructure/compute
workspace: production
terraform_version: v1.6.4
autoplan:
when_modified:
- "*.tf"
- "*.tfvars"
- "../modules/ecs/**/*.tf"
enabled: true
apply_requirements:
- approved
- mergeable
workflow: default
- name: data
dir: infrastructure/data
workspace: production
terraform_version: v1.6.4
autoplan:
when_modified:
- "*.tf"
- "*.tfvars"
enabled: true
apply_requirements:
- approved
- mergeable
workflow: custom-with-checkov
workflows:
custom-with-checkov:
plan:
steps:
- init
- plan
- run: checkov -d . --framework terraform --quiet --compact
apply:
steps:
- apply
Drift Detection
ArgoCD continuously compares Git to cluster state. Terraform does not do this by default. Implement scheduled drift detection:
# GitHub Actions workflow for Terraform drift detection
name: Terraform Drift Detection
on:
schedule:
- cron: '0 8 * * 1-5' # Weekdays at 8 AM
jobs:
drift-check:
runs-on: ubuntu-latest
strategy:
matrix:
project: [networking, compute, data]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.6.4"
- name: Terraform Init
working-directory: infrastructure/${{ matrix.project }}
run: terraform init -backend-config=backend-prod.hcl
- name: Terraform Plan (Drift Detection)
working-directory: infrastructure/${{ matrix.project }}
id: plan
run: |
terraform plan -detailed-exitcode -out=plan.tfplan 2>&1 | tee plan-output.txt
echo "exitcode=$?" >> $GITHUB_OUTPUT
continue-on-error: true
- name: Notify on Drift
if: steps.plan.outputs.exitcode == '2'
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"โ ๏ธ Terraform drift detected in ${{ matrix.project }}. Review: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\"}"
Exit code 2 from terraform plan -detailed-exitcode means changes are needed โ infrastructure has drifted from the declared state.
Repository Structure
The repository structure significantly impacts workflow clarity and team autonomy.
Recommended: Monorepo with Clear Boundaries
platform-configs/
โโโ infrastructure/ # Terraform โ cloud resources
โ โโโ modules/ # Reusable Terraform modules
โ โ โโโ vpc/
โ โ โโโ ecs/
โ โ โโโ rds/
โ โ โโโ s3/
โ โโโ networking/ # VPC, subnets, route tables
โ โ โโโ main.tf
โ โ โโโ variables.tf
โ โ โโโ outputs.tf
โ โ โโโ backend-prod.hcl
โ โโโ compute/ # ECS clusters, EKS, EC2
โ โ โโโ main.tf
โ โ โโโ ...
โ โโโ data/ # RDS, ElastiCache, S3
โ โโโ main.tf
โ โโโ ...
โโโ services/ # Kubernetes manifests โ app workloads
โ โโโ payment-service/
โ โ โโโ base/
โ โ โโโ overlays/
โ โ โโโ development/
โ โ โโโ staging/
โ โ โโโ production/
โ โโโ user-service/
โ โ โโโ base/
โ โ โโโ overlays/
โ โโโ ...
โโโ platform/ # Cluster-wide Kubernetes resources
โ โโโ cert-manager/
โ โโโ ingress-nginx/
โ โโโ prometheus/
โ โโโ argocd/
โโโ atlantis.yaml
โโโ .github/
โโโ workflows/
โโโ drift-detection.yaml
Why monorepo: A change that modifies both Terraform (new RDS instance) and Kubernetes manifests (new service connecting to that RDS) can be reviewed in a single PR. Cross-cutting changes are atomic.
Why not monorepo: If infrastructure and application teams have different review cadences, a monorepo can create PR bottleneck. In that case, split into infra-configs and app-configs repos, with ArgoCD watching each independently.
Secrets in GitOps
Secrets cannot be stored in plaintext in Git. Three approaches:
Sealed Secrets
Encrypt secrets client-side, store the encrypted version in Git, decrypt in-cluster:
# Install Sealed Secrets controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system
# Encrypt a secret
kubectl create secret generic db-credentials \
--from-literal=password=s3cret \
--dry-run=client -o yaml | \
kubeseal --format yaml > sealed-db-credentials.yaml
The sealed secret can safely be committed to Git. Only the Sealed Secrets controller (with its private key) can decrypt it.
SOPS with Age
Mozilla SOPS encrypts specific values within YAML files, keeping keys readable:
# Generate an age key
age-keygen -o key.txt
# Encrypt with SOPS
sops --encrypt --age age1ql3z7hjy54pw3hyww5ayyfg7zqgvc7w3j2elw8zmrj2kg5sfn9aqmcac8p \
--encrypted-regex '^(data|stringData)$' \
secret.yaml > secret.enc.yaml
# ArgoCD decrypts via kustomize-sops plugin or helm-secrets
External Secrets Operator
Fetch secrets from external stores (AWS Secrets Manager, Vault, GCP Secret Manager) at runtime:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: db-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: db-credentials
data:
- secretKey: password
remoteRef:
key: production/payment-service/db
property: password
This is the approach we recommend for production. Secrets never touch Git โ not even in encrypted form.
Promotion Workflows
Changes should flow through environments: development โ staging โ production.
PR-Based Promotion with Kustomize
# Structure
services/payment-service/
base/
deployment.yaml # image: payment-service:IMAGE_TAG
kustomization.yaml
overlays/
development/
kustomization.yaml # image tag: abc123 (latest commit)
staging/
kustomization.yaml # image tag: abc123 (promoted from dev)
production/
kustomization.yaml # image tag: def456 (promoted from staging)
Promotion is a PR that updates the image tag in the next environment's overlay:
# overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
images:
- name: payment-service
newName: ghcr.io/org/payment-service
newTag: abc123f # Updated by promotion PR
patches:
- target:
kind: Deployment
name: payment-service
patch: |
- op: replace
path: /spec/replicas
value: 2
Automate the promotion PR with a GitHub Actions workflow:
name: Promote to Staging
on:
workflow_dispatch:
inputs:
service:
description: 'Service name'
required: true
tag:
description: 'Image tag to promote'
required: true
jobs:
promote:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Update staging image tag
run: |
cd services/${{ inputs.service }}/overlays/staging
kustomize edit set image \
${{ inputs.service }}=ghcr.io/org/${{ inputs.service }}:${{ inputs.tag }}
- name: Create promotion PR
uses: peter-evans/create-pull-request@v6
with:
title: "Promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
branch: promote/${{ inputs.service }}-${{ inputs.tag }}-staging
commit-message: "chore: promote ${{ inputs.service }}:${{ inputs.tag }} to staging"
body: |
Automated promotion from development.
- Service: `${{ inputs.service }}`
- Image tag: `${{ inputs.tag }}`
reviewers: platform-team
ArgoCD Health Checks and Rollback
Custom Health Checks
ArgoCD can monitor resource health beyond standard Kubernetes readiness:
# argocd-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cm
namespace: argocd
data:
resource.customizations.health.argoproj.io_Rollout: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for _, condition in ipairs(obj.status.conditions) do
if condition.type == "Degraded" and condition.status == "True" then
hs.status = "Degraded"
hs.message = condition.message
return hs
end
end
end
if obj.status.currentPodHash ~= nil then
if obj.status.currentPodHash == obj.status.stableRS then
hs.status = "Healthy"
else
hs.status = "Progressing"
end
end
end
hs.status = "Progressing"
return hs
Automated Rollback
ArgoCD does not natively perform automated rollback on health check failure, but you can combine it with Argo Rollouts for canary/blue-green deployments with automatic rollback:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
replicas: 3
strategy:
canary:
steps:
- setWeight: 20
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 1
args:
- name: service-name
value: payment-service
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
containers:
- name: payment-service
image: ghcr.io/org/payment-service:abc123
Disaster Recovery
ArgoCD Backup
# Export all ArgoCD applications
kubectl get applications -n argocd -o yaml > argocd-apps-backup.yaml
# Export ArgoCD configuration
kubectl get configmap argocd-cm argocd-rbac-cm -n argocd -o yaml > argocd-config-backup.yaml
kubectl get secret argocd-secret -n argocd -o yaml > argocd-secret-backup.yaml
# Automated backup with CronJob
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
name: argocd-backup
namespace: argocd
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- |
kubectl get applications -n argocd -o yaml > /backup/apps.yaml
aws s3 cp /backup/apps.yaml s3://backups/argocd/apps-\$(date +%Y%m%d).yaml
volumeMounts:
- name: backup
mountPath: /backup
volumes:
- name: backup
emptyDir: {}
restartPolicy: OnFailure
serviceAccountName: argocd-backup-sa
EOF
Terraform State Recovery
# Always use versioned S3 bucket for state
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
# Recover a previous state version
aws s3api list-object-versions \
--bucket terraform-state-bucket \
--prefix networking/terraform.tfstate
aws s3api get-object \
--bucket terraform-state-bucket \
--key networking/terraform.tfstate \
--version-id "abc123" \
recovered-state.tfstate
# Push recovered state (use with extreme caution)
terraform state push recovered-state.tfstate
Case Study: Platform Team Managing 30+ Microservices
A platform team responsible for 30+ microservices across development, staging, and production environments was deploying via a combination of Jenkins pipelines and manual kubectl apply commands. Deployments happened weekly, coordinated over Slack. Rollbacks required SSH access to a bastion host. Configuration drift was discovered only when something broke.
Before GitOps
- โDeployment frequency: 4-5 per week (batched releases)
- โMean time to deploy: 45 minutes per service (manual process)
- โRollback time: 20-60 minutes (SSH to bastion, find previous manifest, apply)
- โDrift incidents: 2-3 per month (someone manually patched production)
- โConfig stored in: Jenkins job parameters, team wiki, individual laptops
Implementation
The Stripe Systems platform engineering team implemented GitOps over 6 weeks:
Week 1-2: Repository structure and ArgoCD setup. Migrated all Kubernetes manifests from Jenkins workspace directories to the monorepo structure shown above. Used Kustomize overlays for environment-specific configuration.
Week 3: ApplicationSet auto-discovery. The ApplicationSet generator (shown earlier) eliminated the need to manually register each service. New services were onboarded by creating a directory.
Week 4: Atlantis for Terraform. Migrated Terraform from a shared Jenkins agent to Atlantis with PR-based plan/apply. Added checkov scanning to the plan workflow.
Week 5: Secrets migration. Moved from Kubernetes secrets (stored in a shared 1Password vault and applied manually) to External Secrets Operator pulling from AWS Secrets Manager.
Week 6: Promotion workflows and monitoring. Implemented the PR-based promotion pipeline and set up ArgoCD notifications to Slack.
ArgoCD ApplicationSet Configuration
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: all-services
namespace: argocd
spec:
generators:
- matrix:
generators:
- git:
repoURL: https://github.com/org/platform-configs.git
revision: main
directories:
- path: services/*
- list:
elements:
- env: development
cluster: https://dev-cluster.internal:6443
- env: staging
cluster: https://staging-cluster.internal:6443
- env: production
cluster: https://prod-cluster.internal:6443
template:
metadata:
name: '{{path.basename}}-{{env}}'
namespace: argocd
spec:
project: '{{env}}'
source:
repoURL: https://github.com/org/platform-configs.git
targetRevision: main
path: 'services/{{path.basename}}/overlays/{{env}}'
destination:
server: '{{cluster}}'
namespace: '{{path.basename}}'
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
This single ApplicationSet creates 90+ Applications (30 services ร 3 environments) automatically.
After GitOps
- โDeployment frequency: 15-20 per day (individual service deploys)
- โMean time to deploy: 3 minutes (commit to running in production)
- โRollback time: Under 2 minutes (
git revert+ auto-sync) - โDrift incidents: 0 per month (self-heal reverts manual changes)
- โConfig stored in: Git (single source of truth, fully auditable)
The quantitative improvement was significant, but the qualitative change mattered more: on-call engineers could diagnose and roll back issues without production access. Every change was traceable to a PR with an author and reviewer. The weekly deployment coordination meeting was eliminated entirely.
Ready to discuss your project?
Get in Touch โ