Most engineering organisations think they have a DevOps problem. They do not. They have a DevOps belief problem — they believe their CI/CD pipeline, weekly deploys, and a Datadog dashboard amount to a mature DevOps practice. The data says otherwise. The gap between Medium-tier and Elite-tier teams in 2026 is not "more tools"; it is a small set of operational disciplines, applied consistently, that compound into roughly 2× delivery velocity at half the change-failure rate.
This benchmark distills what that gap actually looks like — in numbers, in concrete practices, and in the anti-patterns that quietly hold the middle of the market in place. It is the empirical companion to our DevOps Maturity Matrix self-assessment.
TL;DR — The Numbers That Define the Gap
After triangulating the 2024 DORA Accelerate State of DevOps Report, the 2025 GitHub State of the Octoverse, the Puppet State of DevOps surveys, and our own per-engagement assessments, the four DORA metrics in 2026 cluster as follows:
| DORA metric | Elite (top ~17%) | High (~31%) | Medium (~36%) | Low (~16%) |
|---|---|---|---|---|
| Deployment frequency | On demand (multiple per day) | Daily to weekly | Weekly to monthly | Monthly+ |
| Lead time for changes | < 1 day | 1 day - 1 week | 1 week - 1 month | > 1 month |
| Change failure rate | 0-5% | 5-15% | 16-30% | > 30% |
| Failed-deployment recovery (MTTR) | < 1 hour | < 1 day | 1 day - 1 week | > 1 week |
Two observations matter more than the headline numbers:
- ✓The bands are bimodal, not continuous. Teams cluster in Medium or Elite — there are surprisingly few teams in the "almost Elite" zone. The leap from Medium to High requires a structural change (typically the shift to trunk-based development and feature flags), not incremental polish.
- ✓Change failure rate is the single best predictor of which band a team will land in. Teams with sub-10% change-failure rates almost always also have fast lead time and high deployment frequency, because they trust their pipeline enough to use it aggressively. Teams with > 20% change-failure rates ration their deployments — and rationing deployments is the path to ossification.
The remainder of this post is what the gap actually consists of, dimension by dimension.
The Four DORA Metrics, Properly Defined
Before any benchmarking is meaningful, the metrics need to be defined unambiguously — most "DORA dashboards" we see in the wild are subtly mis-instrumented:
Deployment frequency = the number of times code is deployed to production, not merged to main. Some teams measure merges-to-main as a proxy and mistakenly report Elite-tier numbers; if main is hours or days behind production (because of release-train batching), the proxy is wrong.
Lead time for changes = wall-clock time from first commit to that change being live in production, measured per change-set. The key word is live — not "merged", not "deployed to staging". Many DORA dashboards report commit-to-merge time and mislabel it.
Change failure rate = the percentage of production changes that result in degraded service requiring rollback, hotfix, or recovery action — within a defined window (typically 7 days). Counting only severity-1 incidents understates this; counting any rollback overstates it. The defensible definition is "production changes that triggered an unplanned customer-facing remediation".
Failed-deployment recovery time (MTTR) = time from incident detection to service restoration, measured per incident, not per commit. This is recovery, not root-cause-fix. A rollback in 8 minutes counts as 8 minutes even if the root-cause fix lands a week later.
Mis-instrumented DORA metrics are an Elite-tier anti-signal: a team confidently reporting Elite numbers without explaining how they measure each metric is almost certainly counting something easier than the canonical definition.
Where Teams Sit at Each Maturity Level
Our DevOps Maturity Matrix tool measures eight dimensions on a CMMI-style 1-5 scale. Here is what teams genuinely look like at each level — not the marketing version.
Level 1 — Initial
- ✓Deployments are individual events. Someone calls a meeting, a deploy plan is written, the team gathers, and the process is executed manually with much breath-holding.
- ✓Tests exist, but coverage is uneven and trust is low. "Run the smoke tests after we deploy" is normal.
- ✓Monitoring is a Datadog or CloudWatch dashboard nobody actively watches. Alerts come from customers.
- ✓Security is a Q4 audit. "Production access" is approximately equal to "the engineering team".
- ✓IaC, if it exists, is a partial set of Terraform modules around a hand-built core that nobody fully understands.
Realistic deployment frequency: 1-2 per month. Lead time: weeks. Change failure rate: 30-50%.
Level 2 — Managed
- ✓A CI pipeline exists and runs on every push. It can be bypassed when "urgent".
- ✓Some tests run automatically; coverage is tracked but not enforced.
- ✓A staging environment exists; it differs subtly from production in ways that bite during deploys.
- ✓On-call is a rota; the playbook is "Slack the senior engineer".
- ✓IaC covers the bulk of new infrastructure but legacy services are still hand-built.
- ✓Security scans run quarterly, sometimes in an automated way.
Realistic deployment frequency: weekly. Lead time: 1-3 weeks. Change failure rate: 20-30%.
Level 3 — Defined
- ✓CI is mandatory, not bypassable. Test coverage thresholds are enforced.
- ✓Deployment is one-button (or one-merge) for most services; a small number of legacy services still require human attention.
- ✓SLOs exist for the most-trafficked services; error budgets are computed monthly but rarely spent deliberately.
- ✓IaC is the default for all new work; configuration drift in legacy services is acknowledged and on a backlog.
- ✓Security findings have an SLA; criticals are remediated in days.
- ✓Chaos engineering is "we did a game day last quarter".
Realistic deployment frequency: daily. Lead time: 1-3 days. Change failure rate: 10-20%.
Level 4 — Quantitatively Managed
- ✓DORA metrics are tracked, instrumented correctly, and reviewed in engineering reviews. Trends drive engineering investment.
- ✓Trunk-based development is the standard; long-lived branches are unusual and require justification.
- ✓Feature flags decouple deploy from release; "did you deploy that?" is the wrong question because deploys are not events.
- ✓SLO error budgets are operationalised — when a service is over budget, that team's feature work is paused until the budget recovers. Engineering leadership respects this.
- ✓Continuous compliance: SOC 2, PCI, ISO 27001 controls are continuously evidenced from CI artifacts, not assembled in a sprint before the audit.
- ✓Platform engineering team operates internal developer platform with documented golden paths.
Realistic deployment frequency: many per day. Lead time: hours. Change failure rate: 5-10%. MTTR: under an hour.
Level 5 — Optimizing
- ✓The platform team measures DORA metrics by team, not by org, and uses the variance as a coaching signal — not as a stack rank.
- ✓Progressive delivery is the default: every change reaches 1% of traffic before 10%, 10% before 50%, 50% before 100%, with auto-rollback wired into each gate.
- ✓Resilience testing is continuous: chaos experiments run on prod-equivalent loads weekly, and findings drive architecture decisions.
- ✓Security shift-left is operational: SAST/DAST/SCA run pre-merge; SBOMs are generated for every artifact; runtime threat detection is wired into incident response.
- ✓The platform itself has a roadmap and OKRs, run as a product team — not as "ops".
Realistic deployment frequency: per-commit, on demand. Lead time: minutes-to-hours. Change failure rate: under 5%. MTTR: under 30 minutes.
The progression is non-linear. The hardest leap is L2 → L3 (where automation must become non-bypassable) and L3 → L4 (where engineering leadership must agree that error budgets actually pause feature work). The leap from L4 → L5 is comparatively simpler — by the time you are L4, the cultural and structural foundations are mostly in place.
What Top 1% Teams Actually Do Differently
Beyond the metrics, the practices that materially separate Elite-tier teams from the merely-good are concrete enough that you can audit your own setup against them.
1. They run trunk-based development, not "kind of trunk-based"
The conventional wisdom is "merge to main frequently". The Elite-tier discipline is sharper:
- ✓No long-lived feature branches. Branches live < 24 hours; the average is hours, not days.
- ✓Feature flags decouple merge from release. Code is shipped behind a flag and turned on independently — sometimes minutes later, sometimes weeks.
- ✓Pre-merge CI is fast (< 10 minutes for the inner loop). A 45-minute pre-merge CI gate is incompatible with trunk-based development at scale, because engineers context-switch and lose the muscle memory.
- ✓Trunk is always releasable. If you cannot deploy main right now, that is a P1 — fix it before doing anything else.
Teams that "do trunk-based development" but still routinely have main broken at 3pm are practicing the form, not the function.
2. They operationalise error budgets
Error budgets only matter if they are spent deliberately. The Elite-tier practice:
- ✓Each service has an SLO (e.g., 99.9% successful requests over 28 days), translated to an error budget (~43 minutes of downtime / month).
- ✓When a service is over its error budget, the team's feature work pauses. Reliability work takes over until the budget recovers.
- ✓Engineering leadership respects this. A VP who routinely overrides error-budget pauses to ship a feature is the single most reliable signal that an organisation is stuck at L3.
- ✓Conversely, when a team is well under its error budget for multiple cycles, leadership encourages more risk — bigger refactors, faster experiments, more aggressive deploys. Underspent budgets are a signal of conservatism, not health.
The bidirectional discipline matters as much as the gate itself. SRE practitioners coined the phrase "error budgets are not crash budgets, they are use budgets". Most teams treat them as crash budgets.
3. They have a platform engineering team that ships a platform, not a backlog
Two failure modes dominate Medium-tier "platform" teams:
- ✓The ticket queue. The platform team is a glorified ops queue answering tickets. It scales linearly with the number of services, never gets ahead of demand, and burns out.
- ✓The science project. The platform team builds an over-engineered internal abstraction layer on top of Kubernetes that nobody else uses; the rest of the org bypasses it.
The Elite-tier model:
- ✓The platform team is staffed and run as a product team — with PMs, OKRs, customer (= internal developers) feedback loops, and quarterly roadmaps.
- ✓It ships a coherent internal developer platform — typically: a paved-road service template, deployment pipeline, observability defaults, and SLO instrumentation as a single bundle. New services adopt the bundle on day one.
- ✓Adoption is measurable. The platform team's primary KPI is "% of services on the golden path" — not "tickets closed".
4. They treat security shift-left as an engineering pipeline concern, not a security-team concern
Medium-tier security:
- ✓Security scans run on a schedule. Findings land in a security-team JIRA queue. Tickets are filed against engineering teams. Resolution takes weeks.
- ✓"DevSecOps" is mostly a slide in a management deck.
Elite-tier security:
- ✓SAST runs pre-merge with sensible default rules; criticals block merge. The security team curates the rule set; engineering teams own remediation.
- ✓DAST runs against staging environments; SBOM is generated for every build artifact; SCA flags vulnerable dependencies in PRs with auto-PR remediation suggestions.
- ✓Runtime detection (eBPF or equivalent) catches process-level anomalies in production; alerts route to the on-call engineer, not to a separate SOC.
- ✓Compliance evidence (SOC 2, PCI, ISO 27001) is generated from CI artifacts — every passing build produces auditable evidence for the relevant controls.
5. They have observability — not monitoring
The distinction is subtle but matters:
- ✓Monitoring answers "is this thing up?" with predefined dashboards.
- ✓Observability answers "why is this thing slow for this customer right now?" with arbitrary structured queries against logs, metrics, and traces.
Elite-tier teams:
- ✓Emit structured logs with rich context (request ID, customer ID, feature flag values, version) on every service.
- ✓Use distributed tracing (OpenTelemetry) by default, with sampling that is high enough to be useful (1-10% in production for big-volume services; 100% for low-volume critical services).
- ✓Build service-level dashboards using SLOs as the top-line view, not raw CPU/memory.
- ✓Most importantly, they spend time on debug ergonomics — i.e., they invest in their own ability to slice and dice telemetry quickly. The team that can answer "is this regression affecting Tier-2 customers more than Tier-1?" in 3 minutes is operating at a different level than the team that needs an hour and a Slack thread.
6. They run progressive delivery by default
Deploy and release are decoupled:
- ✓Every meaningful production change goes through canary stages: 1% → 10% → 50% → 100%, with auto-rollback wired to SLO-burn detection at each stage.
- ✓For deeply user-affecting changes, the rollout is cohorted (5% of customers, or all customers in a single region) before being expanded.
- ✓A/B experiments and rollouts share the same infrastructure — a flag system that supports both targeted release and treatment-control comparison.
- ✓The deployment pipeline emits a change record at each stage that links to the SLO-burn data and the feature flag history. Forensics are easy.
7. They invest in chaos engineering — for the easy failures
The popular conception of chaos engineering — "break random things in production!" — is largely a distraction. The Elite-tier practice is more boring:
- ✓Game days are run quarterly and focus on the predictable failure modes (datacenter outage, dependency outage, certificate expiry, IAM credential rotation, regional traffic spike).
- ✓The practice is not "what breaks?" but "what response breaks?" — i.e., when this dependency goes away, do the runbooks work? Does the on-call engineer know what to do?
- ✓Findings drive concrete architecture changes (cache TTL adjustments, circuit breakers, regional failover wiring) — not just "we should write a runbook".
Anti-Patterns That Quietly Cap Mid-Tier Teams
After auditing many engineering operations, a small set of anti-patterns shows up repeatedly. Most of them are local optima — they made sense at some point and were never re-examined.
| Anti-pattern | What it looks like | What it costs |
|---|---|---|
| Release trains | Code is merged daily but only released to prod weekly or biweekly via a "train" | Lead time inflates from hours to weeks; rollback granularity collapses |
| The "QA gate" | A separate QA team certifies releases before they go to prod | Quality becomes someone-else's-problem; engineers under-invest in tests |
| The single staging environment | One shared staging, used for everything from feature testing to release rehearsal | Conflicts; flaky tests; staging looks nothing like production by Friday |
| Shared deploy credentials | All engineers SSH or assume-role with the same prod credentials | Compliance risk; no per-actor audit trail; dangerous blast radius on mistakes |
| The Friday deploy freeze | "We don't deploy on Fridays" because Friday deploys break things | The freeze is a symptom of low confidence — the underlying issue is change failure rate, not the day of the week |
| Long-lived feature branches | Branches live for weeks; merges become "mega-merges" | Merge conflicts at scale; bugs hidden until late; rollback granularity terrible |
| Manual prod-data exports for testing | Engineers periodically copy prod data to staging "for realistic testing" | PII exfiltration risk; staging diverges from prod in unpredictable ways |
| The DevOps team | A separate "DevOps team" owns deploys, the rest of engineering files tickets | Reinvents the dev/ops divide that DevOps was supposed to remove |
| The dashboard mausoleum | 14 Grafana dashboards exist; nobody knows which 2 are authoritative | Alert fatigue; new engineers can't ramp on observability |
| Audit-driven security | Security work happens in the 6 weeks before SOC 2 / ISO renewal | Security debt accumulates linearly; renewals become engineering disasters |
| One golden path that nobody is on | The platform team builds a paved road; existing services are too costly to migrate | Paved road exists in theory only; ROI of platform investment evaporates |
If you recognise three or more of these in your organisation, you are a Medium-tier team — regardless of what your dashboards say.
DORA Gaming: How Mid-Tier Teams Fake Elite Numbers
A surprising number of "Elite-tier" claims in surveys do not survive scrutiny. The common gaming patterns:
- ✓Counting deploy-pipeline runs as deployments. A pipeline that re-deploys identical artifacts 50 times per day is not running 50 deployments — it is exercising a pipeline.
- ✓Measuring lead time from PR-merge instead of from first commit. The interesting work happens before merge.
- ✓Excluding "configuration changes" from change-failure rate. Configuration is a deploy. A misconfigured feature flag that takes the site down is a change failure.
- ✓Measuring MTTR per-rollback rather than per-incident. Several rollbacks within one incident is one incident, not three.
- ✓Reporting Elite for the platform team while the rest of the org is Medium. Org-level DORA only matters if it covers all teams shipping production changes.
A defensible self-assessment uses the canonical definitions, captures all teams shipping production changes, and is willing to publish the methodology along with the numbers.
The Cost-vs-Maturity Curve
A common executive question is "what does it cost to move from L3 to L4?". The honest answer:
- ✓L1 → L2: small (~5-10% of engineering capacity for 6-9 months). Mostly tooling spend and one or two senior hires.
- ✓L2 → L3: medium (~15-20% of engineering capacity for 9-12 months). The expensive part is the cultural shift to "main is always releasable".
- ✓L3 → L4: large (~25-35% of engineering capacity for 12-18 months). This is where you build a real platform engineering function and operationalise error budgets. It also requires VP-level commitment to error-budget gating — the most expensive line item is leadership willingness, not headcount.
- ✓L4 → L5: smaller (~10-15% of capacity for 12 months). Mostly tooling polish, progressive-delivery infrastructure, chaos-engineering practice.
Past L3, the gains are non-linear in the right direction. L4 teams typically deliver ~2× the lead-time-for-changes improvement and ~3× the change-failure-rate reduction for ~2× the investment relative to L2 → L3.
Frequently-Asked Questions
Q: Are these benchmarks credible across industries, or are they skewed toward tech-native companies? A: The 2024 DORA report covers ~36,000 respondents across 100+ countries and is moderately industry-balanced. Regulated industries (financial services, healthcare, government) are over-represented in the Medium tier and under-represented in the Elite tier — not because they cannot be Elite, but because their compliance friction adds 6-18 months to most maturity transitions. The benchmarks themselves remain a reasonable yardstick.
Q: Can a small team (5 engineers) be Elite-tier? A: Yes, and small teams are often closer to Elite than they realise — the structural barriers (release trains, separate QA gates, ticket-queue platform teams) are mostly artifacts of org scale. A five-person team running trunk-based development with feature flags, SLO instrumentation, and pre-merge CI under 10 minutes is meaningfully Elite. The tooling debt is the gating factor, not the headcount.
Q: Our team has Elite deployment frequency but a 25% change failure rate — what does that mean? A: You are deploying often but not safely. The two metrics are coupled in the data — Elite teams have both high frequency and low failure rate because the second enables the first. A high-frequency, high-failure-rate team is a team practicing velocity theatre — running the deploy pipeline aggressively without the underlying test, rollback, and observability discipline. The fix is to invest in change-failure-rate reduction; deployment frequency will follow naturally.
Q: How do we get leadership to take error budgets seriously? A: The pragmatic argument is risk-adjusted velocity, not technical purity. A team consistently spending error budget on incidents is delivering velocity at expense of customer experience that has not been priced in. Showing the cost of a single L1 incident in dollar terms (lost revenue, support cost, rep cost) — and comparing it to the engineering hours that would have been freed by error-budget-driven reliability work — usually moves the needle faster than principled DevOps arguments.
Q: How does AI-assisted development change DevOps maturity? A: It is a force multiplier in both directions. Teams already operating at L3+ are using AI assistants to compress code review, generate test scaffolding, and accelerate runbook writing — pushing them faster toward L4. Teams at L1-L2 sometimes mistake AI-generated tests and pipelines for real DevOps maturity; AI-generated artifacts that nobody reviews introduce subtle failure modes that quietly raise change failure rate. AI is an accelerator, not a substitute for the underlying disciplines.
Q: What is the ROI of moving from L3 to L4? A: The defensible answer is industry-specific, but the pattern is consistent: ~30-50% reduction in incident-response cost, ~25-40% increase in deliverable-per-engineer, ~20-30% reduction in mean recovery time. A 100-engineer organisation typically recovers the L3→L4 investment within 12-18 months at a 4-6× ROI thereafter. The harder-to-monetise wins (engineer retention, faster onboarding, reduced burn-out) are usually larger than the directly-measurable ones.
Q: How does offshore/distributed team composition affect maturity? A: Net-net, modestly positive — distributed teams are forced to invest in the documentation, SLO discipline, and pipeline reliability that co-located teams sometimes get away without. The lift is in the L1 → L3 range, where async-friendly tooling and process maturity converge with what good DevOps requires anyway. Past L3, geographic distribution is roughly neutral; what matters is the team's tooling and discipline, not where the engineers sit. Our 2026 Global Software Engineering Rate Benchmark breaks down the country-by-country picture.
Q: How long does a realistic L1 → L4 transition take? A: 24-36 months for a 50-150 engineer organisation, assuming consistent leadership commitment. The fastest transition we have observed was 18 months — driven by an Elite-tier VP of Engineering who treated the transition as a top-three company OKR. The slowest was over four years — paralysed by mid-management who alternately committed to and abandoned the discipline every quarter.
Q: What is the single highest-leverage change a Medium-tier team can make? A: Make CI mandatory and non-bypassable. Most other improvements compound on top of that one. The teams that get stuck at Medium are almost always teams where the "important" releases skip CI. Closing that loophole is psychologically hard but mechanically easy — and it changes the rest of the maturity ladder from a slope to a staircase.
Methodology and Sources
This benchmark was compiled in April 2026 from the following primary sources:
- ✓DORA Accelerate State of DevOps Report (Google Cloud), 2024 edition — primary source for DORA-metric distribution bands.
- ✓GitHub State of the Octoverse 2025 — for adoption rates of CI/CD, IaC, and progressive delivery practices.
- ✓Puppet State of DevOps Report 2024 and 2025 editions — for platform engineering and security shift-left practice adoption.
- ✓CNCF State of Cloud-Native Development 2025 — for observability and Kubernetes adoption practices.
- ✓Stripe Systems internal engagements (2022-2025) — for the L1-L5 narrative descriptions and the cost-vs-maturity curve, drawn from anonymised assessments across our client base.
If you want to see where your organisation actually sits on this scale, run the 8-dimension self-assessment at /tools/devops-maturity-matrix — it takes about seven minutes and produces a radar profile, prescriptive next-level guidance, and an emailable report.
If you would prefer to discuss your maturity profile with our team, book a conversation — we are happy to walk through the assessment with engineering leaders looking for an outside read.
Ready to discuss your project?
Get in Touch →