Skip to content
devops

What is Toil in DevOps? (And What It's Actually Costing You)

· 10 min read
Engineer silhouette at a terminal surrounded by a circular conveyor belt of repeating manual task icons, checkboxes, wrenches, calendars, servers, with a clock overhead showing time being consumed

The deploy script hasn’t changed in eight months. Every Friday afternoon, someone on your team runs it by hand, copying the steps from a Notion doc, pasting commands into the terminal, and watching logs scroll by.

Nobody’s complained. It works. But somewhere in the back of your mind, you know this shouldn’t still be manual.

That’s DevOps toil. And if that scenario sounds familiar, you’ve got more of it than you realize.

The term comes from Google’s SRE (Site Reliability Engineering) team, but what it describes lives in your codebase right now, even if you don’t have a dedicated SRE, a three-person ops team, or a defined reliability function. Toil is the work that never gets better. It’s the task you did last week, will do again next week, and will still be doing manually a year from now if nothing changes.

This article covers what DevOps toil actually is, how to tell it apart from legitimate engineering work, what it’s costing you in real dollars, and how to know if you have a problem worth fixing.

The Official Definition (And Why It Matters for Small Teams)

Google’s SRE Book defines toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” The book dedicates an entire chapter to eliminating it. There’s also a follow-up in the SRE Workbook.

The book identifies six specific characteristics of toil:

  1. Manual, requires human action to complete
  2. Repetitive, you’re doing it more than once
  3. Automatable, a machine could do it the same way
  4. Tactical/reactive, it interrupts rather than advances
  5. Devoid of enduring value, completing it today doesn’t prevent it tomorrow
  6. Scales linearly, more infrastructure means more of this work, not less

That definition was written for Google’s operations team, which runs thousands of services at a scale most companies will never approach. But every single characteristic applies to a 10-person startup where the lead developer is handling deployments between feature tickets.

The term is less familiar in small-team circles than in big-company SRE shops, but the concept is exactly the same. If your developers are doing ops on the side, that work is almost certainly toil. (If you’re wondering whether you need a dedicated DevOps hire to fix it, that’s a different question, but it starts here.)

Want someone to audit how much toil is in your setup? It’s free and async, no call required.

Toil vs. Engineering Work: How to Tell the Difference

Not everything repetitive is toil. Some work repeats because it’s genuinely complex and requires judgment each time. Debugging a novel production issue is hard and happens often, but it’s not toil because it requires thinking, not just executing.

The clearest test: would a machine do this the same way? If the answer is yes, it’s probably toil.

ToilEngineering Work
Who could do it?Anyone with a checklistRequires judgment and context
What happens next time?Same thing, againProblem is solved or system is better
Does it scale?Gets worse as you growMultiplies your team’s capacity
ExampleManual certificate renewalWriting the automation that renews certs

A practical rule: if you’re doing it again this week without it getting meaningfully easier, it’s toil.

This is also how you distinguish toil from technical debt. Technical debt is design or implementation decisions that slow you down later. Toil is operational work that burns time right now, every single week. They’re related (technical debt often generates toil), but they’re not the same thing.

What DevOps Toil Actually Looks Like in a Small Team

Here’s the list we see at almost every small-team client we work with. These aren’t hypothetical scenarios pulled from an SRE textbook. This is the stuff that shows up in a one-week task log.

  • Manual deployments triggered by a Slack message to whoever “knows the process”
  • Restarting services when they crash, instead of fixing the crash or automating the restart
  • Certificate renewals tracked in someone’s personal calendar
  • Hunting logs in CloudWatch every time something breaks, same searches, same filters, same conclusion
  • Manual RDS backups or integrity checks run on a schedule someone has memorized
  • Rotating AWS access keys for developers by hand when someone leaves
  • Adjusting EC2 capacity by logging in and clicking through the console
  • Responding to the same CloudWatch alarm every Tuesday morning
  • Copying environment variables between .env files for each new developer onboarding

We had a client, a 12-person SaaS team, who asked us to look at their ops setup. We asked them to track every manual task for one week. The list came back with 23 items.

More than half of them had been on someone’s to-do list, or someone’s calendar reminder, for over a year. Nobody had complained about any of them. Toil is invisible until you count it.

What Toil Is Costing You

The number nobody calculates. Here’s the math.

In our experience, the average engineer on a small team spends 25-35% of their time on operational work that could be automated or eliminated. That’s not a guess, it’s what surfaces every time we ask a team to log their tasks for a week.

Run it at your own fully-loaded cost. At $150K per engineer, a 5-person team losing 30% of their time to toil is burning $225K per year on work that shouldn’t be manual.

For smaller teams, the breakdown looks like this:

Team SizeLoaded Cost/Engineer30% in ToilAnnual Toil Tax
3 engineers$150K30%~$135K/year
5 engineers$150K30%~$225K/year
10 engineers$150K30%~$450K/year

That’s what you’re paying to have your developers do work a CI/CD pipeline, a Terraform config, or a cron job could handle for free. A 2025 DuploCloud study found the same pattern at scale, DevOps teams spending ~33% of their time on manual work, adding up to $3.3M/year for a 50-person org. The number scales down, but the percentage doesn’t.

There’s also a retention cost. Engineers don’t quit over one bad week, but sustained toil is a known burnout driver. When your best developers spend 30% of their time doing manual ops work they hate, they start updating their LinkedIn profiles.

Replacing a mid-level engineer typically costs 50-100% of annual salary. Toil doesn’t just waste time; it creates turnover risk.

The full cost breakdown is here if you want to run the numbers for your specific team size.

How to Know If You Have a Toil Problem

Google’s SRE Book sets a clear threshold: no engineer should spend more than 50% of their time on operational work, including toil. If the number creeps above that, Google recommends escalating to management, not as a personal failure, but as a signal that the system is broken.

That’s a useful ceiling. But 50% is high. If 40% of your engineers’ time is disappearing into toil, you’ve already got a real problem. The threshold is a floor, not a goal.

For teams without a formal SRE function, here’s a practical self-assessment. Answer yes or no to each:

  1. Do you have tasks your team does every week that haven’t changed in the last six months?
  2. Could someone follow a document and complete this task without understanding why they’re doing it?
  3. Would this task still need to be done manually in a year if nothing else changed?
  4. Do the same monitoring alerts fire repeatedly for the same root cause?
  5. Is there one person on your team who “just knows” how to do something critical?

Three or more yes answers means you have a toil problem. Five yes answers means it’s worth treating as a priority, not a backlog item.

How to Start Reducing DevOps Toil (Without Boiling the Ocean)

The typical advice is “automate everything.” That’s the wrong place to start.

Some toil isn’t worth automating. If a task takes 20 minutes per month and would require three weeks of engineering time to properly automate, the ROI is negative. Some toil is actually a process problem; the right fix is eliminating the task entirely, not building a script around it. Before writing a single line of automation, triage.

Here’s the framework we use:

Step 1: Log it for a week. Ask everyone on the team to note every manual, repetitive task they complete. Don’t filter. Just write it down with a rough time estimate and how often it happens.

Step 2: Calculate cost vs. fix time. For each item, estimate (hours per month) × (hourly rate) = monthly cost. Then estimate how long it would take to automate or eliminate. Anything with a payback period under three months gets prioritized.

Step 3: Start with the highest-pain, lowest-effort fix. Don’t start with the hardest problem. Start with the one that takes an afternoon to fix and has been annoying your team for a year. Build momentum. One automation win makes the next one easier to justify internally.

A 2026-specific note worth making: AI agents can accelerate toil reduction, but only if you know what you’re reducing first. Teams that add AI automation on top of uncharted manual processes tend to automate the wrong things, or create new toil validating what the AI produced. The toil reduction roadmap gives you the systematic approach; DevOps quick wins covers the fastest places to start.

We send a written report and Loom walkthrough of what we find. No meeting required unless you want one. See what a free audit looks like.

Frequently Asked Questions

What’s the difference between toil and technical debt?

Technical debt is code or architecture that slows down future work, shortcuts taken now that cost you later. Toil is operational work burning time right now, every single week. They’re related: technical debt often generates toil. But toil is a current, measurable drain; technical debt is a future liability.

How does Google SRE define toil?

Google defines toil as work that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth. The SRE Book, Chapter 5, is the canonical source. Google recommends keeping toil below 50% of any engineer’s time.

How much toil is acceptable?

Less than 50% of any engineer’s time, per Google SRE. High-functioning teams typically aim for under 20-25%. If toil is consuming more than a third of your engineering week, treat it as a priority rather than background noise.

Is all repetitive work toil?

No. Some work repeats because it’s complex and requires judgment every time. Incident response during a real outage isn’t toil, it requires context, diagnosis, and real decision-making. The key question: could a machine do this the same way? If yes, it’s toil. If no, it’s engineering work.

The Bottom Line

If there’s a task on your team that someone does every week, that hasn’t gotten meaningfully easier in the last year, and that could be handed off to a runbook with no loss of quality, that’s toil. It’s not a personal failing, and it’s not unique to your team. We see it at almost every company we audit. What varies is how much it costs and how long it’s been invisible.

Three things you can do this week:

  1. Ask your team to log every manual task for five days. Just time-box it and write it down.
  2. Run the five-question self-assessment above. Three or more yes answers means you have a starting point.
  3. Pick the most painful item on that list and estimate the fix time. If it’s under a week of work, put it on the next sprint.

That’s it. No 12-step transformation plan. No DevOps maturity assessment. Just a list and a decision.

Get Your Free Infrastructure Audit → We’ll look at your setup, record a Loom walkthrough of what we find, and send a written report. Async-first. No call required.

How mature is your DevOps?

Take our free assessment. Get a maturity score across 5 dimensions and specific recommendations — written by an engineer, not a bot.

Free DevOps Assessment

Get DevOps insights in your inbox

No spam. Unsubscribe anytime.