How to Identify Toil in Your Infrastructure: A Practical Checklist

Your team is losing 10-15 hours a week to work that should be automated. The problem isn’t that nobody cares. It’s that nobody’s counting.
Toil hides in plain sight. It’s the deploy script someone runs manually every afternoon. The alert that fires every Thursday at 3am because the disk fills up. The DNS record change that requires SSH’ing into a box and editing a file by hand. Each task feels small. But add them up across your team and you’re looking at 30-40% of engineering time burned on work that produces zero lasting value.
This article is the checklist. Go through it layer by layer with your team. By the end, you’ll have a scored list of your biggest toil sources and a clear picture of what to fix first.
Want us to run this audit for you? Request a free async audit and we’ll send you a written report with specific findings. No call required.
What Counts as Toil (Quick Refresher)
Google’s SRE team defined toil as work that meets all six of these criteria: it’s manual, repetitive, automatable, tactical (reactive rather than strategic), produces no lasting value, and scales linearly as your services grow.
The key word is “automatable.” On-call response to a novel production issue isn’t toil. Debugging a new memory leak isn’t toil. Those require human judgment. But restarting the same service every time the same OOM alert fires? That’s toil. For the full deep dive on what qualifies, see our complete guide to DevOps toil.
The distinction matters because not everything that feels tedious is toil. Code reviews are tedious sometimes. They’re still engineering work. The checklist below focuses specifically on tasks that are repetitive, manual, and could be handled by automation.
The Toil Identification Checklist
Go through each layer of your infrastructure. For every item that applies to your team, note it. We’ll score them in the next section.
CI/CD Pipeline Toil
- Manual build triggers. Someone clicks “Run” in the CI dashboard instead of builds triggering automatically on push.
- Hand-edited pipeline configs. Updating YAML pipeline files for each new service or environment by copy-pasting from another project.
- Flaky test babysitting. Re-running the test suite 2-3 times until it passes because of intermittent failures nobody has time to fix.
- Manual artifact promotion. Moving builds between staging and production by hand, or manually tagging releases.
- Waiting for builds. Engineers sitting idle (or context-switching) while a 30-minute build runs because the pipeline isn’t parallelized.
Deployment Toil
- Multi-step deploy scripts. Deploys that require running 3-4 commands in sequence, checking output at each step.
- SSH-to-server deployments. Logging into boxes to pull code, restart services, and verify they came up.
- Environment-specific config edits. Changing env vars, database connection strings, or feature flags by hand for each environment.
- Manual rollbacks. When a deploy goes wrong, someone reverts by hand instead of running a single rollback command.
- Deploy coordination. Slack messages like “nobody deploy for the next hour, I’m pushing the billing update.”
If your deploy requires more than git push and a CI pipeline, it’s probably toil.
Infrastructure Provisioning Toil
- Console clicking. Spinning up servers, databases, or networking through the AWS/GCP/Azure console instead of infrastructure as code.
- Manual DNS updates. Adding or editing DNS records through a web interface.
- Hand-managed SSL certificates. Calendar reminders to renew certs. Manually running certbot. Copy-pasting cert files to servers.
- Snowflake environments. Staging doesn’t match production because environments were built by hand at different times. Nobody knows exactly what’s running where.
- No-IaC resources. Critical infrastructure that exists only because someone clicked “Create” in a console six months ago. No Terraform, no Pulumi, no way to recreate it.
Monitoring and Incident Response Toil
- False-positive alerts. Alerts that fire regularly but don’t indicate real problems. The team acknowledges and ignores them.
- Manual log searching. SSH’ing into servers or scrolling through CloudWatch to find what went wrong.
- Recurring known-issue alerts. The same alert fires every week for the same reason. Everyone knows the fix but nobody has automated it.
- Manual escalation. When an alert fires, someone has to figure out who to page and do it manually.
- Status page updates. Manually updating a status page during an incident instead of it updating automatically based on health checks.
We ran this checklist with a client, a 15-person SaaS company. They assumed their biggest toil problem was deployments. Turns out it was monitoring. Three engineers were spending six hours a week combined acknowledging and responding to false-positive alerts. Fixing the alert thresholds took an afternoon. That one change saved more time than their planned CI/CD automation project would have.
Security and Compliance Toil
- Manual password and secret rotations. Rotating API keys, database passwords, or tokens by hand on a schedule.
- Hand-run vulnerability scans. Running security scanners manually instead of integrating them into CI.
- Manual access reviews. Quarterly access audits done by pulling user lists and reviewing them in a spreadsheet.
- Unautomated audit evidence. Collecting compliance evidence (logs, configs, access reports) by hand for SOC 2 or ISO audits.
Data and Database Toil
- Manual backups. Running backup scripts by hand or relying on cron jobs that nobody monitors.
- Hand-run migrations. Applying database migrations manually in production instead of as part of the deploy pipeline.
- One-off data fix queries. Running SQL to patch bad data because the application doesn’t handle edge cases.
- Manual reporting extracts. Someone runs the same SQL query every Monday to pull metrics for the weekly standup.
The Toil Scoring System
Now that you’ve identified your toil, you need to prioritize. Not everything is worth automating. A monthly task that takes five minutes is probably cheaper to keep than to build automation around.
Score each toil item on three dimensions:
| Dimension | Score 3 | Score 2 | Score 1 |
|---|---|---|---|
| Frequency | Daily | Weekly | Monthly |
| Duration | >30 minutes | 10-30 minutes | <10 minutes |
| Risk | User-facing impact | Internal disruption | No risk |
Add the three scores. Maximum is 9 (daily, long, user-facing). Minimum is 3 (monthly, short, no risk).
Worked Example
| Task | Frequency | Duration | Risk | Total |
|---|---|---|---|---|
| Manual deploy (3 commands, check output) | Daily (3) | 10-30 min (2) | User-facing (3) | 8 |
| False-positive alerts (acknowledge + dismiss) | Daily (3) | <10 min (1) | Internal (2) | 6 |
| SSL cert renewal | Monthly (1) | 10-30 min (2) | User-facing (3) | 6 |
| Manual DB backup verification | Weekly (2) | <10 min (1) | Internal (2) | 5 |
| Quarterly access review | Monthly (1) | >30 min (3) | No risk (1) | 5 |
Sort by total score. Your top 3-5 items are your automation priorities. In this example, manual deploys score highest because they happen daily and directly affect users.
Toil vs. Not Toil: 15 Real Examples
The trickiest part of identifying toil is distinguishing it from legitimate engineering work. Here’s a quick reference:
| Task | Toil? | Why |
|---|---|---|
| Running a deploy script manually | Yes | Automatable, repetitive, no judgment needed |
| Debugging a novel production outage | No | Requires investigation and judgment |
| Restarting a service after the same OOM alert | Yes | Same fix every time, scriptable |
| Writing a post-incident review | No | Requires analysis and writing |
| Manually rotating database passwords | Yes | Automatable with secrets management |
| Reviewing a pull request | No | Requires human judgment on code quality |
| Updating DNS records by hand | Yes | Automatable with IaC |
| Capacity planning for next quarter | No | Strategic, requires forecasting |
| Re-running flaky tests until they pass | Yes | The fix is fixing the tests, not re-running |
| Responding to a new type of security alert | No | Requires investigation |
| Copying config files between environments | Yes | Template-able, automatable |
| Architecting a new microservice | No | Creative engineering work |
| Manually checking if backups succeeded | Yes | Automatable with monitoring |
| Mentoring a junior engineer | No | High-value human interaction |
| Manually scaling infrastructure for a traffic spike | Yes | Auto-scaling exists for this |
How to Run a Toil Audit With Your Team
You don’t need a week-long process. A 30-minute team meeting works. Here’s the format:
Before the meeting (5 minutes): Send the checklist above to your team. Ask each person to note which items apply to their week and estimate hours spent per item.
In the meeting (25 minutes):
Go layer by layer (10 min). Walk through CI/CD, deploys, provisioning, monitoring, security, and data. For each layer, ask: “Who’s doing any of these?” Capture every item on a shared doc.
Score each item (5 min). Use the Frequency + Duration + Risk scoring. Don’t debate the scores for long. Rough estimates are fine.
Calculate your toil cost (5 min). Add up the weekly hours across the team. Multiply by your average hourly engineering cost. For a detailed formula, see our guide to calculating the true cost of toil.
Pick your top 3 (5 min). The three highest-scoring items are your immediate automation targets. Assign an owner and a timeframe for each.
That’s it. You now have a prioritized toil reduction plan. For a detailed roadmap on how to systematically eliminate what you’ve found, see our toil reduction roadmap.
Want us to run this audit for you? We do this for clients every week. Get a free async audit and we’ll go through your infrastructure, score your toil sources, and send you a written report with the top 3 fixes. Loom walkthrough included. No call required.
What to Do With Your Top 3
Once you’ve identified your highest-impact toil, you have three options:
Automate it. CI/CD pipelines, Terraform, auto-scaling, automated cert renewal, secrets rotation. These are solved problems. If your top toil source has a well-known automation solution, build it.
Outsource it. Some toil is complex enough that building the automation yourself takes months. A fractional DevOps retainer puts senior expertise on it immediately. We’ve seen clients go from “45-minute manual deploys” to “automated CI/CD” in their first month of a retainer.
Eliminate the root cause. Sometimes the best fix isn’t automating the toil. It’s removing the thing that creates it. False-positive alerts? Fix the thresholds. Flaky tests? Fix the tests. Snowflake environments? Rebuild them with IaC so they stay consistent.
The scoring system helps you avoid the common trap of automating everything. A monthly 5-minute task scoring 3/9 isn’t worth a two-week automation project. Focus your energy on the 8s and 9s.
FAQ
What percentage of time should be spent on toil? Google’s SRE teams target a maximum of 50% and average around 33%. For a small team without dedicated ops, we recommend targeting under 25%. Most teams we audit are in the 30-40% range before they start reducing.
Is on-call considered toil? Not automatically. Being on-call is operational work, but it only becomes toil when you’re doing the same repetitive fix for the same recurring issue. Responding to a novel incident requires judgment. Restarting the same crashed service for the third time this week is toil.
How do you track toil over time? Monthly check-ins work best. Pick one day per month to estimate your team’s toil percentage. Track the trend. If it’s going up, you’ve added services without investing in automation to support them.
Should I automate everything I identify as toil? No. Score it first. If a task is monthly, takes five minutes, and has no risk, it’s cheaper to keep than to automate. Automation has its own cost — someone has to build and maintain it. Focus on high-frequency, high-duration tasks that score 6+ on the scoring system.
Start Counting
Most teams are surprised by how much toil they’re carrying. It’s invisible until you sit down and count it. The checklist above takes 30 minutes with your team. The number that comes out of it changes how you think about your engineering spend.
We run toil audits for small engineering teams every week. If you want a senior set of eyes on your infrastructure, request a free async audit. You’ll get a Loom walkthrough, a written report, and specific recommendations for your top 3 toil sources. No call. No pressure. Just the numbers.
Related Articles
The True Cost of Toil: How to Calculate What Manual Ops Is Costing Your Team
Your 10-person engineering team is spending roughly $490,000 a year on work that a script could do. That’s not a guess. That’s the math, and we’ll show you how to run it for your own team in the next five minutes.
Case Study: How We Reduced One Client's Toil by 60%
One of the engineers on the team I’m going to describe was spending roughly three hours a day not engineering. He was SSHing into servers to run deploys, digging through log files via tunnels, triaging alerts that didn’t mean anything, and resetting staging environments that never stayed stable. He was good at his job. He was also quietly looking at job listings.
How to Build a Toil Reduction Roadmap
The DORA 2024 report dropped a finding that should have caused a minor crisis in every engineering org: toil rose to 30% of engineering time, up from 25% the year before. That’s the first increase in five years, and it happened while teams were actively adopting AI tools and automation platforms. More tooling, more toil. Something isn’t working.
How mature is your DevOps?
Take our free assessment. Get a maturity score across 5 dimensions and specific recommendations — written by an engineer, not a bot.
Free DevOps AssessmentGet DevOps insights in your inbox
No spam. Unsubscribe anytime.


