How to Identify Toil in Your Infrastructure: A Practical Checklist

Your team is losing 10-15 hours a week to work that should be automated. The problem isn’t that nobody cares. It’s that nobody’s counting.

Toil hides in plain sight. It’s the deploy script someone runs manually every afternoon. The alert that fires every Thursday at 3am because the disk fills up. The DNS record change that requires SSH’ing into a box and editing a file by hand. Each task feels small. But add them up across your team and you’re looking at 30-40% of engineering time burned on work that produces zero lasting value.

This article is the checklist. Go through it layer by layer with your team. By the end, you’ll have a scored list of your biggest toil sources and a clear picture of what to fix first.

Want us to run this audit for you? Request a free async audit and we’ll send you a written report with specific findings. No call required.

What Counts as Toil (Quick Refresher)

Google’s SRE team defined toil as work that meets all six of these criteria: it’s manual, repetitive, automatable, tactical (reactive rather than strategic), produces no lasting value, and scales linearly as your services grow.

The key word is “automatable.” On-call response to a novel production issue isn’t toil. Debugging a new memory leak isn’t toil. Those require human judgment. But restarting the same service every time the same OOM alert fires? That’s toil. For the full deep dive on what qualifies, see our complete guide to DevOps toil.

The distinction matters because not everything that feels tedious is toil. Code reviews are tedious sometimes. They’re still engineering work. The checklist below focuses specifically on tasks that are repetitive, manual, and could be handled by automation.

The Toil Identification Checklist

Go through each layer of your infrastructure. For every item that applies to your team, note it. We’ll score them in the next section.

CI/CD Pipeline Toil

Manual build triggers. Someone clicks “Run” in the CI dashboard instead of builds triggering automatically on push.
Hand-edited pipeline configs. Updating YAML pipeline files for each new service or environment by copy-pasting from another project.
Flaky test babysitting. Re-running the test suite 2-3 times until it passes because of intermittent failures nobody has time to fix.
Manual artifact promotion. Moving builds between staging and production by hand, or manually tagging releases.
Waiting for builds. Engineers sitting idle (or context-switching) while a 30-minute build runs because the pipeline isn’t parallelized.

Deployment Toil

Multi-step deploy scripts. Deploys that require running 3-4 commands in sequence, checking output at each step.
SSH-to-server deployments. Logging into boxes to pull code, restart services, and verify they came up.
Environment-specific config edits. Changing env vars, database connection strings, or feature flags by hand for each environment.
Manual rollbacks. When a deploy goes wrong, someone reverts by hand instead of running a single rollback command.
Deploy coordination. Slack messages like “nobody deploy for the next hour, I’m pushing the billing update.”

If your deploy requires more than git push and a CI pipeline, it’s probably toil.

Infrastructure Provisioning Toil

Console clicking. Spinning up servers, databases, or networking through the AWS/GCP/Azure console instead of infrastructure as code.
Manual DNS updates. Adding or editing DNS records through a web interface.
Hand-managed SSL certificates. Calendar reminders to renew certs. Manually running certbot. Copy-pasting cert files to servers.
Snowflake environments. Staging doesn’t match production because environments were built by hand at different times. Nobody knows exactly what’s running where.
No-IaC resources. Critical infrastructure that exists only because someone clicked “Create” in a console six months ago. No Terraform, no Pulumi, no way to recreate it.

Monitoring and Incident Response Toil

False-positive alerts. Alerts that fire regularly but don’t indicate real problems. The team acknowledges and ignores them.
Manual log searching. SSH’ing into servers or scrolling through CloudWatch to find what went wrong.
Recurring known-issue alerts. The same alert fires every week for the same reason. Everyone knows the fix but nobody has automated it.
Manual escalation. When an alert fires, someone has to figure out who to page and do it manually.
Status page updates. Manually updating a status page during an incident instead of it updating automatically based on health checks.

We ran this checklist with a client, a 15-person SaaS company. They assumed their biggest toil problem was deployments. Turns out it was monitoring. Three engineers were spending six hours a week combined acknowledging and responding to false-positive alerts. Fixing the alert thresholds took an afternoon. That one change saved more time than their planned CI/CD automation project would have.

Security and Compliance Toil

Manual password and secret rotations. Rotating API keys, database passwords, or tokens by hand on a schedule.
Hand-run vulnerability scans. Running security scanners manually instead of integrating them into CI.
Manual access reviews. Quarterly access audits done by pulling user lists and reviewing them in a spreadsheet.
Unautomated audit evidence. Collecting compliance evidence (logs, configs, access reports) by hand for SOC 2 or ISO audits.

Data and Database Toil

Manual backups. Running backup scripts by hand or relying on cron jobs that nobody monitors.
Hand-run migrations. Applying database migrations manually in production instead of as part of the deploy pipeline.
One-off data fix queries. Running SQL to patch bad data because the application doesn’t handle edge cases.
Manual reporting extracts. Someone runs the same SQL query every Monday to pull metrics for the weekly standup.

The Toil Scoring System

Now that you’ve identified your toil, you need to prioritize. Not everything is worth automating. A monthly task that takes five minutes is probably cheaper to keep than to build automation around.

Score each toil item on three dimensions:

Dimension	Score 3	Score 2	Score 1
Frequency	Daily	Weekly	Monthly
Duration	>30 minutes	10-30 minutes	<10 minutes
Risk	User-facing impact	Internal disruption	No risk

Add the three scores. Maximum is 9 (daily, long, user-facing). Minimum is 3 (monthly, short, no risk).

Worked Example

Task	Frequency	Duration	Risk	Total
Manual deploy (3 commands, check output)	Daily (3)	10-30 min (2)	User-facing (3)	8
False-positive alerts (acknowledge + dismiss)	Daily (3)	<10 min (1)	Internal (2)	6
SSL cert renewal	Monthly (1)	10-30 min (2)	User-facing (3)	6
Manual DB backup verification	Weekly (2)	<10 min (1)	Internal (2)	5
Quarterly access review	Monthly (1)	>30 min (3)	No risk (1)	5

Sort by total score. Your top 3-5 items are your automation priorities. In this example, manual deploys score highest because they happen daily and directly affect users.

Toil vs. Not Toil: 15 Real Examples

The trickiest part of identifying toil is distinguishing it from legitimate engineering work. Here’s a quick reference:

Task	Toil?	Why
Running a deploy script manually	Yes	Automatable, repetitive, no judgment needed
Debugging a novel production outage	No	Requires investigation and judgment
Restarting a service after the same OOM alert	Yes	Same fix every time, scriptable
Writing a post-incident review	No	Requires analysis and writing
Manually rotating database passwords	Yes	Automatable with secrets management
Reviewing a pull request	No	Requires human judgment on code quality
Updating DNS records by hand	Yes	Automatable with IaC
Capacity planning for next quarter	No	Strategic, requires forecasting
Re-running flaky tests until they pass	Yes	The fix is fixing the tests, not re-running
Responding to a new type of security alert	No	Requires investigation
Copying config files between environments	Yes	Template-able, automatable
Architecting a new microservice	No	Creative engineering work
Manually checking if backups succeeded	Yes	Automatable with monitoring
Mentoring a junior engineer	No	High-value human interaction
Manually scaling infrastructure for a traffic spike	Yes	Auto-scaling exists for this

How to Run a Toil Audit With Your Team

You don’t need a week-long process. A 30-minute team meeting works. Here’s the format:

Before the meeting (5 minutes): Send the checklist above to your team. Ask each person to note which items apply to their week and estimate hours spent per item.

In the meeting (25 minutes):

Go layer by layer (10 min). Walk through CI/CD, deploys, provisioning, monitoring, security, and data. For each layer, ask: “Who’s doing any of these?” Capture every item on a shared doc.
Score each item (5 min). Use the Frequency + Duration + Risk scoring. Don’t debate the scores for long. Rough estimates are fine.
Calculate your toil cost (5 min). Add up the weekly hours across the team. Multiply by your average hourly engineering cost. For a detailed formula, see our guide to calculating the true cost of toil.
Pick your top 3 (5 min). The three highest-scoring items are your immediate automation targets. Assign an owner and a timeframe for each.

That’s it. You now have a prioritized toil reduction plan. For a detailed roadmap on how to systematically eliminate what you’ve found, see our toil reduction roadmap.

Want us to run this audit for you? We do this for clients every week. Get a free async audit and we’ll go through your infrastructure, score your toil sources, and send you a written report with the top 3 fixes. Loom walkthrough included. No call required.

What to Do With Your Top 3

Once you’ve identified your highest-impact toil, you have three options:

Automate it. CI/CD pipelines, Terraform, auto-scaling, automated cert renewal, secrets rotation. These are solved problems. If your top toil source has a well-known automation solution, build it.

Outsource it. Some toil is complex enough that building the automation yourself takes months. A fractional DevOps retainer puts senior expertise on it immediately. We’ve seen clients go from “45-minute manual deploys” to “automated CI/CD” in their first month of a retainer.

Eliminate the root cause. Sometimes the best fix isn’t automating the toil. It’s removing the thing that creates it. False-positive alerts? Fix the thresholds. Flaky tests? Fix the tests. Snowflake environments? Rebuild them with IaC so they stay consistent.

The scoring system helps you avoid the common trap of automating everything. A monthly 5-minute task scoring 3/9 isn’t worth a two-week automation project. Focus your energy on the 8s and 9s.

FAQ

What percentage of time should be spent on toil? Google’s SRE teams target a maximum of 50% and average around 33%. For a small team without dedicated ops, we recommend targeting under 25%. Most teams we audit are in the 30-40% range before they start reducing.

Is on-call considered toil? Not automatically. Being on-call is operational work, but it only becomes toil when you’re doing the same repetitive fix for the same recurring issue. Responding to a novel incident requires judgment. Restarting the same crashed service for the third time this week is toil.

How do you track toil over time? Monthly check-ins work best. Pick one day per month to estimate your team’s toil percentage. Track the trend. If it’s going up, you’ve added services without investing in automation to support them.

Should I automate everything I identify as toil? No. Score it first. If a task is monthly, takes five minutes, and has no risk, it’s cheaper to keep than to automate. Automation has its own cost — someone has to build and maintain it. Focus on high-frequency, high-duration tasks that score 6+ on the scoring system.

Start Counting

Most teams are surprised by how much toil they’re carrying. It’s invisible until you sit down and count it. The checklist above takes 30 minutes with your team. The number that comes out of it changes how you think about your engineering spend.

We run toil audits for small engineering teams every week. If you want a senior set of eyes on your infrastructure, request a free async audit. You’ll get a Loom walkthrough, a written report, and specific recommendations for your top 3 toil sources. No call. No pressure. Just the numbers.