homerss services talks gpg

The cost of partial automation

2022-09-19

Excuses

These days businesses and teams are geared toward cranking out work as quickly as possible. From what I’ve seen, the prevailing sentiment is that time not spent adding features, or ticking check boxes is wasted time. Often this means that;

  • POC/spike, design, and implementation phases of a project are conflated .
  • Something that mostly works is labeled as good enough.
  • Improving internal tooling is deferred.

The most common reasons I’ve heard in order of increasing comminality;

  1. They have never seen (and therefore don’t recognize the value) of true end-to-end automation.
  2. Are unable to articulate the values of proper design, planning, and execution to the organization as a whole.
  3. Thinking they do not have enough or it’s not worth their time to properly design and automate a solution.

There is an old analogy that comes in several flavors; a butcher, woodcutter, etc is busy. Their blade is dull, which slows them down. But since they are moving so slow they think that there isn’t time to stop and sharpen the blade.

As such we see people spinning their wheels. And since nobody likes spinning their wheels, we see excuses business justifications for wheel spinning disguised as truisms on an almost daily bases.

20% of the work gets you 80% of the way there. Premature optimization is the root of all evil. And my personal favorite; Minimal. Viable. Product. 1

But that logic is flawed.

I’m going to call B.S. on they way most orgs use the term MVP. In essence it’s great. You want to lay out some constraints to avoid over-engineering and wasting time on uneeded features? Cool, that keeps us on task. But, more often than not, it’s used to mark tricky features that aren’t complete as out of scope.

Super common workflows labeled “good enough”

There are a lot of IaC and CI/CD automation design patterns that get labeled and marketed as good enough. What winds up happening is each portion of a design is treated as it’s own little automated unit, rather than the entire process as a whole.

For deployment changes one might look like this

  • PR for terraform module
  • PR for terraform implementation/deployment, cdktf, terragrunt, vanilla terraform
  • PR for helm chart
  • PR for chart deployment

And for changes to a core library, it might look like this.

  • PR for npm/pip package
  • PR for downstream code, of which there is often several
  • PR for helm chart
  • PR for chart deployment

And all of those little steps of opening and closing PR’s are hand-waved over since they’re “automated”.2

Napkin math cost breakdown with conservative numbers

When you start operating automation at scale3 those little bits that were hand-waved over really start to add up.

Lets pretend we have engineers with;

  • The average base salary in Seattle, $150,0004
  • 401k match and insurance cost the employer $15,000 a year
  • And we’ll omit equipment, travel, reimbursments, and the costs of HR and tech support.

That’s a cost of ~$80 dollars an hour to the org.

Assuming each PR approval/merge and subsequent CI run takes 10 minutes (and there are no mistakes) you’re looking at 40 minutes for a change.

If an engineer makes two non-trivial changes a day5, that’s $104. $104 just fiddling with VCS/ci Web-UI’s, copy-pasting data, and running cli commands.

Lets also pretend that they take several weeks off this year.

$80 * 1.3hours * 5days * 45 weeks
23400.0

Since this is napkin math, we’ll round down. $23k in overhead a year, for each engineer. $23k a year that is adding exactly $0 in value to the organization. In fact that is money going right down the drain.

And that was conservative

Not to mention steps like this are error prone as they aren’t tightly coupled.

  • The PR’s have to be opened and merged in a specific order.
  • There is often data that has to be moved from one step to another. Manual processes.

There are going to be bugs caught by unit tests, and performance regressions, and rollbacks6. Things such as dns/endpoint names, services, account information, roles, secrets etc need to be shuffled around. It’s all to common to see organizations use terraform to deploy infrastructure and then manually copy-paste information into vault or a values file in helm. Or to set permissions on objects in S3 manually. Or to manually create database or IaM credentials.

If you are a DevOps engineer, work with 6 other engineers in your organization, and automate all of the ticky-tack little steps that take them 10 to 15 minutes, several times a day;

Congratulations, you improved 6 peoples day-to-day lives and it is as if the org now has an additional head count without the added cost or time investement in onboarding/ramp up time.


  1. Not that these expressions don’t have a use. They’re just applied incorrectly. ↩︎

  2. Which is so strange, because there are often really straightforward ways to tie that all together. ↩︎

  3. And I’m not just talking BigCo web scale. You can have a small product and support and build tooling for a lot of teams. Scaling is not necissarily tied to your customer base. And conversely, you can have a product that is hammered hard but requires next to no infrastructure automation. Size != complexity of a given problem domain ↩︎

  4. In 2022. ↩︎

  5. Because if the’re only doing a bunch of little trivial things, why don’t they have the time to automate it all in the first place? ↩︎

  6. And if it’s not 100% automated, rollbacks and hot-fixes can be awfully unforgiving. ↩︎