homerss services talks gpg

Is it really automated?

2022-08-29

This is written in the context of IaC, code deployments, and cloud vendors.

What is automation?

From Wikipedia,

Automation describes a wide range of technologies that reduce human intervention in processes.

Cool. I’ve always been more of an enabler than an interventionist. Sounds like a good time, sign me up. And from the same Wikipedia page;

The main advantages of automation are:

* Increased throughput or productivity
* Improved quality
* Increased predictability
* Improved robustness (consistency), of processes or product
* Increased consistency of output
* Reduced direct human labor costs and expenses
* Reduced cycle time
* Increased accuracy
* Relieving humans of monotonously repetitive work
* Required work in development, deployment, maintenance, and operation of automated processes — often structured as “jobs”
* Increased human freedom to do other things

All good things, all good things

But lets narrow this down a bit to the things I think about when discussing automation.

  • Increased throughput or productivity
  • Relieving humans of monotonously repetitive work
  • Increased human freedom to do other things

So given the things that I care about, our productivity and throughput should be increased. I shouldn’t have to do the same thing over and over again, and I should be able to go off and solve new problems.

Am I alone in feeling that this is not the current state of DevOps, infrastructure, or software engineering as a whole?

Framing of the problem

Conway’s Law is an interesting lens to look at automation tooling. It states;

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.

Looking at a few popular IaC tools.

tool original author author background
puppet Luke Kanies Unix administrator
saltstack Thomas Hatch data center architect and systems administrator
terraform Mitchel Hashimoto software developer
helm Dies labs tools for enabling software developers to quickly deploy to the cloud

Puppet and Saltstack tend to be geared towards wiring up many disparate parts into a cohesive application or deployment. While Terraform is built to wire up disparate parts as well, it and helm are geared towards enabling people to leverage cloud technologies, primarily developers. Purpose and background are not exactly communication structures. But much like communication structures, they seem to be functions of how each role traditionally operates.

Despite the differences, they share common traits.

  • These tools accept a description of end state
  • From that description a delta between what exists and what must exist is created
  • Actions are taken to resolve said delta

But the specific focus of the tooling naturally drives design decisions. As Terraform and Helm are designed to facilitate rapid cloud deployments they are naturally built to describe individual applications or stacks. So defining my hand-wavy application foo in terraform could look something like so.

# foo.tf
terraform {
  required_providers {
    bar = {},
    baz = {}
}}

provider "bar" { }
provider "baz " { }

resource "bar_thing" "bar" {
}

resource "baz_thing" "baz" {
)]}

While most use puppet and Saltstack in a similar manner1, you are able to break things down into their constituent parts and stitch them back together cohesively.2 Which when you’re in the business of making a lot of changes en masse, is a godsend. The same application foo could look like this in salt stack.

# Yep, an empty file

Because we’d apply bar and baz to everything and use separate data to drive it, if and only if the data to drive it was defined.

{% if bar is defined %}
bar_thing:
  bar: bar.options.whatever
{% endif %}
{% if baz is defined %}
baz_thing:
  baz: baz.options.whatever
{% endif %}

We’d tell salt to walk a tree like this, merging the data basically in what basically amounts to a nested for-loop.

.
├── bar
│   ├── default.yml
│   └── foo
│       └── file.yml
└── baz
    ├── default.yml
    └── foo
        └── file.yml

And at that point you have a really nice and tidy taxonomy, if you need to change something about baz universally you have a one-stop-shop. So rather than having to copy-paste even a small amount of boiler plate and tweak the corresponding bits or feed them in as environment variables at build time, you’re relying on a series of two way merges to take a default configuration and augment or override specific parts of it.

# augments
./bar/foo/file.yml
bar:
  options:
    b: 2
./bar/default.yml
bar:
  options:
    a: 1
./baz/foo/file.yml
# overrides
baz:
  options:
    a: b
./baz/default.yml
baz:
  options:
    a: a

Which by the way, is all the way DRY.

DRY? More like damp.

Whether or not folks like to organize their taxonomy by application or it’s constituent parts, I’ve noticed that I’m not alone in the fact that lots of people like the idea of DRY3. There are even tools that set out to address this. Helmfile4 and Terragrunt are prime examples. Terragrunt is particularly interesting because it heavily markets itself as a DRY tool.

Terragrunt lends itself to a hierarchy, in fact, it was completely designed to cut down on duplicated terraform code. Another interesting bit is rather than traverse down or across a directory tree, it walks it upwards, so you can do something like this.

include {
  path = find_in_parent_folders()
}

And it’ll walk up the tree looking for any terragrunt.hcl folders. It’s a little annoying to have to have that include statement in every terragrunt.hcl file, and it’s not really in the spirit of DRY, but I’ll play along.

[modified for clarity]
time=2022-08-25T16:16:07-08:00 level=error msg=/home/matt/example/foo/bar/baz/example.hcl
includes /home/matt/example/terraform/foo/bar/example.hcl, 
which itself includes /home/matt/example/foo/example.hcl. 
Only one level of includes is allowed.

Just kidding, it can’t actually do that. But to be fair there is a commonly used yaml merging workaround, much like my salt example above.

inputs = merge(
  yamldecode(
    file("${get_terragrunt_dir()}/${find_in_parent_folders("foo.yml", local.defaults)}"),
  ),
  yamldecode(
    file("${get_terragrunt_dir()}/${find_in_parent_folders("bar.yml", local.defaults)}"),
  ),
)

But this is not exactly like my example above. There is a subtle difference. Remember how Terragrunt walks up the directory? This means that you have to define the yaml you’re going to merge at the bottom of the tree. So rather than being able to put in the same elements to override configuration or complements to augment it, you have to put in a big ‘ol block of boilerplate in every application definition.

Walking up has another subtle difference. Any given node can only have one parent. When walking down, you have multiple children and your are afforded more taxonomical options.

This goes back to how the tools were designed to be used. A way to lower the barrier for entry to cloud automation, not a means of centrally managing large swaths of deployments.

Admittedly, Helmfile does an alright job of this by being able to import and merge pull in some base configuration and merge input values.

bases:
  - ../../thing1.yml
  - ../thing2.yml
values:
  - ./default/prod.yml
  - ./{{ .Environment.Name }}.yml

But you’re still left with the problem of boilerplate and the additional issue of relative paths.

Event driven decisions

Due to the agentless nature of these cloud provisioning IaC tools, event driven changes are impossible. You want a something to happen and a modification happens on the fly? You can’t express that in these tools. You’re going to have to use additional tooling outside of your IaC. You might be able to describe the additional automation with your IaC, but it will still live outside the tool itself.5

For a contrived example; say you have a service that has metrics you can query, and based on those metrics you want to horizontally scale a database preventively. There’s a lot of requests in the queue that will eventually work their way through the system as database read operations. You simply cannot do that without a service that knows how to;

  • check metrics
  • determine that an event has happened
  • match that event with an action
  • generate a delta between the current state and the end state defined by the action
  • take actions to resolve the delta

This looks the similar to the list above, with one notable exception. This delta isn’t generated from a declarative configuration, it’s generated from an event. The event handler is generated from a declarative configuration, but the actual action is triggered by a listener. Which naturally has to be running if it’s going to be listening.

I’m not the only person who’s taken note of trade off’s one makes with agentless configuration. For example, CI and CD tools such as ArgoCD are always on. It’s listening for changes in a git repo to deploy applications to Kubernetes. It’s a great alternative to wiring up a bunch of weird CI jobs.

But then we separate things

But it’s yet another tool that needs to be configured. And what I’ve seen in practice is users like to separate their code by framework or tool. So it’s terraform modules into one repo, their terraform configuration into a separate one, their application in another, and their helm configuration in yet another. Then if you want to add your helm configuration to ArgoCD, it’s yet another PR. I’m not a proponent of large, all encompassing mono-repos6, but some amount of grouping can be sensible.

If everything is grouped by framework or tool, rather than how it’s commonly interacted with, human intervention is almost always required. Let’s say I have to change an application. Modifying it it to consume a new external service. I’ll potentially have to open four pull requests and merge them in a very specific order.

  • build or modify a terraform module
  • implement a change in terraform
  • modify the application
  • change the helm chart

Sometimes more are required if it is a multi-step change or we have to wait for cloud provisioning to finish.

It’s easy to get the dependency ordering wrong and all too common to see ’no-op’ PRs created with the sole purpose of triggering another CI after something fails. Not to mention it’s completely mundane and repetitive. I’m not saying this workflow is completely broken, the deployments themselves are reproducible which is great. But calling this automated is disingenuous as there are a lot of repetitive tasks that could easily go away with a little bit of up-front planning.

But planning is hard to do if you working fast and conflate your design and implementation phases. Figuring it out as you go. It’s a lot “easier”, well at first anyways, to write a bunch of little things that you can reason about without a clear big picture. But the trade off is now you’ve de-coupled everything and will need to shuffle data in and out of your little units7. Which typically winds up being a manual process or some really brittle automation, which is just a manual process in disguise.

And again with automation

If we look back to my personal three items

Increased throughput or productivity

With many cloud provisioning tools the initial lift, learning curve, cognitive load, or whatever you want to call it is lower on the front end. Getting started. But managing a lot of the changes in many locations becomes more difficult to orchestrate.

This might be fine for small outfits or even startups, who’s whole business model is move fast while VCs more-or-less throw bologna at the wall to see what sticks. Many of these companies will never have large scale infrastructure. They don’t need to fully automate. Or many of the people in that situation won’t be around to deal with the fallout. They’re either software engineers white-knuckling it until an infra or DevOps person is hired, will have moved on to the next role, or the company will no longer exist

I guess that might reflect why everything is geared towards quick initial lift at the moment.

Relieving humans of monotonously repetitive work

Man, the number of PRs and ticky-tack little changes that have to be opened to accomplish anything at some organizations is astounding. The copy-pasting of boilerplate. Or even worse, tracking down and modifying said boiler plate. All dreadfully boring, all dreadfully common. It’s all too easy to design a workflow that is convoluted.

Increased human freedom to do other things

It’s very common for a DevOps engineer to basically be a glorified operator. Help tickets and slack messages to debug convoluted CI/CD pipelines, coordinating lots of repetitive changes that need to be merged in a specific order, tracking down old boiler plate that needs to be modified, and anything else mentioned above. It’s a time sink, and these aren’t adding value to organizations. Rather than manually orchestrate “automated” processes, these individuals would be better served solving other problems for the business.

But it’s not all fire and brimstone

It wasn’t my intent to come across all doom-and gloom and bad mouth a bunch of tools. To clarify; I like these tools. I use them daily. I reach for them when designing new things. In terms of automation as an industry, we’re at the highest point we’ve ever been. More people than ever have access to tools that help them automate tasks. Not to mention there is a network effect of steering everyone towards the same path8. Even if it’s not straightforward to achieve the levels of automation previous tools afforded, it’s not an apples-to-apples comparison. More is possible now.

I just wanted to say that it can be better. Those of us that automated it then are automating it now and we have opinions.

If you have an opinion on any of these tools, think I was completely off-base, have a correction, or would like to highlight additional tooling I would love to hear from you. contact at pallissard dot net


  1. these are how salt’s states or puppet’s manifests are designed ↩︎

  2. Ok, you can actually do this with some of the terraform tools, but it’s not as straightforward. ↩︎

  3. I’d actually been complaining about it for years before I heard the official term “DRY” mentioned. ↩︎

  4. or is it here?, it’s been unclear for a while, but it looks like they got it sorted out ↩︎

  5. And for a lot of things, this makes sense. ↩︎

  6. I lied, actually I am, but I’m pragmatic. Sometimes it just doesn’t make sense. ↩︎

  7. Not to mention the fact that if you don’t have a clear picture you’re certainly not enforcing standards or infrastructure requirements with any sort of technical guard-rail. ↩︎

  8. Yeah, this has a lot of downsides as well but I think it’s overwhelmingly a net positive. ↩︎