Why should you test Terraform code?

As some may know, I work at a bigger company that uses the cloud and thus terraform to provide different services that our software can rely on. One easy example would be provisioning a PostgreSQL database for a Spring Boot App.

Since this happens quite frequently we also have Terraform modules that are build internally and tackle the most frequent configurations and challenges we internally need to solve. One of those is to provide or use IAM roles that can be assumed from within our network and are the only possibility to connect to this database.

We struggled to have stable versions of this tooling internally because the team that provided those modules weren’t using them! I know, I know, that’s a problem in itself. However, it caused trouble internally because features that were build weren’t applied before being tagged as stable releases. So there were several cases where teams weren’t able to use the latest versions, because the configuration of those modules just wasn’t applicable.

Let’s solve this!

What kinds of Terraform code?

To start of, we need to have some oversight. Which types of modules exist and how would we test whether they work or not? I’m going to start by differentiating between types of modules, that I’ve seen more often.

I differentiate between two module types:

“Configuration” modules
“Provisioning” modules

Configuration modules

These provide information that many teams may need (IP addresses, VPC IDs, CIDRs, etc). They don’t actually use the resource keyword anywhere in their definition and live strictly for their outputs. You also can get any kind of information about existing resources via the data keyword in terraform.

These modules are the easiest and fastest to test as they don’t really do much. All they do in the background is doing some GET-requests and deserialize the response. Most of the failing tests will be due to your CI not having access to your Cloud of choice.

So what would a simple configuration module look like?

data "vpc" {
	name = "myOwnVpc"
}

output "vpc_id" {
	value = data.vpc.id
	description = "The standard vpc used for all centrally provided services."
}

Believe it or not, but this could be an entire configuration module. Anyone who needs the VPC ID could just use this module and rely on it to be working, since it’s given to you from the same people that created and maintain this VPC. If it ever were to break, you can update to the latest version and expect it to be working again.

Provisioning modules

These are a bit more complex to create but also less difficult to grasp since this is basically what you’re doing when using Terraform by default. The only difference being, that you wouldn’t normally apply those, but rather provide settings to configure during their creation.

These modules are more difficult to test, since you’re going to create/destroy resources.

As an example, let’s say we want to provide a AWS S3 bucket, but every team must have tags applied to them since you want to know which bucket belongs to which service.

It would look something like this:

variable "service_name" {
	type = string
	description = "The name of the service"
}

variable "tags" {
	type = map(string)
	description = "A map of tags we use for the service"
}

resource "aws_s3_bucket" "provisioned_bucket" {
	bucket = "${var.service_name}-bucket"

	tags = merge(
		var.tags,
		{
			"service" = var.service_name
		}
	)
}

Now, to use this module, everything we need to do is include it in our services terraform code.

module "s3_bucket" {
	source = "github.com/sironheart/name-of-module.tf"
	
	service_name = "example-app"
	tags         = {
		"owner"        = "sironheart"
		"organization" = "altf4llc"
	}
}

How do you test these modules?

After multiple discussions, we decided to start with smoke testing. This is simply due to time constraint. We already were relying on the usual suspects of terraform validate and terraform fmt -check -diff, but those weren’t of any help to the problem we were facing.

We were having problems during the apply process due to not defined restrictions for resources. For example, when one of our users tried to use a Redis module it wasn’t able to start and/or provide the cluster due to conflicting configurations on our definition.

When running the apply AWS wouldn’t allow the configuration that we passed. That means it was an issue of the module, instead of the user of the module.

What is smoke testing?

Smoke testing refers to two separate occasions that both made sure the tested object was ready for further testing.

In plumbing the term ‘smoke test’ was used, to check for cracks, leaks or breakages in the pipes. Once there was fluid at the outside of weldings, you were sure that there was an error happening.

In electric wiring, it’s referred to actual smoke coming from systems, that aren’t properly handled. One example of this would be a smoking fuse due to too high voltage or due to a short circuit.

In both cases, it’s pretty clear: If you fail this test, you can’t test any further. To use this methodology for us, we can say, that our test fails when an apply or plan fails.

How do you setup smoke tests?

Let’s start by implementing a test module for our code. This could look like this:

# file: tests/main.tf

module "testing_module" {
	source = "../src"
}

Well, that was… easy. Pretty straight forward, no complex logic, this is easily applicable to almost any module!

Let’s continue with automating the testing part! As I am using Gitlab, we’re going to use its CI system for this example.

# file: .gitlab-ci.yml

default:
	image: ghcr.io/opentofu/opentofu:1.9.0 # We'll be using the opentofu images for all our jobs

tf:smoke:init:
	stage: smoke-test
	needs: [] # Let's not be blocked by previous steps, such as static analysis, formatting, etc
	rules: 
		- if: $CI_MERGE_REQUEST_ID # Only runs on merge request
	script:
		- tofu -chdir=tests init
	artifacts: # files we want to share across our pipeline
		access: "developer"
		paths:
			- tests/.terraform
			- tests/.terraform.lock.hcl

tf:smoke:plan:
	stage: smoke-test
	needs: [ tf:smoke:init ] # We need init to be run, to use apply
	rules:
		- if: $CI_MERGE_REQUEST_ID
	script:
		- tofu -chdir=tests plan -out=plan # We want to create a plan so that we always apply the same changes in other pipelines
	artifacts:
		access: "developer"
		paths:
			- tests/plan
			
tf:smoke:apply:
	stage: smoke-test
	needs: [ tf:smoke:init, tf:smoke:plan ] # We rely on the plan being made
	allow_failure: true # We want to always destroy changes, so we must allow jobs to be run after the apply 
	rules: 
		- if: $CI_MERGE_REQUEST_ID
	script:
		- tofu -chdir=tests apply -auto-approve plan # Apply the planned changes from the previous steps
	artifacts:
		access: "developer"
		paths: 
			- tests/terraform.tfstate
	
tf:smoke:destroy:
	stage: smoke-test
	needs: [ tf:smoke:init, tf:smoke:plan, tf:smoke:apply ] # Destroy needs to be run last, so we depend on all previous jobs to be executed
	rules:
		- if: $CI_MERGE_REQUEST_ID
	script:
		- tofu -chdir=tests destroy -auto-approve # Destroy the changes from previous steps

Let’s quickly check what this pipeline does, step by step.

What value does this bring?

When we apply our module we can be sure that others will be able to apply these modules as well! There are no big problems that will cause your project to fail without being able to deal with this. Also, this helps the providing team to earn a huge bit of credibility because they test and the errors that occur will be due to configurations on the users side.

In the past, your module would not have been applicable. From here on out, you can add additional checks!

Where do you go from here?

Since we now know that our module is applicable, we can actually start doing tests with it! At my team, we started doing real checks. Some examples are:

Can your CI access the service?
Can you execute database migrations with the provided database?
Can you run CRUD operations against the system?

We use the provisioned services to do this, since they aren’t relevant for prod and will be destroyed anyway. We can just add another step in between the apply and destroy step.

Another option would be to use the official testing framework that terraform provides.

Does this help solve your problem?

For my team, this helped a lot! We can now publish our internal modules with confidence and know that we did not screw up on the first hurdle! The teams, that use our modules, update our code more often, using tools such as the Renovate Bot, and are happier, since we’re able to provide new features/ configurations more frequently and in a safer fashion.

Also, we have a lot more confidence, when we take responsibility for modules from other teams. We can build tests within minutes that help us in these unknown modules and start refactoring.

Of course, there are limitations. As with all tests, it might happen that we did not test configuration options that do not work together. But with the smoke tests, we can easily make sure those things don’t happen again and give the teams examples on how to use our module correct.

Current issues

At the time of writing, we know of three problems in our setup:

Colliding applies: It can happen that your tests ‘collide’ across pipelines. This usually happens when the apply is (partly) done, but the destroy did not happen yet. This may cause pipelines to fail, even though the code is fine.
Destructive changes: We do not know how the changes behave, compared to already provisioned infrastructure. There may be changes that cause your infrastructure to get destroyed before being recreated. An example of such destructive action would be major version upgrades on databases.
Cost impacts: As we provide real infrastructure, the cloud is billing us for real infrastructure. At our current scale of modules, those are mere cents per month, but we haven’t migrated all terraform modules over to this testing. You really need to be sure whether you want to use this test methodology, since some cloud services have initial cost with less ongoing cost. One of these would be domain zones. These cost per month or year. You probably should not test ordering random domain names on each Merge Request.

Testing Terraform The Right Way