Well-Architected: Banishing the Curse of ClickOps

Written by Alex Kearns

Introduction

ClickOps – a word that (should) send shivers through anyone who manages production infrastructure. 

ClickOps is a colloquialism describing the process of provisioning infrastructure by clicking through a graphical user interface (GUI). In this case, we’re going to talk about it in the context of launching infrastructure or configuring services through the AWS management console, rather than using an infrastructure-as-code tool like AWS CloudFormation or Hashicorp Terraform.

As this post forms part of the Well-Architected series, it’d be remiss of me not to link back to AWS’ best-practice framework. The manner in which you operate in AWS is closely linked to the Operational Excellence pillar. Within the pillar, a best-practice control exists titled “Perform operations as code”, which talks about applying “the same engineering discipline that you use for application code to your entire environment” and that “by performing operations as code, you limit human error and create consistent responses to events.” Who wouldn’t want these behaviours?

I’ve adopted a more formal approach to thought leadership in previous posts. This time, I’d like to try something a little bit different. We’re going to learn how to rid your organisational culture of ClickOps (where appropriate) by telling a story. We won’t go deep into technical detail; instead, we will be grazing the surface of some ideas and concepts that will help us along the way.

So buckle up—you’re about to embark on a journey that will banish the curse of ClickOps!

Please note that the story that follows is a work of fiction. Any resemblance to actual persons, living or dead, organisations, or events is purely coincidental.

Let’s meet our main character.

Say hello to the protagonist of our story, Clara.

Clara is a Cloud Engineer working for WBSPro, a medium-sized organisation that provides project management software to other organisations through a B2B SaaS model. She joined along with a couple of others just over three months ago when the organisation decided to invest in dedicated cloud skillsets rather than rely upon their Software Engineers as they had done up to that point.

WBSPro’s AWS environment was a bit of a mess. Everything had been deployed into one account, and all the resources had been created by hand. Clara has spent the first few months getting to grips with the infrastructure. It wasn’t easy; documentation was pretty sparse, and the Software Engineers didn’t respond particularly well to having responsibility taken away from them.

Previously, the organisation hesitated to invest time and energy in improving its AWS environment, thinking, “Well, it already works.” No one had articulated strong enough reasons to convince them otherwise.

All is well until it’s not.

One of the reasons the organisation hired dedicated cloud resources was in preparation for the major launch of a new version of its application. The latest version had been in the works for almost a year, and the product manager was starting to push harder on its timeline, wanting to release it in beta to a select number of customers.

The Software Engineering team indicated that the currently deployed infrastructure wouldn’t support the new version. Additionally, if upgraded in place, it would break the current version.

Sounds like a no-win situation for Clara! 

To aid WBSPro in moving towards AWS best practice, Clara identified that this would be an excellent opportunity to build new infrastructure in the right way, in a new AWS environment.

Clara and her colleagues in the Cloud Engineering team devised a time estimate for building the infrastructure into the new environment using infrastructure-as-code tools. They thought it would take about three weeks. When this estimate was fed back to the product manager, the team was told it was too long and that they needed to have infrastructure ready within the next week… Fun times! 

Clara and her colleagues now had to perform a miracle by delivering something in a third of the time to keep their stakeholders happy. The Cloud Engineering team decided to work 16-hour days, got the job done, and all suffered from burnout the following week. They all lived happily ever after, the end… 

Just kidding – that’s never the right solution! It was time for them to re-think their approach and view it from a different angle.

Banishing the curse

Now for the exciting part of our story: time to banish the curse of ClickOps. Fortunately for Clara, many organisations have trodden this path before and worked out what needs to be done. Banishing this particular curse requires three key steps:

  1. Discover the damage
  2. Rescue the resources
  3. Prevent the curse from returning

1. Discover the damage

Clara and her team’s first action was to determine just how much damage the ClickOps curse had inflicted on the environment. To do this, they reached straight for the AWS Resource Explorer.

This tool found resources deployed across AWS regions in the single AWS account in which the organisation operates. It didn’t have 100% coverage, but when Clara reviewed the supported resources list, combined with her and her colleagues’ limited knowledge of what was deployed, she was fairly confident that the vast majority had been discovered.

Using AWS Resource Explorer enabled the team to build a list of resources that comprised the existing application’s required infrastructure. They needed this information as they proceeded to the next step.

2. Rescue the resources

Once the resources in the AWS estate had been discovered, it was time to get infrastructure-as-code (IaC) involved.

Clara’s IaC tool of choice was Terraform. Terraform came with some pretty nifty functionality that allowed her to generate configuration by using the import block. Combining this with the ‘generate-config-out‘ option meant that she was able to convert resource IDs (e.g. EC2 instance IDs) to fully formed Terraform code.

The functionality enabled the retrofitting of IaC to the existing environment. She needed to define import blocks, generate configuration, and then run terraform apply to import the resources to the Terraform state. This was combined with Terraform Workspaces to keep the existing state separate from potential future environments.

Hopefully, you can see how this is starting to banish the curse. Clara had a defined Terraform configuration, and the Terraform state reflected the existing environment. Already, she was in a much stronger position with the ability to make controlled and recorded changes.

I can hear you asking, but how did this help Clara with her pesky product manager who needed a new environment?

This is where one of IaC’s most significant benefits comes in – repeatability. Terraform configuration defines the desired state, and Terraform state files define the actual state. By using Terraform Workspaces, state files can be isolated per environment. This meant that Clara could create a new workspace, run terraform apply, and deploy all the defined architecture to a new AWS account.

With this, she could also make the required changes to the configuration to support the new application. Terraform supports variables, and so Clara was able to define different values for each environment.

For example, the EC2 instances used in the old environment could be of type t2.medium, whereas the new application required t4g.large. A variable called ec2_instance_type could be specified and then referenced in the code as var.ec2_instance_type.

3. Prevent the curse from returning

So, where are we up to?

Clara has now deployed a version of the infrastructure to the new environment and accomplished what the product manager needed – result! But how did she stop the curse of ClickOps from returning? The key was in control and culture.

She and her team needed to prevent Software Engineers from making destructive changes to AWS infrastructure. They did this by implementing authorisation controls with AWS IAM. Software Engineers were allowed to have read-only access to AWS, whereas members of the the Cloud Engineering team were granted a higher level of access.

Culture was a tricky one. It’s not something that Clara implemented overnight. Control and process helped to speed it up, however. She decided that embedding a strong DevOps culture would ensure the team’s hard work wasn’t undone and that it stuck around for good. She empowered Software Engineers to release changes by utilising well-controlled automation driven by IaC. For example, rather than people running commands on their local machine to deploy new software versions, pipelines were implemented that are driven by version-controlled source code repositories (i.e. Git).

Looking back

Let’s round out our story with some reflection (and not the Java kind).

Clara’s journey started with a product manager asking to get infrastructure ready for the testing of a new release in a third of the time that she and her team estimated it would be complete. She explored infrastructure-as-code, retrofitting Terraform to the existing environment to bring it closer to best practice. With Terraform Workspaces, she deployed an identical environment to a new AWS account and made the required changes to support the latest release of the application.

WBSPro’s AWS estate is now in a much better state than at the start of this story. With their new Terraform configuration, it’s easy for them to launch ephemeral environments for testing and tear them down when no longer required.

The skills Clara and her team learn in this story mean that ClickOps doesn’t necessarily have to be a complete no-go. It has its place for small-scale development, where only a handful of resources are being created. It’s all about balancing order and chaos. For production-grade applications, there’s no reason for ClickOps to ever be the first choice.

The end

That’s all folks!

So, that draws our story to an end. We’ve banished the curse of ClickOps, and now the WBSPro’s AWS estate is in a much better condition.

I hope you’ve enjoyed this light-hearted read and learnt something valuable along the way.

If you’re finding that your AWS environment is suffering from the curse of ClickOps, please do get in touch. Ubertas Consulting are experts at liberating infrastructure from such pesky curses, and no one knows AWS better.

 


Alex Kearns
Principal Solutions Architect, Ubertas Consulting
AWS Ambassador, Community Builder & User Group Leader

LinkedIn