Well-Architected: A Guide to Cost Optimisation

As you’re already very much aware, Cost Optimisation isn’t a one-off checklist item that one can ‘fire-and-forget’. It’s an ongoing, repeated process. And that can be a chore, especially if you’re the only one driving it.

But what about a more proactive approach to Cost Optimisation? How can we apply an attitude of continual improvement to how we manage our overall cloud investment, its returns and other benefits to business outcomes? And how can we leverage the huge number of AWS services to realise this goal?

In this blog, we will explore Cost Optimisation as not just a cost-centric, resource-focused exercise, but also a result of adopting a positive and forward-thinking technology and business culture. One where cost savings are the responsibility of the many, and not the few.


The 5 Focal Areas of Cost Optimisation

Cost Optimisation forms one of the 6 pillars of the AWS Well-Architected Framework. In turn, that pillar comprises the following 5 focal areas:

  1. Practice Cloud Financial Management
  2. Expenditure and usage awareness
  3. Cost-effective resources
  4. Manage demand and supplying resources
  5. Optimise over time

1. Practice Cloud Financial Management

According to AWS, Cloud Financial Management (CFM), “…enables organizations to realize business value and financial success as they optimize their cost and usage and scale on AWS…”

This is broken down further, into the following items:

Functional Ownership

Create a business function – either an individual or a team – that will take responsibility for driving improved cost awareness in the business. This function should focus its efforts in two ways: centralised and decentralised.

Centralised Approach

This focuses mainly around economies-of-scale incentives, such as Reserved Instances and Savings Plans.

These cost models provide a good degree of flexibility, allowing you to make all-upfront commitments to reserve data-centre capacity (CAPEX cost model) or a pay-monthly commitment (OPEX model), or even a 50/50 hybrid of the two.

Reservation commitments can be made for either 12 or 36 months, either when reserving instances (such as EC2 or RDS), or taking out compute plans (useful for adopters of Fargate and Lambda, for example).

In cases where you know that you are guaranteed to require a certain number of always-on EC2s, pretty much forever, you can leverage the highest cost-saving by choosing an all-upfront 3-year reservation. Typically, you can save in the region of ~62% by doing this. If you cannot commit to that much capital expenditure, you can go halves and pay the rest in monthly instalments across the 3 years. This brings the saving down to ~61%. No capex budget? Then go with the no-upfront, pay-monthly deal, and save ~60%. Dropping down to the 1-year deal with see these discounts drop to around the ~40% mark.

By contrast, if you’re planning migrations from EC2 to Fargate or Lambda, then you might choose to take out one of the more flexible Savings Plans – such as the Compute Savings Plan – which will provide cover across all three technologies whilst you migrate. Despite this, you still make a single up-front reservation of capacity, and are still able to pay all-upfront, monthly or both. Discounts vary in the range of ~30-40% in the 12-month bracket, and move up to ~50% in the 3-year deals.

Decentralised Approach

This is where your new Cost Optimisation function will come into its own, as it begins to drive an improved cultural and behavioural attitude towards cost awareness and avoidance. Constant trade-offs between short-term cost avoidance vs long-term investment in improved business efficiencies (often driven by technology choices) will be of particular focus here.

Finance and Technology Partnership

Ensure that finance and technical teams invest a good proportion of their time to understand each other’s requirements. This is often overlooked, and usually with frictional results.

As your technology teams migrate further away from on-premise and traditional data-centre or fixed infrastructure, so too must your finance team move away from fixed-cost forecasts and budgets (CAPEX) to a more flexible, variable-cost model (OPEX).

Use AWS services such as AWS Budgets & AlarmsAWS Cost Explorer and AWS Cost & Usage Reports to help finance teams become more agile in their own processes and culture. Having better real-time awareness of variable costs and related data will help them to align their management practices with your technology teams’ deliveries and requirements.

Cloud Budgets and Forecasts

As you make better use of cloud technologies, your costs are likely to become more variable. This is a good thing, as it means you are likely to enjoy vastly reduced costs during quiet periods. But you need to protect yourself against sudden spikes in variables costs when demand is high.

One of the ways to do this is to set up AWS Budgets and Forecasts.

You can easily set fixed usage or monetary budgets for a given period, and have AWS send you alerts when your forecast usage is expected to exceed them. These are highly customisable, easy to configure and also offer a sliding-scale model (which is useful for product launches or re-platforming projects).

Also consider using AWS Cost Explorer, which can forecast daily (up to 3 months) or monthly (up to 12 months) cloud costs, based on machine-learning algorithms applied to your historical costs (trend based forecasting).

Cost-Aware Processes

Encourage everyone in your business to think about the cost (investment) and any return on that cost (Return On Investment), before undertaking new projects in particular.

For example, Business Analysts should seek to understand how cost factors have driven existing processes and purchasing decisions, Product Owners should seek to understand how product improvements can reduce demand on digital services and Technical Leaders should support both roles by suggesting innovative use of technology to deliver short term objectives with longer-lasting cost benefits.

Engineers, meanwhile, should enjoy a high level of autonomy in their role, so they can iterate and deliver at high speed. But they should also be constantly educated about the cost implications of their own activities. Developer sandbox accounts should have their own budgets and alarms for example, including escalations to line managers and budget controllers.

You can use AWS Organizations and AWS SSO to provision sandbox accounts to individuals; leverage the organisation-wide billing capabilities of AWS Cost Explorer and AWS Cost & Usage Report to understand and monitor these costs.

Quantify Business Value Delivered Through Cost Optimisation

Whether you are migrating services from one compute or database model to another, applying Savings Plans or Reserved Instances, terminating old EC2 instances, or decommissioning other resources, it’s important to keep a good track record of before-and-after costs.

This can be achieved using AWS Cost & Usage Reports before and after a given cost-reduction exercise. Cost savings can be monitored mid-flight using the sliding-scale model in AWS Budgets.

This is all very useful, but immediate cost savings are not the sole benefit of Cost Optimisation. Longer-term investment in future efficiencies is.

AWS offer the following examples:

Executing cost optimization best practices

For example, resource lifecycle management reduces infrastructure and operational costs and creates time and unexpected budget for experimentation. This increases organization agility and uncovers new opportunities for revenue generation.

Implementing automation

For example, Auto Scaling, which ensures elasticity at minimal effort, and increases staff productivity by eliminating manual capacity planning work.

Forecasting future AWS costs

Forecasting enables finance stakeholders to set expectations with other internal and external organization stakeholders, and helps improve your organization’s financial predictability.

2. Expenditure and Usage Awareness

Every product team or revenue-generating area of your business should have its own cost-monitoring, budgeting and forecasting capability, as well as the business as a whole having a centralised one.

This decentralised model ensures that those closest to the coal face can remain agile in their delivery, whilst also having ownership, accountability and control of their own costs (and savings).

The three following items should be considered:

Governance

Refine Policies

By leveraging AWS services, such as IAMAWS Organizations & Service Control Policies and AWS Config, your business can gradually introduce measures to ensure that use or modification of certain services or resources is restricted.

Basic examples might include applying a region restriction to your AWS Organization, or disallowing the use of expensive, overpowered EC2s instance families.

With time, you can refine these restrictions to target specific AWS accounts or environments, IAM roles or user groups. Your developers might have ample freedom to explore different services in their sandbox accounts, but only on free-tier. In a staging environment, larger instances may be allowed, but with the avoidance of multi-regional deployments, for example. In production, restrictions are constantly re-evaluated to provide the best trade-off of demand vs performance.

Once again, lean on AWS Budgets to provide real-time notifications of off-policy activities that perhaps have not been explicitly enforced.

Use Multiple Accounts

AWS has a one-parent-to-many-children account structure that is commonly known as a management account (the parent) account-member (the child) account.

A best practice is to always have at least one management account with one member account, regardless of your organisation size or usage. All workload resources should reside only within member accounts.

Use consolidated billing within AWS Organizations to aggregate all costs into the parent account. This is useful for finance teams and bill-payers, for obvious reasons. However, each child account can still contain its own budgets and alerts, which is useful for implementing the distributed model of Cost Optimisation accountability and autonomy in your wider teams.

As costs and usage are aggregated in the management account, this allows you to maximise your service volume discounts, and maximise the use of your commitment discounts (Savings Plans and Reserved Instances) to achieve the highest discounts.

AWS Control Tower can quickly set up and configure multiple AWS accounts, ensuring that governance is aligned with your organisation’s requirements.

Track & Decommission Resources

Ensure that you track the entire lifecycle of a workload. This ensures that when workloads or workload components are no longer required, they can be decommissioned or modified.

You can use AWS Config or AWS Systems Manager to quickly collate inventories of existing resources and configurations, regardless of how many accounts you operate. You can then begin to terminate resources that are no longer required.

Monitor Cost and Usage

Your teams must be able to track and understand their own costs before they can be expected to control them.

There are several things you should consider here.

Configure Detailed Data Sources

“Enable hourly granularity in Cost Explorer and create a Cost and Usage Report (CUR). These data sources provide the most accurate view of cost and usage across your entire organization. The CUR provides daily or hourly usage granularity, rates, costs, and usage attributes for all chargeable AWS services. All possible dimensions are in the CUR including: tagging, location, resource attributes, and account IDs.”

Consider using AWS QuickSight to visualise and explore your CUR. Other services, such as Amazon Athena can be used to perform custom data analysis, should your organisational scale require it.

Identify Cost Attribution Categories

Your efforts to categorise costs can be greatly supported using AWS Tag Editor, for example, which can be used to manage resource tagging at scale. For example, you might want to tag all currently running EC2s in your development account, or certain DB instances for some other cost-tracking purpose. This can easily be done here.

Use AWS Tag Policies in your Organization, in order to prevent misused of tagging across your accounts.

Establish Workload Metrics

Using services such as AWS CloudWatch Metrics, begin to align workload success (or failure) metrics with associated cost metrics or budgets in AWS Cost Explorer or AWS Budgets. Use these tools to understand how investments in new technology may have improved or worsened your workload’s output. This in information can be used to help determine overall business value and ROI.

Use Cost Categories

AWS Cost Categories allows you to assign organization meaning to your costs, without requiring tags on resources. You can map your cost and usage information to unique internal organization structures. You define category rules to map and categorize costs using billing dimensions, such as accounts and tags. This provides another level of management capability in addition to tagging. You can also map specific accounts and tags to multiple projects.”

Decommission Resources

The basic steps to decommissioning resources are as follows:

Track Resources Over Their Lifetime

Tag, tag and tag some more, especially in pre-production environments where resources are often short-lived but regularly left running. If you deploy resources automatically, in a CICD pipeline, make the CICD pipeline responsible for tearing them down again. But tag everything. Any resources that remain can then be tracked using the AWS Cost Explorer.

For longer-lived resources, consider tagging them with future dates for decommissioning, if these dates are known. Have a process for filtering those tags and reviewing their accuracy.

Implement a Decommissioning Process

“Implement a standardized process across your organization to identify and remove unused resources. The process should define the frequency searches are performed, and the processes to remove the resource to ensure that all organization requirements are met.”

Decommission Resources

“The frequency and effort to search for unused resources should reflect the potential savings, so an account with a small cost should be analyzed less frequently than an account with larger costs. Searches and decommission events can be triggered by state changes in the workload, such as a product going end of life or being replaced. Searches and decommission events may also be triggered by external events, such as changes in market conditions or product termination.”

Decommission Resources Automatically

“Use automation to reduce or remove the associated costs of the decommissioning process. Designing your workload to perform automated decommissioning will reduce the overall workload costs during its lifetime. You can use AWS Auto Scaling to perform the decommissioning process. You can also implement custom code using the API or SDK to decommission workload resources automatically.”

3. Manage Demand and Supply Resources

The Cloud is billed on-demand by default. This can be a blessing and a curse, as we have discussed in other sections. Cloud resources are also scalable – some vertically and others horizontally. You should work to understand how these two scaling models compare.

For example, a static EC2 instance with a large amount of computational headroom will meet a temporary spike in demand, but it is a sledge-hammer to crack a walnut in cost terms. And if you require redundancy, then your already-high wastage will double or triple. You could introduce the AWS Auto-Scaling feature of EC2 in order to scale up when demand is high, but scale back down when demand is low.

But you also need to consider the speed at which auto-scaling works. Migrating your workloads to AWS ECS+Fargate or Lambda Functions will increase their scaling speed, whilst also reducing the size of each node; so you save twice.

Consider this slicker horizontal scaling model to be ‘just-in-time’ provisioning of resources.

You can also plan to preemptively upscale for known high-traffic events, such an upcoming marketing campaign. AWS Auto-Scaling can be triggered by AWS CloudWatch Events (now EventBridge) either based on real-time events or on a schedule. You can also trigger auto-scaling events via the CLI or SDKs.

You can also use AWS Cost Explorer or Amazon QuickSight with your Cost & Usage Report (CUR) or your application logs, to perform a visual analysis of workload demand. This is useful for future capacity planning and rightsizing requirements.

4. Manage Demand

Throttling

If you’re simply receiving more traffic than you can afford to service, then you can implement throttling on selected services, such as API Gateway.

Buffering

Decouple boundary-layer transactions (such as API calls) from underlying back-end processing. This will allow transactions to execute faster, by minimising the scope of synchronous tasks. Asynchronous messages can be pushed to a queue system, such as AWS SQS, and be processed at a slightly later time, or at a slower rate (which improves end-user experience but reduces costs).

When architecting with a buffer-based approach, ensure that you architect your workload to service the request in the required time, and that you are able to handle duplicate requests for work.

5. Optimise Over Time

One of the great benefits of the cloud is that it’s always improving. That’s good for your business, because you will be able to benefit from these improvements, all the while avoiding the direct ownership costs of operating bare metal resources in the data centres.

However, it also means that you must stay abreast of new services or improvements to existing ones, in order to maintain your Cost Optimisation benefits.

Regular audits are your friend, but so too is your new culture of shared ownership of costs, savings and other related optimisations. Continue to run internal workshops, continue to promote ongoing improvements to cost visibility, continue to learn and educate and share ideas.

Deliberately set aside time to review and evaluate new services, and prove or disprove their Cost Optimisation potential through POC and MVP deliverables.

Good luck!

 


Jim Wood
Solutions Architect, Ubertas Consulting

LinkedIn