Amazon Web Services Cloud Development Kit

22 Sep 2024 - Conan Mercer

Problem Statement

CloudFormation (Cfn) templates, while effective, can become time-consuming to manage and are usually written in a language that differs from the application’s service code. Deployment is often handled through a third party build server (such as TeamCity), which introduces extra components and associated costs. This post explores the use of Amazon Web Services Cloud Development Kit (CDK) and GitHub Actions as potential alternatives, providing a more cohesive, service-driven approach to infrastructure. By aligning infrastructure code with service development, this approach aims to improve workflow efficiency and simplify deployment processes.

Note: A repository of code has been written in-conjunction with this post. I have written examples to accompany this post in C Sharp and Python. The code is functionally the same, however written in two different languages to provide proof of concepts in multiple languages.

The accompanying code to this post resides on my GitHub - it can be viewed from here

A Case for Amazon Web Services Cloud Development Kit

CDK is a software development framework used to model and provision cloud application resources in AWS. It can be used with 5 programming languages:

Python
Java
.NET
GO
TypeScript

This enables infrastructure to be provisioned through object-oriented abstractions, harnessing the power of these object-oriented programming languages.

Advantages over Alternatives

Advantages over Cloudformation Templates

Cloudformation (Cfn) is wonderful for small scale infrastructure deployments, but can quickly become unsuitable for large enterprise cloud environments. The main reason for this is because Cfn is not programmatic. Cfn typically results in a massive amount of JSON or YAML code, that needs to be fed into TeamCity (or similar build servers) to request, maintain, provision, and destroy stacks. CDK outputs Cfn, so the end result is the same, however using CDK can simplify, automate, and organise code more efficiently. It also comes with more benefits which are discussed in detail throughout the rest of this post.

Advantages over Terraform

Terraform is an infrastructure as code tool. Terraform can be used with different cloud providers, which is a better choice if the system architecture uses multiple cloud providers, or if in the future it is likely that the system will be migrated to another cloud provider other than AWS. CDK is built by the AWS team specifically for AWS Cloud, this means that CDK is supported to the fullest extent possible (see subsection 3.1 for an expanded argument). Terraform is free to use but costs can occur for additional capabilities. CDK is totally free and is much more likely to remain free as it is a tool for AWS to get customers on board to their cloud offerings. Terraform is written in HashiCorp Configuration Language (HCL), whereas CDK can be written in the same language as the application/services, lowering the developer barrier for adoption.

Native CDK Advantages

Time to develop. Writing Cfn templates and using, for example, TeamCity to deploy them, is laborious. It works, but it takes more time to complete a development cycle. When changes are made locally, there is no simple method of checking template validity. It is true that the AWS CLI can be used to send the Cfn template to AWS, which then returns if a template is valid, but this needs to be setup. With CDK, by running cdk synth in the command line the developer can immediately observe if their code changes are at least valid. Another useful command is cdk diff. This compares differences between the current state of stacks in the local CDK app against deployed stacks. This command is of great use when developing, and also when identifying if changes desired are changes that will actually be deployed. It is also very powerful when refactoring IaC. Post refactor, this command can be run to ensure that the underlying cloud resources have not changed.

CDK Application

CDK Applications are comprised of various components. It is a collection of one or more Stacks, each Stack is composed of one or more Constructs. Ultimately, the output of a CDK app is a Cfn template, it is the journey CDK takes the developer on that makes this output achievable in a programmatic, reliable, and efficient manner. An example is shown in Figure 1

Overview of AWS Stacks Architecture — Figure 1: Simplified AWS CDK app

Constructs

Constructs in CDK are organised into 3 levels. With each level comes an increase in the level of abstraction. The highest level of abstraction (level 3) is easier to configure and requires less expertise, conversely Level 1 constructs offer no abstraction and are therefore more challenging.

Level 1 Constructs

Level 1 (L1) constructs, are denoted by Cfn and offer no abstraction. They are useful when a deep knowledge of AWS is held by the developer, and complete control over the deployed infrastructure is required. For example, CfnNatGateway is a L1 construct that maps directly to AWS::EC2::NatGateway in a Cloudformation resource. If a resource exists in AWS CloudFormation, it will be available in CDK as an L1 construct.

Level 2 Constructs

L2 constructs provide some abstraction, for example the construct s3.Bucket creates an S3 bucket with associated policy objects. The main difference between L1 and L2 constructs is that L2 constructs provide intelligent default property configurations, notably security settings. For this reason they are the most used constructs.

An example of L1, L2, and L3 Constructs for creating S3 buckets can be found here

Level 3 Constructs

L3 constructs provide the highest level of abstraction. They can contain a collection of resources, not just one, and can be used to deploy in some cases a completed architecture solution. For example the ecsPatterns.ApplicationLoadBalancedFargateService class provides a Fargate service running on ECS with an included load balancer.

Stacks

Single Stack

At some point, the CloudFormation template generated by a single Stack will become too large, reaching AWS limits. When this happens, the project will need to be split into multiple Stacks. Updates to stacks will take longer, as AWS must process the entire stack for any changes, making it harder to pinpoint what has changed. The output of cdk diff will become difficult to read, as developers will have to sift through many lines of output to find the relevant differences. In large applications, a small issue in one part of the stack (such as a reference to a deleted policy or IAM role) can block the entire stack deployment, potentially delaying critical changes from being deployed to production.

Multi Stack

A multi-stack approach addresses many of the problems seen in single stacks, but introduces some new challenges, such as more frequent updates across multiple stacks and handling cross-stack dependencies. If a resource is shared across multiple stacks, removing it can be tricky. However, on balance, using multiple stacks is generally the best way forward.

Stack Best Practices

Put stateful resources in their own stacks. This avoids an ”atomic block” where making a change cannot disturb another stack.
Always provide the account and region for every stack that is being created.
Refactor the code often, using cdk diff to ensure that the existing deployed stack wont change.
Snap-shot test the stack often
Avoid having more than 7 constructs per stack, anything more usually means a refactor to avoid complicating the stack

Infrastructure and Tooling

Infrastructure and Application Code

If infrastructure code is written in the same language as the application and services code, this simplifies the architecture of the entire system. The benefits here should be obvious, but I expand on some themes in the following sub sections.

Collaboration and Team Productivity

Option 1: Everything in one repository

Having IaC, deployment, and application code reside within the same repository is ideal because it enables all engineers involved in the project to have the same visibility, and hopefully innovate together. In theory it can help to facilitate faster development as silos are reduced between infrastructure and application code.

Option 2: Separated application code pipeline

If it is not ideal to have IaC and application code to reside in the same repository, and silos are actually desired, this can be achieved. This would decouple IaC and application code by sorting them into separate repository’s. Both the service and IaC can still be written in the same language.

GitHub Actions Deployment

Because the IaC and application can reside in the same repository, and the repository is on GitHub, all deployment code can also reside in the same repository by using GitHub Actions to deploy infrastructure. This is very powerful because at this point all the IaC, application code, service code, and deployment code are in the same repository. This also means that external deployment software such as TeamCity can be done away with, including their expensive licenses. Github Actions is free until a point, but costs do occur so it is not a free alternative, but the benefits outweigh the disadvantages. With very little code, a GitHub Action can setup AWS credentials, configure the environment, install necessary dependencies, run tests, and then deploy the stack using AWS CDK. An example is shown in Figure 2.

Overview of GitHub Actions workflow automating the deployment — Figure 2: Example of GitHub Actions workflow automating the deployment of stacks to a staging environment for a DMS project when changes are pushed to a specific branch. It sets up AWS credentials, configures the environment, installs necessary dependencies, runs tests, and then deploys the stacks using AWS CDK

GitHub Action Authorisation

Although using long lived keys or access ID is common practice, it is far from ideal. A more secure approach is to use OpenID Connect (OIDC). OIDC allows GitHub Actions workflows to access resources in AWS, without needing to store the AWS credentials as long-lived GitHub secrets. More information on implementing this can be found here

Version Control and Tracking

With IaC, infrastructure configurations can be stored in Git version control. This brings transparency, making it simpler to track changes and revert to previous versions if needed. When this is coupled with GitHub Actions for the deployment of infrastructure, not only can the infrastructure change be tracked, but also the tooling used to actually deploy the change. This should result in enhanced debugging compared with traditional build servers such as TeamCity, which do not make it easy to track changes to build configurations.

In the event of a disaster, because the infrastructure code, the deployment pipelines, and the application/service code are all contained within the same GitHub Repository, a true recovery from nothing is possible, and can be automated and tested regularly to simulate how quickly a recovery can occur from ground zero. Imagine a brand new AWS Account that needs to be provisioned exactly as an older, compromised, AWS Account.

Because the IaC and deployment pipelines are both versioned, they can be subjected to peer review. This is an often overlooked aspect of deployment pipelines. With TeamCity, it is very difficult to subject a pipeline change to peer review. This causes two major issues:

Only the creator of the change typically is aware a change has been made
Peers cannot review the change for possible errors and have little to no opportunity to improve the code of the committer

Overview of versioning deployment pipelines — Figure 3: Example of how versioning deployment pipelines can lead to peer review benefits. The top figure is the result of not having a peer review mechanism, the bottom figure is an IaC deployment pipeline that can be subjected to a peer review

Having both the application/service and IaC versioned equally is a big improvement. The version could also be tagged with the git hash of the commit that caused a version change. That would mean that a version can be tied back to a specific commit. For example a docker container could be tagged in ECR with the commit hash, it would then be very trivial to look at what changed between two versions of a docker container. This would make debugging docker container issues easier.

Testing and Validation

CDK code is testable, just the same as standard software development testing works. CDK apps use three types of testing categories:

Fine-grained assertions
Snapshot tests
Integration tests

Fine-grained assertions can be used to test for regression. They work by testing the output (Cfn template) for example ”does a resource have a specified property with a certain value”. They can be useful in test driven development, by writing the test first and then making it pass by writing a correct implementation in CDK.

Snapshot tests can be used when refactoring CDK code. This tests that the refactored code is not going to change at all the state of the already deployed infrastructure. It does this by comparing a synthesised template (the new refactored template) against a previously stored baseline template, the template currently deployed for example.

Once tests have been written, they can be included in CI/CD builds and infrastructure deployments. Depending on how strict the deployment needs to be, the tests can be run before any deployment occurs, and would fail a deployment if the tests fail for any reason. This adds a layer of safety and builds confidence when deploying changes to infrastructure. If for lower environments it is acceptable to not be as strict, the tests could be triggered to run only on Pull Requests. An example of a pipeline is shown in Figure 4.

Figure 4: Simplified Github Actions CDK testing pipeline

Integration testing is also possible with CDK, I will not go into detail here, but infrastructure can be deployed into 3 different regions and tested, this is very useful when planning and testing for disaster recovery, and by deploying (temporarily while testing) into multiple regions can uncover a lot of mistakes that would not normally be caught until a real world disaster.

Code Organisation for Large Projects

A CDK codebase needs to be well organised, and scale well. In this context, scaling means that the IaC doesn’t become hard to navigate, extend to multiple teams incoherently, and does not become slow to develop with or hard to on board new colleagues with.

Factories for Consistent Patterns

Abstract factories are common, well known, design patterns in software engineering. These can be used to promote code reuse and simplify the process of managing constructs.

Take for example provisioning EBS volumes. To provision EBS volumes efficiently, a design pattern that encapsulates volume creation logic can be used. The code defines an abstract base class called EBSStorage and several subclasses (GeneralPurposeEbs, IopsEbs, and ColdHddEbs) that represent different types of EBS volumes. Each subclass implements the initialize method to create specific EBS volumes with tailored properties, such as volume size and type.

An abstract factory, AbstractFactory, defines a method for creating EBS volumes, while EbsFactory provides concrete implementations based on the desired volume type (e.g., gp2, io1, sc1). This pattern simplifies the creation of EBS volumes by encapsulating the setup logic, promoting code reuse, and ensuring adherence to best practices.

The CdkEbsStack class demonstrates how to use this factory to provision and initialize an EBS volume in a CDK stack. By selecting the volume type through the factory, developers can easily manage and configure EBS resources in a consistent and maintainable manner. As shown in Figure 5, the Factory Pattern is used to create and initialize different types of EBS volumes in AWS CDK.

View the Python code for the CDK EBS stack on GitHub by following this link and the C Sharp code following this link.

Figure 5: Example of Factory Pattern used to create and initialize different types of EBS volumes in AWS CDK, showing the relationships between abstract classes, concrete storage types, and the factory class

Migration from an AWS CloudFormation template

When migrating workflows and IaC, time to develop, foresight, and costs need to be addressed. Ultimately if the time taken to complete a migration is too long, or if it costs too much money, it may not be the best route forward. The corollary argument is that accumulating technical debt, the risk of a technology stack becoming too old, is a risk that is often overlooked. With that said, here are some thoughts on how a project can go about minimising these risks as much as possible.

For large enterprise projects that have already been written in Cfn templates, it is possible to migrate those Cfn templates into CDK code. The AWS CDK CLI can be used to migrate from the following sources:

Deployed AWS resources
Deployed AWS CloudFormation stacks
Local AWS CloudFormation templates

This strategy can be used to get a CDK migration off the ground. It will try to create a CDK stack on a best effort basis, so while not foolproof it will go a long way to assist. Once the Cfn template has been migrated into a CDK app, normal CDK development and modification can occur. To test the new CDK version of the original Cfn template, cdk diff can be run locally to ensure no changes have been made to the underlying infrastructure.

This feature is experimental, and all 5 of the programming languages supported by CDK can be used. For example to convert an API Gateway Cfn template stack into a CDK app written in Python code, the following command can be run:

cdk migrate --from-path "./gateway.json" --stack-name "myCloudFormationStack" --language python

Figure 6: Example of cdk migrate command in the CLI

Figure 7: Example of migrating an existing Cfn template into a CDK app and testing for differences before deploying new CDK app

Integrating AWS CDK with Application Code for Real-Time Infrastructure Monitoring and Synchronization

The following is a more advanced use case of CDK with application and service code where there is automated communication between both. In short, the application or service can know when there is a problem with the infrastructure it is running on and make adjustments to correct itself. I will outline a rough example to demonstrate this possibility. Actual implementations can take a different approach but this is useful to gain a clearer picture of the concept. This is accompanied by Figure 8

Example of the integration between AWS CDK-deployed infrastructure and an application
for real-time monitoring — Figure 8: Example of the integration between AWS CDK-deployed infrastructure and an application for real-time monitoring, health event detection, and self-healing mechanisms. The process involves using CloudWatch Alarms, SNS for notifications, AWS Health APIs, EventBridge for event handling, and SSM Parameter Store for version synchronization. The application is capable of detecting infrastructure issues and initiating corrective actions, such as redeploying infrastructure components or restarting services like EC2 instances.

Using CloudWatch Alarms and SNS for Notifications

CloudWatch Alarms can be created, which the application or service can listen to in case of an infrastructure issue via SNS notifications. The application can subscribe to the SNS topic using AWS SDK or other methods to receive notifications when an alarm is triggered.

Using AWS Health APIs and EventBridge

AWS Health APIs can be used to monitor CDK derived AWS infrastructure health. EventBridge rules can then capture AWS Health events and notify the application.

AWS Systems Manager Parameter Store for Syncing Versions

The infrastructure version can be stored in SSM Parameter Store using CDK, this would allow the application to ensure it is synced with the correct infrastructure version. The application can query the parameter store to validate infrastructure versions using the AWS SDK.

Application-Driven Self-Healing

If the application detects issues with the infrastructure, it can trigger self-healing by updating or re-deploying specific parts of the CDK stack.

Example Scenario Auto-Restarting an EC2 Instance on Failure

Suppose an application is running on an EC2 instance. If the instance becomes unhealthy (e.g., due to high CPU usage, disk space exhaustion, or a process crash), the application could experience downtime. A self-healing solution could automatically detect this issue, stop the unhealthy instance, and launch a new one or restart the same instance.

Summary

CloudWatch Alarms and SNS can be used to notify the application in real-time whenever infrastructure issues arise. EventBridge captures AWS Health events, such as planned maintenance or outages, and relays them to the application. To ensure consistency, the infrastructure and application versions can be synchronized using SSM Parameter Store. Additionally, self-healing mechanisms can be implemented by allowing the application to trigger infrastructure automation via AWS CDK in response to detected problems.

Conclusion

In conclusion, AWS CDK offers significant advantages over traditional CloudFormation templates and Terraform. Its use of general purpose programming languages such as C Sharp and Python allows developers to leverage existing coding skills, create reusable constructs, and benefit from familiar software development best practices like version control, unit testing, and refactoring. This flexibility accelerates infrastructure development and reduces errors by enabling the use of logic, loops, and conditions within the code, which can be more cumbersome with JSON or YAML based CloudFormation templates.

One of the key benefits of CDK is the ability to synchronize application and infrastructure code within the same project, enabling both to be versioned together. This ensures that any changes to infrastructure are consistently aligned with the corresponding application or service updates, simplifying deployment workflows and minimizing potential issues from misaligned versions. While Terraform provides cross-cloud capabilities, CDK’s tight integration with AWS services and its developer-friendly approach make it an ideal choice for AWS-centric organizations, offering enhanced productivity, scalability, and maintainability.

« A Guide to Creating Your Very First Docker Application SQS FIFO - Why It Might Not Be What You Want It To Be »

Conan Mercer Site Reliability Engineering Team Lead

Amazon Web Services Cloud Development Kit - A Strategy for Infrastructure as Code

Problem Statement

A Case for Amazon Web Services Cloud Development Kit