EKS Cluster with On-demand Capacity Reservations (ODCR)¶
This pattern demonstrates how to consume/utilize on-demand capacity reservations (ODCRs) with Amazon EKS. The solution is comprised of primarily 3 components:
-
The node group that will utilize the ODCRs should have the subnets provided to it restricted to the availability zone where the ODCR(s) capacity is allocated. For example - if the ODCR(s) are allocated to
us-west-2b
, the node group should only have subnet IDs provided to it that reside inus-west-2b
. If the subnets that reside in other AZs are provided, its possible to encounter an error such asInvalidParameterException: The following supplied instance types do not exist ...
. It is not guaranteed that this error will always be shown, and may appear random since the underlying autoscaling group(s) will provision nodes into different AZs at random. It will only occur when the underlying autoscaling group tries to provision instances into an AZ where capacity is not allocated and there is insufficient on-demand capacity for the desired instance type. -
A custom launch template is required in order to specify the
capacity_reservation_specification
arguments. This is how the ODCRs are integrated into the node group (i.e. - tells the autoscaling group to utilize the provided capacity reservation(s)).Info
By default, the
terraform-aws-eks
module creates and utilizes a custom launch template with EKS managed node groups which means users just need to supply thecapacity_reservation_specification
in their node group definition. -
A resource group will need to be created for the capacity reservations. The resource group acts like a container, allowing for ODCRs to be added or removed as needed to adjust the available capacity. Utilizing the resource group allows for this additional capacity to be adjusted without any modification or disruption to the existing node group or launch template. As soon as the ODCR has been associated to the resource group, the node group can scale up to start utilizing that capacity.
Links:
- Tutorial: Launch On-Demand Instances using targeted Capacity Reservations
- Target a group of Amazon EC2 On-Demand Capacity Reservations
Code¶
################################################################################
# Required Input
################################################################################
variable "capacity_reservation_arns" {
description = "List of on-demand capacity block reservation ARNs for the node group"
type = list(string)
}
################################################################################
# Cluster
################################################################################
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.26"
cluster_name = local.name
cluster_version = "1.31"
# Give the Terraform identity admin access to the cluster
# which will allow it to deploy resources into the cluster
enable_cluster_creator_admin_permissions = true
cluster_endpoint_public_access = true
cluster_addons = {
coredns = {}
eks-pod-identity-agent = {}
kube-proxy = {}
vpc-cni = {
most_recent = true
}
}
# Add security group rules on the node group security group to
# allow EFA traffic
enable_efa_support = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
odcr = {
# The EKS AL2023 NVIDIA AMI provides all of the necessary components
# for accelerated workloads w/ EFA
ami_type = "AL2023_x86_64_NVIDIA"
instance_types = ["p5.48xlarge"]
# Mount instance store volumes in RAID-0 for kubelet and containerd
# https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0
cloudinit_pre_nodeadm = [
{
content_type = "application/node.eks.aws"
content = <<-EOT
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
instance:
localStorage:
strategy: RAID0
EOT
}
]
min_size = 2
max_size = 2
desired_size = 2
# This will:
# 1. Create a placement group to place the instances close to one another
# 2. Ignore subnets that reside in AZs that do not support the instance type
# 3. Expose all of the available EFA interfaces on the launch template
enable_efa_support = true
min_size = 4
max_size = 5
desired_size = 2
labels = {
"vpc.amazonaws.com/efa.present" = "true"
"nvidia.com/gpu.present" = "true"
}
taints = {
# Ensure only GPU workloads are scheduled on this node group
gpu = {
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}
}
# First subnet is in the "${local.region}a" availability zone
# where the capacity reservation is created
# TODO - Update the subnet to match the availability zone of *YOUR capacity reservation
subnet_ids = [element(module.vpc.private_subnets, 0)]
# Targeted on-demand capacity reservation
capacity_reservation_specification = {
capacity_reservation_target = {
capacity_reservation_resource_group_arn = aws_resourcegroups_group.odcr.arn
}
}
}
# This node group is for core addons such as CoreDNS
default = {
instance_types = ["m5.large"]
min_size = 1
max_size = 2
desired_size = 2
}
}
tags = local.tags
}
################################################################################
# Resource Group
################################################################################
resource "aws_resourcegroups_group" "odcr" {
name = "${local.name}-p5-odcr"
description = "P5 instance on-demand capacity reservations"
configuration {
type = "AWS::EC2::CapacityReservationPool"
}
configuration {
type = "AWS::ResourceGroups::Generic"
parameters {
name = "allowed-resource-types"
values = ["AWS::EC2::CapacityReservation"]
}
}
}
resource "aws_resourcegroups_resource" "odcr" {
count = length(var.capacity_reservation_arns)
group_arn = aws_resourcegroups_group.odcr.arn
resource_arn = element(var.capacity_reservation_arns, count.index)
}
Deploy¶
See here for the prerequisites and steps to deploy this pattern.
Validate¶
-
Navigate to the EC2 console page - on the left hand side, click on
Capacity Reservations
under theInstances
section. You should see the capacity reservation(s) that have been created similar to the screenshot below. For this example, you can see thatAvailable capacity
column is empty, which means that the capacity reservations have been fully utilized by the example (as expected).
-
Click on one of the capacity reservation IDs to view the details of the capacity reservation. You should see the details of the capacity reservation similar to the screenshot below. For this example, you can see that
Available capacity
is0 instances
, which means that the capacity reservation has been fully utilized by the example (as expected).
Destroy¶
terraform destroy -target="module.eks_blueprints_addons" -auto-approve
terraform destroy -target="module.eks" -auto-approve
terraform destroy -auto-approve
See here for more details on cleaning up the resources created.