EKS w/ ML Capacity Block Reservation (CBR)¶
This pattern demonstrates how to consume/utilize ML capacity block reservations (CBR) with Amazon EKS. The solution is comprised of primarily 2 components:
-
The self-managed node group that will utilize the CBR should have the subnets provided to it restricted to the availability zone where the CBR has been allocated. For example - if the CBR is allocated to
us-west-2b
, the node group should only have subnet IDs provided to it that reside inus-west-2b
. If the subnets that reside in other AZs are provided, its possible to encounter an error such asInvalidParameterException: The following supplied instance types do not exist ...
. It is not guaranteed that this error will always be shown, and may appear random since the underlying autoscaling group(s) will provision nodes into different AZs at random. It will only occur when the underlying autoscaling group tries to provision instances into an AZ where capacity is not allocated and there is insufficient on-demand capacity for the desired instance type.Warning
The use of self-managed node group(s) are required at this time to support capacity block reservations within EKS. This pattern will be updated to demonstrate EKS managed node groups once support has been implemented by the EKS service.
-
The launch template utilized should specify the
instance_market_options
andcapacity_reservation_specification
arguments. This is how the CBR is utilized by the node group (i.e. - tells the autoscaling group to launch instances utilizing provided capacity reservation).
Links:
Code¶
################################################################################
# Required Input
################################################################################
# See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-blocks-using.html
# on how to obtain a ML capacity block reservation. Once acquired, you can provide
# the reservation ID through this input to deploy the pattern
variable "capacity_reservation_id" {
description = "The ID of the ML capacity block reservation for the node group"
type = string
}
################################################################################
# Cluster
################################################################################
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.9"
cluster_name = local.name
cluster_version = "1.29"
# Give the Terraform identity admin access to the cluster
# which will allow it to deploy resources into the cluster
enable_cluster_creator_admin_permissions = true
cluster_endpoint_public_access = true
cluster_addons = {
coredns = {}
eks-pod-identity-agent = {}
kube-proxy = {}
vpc-cni = {}
}
# Add security group rules on the node group security group to
# allow EFA traffic
enable_efa_support = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
# This node group is for core addons such as CoreDNS
default = {
instance_types = ["m5.large"]
min_size = 1
max_size = 2
desired_size = 2
}
}
# Note: ML capacity block reservations are only supported
# on self-managed node groups at this time
self_managed_node_groups = {
cbr = {
# The EKS AL2 GPU AMI provides all of the necessary components
# for accelerated workloads w/ EFA
ami_type = "AL2_x86_64_GPU"
instance_type = "p5.48xlarge"
pre_bootstrap_user_data = <<-EOT
# Mount instance store volumes in RAID-0 for kubelet and containerd
# https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0
/bin/setup-local-disks raid0
# Ensure only GPU workloads are scheduled on this node group
export KUBELET_EXTRA_ARGS='--node-labels=vpc.amazonaws.com/efa.present=true,nvidia.com/gpu.present=true \
--register-with-taints=nvidia.com/gpu=true:NoSchedule'
EOT
min_size = 2
max_size = 2
desired_size = 2
# This will:
# 1. Create a placement group to place the instances close to one another
# 2. Ignore subnets that reside in AZs that do not support the instance type
# 3. Expose all of the available EFA interfaces on the launch template
enable_efa_support = true
# ML capacity block reservation
instance_market_options = {
market_type = "capacity-block"
}
capacity_reservation_specification = {
capacity_reservation_target = {
capacity_reservation_id = var.capacity_reservation_id
}
}
}
}
tags = local.tags
}
Deploy¶
See here for the prerequisites and steps to deploy this pattern.
Destroy¶
terraform destroy -target="module.eks_blueprints_addons" -auto-approve
terraform destroy -target="module.eks" -auto-approve
terraform destroy -auto-approve
See here for more details on cleaning up the resources created.