Skip to content

EKS Cluster w/ AWS Neuron Devices and EFA for Machine Learning

This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that utilizes trn1.32xlarge instances that are used in distributed, multi-node machine learning workloads.

The following components are demonstrated in this pattern:

  • A "default" node group that supports addons and components that do not require AWS Neuron nor EFA devices. Any pods that do not tolerate the taints of the Neuron node group will be scheduled on instances within this node group.
  • A node group of trn1.32xlarge instances with:
    • all x8 EFA network interfaces enabled
    • provisioned within a placement group so that the instances are co-located close to one another in a single availability zone that supports the instance type
    • a common taint of "aws.amazon.com/neuron:NoSchedule" to ensure only the intended applications are permitted to run on the nodes created
    • two labels identifying that this nodegroup supports AWS Neuron and EFA devices; allowing pods to use node selectors with these labels
    • the NVME instance store volumes are mounted in a RAID-0 array to provide a single, large, high-performance storage volume for the Neuron workloads
    • kubelet and containerd are configured to utilize the RAID-0 volume, allowing kubelet to discover the additional storage as ephemeral storage that can be utilized by pods
  • A Helm chart deployment for the Neuron device plugin to expose and mount the Neuron devices provided by the instances to the pods that request them
  • A Helm chart deployment for the EFA device plugin to expose and mount the EFA network interfaces provided by the instances to the pods that request them. Since the EFA network interfaces are only found on the instances that provide AWS Neuron devices in this pattern, we do not apply an additional taint for the EFA network interfaces to avoid over-constraining.

Code

Cluster

################################################################################
# Cluster
################################################################################

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.26"

  cluster_name    = local.name
  cluster_version = "1.31"

  # Give the Terraform identity admin access to the cluster
  # which will allow it to deploy resources into the cluster
  enable_cluster_creator_admin_permissions = true
  cluster_endpoint_public_access           = true

  cluster_addons = {
    coredns                = {}
    eks-pod-identity-agent = {}
    kube-proxy             = {}
    vpc-cni = {
      most_recent = true
    }
  }

  # Add security group rules on the node group security group to
  # allow EFA traffic
  enable_efa_support = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {
    neuron-efa = {
      # The EKS AL2023 Neuron AMI provides all of the necessary components
      # for accelerated workloads w/ EFA
      ami_type       = "AL2023_x86_64_NEURON"
      instance_types = ["trn1.32xlarge"]

      # Mount instance store volumes in RAID-0 for kubelet and containerd
      # https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0
      cloudinit_pre_nodeadm = [
        {
          content_type = "application/node.eks.aws"
          content      = <<-EOT
            ---
            apiVersion: node.eks.aws/v1alpha1
            kind: NodeConfig
            spec:
              instance:
                localStorage:
                  strategy: RAID0
          EOT
        }
      ]

      min_size     = 2
      max_size     = 2
      desired_size = 2

      # This will:
      # 1. Create a placement group to place the instances close to one another
      # 2. Ignore subnets that reside in AZs that do not support the instance type
      # 3. Expose all of the available EFA interfaces on the launch template
      enable_efa_support = true

      labels = {
        "vpc.amazonaws.com/efa.present" = "true"
        "aws.amazon.com/neuron.present" = "true"
      }

      taints = {
        # Ensure only Neuron workloads are scheduled on this node group
        gpu = {
          key    = "aws.amazon.com/neuron"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }

    # This node group is for core addons such as CoreDNS
    default = {
      instance_types = ["m5.large"]

      min_size     = 1
      max_size     = 2
      desired_size = 2
    }
  }

  tags = local.tags
}

Device Plugins

data "aws_ecrpublic_authorization_token" "token" {
  provider = aws.ecr
}

################################################################################
# Helm charts
################################################################################

resource "helm_release" "neuron" {
  name             = "neuron"
  repository       = "oci://public.ecr.aws/neuron"
  chart            = "neuron-helm-chart"
  version          = "1.0.0"
  namespace        = "neuron"
  create_namespace = true
  wait             = false

  # Public ECR
  repository_username = data.aws_ecrpublic_authorization_token.token.user_name
  repository_password = data.aws_ecrpublic_authorization_token.token.password

  values = [
    <<-EOT
      nodeSelector:
        aws.amazon.com/neuron.present: 'true'
      npd:
        enabled: false
    EOT
  ]
}

resource "helm_release" "aws_efa_device_plugin" {
  name       = "aws-efa-k8s-device-plugin"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-efa-k8s-device-plugin"
  version    = "v0.5.5"
  namespace  = "kube-system"
  wait       = false

  values = [
    <<-EOT
      nodeSelector:
        vpc.amazonaws.com/efa.present: 'true'
      tolerations:
        - key: aws.amazon.com/neuron
          operator: Exists
          effect: NoSchedule
    EOT
  ]
}

Deploy

See here for the prerequisites and steps to deploy this pattern.

Validate

  1. List the nodes and their instance type:

    kubectl get nodes -L node.kubernetes.io/instance-type
    
    NAME                                        STATUS   ROLES    AGE   VERSION               INSTANCE-TYPE
    ip-10-0-12-200.us-east-2.compute.internal   Ready    <none>   82m   v1.31.0-eks-a737599   m5.large
    ip-10-0-24-248.us-east-2.compute.internal   Ready    <none>   82m   v1.31.0-eks-a737599   m5.large
    ip-10-0-39-213.us-east-2.compute.internal   Ready    <none>   75m   v1.31.0-eks-a737599   trn1.32xlarge
    ip-10-0-43-172.us-east-2.compute.internal   Ready    <none>   75m   v1.31.0-eks-a737599   trn1.32xlarge
    

    You should see two EFA-enabled (in this example trn1.32xlarge) nodes in the list.

Destroy

terraform destroy -target="module.eks_blueprints_addons" -auto-approve
terraform destroy -target="module.eks" -auto-approve
terraform destroy -auto-approve

See here for more details on cleaning up the resources created.