Skip to content

Multi-Node Inference w/ vLLM

This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that support multi-node inference using vLLM, and lws (LeaderWorkerSet).

This example is based off the LWS example found here

The following components are demonstrated in this pattern:

  • A "default" node group that supports addons and components that do not require GPUs nor EFA devices. Any pods that do not tolerate the taints of the GPU node group will be scheduled on instances within this node group.
  • A node group of g6e.8xlarge instances with:
    • all EFA network interfaces enabled
    • provisioned within a placement group so that the instances are provisioned close to one another in a single availability zone that supports the instance type
    • a common NVIDIA taint of "nvidia.com/gpu:NoSchedule" to ensure only the intended applications are allowed to run on the nodes created
    • two labels to identify that this nodegroup supports NVIDIA GPUs and EFA devices and allow pods to use node selectors with these labels
    • the NVME instance store volumes are mounted in a RAID-0 array to provide a single, large, high-performance storage volume for the GPU workloads
      • kubelet and containerd are configured to utilize the RAID-0 volume, allowing kubelet to discover the additional storage as ephemeral storage that can be utilized by pods
  • A Helm chart deployment for the NVIDIA device plugin to expose and mount the GPUs provided by the instances to the pods that request them
  • A Helm chart deployment for the EFA device plugin to expose and mount the EFA network interfaces provided by the instances to the pods that request them. Since the EFA network interfaces are only found on the instances that provide NVIDIA GPUs in this pattern, we do not apply an additional taint for the EFA network interfaces to avoid over-constraining.
  • A Dockerfile that demonstrates how to build a container image with the necessary collective communication libraries for multi-node inference with EFA. An ECR repository is created as part of the deployment to store the container image.

Code

Cluster

################################################################################
# EKS Cluster
################################################################################

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.34"

  cluster_name    = local.name
  cluster_version = "1.32"

  # Gives Terraform identity admin access to cluster which will
  # allow deploying resources into the cluster
  enable_cluster_creator_admin_permissions = true
  cluster_endpoint_public_access           = true

  # These will become the default in the next major version of the module
  bootstrap_self_managed_addons   = false
  enable_irsa                     = false
  enable_security_groups_for_pods = false

  cluster_addons = {
    coredns                   = {}
    eks-node-monitoring-agent = {}
    eks-pod-identity-agent = {
      before_compute = true
    }
    kube-proxy = {}
    vpc-cni = {
      most_recent    = true
      before_compute = true
    }
  }

  # Add security group rules on the node group security group to
  # allow EFA traffic
  enable_efa_support = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_group_defaults = {
    node_repair_config = {
      enabled = true
    }
  }

  eks_managed_node_groups = {
    g6e = {
      # The EKS AL2023 NVIDIA AMI provides all of the necessary components
      # for accelerated workloads w/ EFA
      ami_type       = "AL2023_x86_64_NVIDIA"
      instance_types = ["g6e.8xlarge"]

      min_size     = 2
      max_size     = 5
      desired_size = 2

      # Mount instance store volumes in RAID-0 for kubelet and containerd
      # https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0
      cloudinit_pre_nodeadm = [
        {
          content_type = "application/node.eks.aws"
          content      = <<-EOT
            ---
            apiVersion: node.eks.aws/v1alpha1
            kind: NodeConfig
            spec:
              instance:
                localStorage:
                  strategy: RAID0
          EOT
        }
      ]

      # This will:
      # 1. Create a placement group to place the instances close to one another
      # 2. Ignore subnets that reside in AZs that do not support the instance type
      # 3. Expose all of the available EFA interfaces on the launch template
      enable_efa_support = true
      subnet_ids         = [element(module.vpc.private_subnets, 2)]

      labels = {
        "vpc.amazonaws.com/efa.present" = "true"
        "nvidia.com/gpu.present"        = "true"
      }

      taints = {
        # Ensure only GPU workloads are scheduled on this node group
        gpu = {
          key    = "nvidia.com/gpu"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }

    # This node group is for core addons such as CoreDNS
    default = {
      instance_types = ["m5.large"]

      min_size     = 2
      max_size     = 2
      desired_size = 2
    }
  }

  tags = local.tags
}

Helm Charts

################################################################################
# Device Plugin(s)
################################################################################

resource "helm_release" "nvidia_device_plugin" {
  name             = "nvidia-device-plugin"
  repository       = "https://nvidia.github.io/k8s-device-plugin"
  chart            = "nvidia-device-plugin"
  version          = "0.17.1"
  namespace        = "nvidia-device-plugin"
  create_namespace = true
  wait             = false
}

resource "helm_release" "aws_efa_device_plugin" {
  name       = "aws-efa-k8s-device-plugin"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-efa-k8s-device-plugin"
  version    = "v0.5.7"
  namespace  = "kube-system"
  wait       = false

  values = [
    <<-EOT
      nodeSelector:
        vpc.amazonaws.com/efa.present: 'true'
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
    EOT
  ]
}

################################################################################
# LWS (LeaderWorkerSet)
################################################################################

locals {
  lws_version = "v0.5.1"
}

data "http" "lws" {
  url = "https://github.com/kubernetes-sigs/lws/releases/download/${local.lws_version}/manifests.yaml"
}

data "kubectl_file_documents" "lws" {
  content = data.http.lws.response_body
}

resource "kubectl_manifest" "lws" {
  for_each = data.kubectl_file_documents.lws.manifests

  yaml_body         = each.value
  server_side_apply = true
}

Dockerfile

# syntax=docker/dockerfile:1

FROM ubuntu:22.04

RUN <<EOT
  rm -f /etc/apt/apt.conf.d/docker-clean
  echo 'Binary::apt::APT::Keep-Downloaded-Packages "false";' > /etc/apt/apt.conf.d/keep-cache
  echo 'APT::Keep-Downloaded-Packages "false";' > /etc/apt/apt.conf.d/99custom-conf
  echo 'APT::Install-Suggests "0";' >> /etc/apt/apt.conf.d/00-docker
  echo 'APT::Install-Recommends "0";' >> /etc/apt/apt.conf.d/00-docker

  # Install CUDA keyring
  apt update
  apt upgrade -y
  apt install -y \
    ca-certificates \
    curl \
    gnupg2

  curl -fsSLO https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
  dpkg -i cuda-keyring_1.1-1_all.deb
  rm cuda-keyring_1.1-1_all.deb

  apt clean
  rm -rf /var/{log,dpkg.log}
EOT

# vLLM requires python3-dev
RUN <<EOT
  apt update
  apt install -y \
    python3.11 \
    python3-dev \
    python3-pip \
    python-is-python3
  apt clean
  rm -rf /var/{log,dpkg.log}
EOT

RUN <<EOT
  pip install --no-cache-dir vllm
  pip uninstall opencv-python-headless pytest torchaudio torchvision -y

  # NCCL is installed below since its compiled/linked with aws-ofi-nccl
  rm -rf /usr/local/lib/python3.10/dist-packages/nvidia/nccl
EOT

ARG AWS_OFI_NCCL_VERSION=1.13.2-aws
ARG EFA_INSTALLER_VERSION=1.37.0
ARG NCCL_VERSION=2.25.1

# CUDA version needs to match vLLM https://github.com/vllm-project/vllm/blob/66e16a038e9fe8bf04e133858621cd9803e7145b/Dockerfile#L8
ARG CUDA_MAJOR_VERSION=12
ARG CUDA_MINOR_VERSION=4

RUN <<EOT
  apt update
  apt install -y \
    gcc-10 \
    g++-10 \
    cuda-minimal-build-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
    git \
    libhwloc15 \
    libhwloc-dev \
    make

  # TODO - https://github.com/vllm-project/vllm/blob/eb8b5eb183b8428f7e58adf2559e7a8d9400fc30/Dockerfile#L33-L34
  update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 110 --slave /usr/bin/g++ g++ /usr/bin/g++-10

  # EFA installer
  cd /tmp
  curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz | tar xvz
  cd aws-efa-installer

  # OpenMPI v4 vs v5 doesn't matter since its not used here, but v4 is slightly smaller
  ./efa_installer.sh --yes --skip-kmod --skip-limit-conf --no-verify --mpi openmpi4

  echo "/opt/amazon/openmpi/lib" >> /etc/ld.so.conf.d/000_efa.conf
  ldconfig

  # NCCL
  cd /tmp
  git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1
  cd nccl

  make -j $(nproc) src.build \
    BUILDDIR=/opt/nccl \
    CUDA_HOME=/usr/local/cuda \
    NVCC_GENCODE="-gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=sm_89"

  echo "/opt/nccl/lib" >> /etc/ld.so.conf.d/000_efa.conf
  ldconfig

  # AWS-OFI-NCCL plugin
  cd /tmp
  curl -sL https://github.com/aws/aws-ofi-nccl/releases/download/v${AWS_OFI_NCCL_VERSION}/aws-ofi-nccl-${AWS_OFI_NCCL_VERSION}.tar.gz | tar xvz
  cd aws-ofi-nccl-${AWS_OFI_NCCL_VERSION}

  ./configure --prefix=/opt/aws-ofi-nccl/install \
    --with-mpi=/opt/amazon/openmpi \
    --with-libfabric=/opt/amazon/efa \
    --with-cuda=/usr/local/cuda \
    --enable-tests=no \
    --enable-platform-aws
  make -j $(nproc)
  make install

  echo "/opt/aws-ofi-nccl/install/lib" >> /etc/ld.so.conf.d/000_efa.conf
  ldconfig

  # Remove static libs to avoid copying them to the final image
  find / -name '*.a' | xargs rm
  rm -rf /tmp/*

  apt-get purge --autoremove -y \
    cuda-minimal-build-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
    git
  apt clean
  rm -rf /var/{log,dpkg.log}
EOT

WORKDIR /vllm-workspace
RUN <<EOT
  curl -O https://raw.githubusercontent.com/kubernetes-sigs/lws/main/docs/examples/vllm/build/ray_init.sh
  chmod +x ray_init.sh

  # For vLLM debug/issue reporting
  curl -O https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py

  pip install --no-cache-dir --upgrade hf_transfer
EOT

Deploy

See here for the prerequisites and steps to deploy this pattern.

Warning

This example provisions two g6e.8xlarge instances, which will require at lest 64 vCPU in the Running On-Demand G and VT instances EC2 service quota (Maximum number of vCPUs assigned to the Running On-Demand G and VT instances). If you fail to see the g6e.8xlarge instances provision, and the following error in the Autoscaling events log, please navigate to the Service Quotas section in the AWS console and request a quota increase for Running On-Demand G and VT instances to at least 64.

Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.

Validate

  1. List the nodes and their instance type:

    kubectl get nodes -L node.kubernetes.io/instance-type
    
    NAME                                        STATUS   ROLES    AGE    VERSION               INSTANCE-TYPE
    ip-10-0-20-54.us-east-2.compute.internal    Ready    <none>   12m    v1.32.0-eks-aeac579   g6e.8xlarge
    ip-10-0-23-209.us-east-2.compute.internal   Ready    <none>   12m    v1.32.0-eks-aeac579   g6e.8xlarge
    ip-10-0-26-209.us-east-2.compute.internal   Ready    <none>   12m    v1.32.0-eks-aeac579   m7a.xlarge
    ip-10-0-40-21.us-east-2.compute.internal    Ready    <none>   12m    v1.32.0-eks-aeac579   m7a.xlarge
    
  2. Verify that the lws, EFA device plugin, and NVIDIA device plugin pods are running:

    kubectl get pods -A
    
    NAMESPACE              NAME                                     READY   STATUS    RESTARTS   AGE
    kube-system            aws-efa-k8s-device-plugin-9jxp9          1/1     Running   0          56m
    kube-system            aws-efa-k8s-device-plugin-hrwfm          1/1     Running   0          6h34m
    kube-system            aws-efa-k8s-device-plugin-lzpfs          1/1     Running   0          6h34m
    kube-system            aws-efa-k8s-device-plugin-z5j46          1/1     Running   0          56m
    kube-system            aws-node-9hph2                           2/2     Running   0          7h6m
    kube-system            aws-node-ddwr5                           2/2     Running   0          7h5m
    kube-system            aws-node-g9zgq                           2/2     Running   0          7h6m
    kube-system            aws-node-ldtsd                           2/2     Running   0          56m
    kube-system            aws-node-w7mwb                           2/2     Running   0          56m
    kube-system            aws-node-xlxnw                           2/2     Running   0          7h5m
    kube-system            coredns-6b94694fcb-88wzs                 1/1     Running   0          7h5m
    kube-system            coredns-6b94694fcb-zw4wt                 1/1     Running   0          7h5m
    kube-system            kube-proxy-h6p9k                         1/1     Running   0          7h5m
    kube-system            kube-proxy-j4q5f                         1/1     Running   0          7h5m
    kube-system            kube-proxy-r7rq4                         1/1     Running   0          7h5m
    kube-system            kube-proxy-t8rp4                         1/1     Running   0          56m
    kube-system            kube-proxy-vtd9k                         1/1     Running   0          56m
    kube-system            kube-proxy-whdws                         1/1     Running   0          7h5m
    lws-system             lws-controller-manager-fbb6489f9-9n98v   1/1     Running   0          7h7m
    lws-system             lws-controller-manager-fbb6489f9-z8x4k   1/1     Running   0          7h7m
    nvidia-device-plugin   nvidia-device-plugin-pjt52               1/1     Running   0          56m
    nvidia-device-plugin   nvidia-device-plugin-s9pt5               1/1     Running   0          6h34m
    nvidia-device-plugin   nvidia-device-plugin-sv2qg               1/1     Running   0          56m
    nvidia-device-plugin   nvidia-device-plugin-xqbv8               1/1     Running   0          6h34m
    
  3. Build and push the provided Dockerfile as a container image into ECR (the build.sh file is created as part of terraform apply):

    ./build.sh
    

    Warning

    Building and pushing the Docker image will take a considerable amount of resources and time. Building and pushing this image took 26 minutes on a system without any prior images/layers cached; this was on an AMD Ryzen Threadripper 1900X 8-core 4.2 GHz CPU with 128GB of RAM and a 500GB NVMe SSD. The resultant image is roughly 7.6GB in size (unpacked).

  4. Update the provided lws.yaml file with your HuggingFace token that will be used to pull down the meta-llama/Llama-3.3-70B-Instruct model used in this example.

  5. Deploy the LeaderWorkerSet and its associated K8s service:

    kubectl apply -f lws.yaml
    kubectl get pods
    
    NAME       READY   STATUS    RESTARTS   AGE
    vllm-0     1/1     Running   0          24m
    vllm-0-1   1/1     Running   0          24m
    vllm-0-2   1/1     Running   0          24m
    vllm-0-3   1/1     Running   0          24m
    
  6. Verify that the distributed tensor-parallel inference works:

    kubectl logs vllm-0 |grep -i "Loading model weights took"
    

    Should get an output similar to this:

    INFO 03-10 22:53:24 model_runner.py:1115] Loading model weights took 33.8639 GB
    (RayWorkerWrapper pid=208, ip=10.0.45.39) INFO 03-10 22:53:10 model_runner.py:1115] Loading model weights took 33.8639 GB
    
  7. Use kubectl port-forward to forward local port 8080:

    kubectl port-forward svc/vllm-leader 8080:8080
    
  8. Open another terminal and send a request to the model:

    curl -s http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }' | jq
    

    The output should be similar to the following

    {
        "id": "cmpl-48e678b2dacf4db9ac7f32dffa32c913",
        "object": "text_completion",
        "created": 1741483804,
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "choices": [
            {
            "index": 0,
            "text": " top destination for travelers, with its",
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "prompt_logprobs": null
            }
        ],
        "usage": {
            "prompt_tokens": 5,
            "total_tokens": 12,
            "completion_tokens": 7,
            "prompt_tokens_details": null
        }
    }
    

Destroy

terraform destroy -target="module.eks_blueprints_addons" -auto-approve
terraform destroy -target="module.eks" -auto-approve
terraform destroy -auto-approve

See here for more details on cleaning up the resources created.