Multi-Node Inference w/ vLLM¶
This pattern demonstrates an Amazon EKS Cluster with an EFA-enabled nodegroup that support multi-node inference using vLLM, kuberay, and lws (LeaderWorkerSet).
This example is based off the LWS example found here
The following components are demonstrated in this pattern:
- A "default" node group that supports addons and components that do not require GPUs nor EFA devices. Any pods that do not tolerate the taints of the GPU node group will be scheduled on instances within this node group.
- A node group of
g6e.8xlarge
instances with:- all EFA network interfaces enabled
- provisioned within a placement group so that the instances are provisioned close to one another in a single availability zone that supports the instance type
- a common NVIDIA taint of
"nvidia.com/gpu:NoSchedule"
to ensure only the intended applications are allowed to run on the nodes created - two labels to identify that this nodegroup supports NVIDIA GPUs and EFA devices and allow pods to use node selectors with these labels
- the NVME instance store volumes are mounted in a RAID-0 array to provide a single, large, high-performance storage volume for the GPU workloads
- kubelet and containerd are configured to utilize the RAID-0 volume, allowing kubelet to discover the additional storage as ephemeral storage that can be utilized by pods
- A Helm chart deployment for the NVIDIA device plugin to expose and mount the GPUs provided by the instances to the pods that request them
- A Helm chart deployment for the EFA device plugin to expose and mount the EFA network interfaces provided by the instances to the pods that request them. Since the EFA network interfaces are only found on the instances that provide NVIDIA GPUs in this pattern, we do not apply an additional taint for the EFA network interfaces to avoid over-constraining.
- A Dockerfile that demonstrates how to build a container image with the necessary collective communication libraries for multi-node inference with EFA. An ECR repository is created as part of the deployment to store the container image.
Code¶
Cluster¶
################################################################################
# EKS Cluster
################################################################################
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.33"
cluster_name = local.name
cluster_version = "1.31"
# Gives Terraform identity admin access to cluster which will
# allow deploying resources into the cluster
enable_cluster_creator_admin_permissions = true
cluster_endpoint_public_access = true
# These will become the default in the next major version of the module
bootstrap_self_managed_addons = false
enable_irsa = false
enable_security_groups_for_pods = false
cluster_addons = {
coredns = {}
kube-proxy = {}
vpc-cni = {
before_compute = true
}
}
# Add security group rules on the node group security group to
# allow EFA traffic
enable_efa_support = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
# This node group is for core addons such as CoreDNS
default = {
ami_type = "AL2023_x86_64_STANDARD"
instance_types = [
"m7a.xlarge",
"m7i.xlarge",
]
min_size = 2
max_size = 3
desired_size = 2
}
g6 = {
# The EKS AL2023 NVIDIA AMI provides all of the necessary components
# for accelerated workloads w/ EFA
ami_type = "AL2023_x86_64_NVIDIA"
instance_types = ["g6e.8xlarge"]
min_size = 2
max_size = 5
desired_size = 2
# Mount instance store volumes in RAID-0 for kubelet and containerd
# https://github.com/awslabs/amazon-eks-ami/blob/master/doc/USER_GUIDE.md#raid-0-for-kubelet-and-containerd-raid0
cloudinit_pre_nodeadm = [
{
content_type = "application/node.eks.aws"
content = <<-EOT
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
instance:
localStorage:
strategy: RAID0
EOT
}
]
# This will:
# 1. Create a placement group to place the instances close to one another
# 2. Ignore subnets that reside in AZs that do not support the instance type
# 3. Expose all of the available EFA interfaces on the launch template
enable_efa_support = true
subnet_ids = [element(module.vpc.private_subnets, 2)]
labels = {
"vpc.amazonaws.com/efa.present" = "true"
"nvidia.com/gpu.present" = "true"
}
taints = {
# Ensure only GPU workloads are scheduled on this node group
gpu = {
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}
}
}
}
tags = local.tags
}
Helm Charts¶
################################################################################
# Device Plugin(s)
################################################################################
resource "helm_release" "nvidia_device_plugin" {
name = "nvidia-device-plugin"
repository = "https://nvidia.github.io/k8s-device-plugin"
chart = "nvidia-device-plugin"
version = "0.17.0"
namespace = "nvidia-device-plugin"
create_namespace = true
wait = false
}
resource "helm_release" "aws_efa_device_plugin" {
name = "aws-efa-k8s-device-plugin"
repository = "https://aws.github.io/eks-charts"
chart = "aws-efa-k8s-device-plugin"
version = "v0.5.7"
namespace = "kube-system"
wait = false
values = [
<<-EOT
nodeSelector:
vpc.amazonaws.com/efa.present: 'true'
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOT
]
}
################################################################################
# LWS (LeaderWorkerSet)
################################################################################
locals {
lws_version = "v0.5.0"
}
data "http" "lws" {
url = "https://github.com/kubernetes-sigs/lws/releases/download/${local.lws_version}/manifests.yaml"
}
data "kubectl_file_documents" "lws" {
content = data.http.lws.response_body
}
resource "kubectl_manifest" "lws" {
for_each = data.kubectl_file_documents.lws.manifests
yaml_body = each.value
server_side_apply = true
}
Dockerfile¶
# syntax=docker/dockerfile:1
ARG CUDA_VERSION=12.4.1
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS efa-build
ARG AWS_OFI_NCCL_VERSION=1.13.2-aws
ARG EFA_INSTALLER_VERSION=1.37.0
ARG NCCL_VERSION=2.23.4
RUN <<EOT
rm -f /etc/apt/apt.conf.d/docker-clean
echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache
echo 'APT::Install-Suggests "0";' >> /etc/apt/apt.conf.d/00-docker
echo 'APT::Install-Recommends "0";' >> /etc/apt/apt.conf.d/00-docker
echo 'tzdata tzdata/Areas select America' | debconf-set-selections
echo 'tzdata tzdata/Zones/America select Chicago' | debconf-set-selections
EOT
RUN <<EOT
apt update
apt install -y \
curl \
git \
libhwloc-dev \
pciutils \
python3
# EFA installer
cd /tmp
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz | tar xvz
cd aws-efa-installer
./efa_installer.sh --yes --skip-kmod --skip-limit-conf --no-verify --mpi openmpi5
echo "/opt/amazon/openmpi5/lib" > /etc/ld.so.conf.d/openmpi.conf
ldconfig
# NCCL
cd /tmp
git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1
cd nccl
rm /opt/nccl/lib/*.a
make -j $(nproc) src.build \
BUILDDIR=/opt/nccl \
CUDA_HOME=/usr/local/cuda \
NVCC_GENCODE="-gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=sm_89"
echo "/opt/nccl/lib" > /etc/ld.so.conf.d/000_nccl.conf
ldconfig
# AWS-OFI-NCCL plugin
cd /tmp
curl -sL https://github.com/aws/aws-ofi-nccl/releases/download/v${AWS_OFI_NCCL_VERSION}/aws-ofi-nccl-${AWS_OFI_NCCL_VERSION}.tar.gz | tar xvz
cd aws-ofi-nccl-${AWS_OFI_NCCL_VERSION}
./configure --prefix=/opt/aws-ofi-nccl/install \
--with-mpi=/opt/amazon/openmpi5 \
--with-libfabric=/opt/amazon/efa \
--with-cuda=/usr/local/cuda \
--enable-tests=no \
--enable-platform-aws
make -j $(nproc)
make install
echo "/opt/aws-ofi-nccl/install/lib" > /etc/ld.so.conf.d/000-aws-ofi-nccl.conf
ldconfig
EOT
################################################################
FROM docker.io/vllm/vllm-openai:latest
COPY --from=efa-build /opt/amazon /opt/amazon
COPY --from=efa-build /opt/aws-ofi-nccl /opt/aws-ofi-nccl
COPY --from=efa-build /opt/nccl/lib /opt/nccl/lib
COPY --from=efa-build /etc/ld.so.conf.d /etc/ld.so.conf.d
ENV LD_PRELOAD=/opt/nccl/lib/libnccl.so
COPY ray_init.sh /vllm-workspace/ray_init.sh
RUN chmod +x /vllm-workspace/ray_init.sh
Deploy¶
See here for the prerequisites and steps to deploy this pattern.
Warning
This example provisions two g6e.8xlarge
instances, which will require at lest 64 vCPU in the Running On-Demand G and VT instances
EC2 service quota (Maximum number of vCPUs assigned to the Running On-Demand G and VT instances). If you fail to see the g6e.8xlarge
instances provision, and the following error in the Autoscaling events log, please navigate to the Service Quotas section in the AWS console and request a quota increase for Running On-Demand G and VT instances
to at least 64.
Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.
Validate¶
-
List the nodes and their instance type:
NAME STATUS ROLES AGE VERSION INSTANCE-TYPE ip-10-0-20-54.us-east-2.compute.internal Ready <none> 12m v1.31.4-eks-aeac579 g6e.8xlarge ip-10-0-23-209.us-east-2.compute.internal Ready <none> 12m v1.31.4-eks-aeac579 g6e.8xlarge ip-10-0-26-209.us-east-2.compute.internal Ready <none> 12m v1.31.4-eks-aeac579 m7a.xlarge ip-10-0-40-21.us-east-2.compute.internal Ready <none> 12m v1.31.4-eks-aeac579 m7a.xlarge
-
Verify that the lws, EFA device plugin, and NVIDIA device plugin pods are running:
NAMESPACE NAME READY STATUS RESTARTS AGE kube-system aws-efa-k8s-device-plugin-4b4jh 1/1 Running 0 2m kube-system aws-efa-k8s-device-plugin-h2vqn 1/1 Running 0 2m kube-system aws-node-rdx66 2/2 Running 0 2m kube-system aws-node-w9d8t 2/2 Running 0 2m kube-system aws-node-xs7wv 2/2 Running 0 2m kube-system aws-node-xtslm 2/2 Running 0 2m kube-system coredns-6b94694fcb-kct65 1/1 Running 0 2m kube-system coredns-6b94694fcb-tzg25 1/1 Running 0 2m kube-system kube-proxy-4znrq 1/1 Running 0 2m kube-system kube-proxy-bkzmz 1/1 Running 0 2m kube-system kube-proxy-brpt5 1/1 Running 0 2m kube-system kube-proxy-f9qvw 1/1 Running 0 2m lws-system lws-controller-manager-fbb6489f9-hrltq 1/1 Running 0 2m lws-system lws-controller-manager-fbb6489f9-hxdpj 1/1 Running 0 2m nvidia-device-plugin nvidia-device-plugin-g5lwg 1/1 Running 0 2m nvidia-device-plugin nvidia-device-plugin-v6gkj 1/1 Running 0 2m
-
Build and push the provided Dockerfile as a container image into ECR (the
build.sh
file is created as part ofterraform apply
):
Warning
Building and pushing the Docker image will take a considerable amount of resources and time. Building and pushing this image took a little over 1 hour and 10 minutes on a system without any prior images/layers cached; this was on an AMD Ryzen Threadripper 1900X 8-core 4.2 GHz CPU with 128GB of RAM and a 500GB NVMe SSD. The resultant image is roughly 16.7GB in size (unpacked).
-
Update the provided
lws.yaml
file with your HuggingFace token that will be used to pull down themeta-llama/Llama-3.1-8B-Instruct
model used in this example. -
Deploy the LeaderWorkerSet and its associated K8s service:
-
Verify that the distributed tensor-parallel inference works:
Should get an output similar to this:
-
Use kubectl port-forward to forward local port 8080:
-
Open another terminal and send a request to the model:
curl -s http://localhost:8080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }' | jq
The output should be similar to the following
{ "id": "cmpl-7b171b2a1a5b4f56805e721a60b923f4", "object": "text_completion", "created": 1738278714, "model": "meta-llama/Llama-3.1-8B-Instruct", "choices": [ { "index": 0, "text": " top tourist destination, and for good", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 5, "total_tokens": 12, "completion_tokens": 7, "prompt_tokens_details": null } }
Destroy¶
terraform destroy -target="module.eks_blueprints_addons" -auto-approve
terraform destroy -target="module.eks" -auto-approve
terraform destroy -auto-approve
See here for more details on cleaning up the resources created.