With the evolution of generative AI and models like ChatGPT (GPT 3.5 and GPT 4.0), Llama 2, Falcon-7B or Falcon-40B, every company is trying to build some AI features in their product lineup and dependent on GPUs. GPUs are used for training the machine learning models and also for inferencing. Being the most critical and expensive piece of hardware in the AI/ML infrastructure, their efficient usage is paramount to optimize cost and performance for the product and solution you are building.

Most cloud services now offer GPU capabilities, and there are specialized GPU cloud providers that often offer a wider range of GPUs at more affordable prices compared to mainstream cloud providers. A prevalent approach involves containerizing AI/ML workloads and managing them through Kubernetes-based systems. By doing so, you can create your own scalable platform, free from the constraints of closed, proprietary cloud solutions, and potentially realize cost savings. Additionally, this approach allows you to achieve vendor independence.

A winning combination: Kubernetes + NVIDIA Triton Inference Server

NVIDIA Triton Inference Server

Model inferencing is the process of generating output from the live data using the model trained on the large dataset. Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, datacenter, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton delivers optimized performance for many query types, including real-time, dynamic batching, ensembles and audio/video streaming. It is a highly performant server and available as open source under BSD-3 license that allows commercial use as well.

How does Triton achieve compatibility with so many different frameworks?

Behind the scenes Trition’s architecture is designed on the principles of loosely coupled components. Triton defines what’s called a backend.

A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework like PyTorch, TensorFlow, TensorRT, ONNX Runtime or OpenVINO.

The backend itself is an implementation of a C interface that’s defined in Triton Core.

There are already backend implementations for all major frameworks listed above, and new ones can be implemented by using the Backend API for your specific use case.

Once implemented, the backends need to be compiled as a shared library and according to a specific naming convention libtriton_<backend-name>.so

The case for Kubernetes

Having seen the possibilities of serving ML models with Triton server as a containerized workload, the next step is naturally talking about container orchestration – and what better platform support one can have other than kubernetes. While there exists some other alternatives to kubernetes, nothing comes close to the versatility, robustness and reliability of the platform to schedule or provision these workloads on demand as kubernetes.

The support for fault tolerance, scheduled jobs, autoscaling are just a few things that come backed in with kubernetes along with the compatibility of a wide variety of the whole Cloud-native landscape of tools that align with kubernetes one way or another.

But so far we have only talked about use cases where a Triton server based inferred model was served on a per node basis. Whereas the general idea with kubernetes is that you can have multiple replicas of workloads scheduled on one or multiple nodes.

This gives us 2 advantages:

While multi node architecture helps is avoid single point of failure issues
Running multiple replicas of workload on a node allows us to make sure that we consume our underlying hardware resources as best as possible.

How to optimize the usage of GPUs?

And this brings us to the main issue we might have in this case i.e the Underutilization of our expensive GPUs or unsaturated GPUs.

Basically we do not want to pay for expensive hardwares like GPUs while not even utilizing them to their full potential. If somehow we are able to solve this issue of underutilization, we might have at least few advantages:

Couple the problem of solving underutilization with GPU sharing and we can bring down costs significantly for individual customers
Since more customer workloads can now fit into each node/GPU instance even cloud providers can have more margins IF they are able to provide a good tenant isolation between these workloads.

Solution: Use MIG - Multi Instance GPUs

MIG or Multi Instance GPU can maximize the GPU utilization of large GPUs such as A100 GPU, or HGX H100. It can also enable multiple users to share a single GPU, by running multiple workloads in parallel as if there were multiple, smaller GPUs.

MIG capability can divide a single GPU into multiple GPU partitions called GPU instances. Each instance has dedicated memory and compute resources, so the hardware-level isolation ensures simultaneous workload execution with guaranteed quality of service and fault isolation.

Prerequisite

To use MIG, you must enable MIG mode and create MIG devices on GPUs. We will discuss the steps in next few sections.

Partitioning the GPUs

There are several options to achieve this desired partitioning of GPUs:

Use nvidia-smi to create GPU and compute instances manually
Use nvidia-mig-parted tool to declaratively define a possible set of MIG configurations and applied to GPU nodes

Partitioning Schemes

Based on the GPU types there can be various kinds of partitions of GPU SM, Compute Instance (CI) and Memory Slice configurations can be created. However there a limit to the maximum number of partitions that these devices can be broken down into, which can be attributed to the below table

Product	Architecture	Microarchitecture	Compute Capability	Memory Size	Max Number of Instances
H100-SXM5	Hopper	GH100	9.0	80GB	7
H100-PCIE	Hopper	GH100	9.0	80GB	7
H100-SXM5	Hopper	GH100	9.0	94GB	7
H100-PCIE	Hopper	GH100	9.0	94GB	7
H100 on GH200	Hopper	GH100	9.0	96GB	7
A100-SXM4	NVIDIA Ampere	GA100	8.0	40GB	7
A100-SXM4	NVIDIA Ampere	GA100	8.0	80GB	7
A100-PCIE	NVIDIA Ampere	GA100	8.0	40GB	7
A100-PCIE	NVIDIA Ampere	GA100	8.0	80GB	7
A30	NVIDIA Ampere	GA100	8.0	24GB	4

You can find more information about the partitioning schemes here.

Kubernetes Specific Requirements

Once we have the base layer setup as described above and we have partitioned the GPUs on the node, we can then make sure these devices are propagated properly for kubernetes to identify and schedule the requested workloads on. To enable this, we basically rely on 2 components

K8s-device-plugin: a Daemonset that allows you to automatically:

Expose the number of GPUs on each nodes of your cluster
Keep track of the health of your GPUs
Run GPU enabled containers in your Kubernetes cluster

Gpu-feature-discovery: software component that allows you to automatically generate labels for the set of GPUs available on a node. It leverages the Node Feature Discovery to perform this labeling.

Both of the above components can be installed in a variety of supported ways including convenient helm charts.

Available Strategies

None -- (Emulate all underlying GPUs whether MIG enabled on not, even non partitioned ones) – The none strategy is designed to keep the nvidia-device-plugin running the same as it always has. The plugin will make no distinction between GPUs that have either MIG enabled or not, and will enumerate all GPUs on the node, making them available using the nvidia.com/gpu resource type.
Single -- (Emulate underlying GPUs according the the partitioned scheme but only linear type of partitions are supported – 2x(4 memory, 3 compute), 3x(2 memory, 2 compute), 7x(1 memory, 1 compute) ) – The single strategy is designed to keep the user-experience of working with GPUs in Kubernetes the same as it has always been. MIG devices are enumerated with the nvidia.com/gpu resource type just as before. However, the properties associated with that resource type now map to the MIG devices available on that node, instead of the full GPUs
Mixed -- (Emulate a mixed set of underlying partitioned GPUs – nvidia.com/mig-<slice_count>g.<memory_size>gb) – The mixed strategy is designed to enumerate a different resource type for every MIG device configuration available in the cluster.

Demo of Various Strategies

None

You can verify the gpu availability in the Kubernetes nodes.

$ kubectl describe node
...
Capacity:
nvidia.com/gpu:          1
...
Allocatable:
nvidia.com/gpu:          1
...

bash

Check Node Labels starting with nvidia.com/. Labeling is important for correct functioning of GPUs on Kubernetes.

$ kubectl get node -o json | \
   jq '.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'

{
"nvidia.com/cuda.driver.major": "450",
"nvidia.com/cuda.driver.minor": "80",
"nvidia.com/cuda.driver.rev": "02",
"nvidia.com/cuda.runtime.major": "11",
"nvidia.com/cuda.runtime.minor": "0",
"nvidia.com/gfd.timestamp": "1605312111",
"nvidia.com/gpu.compute.major": "8",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "1",
"nvidia.com/gpu.family": "ampere",
"nvidia.com/gpu.machine": "NVIDIA DGX",
"nvidia.com/gpu.memory": "40537",
"nvidia.com/gpu.product": "A100-SXM4-40GB"
}
bash

Let's create a POD to consume the GPU. We are using nvidia/cuda image to run the nvidia-smi command and limiting the gpu resource to 1.

$ kubectl run -it --rm \
   --image=nvidia/cuda:11.0-base \
   --restart=Never \
   --limits=nvidia.com/gpu=1 \
   mig-none-example -- nvidia-smi -L

GPU 0: A100-SXM4-40GB (UUID: GPU-15f0798d-c807-231d-6525-a7827081f0f1)
bash

Single

Describe nodes

$ kubectl describe node
...
Capacity:
nvidia.com/gpu:          7
...
Allocatable:
nvidia.com/gpu:          7
...
bash

Get node labels

$ kubectl get node -o json | \
   jq '.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'

{
"nvidia.com/cuda.driver.major": "450",
"nvidia.com/cuda.driver.minor": "80",
"nvidia.com/cuda.driver.rev": "02",
"nvidia.com/cuda.runtime.major": "11",
"nvidia.com/cuda.runtime.minor": "0",
"nvidia.com/gfd.timestamp": "1605657366",
"nvidia.com/gpu.compute.major": "8",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "7",
"nvidia.com/gpu.engines.copy": "1",
"nvidia.com/gpu.engines.decoder": "0",
"nvidia.com/gpu.engines.encoder": "0",
"nvidia.com/gpu.engines.jpeg": "0",
"nvidia.com/gpu.engines.ofa": "0",
"nvidia.com/gpu.family": "ampere",
"nvidia.com/gpu.machine": "NVIDIA DGX",
"nvidia.com/gpu.memory": "4864",
"nvidia.com/gpu.multiprocessors": "14",
"nvidia.com/gpu.product": "A100-SXM4-40GB-MIG-1g.5gb",
"nvidia.com/gpu.slices.ci": "1",
"nvidia.com/gpu.slices.gi": "1",
"nvidia.com/mig.strategy": "single"
}
bash

Deploy 7 pods, each consuming one MIG device and read their logs

$ for i in $(seq 7); do
   kubectl run \
      --image=nvidia/cuda:11.0-base \
      --restart=Never \
      --limits=nvidia.com/gpu=1 \
      mig-single-example-${i} -- bash -c "nvidia-smi -L; sleep infinity"
done

pod/mig-single-example-1 created
pod/mig-single-example-2 created
pod/mig-single-example-3 created
pod/mig-single-example-4 created
pod/mig-single-example-5 created
pod/mig-single-example-6 created
pod/mig-single-example-7 created

$ for i in $(seq 7); do
echo "mig-single-example-${i}";
kubectl logs mig-single-example-${i}
echo "";
done

mig-single-example-1
GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
   MIG 1g.5gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/7/0)

mig-single-example-2
GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
   MIG 1g.5gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/9/0)

...
bash

Mixed

To test this strategy, we check that all MIG devices are enumerated using their fully qualified name like nvidia.com/mig-<slice_count>g.<memory_size>gb.

Describe nodes

$ kubectl describe node
...
Capacity:
nvidia.com/mig-1g.5gb:   1
nvidia.com/mig-2g.10gb:  1
nvidia.com/mig-3g.20gb:  1
...
Allocatable:
nvidia.com/mig-1g.5gb:   1
nvidia.com/mig-2g.10gb:  1
nvidia.com/mig-3g.20gb:  1
...
bash

Check node labels

$ kubectl get node -o json | \
   jq '.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'

{
"nvidia.com/cuda.driver.major": "450",
"nvidia.com/cuda.driver.minor": "80",
"nvidia.com/cuda.driver.rev": "02",
"nvidia.com/cuda.runtime.major": "11",
"nvidia.com/cuda.runtime.minor": "0",
"nvidia.com/gfd.timestamp": "1605658841",
"nvidia.com/gpu.compute.major": "8",
"nvidia.com/gpu.compute.minor": "0",
"nvidia.com/gpu.count": "1",
"nvidia.com/gpu.family": "ampere",
"nvidia.com/gpu.machine": "NVIDIA DGX",
"nvidia.com/gpu.memory": "40537",
"nvidia.com/gpu.product": "A100-SXM4-40GB",
"nvidia.com/mig-1g.5gb.count": "1",
"nvidia.com/mig-1g.5gb.engines.copy": "1",
"nvidia.com/mig-1g.5gb.engines.decoder": "0",
"nvidia.com/mig-1g.5gb.engines.encoder": "0",
"nvidia.com/mig-1g.5gb.engines.jpeg": "0",
"nvidia.com/mig-1g.5gb.engines.ofa": "0",
"nvidia.com/mig-1g.5gb.memory": "4864",
"nvidia.com/mig-1g.5gb.multiprocessors": "14",
"nvidia.com/mig-1g.5gb.slices.ci": "1",
"nvidia.com/mig-1g.5gb.slices.gi": "1",
"nvidia.com/mig-2g.10gb.count": "1",
"nvidia.com/mig-2g.10gb.engines.copy": "2",
"nvidia.com/mig-2g.10gb.engines.decoder": "1",
"nvidia.com/mig-2g.10gb.engines.encoder": "0",
"nvidia.com/mig-2g.10gb.engines.jpeg": "0",
"nvidia.com/mig-2g.10gb.engines.ofa": "0",
"nvidia.com/mig-2g.10gb.memory": "9984",
"nvidia.com/mig-2g.10gb.multiprocessors": "28",
"nvidia.com/mig-2g.10gb.slices.ci": "2",
"nvidia.com/mig-2g.10gb.slices.gi": "2",
"nvidia.com/mig-3g.21gb.count": "1",
"nvidia.com/mig-3g.21gb.engines.copy": "3",
"nvidia.com/mig-3g.21gb.engines.decoder": "2",
"nvidia.com/mig-3g.21gb.engines.encoder": "0",
"nvidia.com/mig-3g.21gb.engines.jpeg": "0",
"nvidia.com/mig-3g.21gb.engines.ofa": "0",
"nvidia.com/mig-3g.21gb.memory": "20096",
"nvidia.com/mig-3g.21gb.multiprocessors": "42",
"nvidia.com/mig-3g.21gb.slices.ci": "3",
"nvidia.com/mig-3g.21gb.slices.gi": "3",
"nvidia.com/mig.strategy": "mixed"
}
bash

Deploy Pod consuming one of the available MIG devices. Please note the difference in the label from previous strategies.

$ kubectl run -it --rm \
   --image=nvidia/cuda:11.0-base \
   --restart=Never \
   --limits=nvidia.com/mig-1g.5gb=1 \
   mig-mixed-example -- nvidia-smi -L

GPU 0: A100-SXM4-40GB (UUID: GPU-4200ccc0-2667-d4cb-9137-f932c716232a)
MIG 1g.5gb Device 0: (UUID: MIG-GPU-4200ccc0-2667-d4cb-9137-f932c716232a/9/0)
bash

Conclusion

Due to high demand, the GPU availability has been a problem and to democratize AI and make it available to everyone in cost-effective manner, technologies like Kubernetes and NVIDIA MIG plays an important role. We have tried to share the possibilities in the current platform so that you can run your models more efficiently. Using the MIG and advanced Kubernetes scheduling strategies, we can utilize the large GPU for serving multiple users or multiple apps to fully utilize the GPU.

Let us know if you are looking for any additional help, we are just an email or a meeting away.

Credits - Thanks to our partner NeevCloud for providing the A30 GPUs and bare metal machines for testing.

Thanks to our guest author, Ishan for sharing the thoughts.

Optimizing NVIDIA GPUs with Partitioning in Kubernetes