CERN：K8s 下 MIG vs time-slicing 的可复现损耗与落地配置

EPJ Web of Conferences 337, 01079 (2025)

CHEP 2024

https://doi.org/10.1051/epjconf/202533701079

Improving overall GPU sharing and usage efﬁciency with

Kubernetes

Gaponcic Diana^1,∗, Ricardo Rocha^1,∗∗, Diogo Filipe Tomas Guerra¹, and Dejan Golubovic¹

¹CERN

Abstract. GPUs and accelerators are enabling High Energy Physics (HEP) to

keep pace with the growing data volume and computational complexity. The

challenge remains to improve overall eﬃciency and sharing opportunities of

what are currently expensive and scarce resources.

In this paper, we describe the common patterns of GPU usage in HEP, including

spiky requirements with low overall usage for interactive access, as well as

more predictable but potentially bursty workloads. We then explore the multiple

mechanisms to share and partition GPUs, covering time-slicing, and physical

partitioning (MIG) for NVIDIA devices.

We conclude with the results of an extensive set of benchmarks for represen-

tative HEP use cases. We highlight the limitations of each option and the use

cases where they ﬁt best. Finally, we cover the deployment aspects and the dif-

ferent options available targeting a centralized GPU pool that can signiﬁcantly

push the overall GPU usage eﬃciency.

1 Introduction

GPUs are shaping the way organizations access and use their data, and CERN is not an

exception. High Energy Physics (HEP) analysis and deployments are being rethought due

to the growing complexity and volume of data, which traditional computational methods

struggle to process eﬃciently. In this context, accelerators remain the key to enabling eﬃcient

Machine Learning (ML). [1]

The problem arises when we realize that in reality, many use cases cannot fully utilize the

available hardware. This can be caused by many reasons such as workloads designed with

CPU in mind, badly considered batch sizes, wrong assumptions about hardware requirements,

etc.

Even when the experts understand the software and the hardware well, and are invested

in getting maximum performance, some resources will stay idle due to the iterative nature

of the development process. To solve this problem, it is important to come up with ways of

sharing the available resources. This can be done either at the infrastructure level or at the

GPU level itself.

∗

e-mail: diana.gaponcic@cern.ch

e-mail: ricardo.rocha@cern.ch

∗∗

Commons Attribution License 4.0 (https://creativecommons.org/licenses/by/4.0/).

EPJ Web of Conferences 337, 01079 (2025)

CHEP 2024

https://doi.org/10.1051/epjconf/202533701079

2 Sharing Resources at the Infrastructure Level

Sharing at the infrastructure level means having a single point of GPU access. As you can

see in ﬁgure 1, this needs to be a platform that acts as the entry point for workloads requiring

a GPU, regardless of what is being run on it: simulations, inference, CI jobs, training, etc.

Users access the platform, choose what GPU is preferred or required, how much memory

is needed, and what is the usage pattern, and then get access to accelerators from a common

pool of resources. This way, GPUs are always in use. As soon as a GPU is released, it can be

reassigned to someone else. This also creates an opportunity to access GPUs, or even other

accelerators (TPUs, IPUs) through the public clouds (especially for specialized hardware that

we cannot get on-premise).

Figure 1. Single point of GPU access for various use cases

Kubernetes does a great job at being the common infrastructure for diﬀerent use cases

and utilization patterns. Many of them can be covered with Kubeﬂow [2] (a foundation of

tools for AI platforms on Kubernetes), e.g. distributed training for machine learning. For

the others, dedicated services can be deployed in the same cluster or across multiple clusters.

Having a common pool of resources:

• increases the GPU oﬀering at CERN (since many of the dedicated GPUs can be added to

the common pool)

• increases the overall GPU usage (since GPUs stay idle for less time).

Even though this is a huge improvement, one issue still remains: some resources will be

wasted if the user case cannot fully utilize the obtained GPU. To solve this, it is important to

think of sharing at a more granular level - sharing at the GPU level itself.

3 Sharing Resources at the GPU Level

There are many beneﬁts to sharing GPUs, but sharing also comes with added complexity. One

big beneﬁt is the cost optimization, since accelerators are expensive and organizations can

beneﬁt from giving access to more users, and getting increased overall hardware utilization.

The same applies to energy and sustainability awareness.

At the same time, in a multi-tenant setup we introduce new problems that didn’t previ-

ously exist. Some of them are the noisy neighbor issue, the lack of data isolation, the loss of

eﬃcienty due to the complexity to manage multiple users concurrently, etc. Those issues are

key points that need to be addressed to ensure multiple workloads can run safely at the same

time on the same chip.

2

EPJ Web of Conferences 337, 01079 (2025)

https://doi.org/10.1051/epjconf/202533701079

CHEP 2024

3.1 Time-slicing

Time-slicing 2 is a sharing mechanism, where the scheduler gives an equal share of time to

all GPU processes and alternates them in a round-robin fashion. Figure 2 shows how process

are sharing the memory, while the compute resources are assigned to one process at a time.

Figure 2. High level overview how time-slicing works

To provision GPUs in Kubernetes, we need to treat them as special resources and have a

way to identify them, allocate them to workloads, and monitor their health. To help with this

task, we can use the NVIDIA gpu-operator [3]. The operator automates the management of

NVIDIA software needed to use GPUs, those include the NVIDIA drivers, the device plugin,

the container toolkit, etc.

To enable time-slicing on Kubernetes, some extra conﬁguration needs to be provided to

the gpu-operator. First we need to let the device plugin know that it needs to use the con-

ﬁguration available in a ConﬁgMap (see ﬁgure 3), in this case, we called the ConﬁgMap

nvidia-time-slicing-conﬁg. Then the description of the desired sharing options needs to be

added to the referenced ConﬁgMap. For example, in slice-4, the GPU will be shared into 4

replicas with time-slicing, and the GPU resource will be renamed.

Figure 3. The conﬁguration needed to enable time-slicing

When the conﬁguration works properly, the node will start advertising

nvidia.com/gpu.shared resources, instead of the default nvidia.com/gpu:

3

EPJ Web of Conferences 337, 01079 (2025)

https://doi.org/10.1051/epjconf/202533701079

CHEP 2024

Allocatable:

...

nvidia.com/gpu:

nvidia.com/gpu.shared:

0

4

Time-slicing works on a wide range of NVIDIA architecture, it is an easy way to set up

GPU concurrency, and oﬀers an unlimited number of partitions. On the other hand, there is

no process/memory isolation, and no ability to set priorities.

3.2 Multi Instance GPU

Multi Instance GPU (MIG) is a mechanism that allows the partitioning of a GPU into up to

seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute

cores. The partitions on an A100 40GB are shown in ﬁgure 4.

Figure 4. MIG Proﬁles on A100 40GB [4]

To enable MIG on Kubernetes, some extra conﬁguration needs to be provided to the GPU

operator:

# values.yaml in NVIDIA gpu-operator Helm chart

...

mig:

strategy: mixed

migManager:

config:

name: nvidia-mig-config

# $ cat nvidia-mig-config.yaml

apiVersion: v1

kind: ConfigMap

metadata:

name: nvidia-mig-config

data:

config.yaml: |

version: v1

mig-configs:

# A100-40GB

3g.20gb-2x2g.10gb:

- devices: all

mig-enabled: true

mig-devices:

"2g.10gb": 2

"3g.20gb": 1

4

EPJ Web of Conferences 337, 01079 (2025)

CHEP 2024

https://doi.org/10.1051/epjconf/202533701079

First we need to decide on a MIG strategy (mixed/single/none), then let the MIG manager

know that it needs to use the conﬁguration available in the nvidia-mig-conﬁg. Lastly, the

description of the mig options has to be added to the referenced ConﬁgMap. For example,

with 3g.20gb-2x2g.10gb, the GPU will be shared into 3 instances - 2 instances 2g.10gb, and

1 instance 3g.20gb. If we sum up the instances, we get 7g.40gb, which is the equivalent of

the full GPU.

If the conﬁguration works properly, the node will start advertising the new resources,

which in this case are named based on the instance they represent:

Allocatable:

...

nvidia.com/gpu:

nvidia.com/mig-2g.10gb:

nvidia.com/mig-3g.20gb:

0

2

1

A very important MIG addition is the possibility to have telemetry data per instance. As

shown in the ﬁgure 5 , every instance can be monitored independently, The telemetry allows

for detailed insights and analysis of each instance’s performance, resource utilization, and

health metrics.

Figure 5. Monitoring a MIG partitioned GPU per instance

MIG oﬀers numerous advantages in various computing environments. One key beneﬁt

is the hardware isolation that allows processes to run securely in parallel and not inﬂuence

each other. Additionally, MIG provides monitoring and telemetry data at partition level. This

makes MIG a very ﬂexible solution, that can be tailored to diverse workload requirements.

Howether, MIG also comes with certain disadvantages that need to be considered. First of

all, it is only available for Ampere, Hopper, and Blackwell architecture. Another challenge is

that reconﬁguring the partition layout requires all running processes to be evicted, which can

disrupt ongoing tasks. Furthermore, there is a potential loss of available memory depending

on the chosen proﬁle layout, although this risk can be mitigated if the partitioning layout is

selected thoughtfully after careful consideration.

5

EPJ Web of Conferences 337, 01079 (2025)

https://doi.org/10.1051/epjconf/202533701079

CHEP 2024

4 Benchmarking GPU Sharing Mechanisms

The benchmarking was done with a simulation of LHC turning particles. We used an

OpenCL-oriented Simpletrack benchmarking built for a selection of GPU and CPU plat-

forms. In this case, we only used the NVIDIA one. More details on the benchmarked script:

• built with Xsuite [5]

• heavy on GPU usage

• low on CPU-to-GPU communication and memory accesses

• run on an NVIDIA A100 40GB PCIE GPU

• run on a Kubernetes 1.22 cluster (Cuda version utilized: 11.6, Driver Version: 470.129.06)

The used script [6] allows changing the number of particles and turns. We used the

default value for turns, which is 15, but were changing the number of particles to increase

and decrease the amount of computation to be done on the NVIDIA device.

4.1 Time-slicing

First we schedule only one process on the GPU, then slowly double the number of processes.

Initially we compare the times to execute on a full GPU and a GPU time-sliced into 2, the

results are shown in table 1. We consider as the reference value the execution time when only

one process is running multiplied by the number of processes. Afterwards, we compare the

execution time when we time-sliced into 2 vs 4 (see table 2) and 4 vs 8 (see table 3)

Number of particles Shared x1 [s] Shared x1 * 2 [s] Shared x2 [s] Loss [%]

15 000 000

20 000 000

30 000 000

77.12

99.91

152.61

154.24

199.82

305.22

212.71

276.23

423.08

37.90

38.23

38.61

Table 1. Comparing full GPU vs sharing between two processes with time-slicing

Number of particles Shared x2 [s] Shared x4 [s] Loss [%]

15 000 000

20 000 000

30 000 000

212.71

276.23

423.08

421.55

546.19

838.55

0

Table 2. Benchmarking time-slicing with two and four processes

Number of particles Shared x4 [s] Shared x8 [s] Loss [%]

15 000 000

20 000 000

30 000 000

421.55

546.19

838.55

838.22

1087.99

1672.95

0

Table 3. Comparing sharing between four and eight processes with time-slicing

6

EPJ Web of Conferences 337, 01079 (2025)

https://doi.org/10.1051/epjconf/202533701079

CHEP 2024

Conclusions:

1. If we run 1 process on a dedicated GPU, and then 2 identical processes, we would

expect the execution time to double. In practice, since the GPU needs to perform

context switching (going from shared x1 to shared x2), there is a performance loss of

38%.

2. Sharing a GPU with time-slicing between 2 processes introduces a big performance

penalty. However increasing the number of processes on the GPU (4, 8) doesn’t intro-

duce additional performance loss.

4.2 Multi Instance GPU

A full A100 GPU has 108 Streaming Multiprocessors, or SMs. When MIG is enabled on the

GPU (7g.40gb), 10 SMs are lost - which represents 9.25% of the available compute. Figure

6 shows how much those lost SMs inﬂuence the amount of CUDA cores and tensor cores

available to perform the actual work.

Figure 6. Performance loss for an A100 GPU when MIG is enabled

First we compare the benchmarks on the full GPU, then on a full GPU but with MIG

enabled (see table 4). Afterwards we partition into smaller instances (table 5) instances and

compare the performance (table 6).

Number of particles Whole GPU, no MIG Whole GPU, with MIG (7g.40gb)

Loss

5 000 000

10 000 000

15 000 000

26.365 seconds

51.135 seconds

76.374 seconds

28.732 seconds

55.930 seconds

83.184 seconds

8.97 %

9.37 %

8.91 %

Table 4. Benchmarking on a full GPU and a GPU with MIG enabled

7

EPJ Web of Conferences 337, 01079 (2025)

CHEP 2024

https://doi.org/10.1051/epjconf/202533701079

Number of particles 7g.40gb [s] 3g.20gb [s] 2g.10gb [s] 1g.5gb [s]

5 000 000

10 000 000

15 000 000

28.732

55.930

83.184

62.268

122.864

183.688

92.394

183.01

273.7

182.32

362.10

542.3

Table 5. Benchmarking MIG partitioning

Number of particles 3g.20gb / 7g.40gb 2g.10gb / 3g.20gb 1g.5gb / 2g.10gb

5 000 000

10 000 000

15 000 000

ideal scale

2.16

2.19

2.20

1.48

1.97

1.98

7/3 = 2.33

3/2 = 1.5

2/1 = 2

Table 6. Comparing the scaling between MIG partitions

Conclusions:

1. When we run the benchmarking script on a full GPU without MIG and with MIG

enabled (7g.40gb), we conclude that the theoretical loss of 9.25% is also seen experi-

mentally.

2. We should never enable MIG on a GPU, if we will not make use of it by partitioning,

as it comes with huge performance loss.

3. The scaling between partitions converges to ideal values. In a linear manner, when the

available resources increase, the execution time becomes smaller.

5 Monitoring GPUs

Monitoring is an important part of any infrastructure. It allows tracking and identifying is-

sues, predicting usage or behaviour, setting up alerting, etc. Even though it can be compli-

cated to set up, Kubernetes makes it quite easy to get insights about the GPU usage on a

cluster.

We can start by using kube-prometheus-stack [7], which is a set of manifests, tools, dash-

boards, that make cluster monitoring easy. On the other side we have the gpu-operator (in par-

ticular NVIDIA DCGM [8]), that collects metrics from the GPUs in the cluster, and exposes

them to an endpoint that needs to be scraped. In this way, by using kube-prometheus-stack +

gpu-operator, the GPU metrics are already generated and available to the Prometheus/Grafana

stack for visualization.

Usually, the default GPU metrics are enough, but when more granular/speciﬁc metrics are

needed, one can make use of DCGM Field Identiﬁers [9], which allow enabling and disabling

a very wide range of metrics.

An example of dashboard can be seen in ﬁgure 7. It shows the GPU utilization across all

available GPUs, lists the MIG devices, has extensive information about the memory utiliza-

tion per device or partition, the temperature, and can be adapted and extended as needed.

8

EPJ Web of Conferences 337, 01079 (2025)

https://doi.org/10.1051/epjconf/202533701079

CHEP 2024

Figure 7. Example dashboard for monitoring GPUs

6 Conclusions

To conclude, having a single point of access for GPUs is a solution that can greatly decrease

the idle time of resources, while also increasing the overall GPU oﬀering at CERN. Still, this

cannot be a solution when the use cases cannot fully utilize the available GPUs. To improve

this, we need to start thinking of sharing at the GPU level (either logical or hardware). Still,

since sharing comes with performance tradeoﬀs, there must be a mechanism to provide dedi-

cated GPUs to use cases that will fully utilize them, to avoid performance losses. Therefore a

combination of sharing at diﬀerent infrastructure levels is needed to gain optimal GPU usage.

References

[1] Eﬃcient Access to Shared GPU Resources https://kubernetes.web.cern.ch /tags /gpu /. Ac-

cesses on: December 12 2024

[2] Kubeﬂow https://www.kubeﬂow.org /. Accessed on: June 18 2025

[3] . Accesses on: December 12 2024 NVIDIA gpu-operator https://github.com /NVIDIA /

gpu-operator

[4] . Accesses on: December 12 2024 MIG User Guide https://docs.nvidia.com /datacenter /

tesla /mig-user-guide /. Accesses on: December 12 2024

[5] Xsuite https://github.com /xsuite /xsuite Accessed on: June 20 2025

[6] Benchmarked script https://gitlab.cern.ch /hep-benchmarks /hep-workloads-gpu /-/blob /

master /lhc /simpletrack /lhc-simpletrack.sh?ref_type =heads. Accessed on: June 20 2025

[7] kube-prometheus-stack

https://github.com /prometheus-community /helm-charts /tree /

main /charts /kube-prometheus-stack. Accesses on: December 23 2024

[8] NVIDIA DCGM https://developer.nvidia.com /dcgm. Accesses on: December 23 2024

[9] NVIDIA Field Identiﬁers https://docs.nvidia.com /datacenter /dcgm /2.4 /dcgm-api /

dcgm-api-ﬁeld-ids.html. Accesses on: December 23 2024

9

CERN：K8s 下 MIG vs time-slicing 的可复现损耗与落地配置

发表留言