Accessing GPUs

In the JSC Cloud environment there are four hosts available equipped with 3x Nvidia A100 PCIe 80 GB card and 16 nodes with 1x Nvidia V100 PCIe 16GB card each. A GPU can be accessed by the virtual machines via PCIe passthrough.

Warning

While a VM with GPUs might speedup your codes, there are several technical limitations with them and their OpenStack integration. Because live-migration is not supported, there are strong operational limitations which are pointed out by the official RedHat Documentation. You must deploy your code in a way that you can easily redeploy it. And you should not store important data within those virtual machines, as they might pass away or show unexpected behaviours during maintenance of the underlying cloud infrastructure.

In case no GPU is left, OpenStack will fail to schedule your VM.

Prerequesits

A VM needs to be prepared for being able to use a GPU assigned to it.

Please pick one of the GPU flavors mentioned at Virtual Machine Types. Note that these GPU VM flavors are currently private and need to be requested first.

Create a VM with the RockyLinux 9.3 image and a gpu-flavor. You might take a look on the First Steps Page.

$ openstack server create --flavor SCS-16L:64:20n-z3-GNa:108 --security-group your_group --key-name your_key --network your_net --image "RockyLinux 9.3" gpu-vm

Install Nvidia driver for GPU support

Install the Nvidia GPU driver.

Logon to the VM and bring it to a latest state, together with some dependencies.

$ sudo dnf update -y # cosmetic for getting the latest kernel in
$ sudo shutdown 1 -r
$ sudo dnf config-manager --set-enabled crb
$ sudo dnf -y install epel-release
$ sudo dnf install -y gcc gcc-c++ make kernel-headers-$(uname -r) kernel-devel-$(uname -r) tar bzip2 automake elfutils-libelf-devel libglvnd libglvnd-devel libglvnd-opengl libglvnd-glx acpid pciutils dkms

Install the Nvidia driver via the following commands.

$ sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo
$ sudo dnf module -y install nvidia-driver:latest-dkms
$ sudo shutdown 1 -r

After the driver build successfully finished and you reboot the VM, a GPU device should be visible

$ [rocky@gpu-vm ~]$ nvidia-smi
Mon Dec 18 14:42:12 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:00:05.0 Off |                    0 |
| N/A   34C    P0              57W / 300W |      4MiB / 81920MiB |     21%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

You can now proceed with CUDA, VirtualGL or some other GPU workload based on the shown device.