Accessing Virtual GPUs

In the JUSUF cloud environment there are hosts available equipped with an Nvidia V100 PCIe 16 GB card. The GPU can be be shared with several virtual machines on the host with the Nvidia vGPU feature.

Warning

While vGPUs might speedup your codes, there are several technical limitations with them and their OpenStack integration. Because live-migration is not supported, there are strong operational limitations which are pointed out by the official RedHat Documentation. You must deploy your code in a way that you can easily redeploy it. And you should not store important data within those virtual machines, as they might pass away or show unexpected behaviours during maintenance of the underlying cloud infrastructure.

In case no vGPU is left, OpenStack will fail to schedule your VM.

Prerequesits

A VM needs to be prepared for being able to use a vGPU assigned to it.

Please pick one of the GPU flavors mentioned at Quick Introduction.

Create a VM with the Centos 8.2 image and a gpu-flavor. You might take a look on the First Steps Page.

$ openstack server create --flavor gpu.m --security-group your_group --key-name your_key --network your_net --image CentOS-8-GenericCloud-8.2.2004-20200611.2.x86_64 gpu-vm

Install Nvidia driver for vGPU support

Install the Nvidia vGPU driver and register the vGPU on the license server.

Logon to the VM and bring it to a latest state, together with some dependencies.

$ sudo yum update -y # cosmetic for getting the latest kernel in
$ sudo shutdown 1 -r
$ sudo yum install -y gcc make kernel-devel elfutils-libelf-devel libglvnd libglvnd-devel pciutils gcc-c++ epel-release

Install the Nvidia driver via the following commands.

$ curl -o /tmp/NVIDIA-Driver.latest.run https://hpsrepo.fz-juelich.de/jusuf/nvidia/NVIDIA-Driver.latest
$ chmod 755 /tmp/NVIDIA-Driver.latest.run
$ sudo mkdir /etc/nvidia
$ curl -o /tmp/gridd.conf https://hpsrepo.fz-juelich.de/jusuf/nvidia/gridd.conf
$ sudo mv /tmp/gridd.conf /etc/nvidia/gridd.conf
$ sudo /tmp/NVIDIA-Driver.latest.run --ui=none --no-questions --disable-nouveau
$ sudo shutdown 1 -r

After the driver build successfully finished and you reboot the VM, a GPU device should be visible

$ [centos@some_vm ~]$ nvidia-smi
Tue Sep  1 08:13:12 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.56       Driver Version: 440.56       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID V100-4C        On   | 00000000:00:05.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |    304MiB /  4096MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You can now proceed with CUDA, VirtualGL or some other GPU workload based on the shown device.