Nvidia GPUDirect
Background Research
Section titled “Background Research”See Background Research for background information I studied before doing this.
Helpful Links
Section titled “Helpful Links”RDMA Aware Programming Typical Application Flow
Source Code
Section titled “Source Code”See the test file
Description of the Problem
Section titled “Description of the Problem”What we are doing in the below steps is playing a packet from one Mellanox card directly into another. Before playing the packet, we write to a region in GPU memory with a specific pattern and in the packet we send we have a different pattern. As a proof of concept we expect that the packet’s data overwrites this memory buffer.
See this post for the original description of the problem.
- A queue pair and its associated resources are established exactly as described in the (generic application flow)[https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/Typical+Application]
- Lines 0-192 of the attached code
- Register a region of host memory and fill it with a known pattern 1.lines 192-195
- Register a region of GPU memory 1.Lines 197-223
- Send a packet containing a known pattern from one Mellanox device to another 1.Lines 223-375
- Copy the data from the GPU device’s memory region into the host system memory which we expect to overwrite the host system memory’s bit pattern with the one we just sent 1.Line 375-380
- Confirm that the memory patterns match. The idea being that we just sent a new pattern from one Mellanox device to the other and then told it to overwrite the pattern that was already in system memory with what the GPU received. The logic being that we expect the pattern which was in system memory to be overwritten by what was just sent. 1.This happens in lines 391-396
At no point does the CUDA toolkit issue any errors. Everything returns as a success however, the pattern in system memory is not overwritten. Need to determine why this simple POC does not work in order to move forward with customer.
Lab Configuration
Section titled “Lab Configuration”Hardware Configuration
Section titled “Hardware Configuration”Dell R750 with a Mellanox MLX6 as the transmitting device and a MLX5 as the receiving device. Worth noting is that at the time of writing there is no special MLX6 driver. All device names will appear is MLX5.
RHEL Version
Section titled “RHEL Version”NAME="Red Hat Enterprise Linux"VERSION="8.5 (Ootpa)"ID="rhel"ID_LIKE="fedora"VERSION_ID="8.5"PLATFORM_ID="platform:el8"PRETTY_NAME="Red Hat Enterprise Linux 8.5 (Ootpa)"ANSI_COLOR="0;31"CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"HOME_URL="https://www.redhat.com/"DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"REDHAT_BUGZILLA_PRODUCT_VERSION=8.5REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"REDHAT_SUPPORT_PRODUCT_VERSION="8.5"Red Hat Enterprise Linux release 8.5 (Ootpa)Red Hat Enterprise Linux release 8.5 (Ootpa)GCC Version
Section titled “GCC Version”[root@gputest ~]# gcc --versiongcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4)Copyright (C) 2018 Free Software Foundation, Inc.This is free software; see the source for copying conditions. There is NOwarranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.Nvidia SMI
Section titled “Nvidia SMI”+-----------------------------------------------------------------------------+| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 NVIDIA A100-PCI... Off | 00000000:CA:00.0 Off | 0 || N/A 26C P0 35W / 250W | 541MiB / 40536MiB | 0% Default || | | Disabled |+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| 0 N/A N/A 4969 G /usr/libexec/Xorg 26MiB || 0 N/A N/A 5345 G /usr/bin/gnome-shell 98MiB || 0 N/A N/A 606994 C /tmp/test/rdma-loopback 413MiB |+-----------------------------------------------------------------------------+CUDA Version
Section titled “CUDA Version”[root@gputest ~]# nvcc --versionnvcc: NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2021 NVIDIA CorporationBuilt on Thu_Nov_18_09:45:30_PST_2021Cuda compilation tools, release 11.5, V11.5.119Build cuda_11.5.r11.5/compiler.30672275_0`
### MLX Config
See [MLX5_0](images/mlx5_0.log) and [MLX5_2](images/mlx5_2.log)
## Installation
### MLNX_OFED
1. Download MLNX_OFED drivers from https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed 1.MLNX_OFED is version dependent. I suggest you use `subscription-manager release --set=8.4` to ensure your version of RHEL stays at the version for which MLNX_OFED was compiled2. Upload the ISO to the target box3. Run:
```bashdnf group install "Development Tools" -ydnf install -y tk tcsh tcl gcc-gfortran kernel-modules-extra gcc-g++ gdb rsync ninja-build make zipmount MLNX* /mntcd /mnt./mlnxofedinstallSetting Up the Code Environment
Section titled “Setting Up the Code Environment”CUDA Development Packages
Section titled “CUDA Development Packages”- Make sure you have an Nvidia GPU that shows on the device with
lspci | grep -i nvidia - Make sure you have the kernel dev headers for your kernel. with
rpm -qa | grep devel | grep kernel && uname -r - Run the following (See Nvidia’s instructions for details):
dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpmsubscription-manager repos --enable=rhel-8-for-x86_64-appstream-rpmssubscription-manager repos --enable=rhel-8-for-x86_64-baseos-rpmssubscription-manager repos --enable=codeready-builder-for-rhel-8-x86_64-rpmssudo rpm -i cuda-repo-rhel8-11-5-local-11.5.1_495.29.05-1.x86_64.rpmsudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.reposudo dnf clean expire-cachesudo dnf module install -y nvidia-driver:latest-dkmssudo dnf install -y cudaexport PATH=/usr/local/cuda-11.5/bin${PATH:+:${PATH}}echo 'export PATH=/usr/local/cuda-11.5/bin${PATH:+:${PATH}}' >> /root/.bashrcmodprobe nvidia-peermem # YOU MUST RUN THIS MANUALLY# Below is just for debugging. You don't have to install them.# Make sure, even though it is RHEL 8, you use the word yum here.yum debuginfo-install libgcc-8.5.0-4.el8_5.x86_64 libibverbs-55mlnx37-1.55103.x86_64 libnl3-3.5.0-1.el8.x86_64 libstdc++-8.5.0-4.el8_5.x86_64 nvidia-driver-cuda-libs-495.29.05-1.el8.x86_64WARNING Whenever you want to run this code you must manually load the nvidia-peermem module. See Nvidia peermem. Load with modprobe nvidia-peermem
Prepare the Code
Section titled “Prepare the Code”First we need to make some manual adjustments to some parameters in the code. For this you need the MAC addresses
My first challenge was that the system was entirely remote so I had to figure out how to determine exactly which interfaces belonged to which card. To find the MAC addresses remotely you can use lspci -v | grep -i ethernet and compare that with the output of ethtool -i <interface_name>. This will allow you to corelate the model name of the NIC to the interface/MAC using the PCIe bus number. Ex:
[root@gputest ~]# ip a s1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: eno8303: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether b0:7b:25:f8:44:d2 brd ff:ff:ff:ff:ff:ff inet 172.28.1.40/24 brd 172.28.1.255 scope global dynamic noprefixroute eno8303 valid_lft 71017sec preferred_lft 71017sec inet6 fe80::b27b:25ff:fef8:44d2/64 scope link noprefixroute valid_lft forever preferred_lft forever3: eno8403: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether b0:7b:25:f8:44:d3 brd ff:ff:ff:ff:ff:ff4: eno12399: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b4:96:91:cd:e8:ac brd ff:ff:ff:ff:ff:ff5: eno12409: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b4:96:91:cd:e8:ad brd ff:ff:ff:ff:ff:ff6: ens6f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether b8:ce:f6:cc:9e:dc brd ff:ff:ff:ff:ff:ff7: ens6f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether b8:ce:f6:cc:9e:dd brd ff:ff:ff:ff:ff:ff8: ens5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 0c:42:a1:73:8d:e6 brd ff:ff:ff:ff:ff:ff9: ens5f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 0c:42:a1:73:8d:e7 brd ff:ff:ff:ff:ff:ff10: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 link/ether 52:54:00:b7:e9:a7 brd ff:ff:ff:ff:ff:ff inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0 valid_lft forever preferred_lft forever11: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master virbr0 state DOWN group default qlen 1000 link/ether 52:54:00:b7:e9:a7 brd ff:ff:ff:ff:ff:ff[root@gputest test_code]# lspci -v | grep -i ethernet04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe31:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) Subsystem: Intel Corporation Ethernet 25G 2P E810-XXV OCP31:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02) Subsystem: Intel Corporation Ethernet 25G 2P E810-XXV OCP98:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]98:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]b1:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]b1:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex][root@gputest test_code]# ethtool -i ens6f1driver: mlx5_coreversion: 5.5-1.0.3firmware-version: 22.31.1014 (DEL0000000027)expansion-rom-version:bus-info: 0000:98:00.1supports-statistics: yessupports-test: yessupports-eeprom-access: nosupports-register-dump: nosupports-priv-flags: yes[root@gputest ~]# ethtool -i ens5f0driver: mlx5_coreversion: 5.5-1.0.3firmware-version: 16.27.6106 (DEL0000000004)expansion-rom-version:bus-info: 0000:b1:00.0supports-statistics: yessupports-test: yessupports-eeprom-access: nosupports-register-dump: nosupports-priv-flags: yesSo here we can see from the bus numbers that in my case MLX6 device is ens6f0/ens6f1 and the MLX5 is ens5f0/ensf1. My transmit interface will be b8:ce:f6:cc:9e:dd/ens6f1 and my receive is 0c:42:a1:73:8d:e6/ens5f0.
Next you have to figure out the Mellanox device numbers. You can do this using mlxconfig. To get this listing run mlxconfig -d mlx5_0 q. In the header of each block you should see something like this:
Device #1:----------
Device type: ConnectX6DXName: 0F6FXM_08P2T2_AxDescription: Mellanox ConnectX-6 Dx Dual Port 100 GbE QSFP56 Network AdapterDevice: mlx5_0From my experimentation they are in order. So mlx5_0 is the first interface on the MLX6 device which lines up with ens6f0. This means mlx5_1 is the second interface and then mlx5_3 would be the first interface of the MLX5 device which we can confirm with mlxconfig -d mlx5_3 q
Device #1:----------
Device type: ConnectX5Name: 09FTMY_071C1T_AxDescription: Mellanox ConnectX-5 Ex Dual Port 100 GbE QSFP Network AdapterDevice: mlx5_3Compiling and Running the App
Section titled “Compiling and Running the App”g++ rdma-loopback.cc -o rdma-loopback -libverbs -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudartDebugging
Section titled “Debugging”gdb --args rdma-loopback 0
set print pretty on is helpful
To get the config of the Mellanox devices run mlxconfig -d mlx5_0 q > mlx5_0.log. Replace mlx5 with your device name.