Skip to main content

Using BlueField

A DPU hackathon was held in February 2023, which included many useful tips on how to use DINE. A presentation is available from cosma-support which provides usage examples, namely the lab exercises on slides 45 and 84. Please request UK_DPU_Hackathon.pptx.

Network information

DINE has several networks:

Command and control network

Used for login, SLURM submission, etc. Both the hosts and BlueField cards are connected to this network. You can specify as:

b[101-124] for the hosts, and

bluefield[101-124] for the cards

This network is accessible from the login nodes.

InfiniBand network

The high performance (200GBit/s HDR) fabric used for inter-job communication and some file system access. Both the hosts and BlueField cards are connected to this network. You can specify as:

bfd[101-124].ib for the hosts

bfh[101-124].ib for the cards

This fabric is not accessible from the login nodes: it is only accessible to other DINE nodes (hosts and BlueField cards).

Local BlueField card access

From a host, bfl will provide access to the attached card over a slow internal network.

SLURM submission

To submit to DINE, you need to belong to the "durham" or "do008" group, and submit such as:

#SBATCH -p bluefield1

#SBATCH -A durham

or

#SBATCH -A do008

etc

MPI notes

Openmpi versions may require:

mpirun --mca btl_tcp_if_include p1p2

To help with stability, this could also help: --mca oob_tcp_if_include p1p2 (or p2p1, as appropriate)

or

mpirun --mca btl_tcp_if_include p1p2 --mca oob_tcp_if_include p1p2 -x UCX_NET_DEVICES=mlx5_1:1 my_script

Another setting which may help performance when on RoCE devices (e.g. p2p1, mlx5_2):

--mca btl ^openib

Intel MPI

Using mpiexec -genv UCX_NET_DEVICES mlx5_1:1 ... might help.

When using intel_mpi/2018 the following steps are required to run on p2p1 (mlx5_2):
  1. create a file dat.conf with contents:
    1. ofa-v2-cma-p2p1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "p2p1 0" ""
  2. set the following:
export I_MPI_FABRICS=shm:dapl
export DAT_OVERRIDE=./dat.conf
export I_MPI_FALLBACK=0
When running on the ib0 (EDR Infiniband) network, the dat.conf file should contain:
ofa-v2-mlx5_2-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_2 1" ""
Setting UCX_NET_DEVICES is only relevant to Intel 2019/2020 and perhaps OpenMPI.

SLURM submission to both host (x86) and BlueField device (ARM) cores

To submit a job that will run across both host and arm cores, the following procedure can be used (Mark Turner).

A SLURM script such as:

#!/bin/bash -l
#SBATCH -o smartmpi.out
#SBATCH -e smartmpi.err
#SBATCH -p bluefield1
#SBATCH -A durham
#SBATCH -t 00:30:00
#SBATCH --nodes=2

module purge
module load python/3.6.5

# Get a comma separated list of IPs for the host and
# Smart NICs that SLURM has assigned us
IPs=$( python3 smartmpi/scripts/dine_config.py )
echo "IPs in use: " $IPs

# Assumes alternating topology with 2 ranks per node
# (one on x86; one on arm64)
np=$(( $SLURM_JOB_NUM_NODES * 2 ))
echo "Num processes: " $np

# Prevent SLURM from blocking the use of Smart NICs
unset SLURM_JOBID

mpirun --mca btl_tcp_if_exclude tmfifo_net0,lo,ib0,em1 -host $IPs -np $np launcher_script.sh

Where the dine_config.py file is defined as:

import os

def extract_nodes(nodes):
for node_entry in nodes.split(','):
elem = node_entry.split('-')
if len(elem) == 1:
yield int(elem[0])
elif len(elem) == 2:
node_range = list(map(int, elem))
for i in range(node_range[0], node_range[1]+1):
yield i
else:
raise ValueError('format error in %s' % x)

def print_ips(node_list):
node_ips = []
for node in extract_nodes(node_list):
basenumber = 100
ip_elem = node - basenumber
node_ips.append(f"192.168.101.{2*ip_elem-1}")
node_ips.append(f"192.168.101.{2*ip_elem}")
print(",".join(node_ips))


"""
This script is intended for use on the DINE cluster. It should be used within SLURM
jobs before the mpirun command.

It prints the comma separated IPs for the x86 hosts and arm 64 Smart NICs allocated
to us by SLURM. In SLURM I capture this stdout within a variable and pass it to
the `-host` argument to mpirun when not using a rankfile.

For full documentation on smarTeaMPI on DINE, see docs/.
"""
if __name__ == "__main__":
node_list = os.environ["SLURM_JOB_NODELIST"]
print_ips(node_list[2:-1])

And the launcher script is (assuming paeno is the code to be run):

#!/bin/bash -l

case "$HOSTNAME" in
"bluefield"* )
export OMP_NUM_THREADS=16
./Peano_bfd/examples/exahype2/euler/peano4 --timeout 300
;;
"b1"* )
export OMP_NUM_THREADS=32
./Peano/examples/exahype2/euler/peano4 --timeout 300
;;
esac

These scripts can be found in /cosma/home/sample-user/dine/

Intel MKL

DINE is an AMD system, and Intel MKL is known to be hobbled. There are some work arounds to improve performance.

Intel Compiler

If you compile with -xHost, you may get:

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT and AVX instructions.

Intel MKL problems?

Some links: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/799716

https://github.com/QMCPACK/qmcpack/issues/1158

To solve the problem:

export FI_PROVIDER=tcp