A DPU hackathon was held in February 2023, which included many useful tips on how to use DINE. A presentation is available from cosma-support which provides usage examples, namely the lab exercises on slides 45 and 84. Please request UK_DPU_Hackathon.pptx.
DINE has several networks:
Used for login, SLURM submission, etc. Both the hosts and BlueField cards are connected to this network. You can specify as:
b[101-124] for the hosts, and
bluefield[101-124] for the cards
This network is accessible from the login nodes.
The high performance (200GBit/s HDR) fabric used for inter-job communication and some file system access. Both the hosts and BlueField cards are connected to this network. You can specify as:
bfd[101-124].ib for the hosts
bfh[101-124].ib for the cards
This fabric is not accessible from the login nodes: it is only accessible to other DINE nodes (hosts and BlueField cards).
From a host, bfl will provide access to the attached card over a slow internal network.
To submit to DINE, you need to belong to the "durham" or "do008" group, and submit such as:
#SBATCH -p bluefield1
#SBATCH -A durham
or
#SBATCH -A do008
etc
Openmpi versions may require:
mpirun --mca btl_tcp_if_include p1p2
To help with stability, this could also help: --mca oob_tcp_if_include p1p2 (or p2p1, as appropriate)
--mca oob_tcp_if_include p1p2
mpirun --mca btl_tcp_if_include p1p2 --mca oob_tcp_if_include p1p2 -x UCX_NET_DEVICES=mlx5_1:1 my_script
Another setting which may help performance when on RoCE devices (e.g. p2p1, mlx5_2):
--mca btl ^openib
Using mpiexec -genv UCX_NET_DEVICES mlx5_1:1 ... might help.
mpiexec -genv UCX_NET_DEVICES mlx5_1:1 ...
ofa-v2-cma-p2p1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "p2p1 0" ""
export I_MPI_FABRICS=shm:dapl
export DAT_OVERRIDE=./dat.conf
export I_MPI_FALLBACK=0
ofa-v2-mlx5_2-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx5_2 1" ""
To submit a job that will run across both host and arm cores, the following procedure can be used (Mark Turner).
A SLURM script such as:
#!/bin/bash -l
#SBATCH -o smartmpi.out
#SBATCH -e smartmpi.err
#SBATCH -t 00:30:00
#SBATCH --nodes=2
module purge
module load python/3.6.5
# Get a comma separated list of IPs for the host and
# Smart NICs that SLURM has assigned us
IPs=$( python3 smartmpi/scripts/dine_config.py )
echo "IPs in use: " $IPs
# Assumes alternating topology with 2 ranks per node
# (one on x86; one on arm64)
np=$(( $SLURM_JOB_NUM_NODES * 2 ))
echo "Num processes: " $np
# Prevent SLURM from blocking the use of Smart NICs
unset SLURM_JOBID
mpirun --mca btl_tcp_if_exclude tmfifo_net0,lo,ib0,em1 -host $IPs -np $np launcher_script.sh
Where the dine_config.py file is defined as:
import os
def extract_nodes(nodes):
for node_entry in nodes.split(','):
elem = node_entry.split('-')
if len(elem) == 1:
yield int(elem[0])
elif len(elem) == 2:
node_range = list(map(int, elem))
for i in range(node_range[0], node_range[1]+1):
yield i
else:
raise ValueError('format error in %s' % x)
def print_ips(node_list):
node_ips = []
for node in extract_nodes(node_list):
basenumber = 100
ip_elem = node - basenumber
node_ips.append(f"192.168.101.{2*ip_elem-1}")
node_ips.append(f"192.168.101.{2*ip_elem}")
print(",".join(node_ips))
"""
This script is intended for use on the DINE cluster. It should be used within SLURM
jobs before the mpirun command.
It prints the comma separated IPs for the x86 hosts and arm 64 Smart NICs allocated
to us by SLURM. In SLURM I capture this stdout within a variable and pass it to
the `-host` argument to mpirun when not using a rankfile.
For full documentation on smarTeaMPI on DINE, see docs/.
if __name__ == "__main__":
node_list = os.environ["SLURM_JOB_NODELIST"]
print_ips(node_list[2:-1])
And the launcher script is (assuming paeno is the code to be run):
case "$HOSTNAME" in
"bluefield"* )
export OMP_NUM_THREADS=16
./Peano_bfd/examples/exahype2/euler/peano4 --timeout 300
;;
"b1"* )
export OMP_NUM_THREADS=32
./Peano/examples/exahype2/euler/peano4 --timeout 300
esac
These scripts can be found in /cosma/home/sample-user/dine/
DINE is an AMD system, and Intel MKL is known to be hobbled. There are some work arounds to improve performance.
If you compile with -xHost, you may get:
Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT and AVX instructions.
Some links: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/799716
https://github.com/QMCPACK/qmcpack/issues/1158
To solve the problem:
export FI_PROVIDER=tcp