Half of COSMA7 now has a Rockport Ethernet network fabric, rather than InfiniBand.
This is available for anyone with an active DiRAC project on COSMA7 to use. Please ask if you are unsure (or just try to use it).
Please note, COSMA7 nodes have 28 cores and 512GB RAM each. Both the cosma7 and cosma7-rp (Rkcport) partitions have 224 nodes.
This is exploratory and will allow us to benchmark and explore this technology for use in future DiRAC procurements.
It is based on a 6D torus topology, rather than the 2:1 blocking fat tree fabric used by the rest of COSMA7. It should offer consistently low latency even when the network is heavily congested, and so workloads that are latency dependent, or suffer when there is congestion should benefit. Each node has 100 GBit/s connectivity (also the case for the InfiniBand nodes).
To use the Rockport nodes, you should submit jobs to the cosma7-rp partition. e.g. within your batch script, you would use:
#SBATCH -p cosma7-rp
You are also advised to load the rockport-settings environment module within your SLURM batch script:
module load rockport-settings
This sets some environment variables which then allow the Rockport fabric to be used most efficiently. To see these, you can use module show rockport-settings.
module show rockport-settings
If using Intel MPI, you can simply use mpirun.
If using OpenMPI, use mpirun $RP_OPENMPI_ARGS ...
mpirun $RP_OPENMPI_ARGS
The recommended MPI libraries to use are (as of June 2022):
Intel 2018: module load intel_comp/2018
module load intel_comp/2018
Intel 2022: module load intel_comp/2022.1.2 compiler mpi (and you may also wish to module load ucx)
module load intel_comp/2022.1.2 compiler mpi
module load ucx
OpenMPI 4.1.1: module load openmpi/4.1.1
module load openmpi/4.1.1
To enable multipath on a per-job basis, add to the mpirun commandline (when using UCX), for OpenMPI:
-x UCX_IB_TRAFFIC_CLASS=160
and for Intel MPI (after 2018):
-genv UCX_IB_TRAFFIC_CLASS=160
This allows packets to take multiple routes between pairs of nodes increasing direct bandwidth between these nodes.
Rockport recommend some other settings for RDMA transports:
--fwd-mpirun-port
--mca oob_tcp_listen_mode listen_thread
The Rockport partition does not currently have dedicated (fabric-native) storage. Rather, the /cosma7, /cosma6 etc storage are routed through 4 dedicated nodes. This results in a slight reduction in performance. Therefore, when benchmarking codes, please don't include any time to read or write to storage.
Not all storage systems are mounted on the Rockport partition - please check before use. e.g. /cosma6 is not currently mounted.
We are very interested to see how the Rockport partition compares (performance-wise) with the cosma7 partition. Therefore, if you are able to do any tests or code comparisons, please let us know, and pass on any results that you get.
There is additional time allocation available for benchmarking - please let us know if you would like to use this.