Skip to main content

Ed Fauchon-Jones, June 2019

We had been trying since the beginning of March to run on COSMA7, with most runs resulting in failures. Here is a summary of our attempts to solve this issue.

* We have tried with two versions of our code
Both versions of our code have been used successfully on COSMA5 and COSMA6
hardware.

* Using multiple compiler/MPI library combinations
We have tried to build our code using the following compiler and MPI library
combinations. All jobs failed using these combinations.
* gnu_comp/7.3.0, openmpi/3.0.1
* intel_comp/2018, intel_mpi/2018 (used successfully on COSMA6
hardware)
* intel_comp/2019-update2, intel_mpi/2019-update2

* Different compiler options
We tried with different levels of compiler optimisation and including and
excluding debugging information, all leading to failures.

* Non-actionable and inconsistent errors
Errors are not usually reproducible when running identical jobs, and
typically fail with different errors. Unfortunately the error messages never
provided enough actionable information to find solutions and no one
including our resident research software engineer were able to find a
solution at the time.

* Non-actionable instrumented runs
We also performed some instrumented runs including AddressSanitizer to
attempt to debug our code. We would no longer get failures with the
instrumented code however this would result in much worse performance.

Based on the results it suggested the issue was with our code rather than
COSMA7 or software modules.

In early May we used the newly available openmpi/20190429 module along side
gnu_comp/7.3.0.<http://7.3.0.> This particular combination worked. Based on
the previous failure results we believe the issue we had with the OpenMPI
library was related to https://github.com/open-mpi/ompi/issues/4686 and was
solved with https://github.com/open-mpi/ompi/pull/4767 which was not
available in any of the OpenMPI modules previously available for COSMA7.