Ed Fauchon-Jones, June 2019
We had been trying since the beginning of March to run on COSMA7, with most runs resulting in failures. Here is a summary of our attempts to solve this issue.
* We have tried with two versions of our codeBoth versions of our code have been used successfully on COSMA5 and COSMA6hardware.
* Using multiple compiler/MPI library combinationsWe have tried to build our code using the following compiler and MPI librarycombinations. All jobs failed using these combinations.* gnu_comp/7.3.0, openmpi/3.0.1* intel_comp/2018, intel_mpi/2018 (used successfully on COSMA6hardware)* intel_comp/2019-update2, intel_mpi/2019-update2
* Different compiler optionsWe tried with different levels of compiler optimisation and including andexcluding debugging information, all leading to failures.
* Non-actionable and inconsistent errorsErrors are not usually reproducible when running identical jobs, andtypically fail with different errors. Unfortunately the error messages neverprovided enough actionable information to find solutions and no oneincluding our resident research software engineer were able to find asolution at the time.
* Non-actionable instrumented runsWe also performed some instrumented runs including AddressSanitizer toattempt to debug our code. We would no longer get failures with theinstrumented code however this would result in much worse performance.
Based on the results it suggested the issue was with our code rather thanCOSMA7 or software modules.
In early May we used the newly available openmpi/20190429 module along sidegnu_comp/7.3.0.<http://7.3.0.> This particular combination worked. Based onthe previous failure results we believe the issue we had with the OpenMPIlibrary was related to https://github.com/open-mpi/ompi/issues/4686 and wassolved with https://github.com/open-mpi/ompi/pull/4767 which was notavailable in any of the OpenMPI modules previously available for COSMA7.