AMPI failure with migration under Cray compiler due to tcmalloc bugs or incompatibility
We are facing errors in AMPI with a version of a Jacobi code using the Cray compiler. This Jacobi code is such that there are periodic calls to MPI_Migrate(), which activate migration when used with RotateLB.
To provide an isolation of the problem, we created a smaller code that reproduces a minimum of the Jacobi behavior and still presents the same execution error as Jacobi. The pseudo-code for this example is just the following:
allocate array (~1 KB memory)
write to array
When executed on two JYC nodes, the code above has a seg-fault after one or two migrations. There are two versions of the code, one in Fortran, one in C, and they both fail on JYC if compiled by a version of AMPI generated with the Cray compiler. Meanwhile, when the program is compiled by an AMPI generated with GNU, the code works fine.
The programs with the C and Fortran examples are on JYC, with proper Makefiles, pbs scripts and examples of the produced outputs for the good and bad executions. This is under ~cmendes/BUG-AMPI-BUG-2/ with a C/ and a FORTRAN/ subdir for the C and Fortran versions, respectively. To use the Makefile, it is necessary to have the correct PrgEnv module loaded, e.g. to produce the version with the Cray compiler it's necessary to have PrgEnv-cray loaded. The build can be done as "make frotate-cray", for the Fortran case and the Cray compiler.
The compilation flags in the Makefile are needed for building the real Jacobi example, which is intended to be expanded in the future with use of GPUs (hence dynamic linking is needed). The Charm++ version in use was extracted from the GIT repository on March-6/2013, and the Cray compiler in use was cce/8.1.5 (current default on JYC). The pbs script sets variable ATP_ENABLED such that a stack trace is displayed upon a seg-fault.
#2 Updated by Nikhil Jain over 5 years ago
.Seem like isomalloc and tcmalloc dont like each other. A hack for now is to install malloc outside charm, force them to use dl* names, and link them using dlsym. Nothing else seems to be working. Steps-
1. Download ptmalloc and install a threadsafe version with dl* function naming enabled.
2. Checkout isomalloc-cray-hack branch of Charm.
3. Compile you code with additional LDOPTS="-L/u/staff/nikhilj/installs/ptmalloc3 -lptmalloc3 -lpthread -Wl,-rpath=/u/staff/nikhilj/installs/ptmalloc3" with os-isomalloc.
#3 Updated by Phil Miller over 5 years ago
Per https://bluewaters.ncsa.illinois.edu/known-issues tcmalloc is known to be problematic on BW:
TCMALLOC: Codes may crash in tcmalloc. Cray is working on fixing these. In the mean time, use the flag "-hsystem_alloc" to avoid linking in tcmalloc. Using this flag may result in a new set of problems. So use it with caution. Continue to open bugs on Jira if you see problems with tcmalloc.
So, it's not just us or isomalloc's problem.
#5 Updated by Phil Miller over 5 years ago
- Status changed from Resolved to Upstream
What's the resolution? Will ampicc make sure that
-hsystem_alloc gets passed until this is fixed? Or are we just documenting the issue and instructing users to pass it themselves?
Either way, until Cray deals with it, let's keep it open on our tracker (status Upstream) so that we can clean up the workaround once it's unnecessary.