Project

General

Profile

Bug #78

AMPI failure with migration under Cray compiler due to tcmalloc bugs or incompatibility

Added by Celso Mendes over 5 years ago. Updated about 1 year ago.

Status:
Upstream
Priority:
Normal
Assignee:
Category:
AMPI
Target version:
Start date:
03/08/2013
Due date:
% Done:

100%

Spent time:

Description

We are facing errors in AMPI with a version of a Jacobi code using the Cray compiler. This Jacobi code is such that there are periodic calls to MPI_Migrate(), which activate migration when used with RotateLB.

To provide an isolation of the problem, we created a smaller code that reproduces a minimum of the Jacobi behavior and still presents the same execution error as Jacobi. The pseudo-code for this example is just the following:

MPI_Init()
allocate array (~1 KB memory)
loop
write to array
MPI_Barrier()
MPI_Migrate()
MPI_Barrier()
end-loop

When executed on two JYC nodes, the code above has a seg-fault after one or two migrations. There are two versions of the code, one in Fortran, one in C, and they both fail on JYC if compiled by a version of AMPI generated with the Cray compiler. Meanwhile, when the program is compiled by an AMPI generated with GNU, the code works fine.

The programs with the C and Fortran examples are on JYC, with proper Makefiles, pbs scripts and examples of the produced outputs for the good and bad executions. This is under ~cmendes/BUG-AMPI-BUG-2/ with a C/ and a FORTRAN/ subdir for the C and Fortran versions, respectively. To use the Makefile, it is necessary to have the correct PrgEnv module loaded, e.g. to produce the version with the Cray compiler it's necessary to have PrgEnv-cray loaded. The build can be done as "make frotate-cray", for the Fortran case and the Cray compiler.

The compilation flags in the Makefile are needed for building the real Jacobi example, which is intended to be expanded in the future with use of GPUs (hence dynamic linking is needed). The Charm++ version in use was extracted from the GIT repository on March-6/2013, and the Cray compiler in use was cce/8.1.5 (current default on JYC). The pbs script sets variable ATP_ENABLED such that a stack trace is displayed upon a seg-fault.

History

#1 Updated by Nikhil Jain over 5 years ago

  • Target version set to 6.5.1

#2 Updated by Nikhil Jain over 5 years ago

.Seem like isomalloc and tcmalloc dont like each other. A hack for now is to install malloc outside charm, force them to use dl* names, and link them using dlsym. Nothing else seems to be working. Steps-

1. Download ptmalloc and install a threadsafe version with dl* function naming enabled.
2. Checkout isomalloc-cray-hack branch of Charm.
3. Compile you code with additional LDOPTS="-L/u/staff/nikhilj/installs/ptmalloc3 -lptmalloc3 -lpthread -Wl,-rpath=/u/staff/nikhilj/installs/ptmalloc3" with os-isomalloc.

#3 Updated by Phil Miller over 5 years ago

Per https://bluewaters.ncsa.illinois.edu/known-issues tcmalloc is known to be problematic on BW:

TCMALLOC: Codes may crash in tcmalloc. Cray is working on fixing these. In the mean time, use the flag "-hsystem_alloc" to avoid linking in tcmalloc. Using this flag may result in a new set of problems. So use it with caution. Continue to open bugs on Jira if you see problems with tcmalloc.

So, it's not just us or isomalloc's problem.

#4 Updated by Nikhil Jain over 5 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Resolved

#5 Updated by Phil Miller over 5 years ago

  • Status changed from Resolved to Upstream

What's the resolution? Will ampicc make sure that -hsystem_alloc gets passed until this is fixed? Or are we just documenting the issue and instructing users to pass it themselves?

Either way, until Cray deals with it, let's keep it open on our tracker (status Upstream) so that we can clean up the workaround once it's unnecessary.

#6 Updated by Phil Miller over 5 years ago

  • Subject changed from AMPI failure with migration under Cray compiler to AMPI failure with migration under Cray compiler due to tcmalloc bugs or incompatibility

Update subject so I needn't read through description and notes every time.

#7 Updated by Phil Miller over 5 years ago

  • Target version changed from 6.5.1 to 6.5.2

#8 Updated by Nikhil Jain almost 5 years ago

  • Target version changed from 6.5.2 to 6.7.0

#9 Updated by Phil Miller about 3 years ago

  • Target version changed from 6.7.0 to 6.8.0

#10 Updated by Phil Miller about 2 years ago

  • Category set to AMPI

#11 Updated by Sam White over 1 year ago

  • Assignee changed from Nikhil Jain to Sam White

Charm/AMPI do actually compile on the Cray Compiler now (CCE 8.5.4+), so I will try it out at some point before the release.

#12 Updated by Sam White over 1 year ago

  • Target version changed from 6.8.0 to 6.8.1

This is very low priority since it is A) AMPI-specific and B) CrayCC-specific.

#13 Updated by Phil Miller about 1 year ago

  • Target version changed from 6.8.1 to Unscheduled

I don't think there's going to be urgency for fixing this any time soon, so pulling it off any release target.

Also available in: Atom PDF