Project

General

Profile

Bug #1708

Charm++ programs hang with mpi-crayxc build on Edison when run on 2 hosts

Added by Nitin Bhat 2 months ago. Updated 2 months ago.

Status:
Implemented
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
Start date:
10/06/2017
Due date:
% Done:

0%


Description

The hang happens when a 2 process (logical node) run is made on 2 hosts. It however works on a 2 process run on 1 host. Additionally, the other arch of the same layer (mpi-linux-x86_64) also works on Golub and a lab machine.


Related issues

Related to Charm++ - Bug #968: charm++ programs fail to run on BlueWaters due to craype-hugepages8M Upstream 02/04/2016

History

#1 Updated by Nitin Bhat 2 months ago

The bug was not caught by autobuild. My guess is that it runs mpi-crayxc tests only on 1 host.

#2 Updated by Nitin Bhat 2 months ago

  • Subject changed from examples/zerocopy/pingpong hangs with mpi-crayxc build on Edison when run on 2 hosts to Charm++ programs hang with mpi-crayxc build on Edison when run on 2 hosts

I ran into this while running examples/charm++/zerocopy/pingpong/ to get numbers for mpi-crayxc build. I later found that other programs like leanmd and hello were also hanging.

To reproduce:

1. Build charm with: `./build charm++ mpi-crayxc -j24 -g -O0`
2. Allocate 2 physical nodes interactively on Edison with: `salloc -N 2 -p debug -t 00:20:00 --ntasks-per-node=1`
3. Run leanmd (I am seeing for any charm program) with: `srun -n 2 -c 1 ./leanmd` or `srun -n 2 -c 2 ./leanmd`

#3 Updated by Nitin Bhat 2 months ago

  • Priority changed from Normal to Immediate

#4 Updated by Nitin Bhat 2 months ago

When does this issue occur?
The issue occurs presumably because of an incompatibility between using Cray MPI when Cray Hugepages is loaded. This was mentioned by Sam on a slack discussion.

On testing, it was found that the applications work fine when craype-hugepages is not loaded during build time.

YES - craype-hugepages8M is loaded
NO - craype-hugepages8M is not loaded

Charm++ Build Time Application Build Time Application Runtime Does the Issue occur?
YES YES YES YES (Hang Occurs)
YES YES NO NO
YES NO YES NO
YES NO NO NO
NO YES YES YES (Hang Occurs)
NO YES NO NO
NO NO YES NO
NO NO NO NO

So, the issue never occurs when the module is not loaded during application build time. The fix would be to have a check during application build time to ensure that the module is not loaded.

#5 Updated by Nitin Bhat 2 months ago

  • Status changed from New to Implemented

Added a fix to check that hugepages is not loaded while building using charmc for mpi-crayxc and mpi-crayxe builds. Fix: https://charm.cs.illinois.edu/gerrit/#/c/3123/

#6 Updated by Jim Phillips 2 months ago

The mpi-crayxc build of NAMD works fine with craype-hugepages8M loaded at Charm++ build, NAMD build, and run.
I also set at runtime:
setenv HUGETLB_DEFAULT_PAGE_SIZE 8M
setenv HUGETLB_MORECORE no
but the module sets
setenv HUGETLB_MORECORE yes

Does this only happen with exactly two pes?

#7 Updated by Jim Phillips 2 months ago

The hang occurs in NAMD with "setenv HUGETLB_MORECORE yes" but not "setenv HUGETLB_MORECORE no".

#8 Updated by Phil Miller 6 days ago

  • Related to Bug #968: charm++ programs fail to run on BlueWaters due to craype-hugepages8M added

Also available in: Atom PDF