Charm++ programs hang with mpi-crayxc build on Edison when run on 2 hosts
The hang happens when a 2 process (logical node) run is made on 2 hosts. It however works on a 2 process run on 1 host. Additionally, the other arch of the same layer (mpi-linux-x86_64) also works on Golub and a lab machine.
#2 Updated by Nitin Bhat about 1 year ago
- Subject changed from examples/zerocopy/pingpong hangs with mpi-crayxc build on Edison when run on 2 hosts to Charm++ programs hang with mpi-crayxc build on Edison when run on 2 hosts
I ran into this while running examples/charm++/zerocopy/pingpong/ to get numbers for mpi-crayxc build. I later found that other programs like leanmd and hello were also hanging.
1. Build charm with: `./build charm++ mpi-crayxc -j24 -g -O0`
2. Allocate 2 physical nodes interactively on Edison with: `salloc -N 2 -p debug -t 00:20:00 --ntasks-per-node=1`
3. Run leanmd (I am seeing for any charm program) with: `srun -n 2 -c 1 ./leanmd` or `srun -n 2 -c 2 ./leanmd`
#4 Updated by Nitin Bhat about 1 year ago
When does this issue occur?
The issue occurs presumably because of an incompatibility between using Cray MPI when Cray Hugepages is loaded. This was mentioned by Sam on a slack discussion.
On testing, it was found that the applications work fine when craype-hugepages is not loaded during build time.
YES - craype-hugepages8M is loaded
NO - craype-hugepages8M is not loaded
|Charm++ Build Time||Application Build Time||Application Runtime||Does the Issue occur?|
|YES||YES||YES||YES (Hang Occurs)|
|NO||YES||YES||YES (Hang Occurs)|
So, the issue never occurs when the module is not loaded during application build time. The fix would be to have a check during application build time to ensure that the module is not loaded.
#6 Updated by Jim Phillips about 1 year ago
The mpi-crayxc build of NAMD works fine with craype-hugepages8M loaded at Charm++ build, NAMD build, and run.
I also set at runtime:
setenv HUGETLB_DEFAULT_PAGE_SIZE 8M
setenv HUGETLB_MORECORE no
but the module sets
setenv HUGETLB_MORECORE yes
Does this only happen with exactly two pes?