Charm++ programs hang with mpi-crayxc build on Edison when run on 2 hosts
The hang happens when a 2 process (logical node) run is made on 2 hosts. It however works on a 2 process run on 1 host. Additionally, the other arch of the same layer (mpi-linux-x86_64) also works on Golub and a lab machine.
#2 Updated by Nitin Bhat 12 months ago
- Subject changed from examples/zerocopy/pingpong hangs with mpi-crayxc build on Edison when run on 2 hosts to Charm++ programs hang with mpi-crayxc build on Edison when run on 2 hosts
I ran into this while running examples/charm++/zerocopy/pingpong/ to get numbers for mpi-crayxc build. I later found that other programs like leanmd and hello were also hanging.
1. Build charm with: `./build charm++ mpi-crayxc -j24 -g -O0`
2. Allocate 2 physical nodes interactively on Edison with: `salloc -N 2 -p debug -t 00:20:00 --ntasks-per-node=1`
3. Run leanmd (I am seeing for any charm program) with: `srun -n 2 -c 1 ./leanmd` or `srun -n 2 -c 2 ./leanmd`
#4 Updated by Nitin Bhat 12 months ago
When does this issue occur?
The issue occurs presumably because of an incompatibility between using Cray MPI when Cray Hugepages is loaded. This was mentioned by Sam on a slack discussion.
On testing, it was found that the applications work fine when craype-hugepages is not loaded during build time.
YES - craype-hugepages8M is loaded
NO - craype-hugepages8M is not loaded
|Charm++ Build Time||Application Build Time||Application Runtime||Does the Issue occur?|
|YES||YES||YES||YES (Hang Occurs)|
|NO||YES||YES||YES (Hang Occurs)|
So, the issue never occurs when the module is not loaded during application build time. The fix would be to have a check during application build time to ensure that the module is not loaded.
#6 Updated by Jim Phillips 12 months ago
The mpi-crayxc build of NAMD works fine with craype-hugepages8M loaded at Charm++ build, NAMD build, and run.
I also set at runtime:
setenv HUGETLB_DEFAULT_PAGE_SIZE 8M
setenv HUGETLB_MORECORE no
but the module sets
setenv HUGETLB_MORECORE yes
Does this only happen with exactly two pes?