AMPI zero copy patches are hanging on gni-crayxc
Testing this whole series of patches on Cori (Haswell), I've found that it hangs when calling rget() for an inter-process message transfer. After calling rget(), the callback that is given in the CkNcpy object is never invoked (the ampi::completedRecv method).
Here is the top patch in the series of AMPI zero copy changes: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4084/
It works for all transfers in the same process, and it works on mpi-crayxc, ofi-linux-x86_64, and verbs-linux-x86_64 across processes. This leads me to believe there is some issue in the gni implementation of the direct API, however all the example and test programs for the direct API pass.
#1 Updated by Sam White about 1 year ago
$ cd charm/ # Checkout this patch: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4084/ $ ./build AMPI gni-crayxc -j16 --with-production --enable-error-checking --without-romio -g -O0 $ cd gni-crayxc/examples/ampi/pingpong/ $ make OPTS="-g -O0" $ salloc -N 1 -q interactive -C haswell -t 00:30:00 $ srun -n 2 ./pgm 1 1000000
This will try to do a single iteration of pingpong with a message size of 1000000 bytes.
#2 Updated by Sam White about 1 year ago
- Status changed from New to Rejected
Nevermind, this is an issue with the commit that adds a free list for CkNcpy objects. The earlier patches all work.
Offending commit: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3946/