Project

General

Profile

Bug #1894

AMPI zero copy patches are hanging on gni-crayxc

Added by Sam White about 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
05/02/2018
Due date:
% Done:

0%


Description

Testing this whole series of patches on Cori (Haswell), I've found that it hangs when calling rget() for an inter-process message transfer. After calling rget(), the callback that is given in the CkNcpy object is never invoked (the ampi::completedRecv method).

Here is the top patch in the series of AMPI zero copy changes: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4084/

It works for all transfers in the same process, and it works on mpi-crayxc, ofi-linux-x86_64, and verbs-linux-x86_64 across processes. This leads me to believe there is some issue in the gni implementation of the direct API, however all the example and test programs for the direct API pass.

History

#1 Updated by Sam White about 1 year ago

To reproduce:

$ cd charm/

# Checkout this patch: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4084/

$ ./build AMPI gni-crayxc -j16 --with-production --enable-error-checking --without-romio -g -O0
$ cd gni-crayxc/examples/ampi/pingpong/
$ make OPTS="-g -O0" 
$ salloc -N 1 -q interactive -C haswell -t 00:30:00
$ srun -n 2 ./pgm 1 1000000

This will try to do a single iteration of pingpong with a message size of 1000000 bytes.

#2 Updated by Sam White about 1 year ago

  • Status changed from New to Rejected

Nevermind, this is an issue with the commit that adds a free list for CkNcpy objects. The earlier patches all work.

Offending commit: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3946/

Also available in: Atom PDF