Eliminate extra copies in AMPI reduce/gather(v) receive paths
As pointed out by Phil in review of change #985:
Aside from inter-communicator collectives, which we don't support currently (and maybe even in that case, depending on implementation details):
When gather(v)Result is called, it means that the object on which it was called must have participated in the collective that fed into it (otherwise the reduction to perform the collective couldn't have completed). That means that there's already a matching receive buffer allocated on that rank, that the data could be copied into directly, without additional dynamic memory allocation and copy in gather(v)Result.
#1 Updated by Phil Miller about 3 years ago
Just a note for when this gets implemented: we won't be able to use
memcpy straight out of the buffer in the reduction message, because the receive data type might not be contiguous. We'll have to go through DDT to unpack from the sequential representation into the receive buffer.
#2 Updated by Sam White about 3 years ago
- Status changed from New to In Progress
- Subject changed from Eliminate extra copy in ampi::gather(v)Result to Eliminate extra copy in AMPI reductions/gathers
Started working on this. A couple notes:
1) We can eliminate an intermediate copy from the CkReductionMsg to an AmpiMsg for reductions.
2) We can also eliminate the ugly hack that is MPI_REDUCE_TAG having its own special semantics in recv (to account for the result message data containing the AmpiOpHeader).
3) Removing AMPI support for CMI_PIC_ELFCOPY will simplify the implementations of these (issue #1106).
#3 Updated by Sam White about 3 years ago
A couple more notes:
1. Gathers have two extra copies on the recv side right now. The recv path is like this: first copy from a CkReductionMsg to a buffer ordered by contributor's rank, then from the intermediate buffer to an AmpiMsg, then from the AmpiMsg to the user's recvbuf. We can deserialize and reorder the data directly from the CkReductionMsg to the user's recvbuf to get rid of both extra copies.
2. I found another extra copy in AMPI reductions, this one on the sender side: we have to copy the contributed data into a CkReductionMsg prefixed with an AmpiOpHeader struct, which contains the MPI_Op user function to perform the reduction. For predefined types like MPI_INT, MPI_LONG, MPI_FLOAT, and MPI_DOUBLE and basic operations like MPI_SUM, MPI_PROD, MPI_MAX, and MPI_MIN at least. For now we can avoid the copy of AmpiOpHeader and benefit from most builtin reducers being streamable, and we could completely eliminate this copy if/when Charm++ gets RDMA versions of contribute(). We could also potentially expand the set of built-in reduction types in Charm for types supported in MPI but not in Charm. See issue #1107.
#6 Updated by Sam White almost 3 years ago
This patch is now held up because it breaks BigSim, with aborts for recv'ing unexpected reductions in ampi::rednResult. Here is the output of a 4d stencil program that does an MPI_Reduce at the end of each iteration, run on 1 PE to trace 16 VPs:
Running Jacobi on 16 processors with (2, 2, 2, 2) elements Array Dimensions: 2 2 2 2 Block Dimensions: 1 1 1 1 Rank 1 completed iter 1 at time 0.001960 Rank 2 completed iter 1 at time 0.001972 Rank 5 completed iter 1 at time 0.001895 Rank 6 completed iter 1 at time 0.001908 Rank 7 completed iter 1 at time 0.001900 Rank 9 completed iter 1 at time 0.001895 Rank 10 completed iter 1 at time 0.001898 Rank 11 completed iter 1 at time 0.001890 Rank 13 completed iter 1 at time 0.001892 Rank 14 completed iter 1 at time 0.001884 Rank 15 completed iter 1 at time 0.001890 Rank 4 completed iter 1 at time 0.001914 Rank 8 completed iter 1 at time 0.001895 Rank 12 completed iter 1 at time 0.001889 In ampi::blockOnRedn() at time 0.001961 Rank 3 completed iter 1 at time 0.001956 Rank 11 completed iter 2 at time 0.001935 Rank 11 done at time 0.001944 Rank 15 completed iter 2 at time 0.001931 Rank 15 done at time 0.001938 Rank 7 completed iter 2 at time 0.001947 Rank 7 done at time 0.001954 Rank 8 completed iter 2 at time 0.001938 Rank 8 done at time 0.001945 Rank 10 completed iter 2 at time 0.001942 Rank 10 done at time 0.001949 Rank 9 completed iter 2 at time 0.001926 Rank 9 done at time 0.001933 Rank 12 completed iter 2 at time 0.001923 Rank 12 done at time 0.001929 Rank 14 completed iter 2 at time 0.001915 Rank 14 done at time 0.001922 Rank 13 completed iter 2 at time 0.001935 Rank 13 done at time 0.001942 Rank 4 completed iter 2 at time 0.001952 Rank 4 done at time 0.001960 Rank 6 completed iter 2 at time 0.001945 Rank 6 done at time 0.001952 Rank 5 completed iter 2 at time 0.001938 Rank 5 done at time 0.001946 Rank 0 completed iter 1 at time 0.001999 In ampi::blockOnRedn() at time 0.002024 Rank 3 completed iter 2 at time 0.002000 Rank 3 done at time 0.002007 Rank 2 completed iter 2 at time 0.002021 Rank 2 done at time 0.002028 Rank 1 completed iter 2 at time 0.002024 Rank 1 done at time 0.002032 AMPI recv'ed a blocking reduction at time 0.002041 AMPI recv'ed a blocking reduction unexpectedly at time 0.002064 Rank 0 completed iter 2 at time 0.002071 Rank 0 done at time 0.002087 Time elapsed per iteration: 0.000077 AMPI_Exit called by rank 0 at time 0.002096  Number is numX:16 numY:1 numZ:1 numCth:1 numWth:1 numEmulatingPes:1 totalWorkerProcs:16 bglog_ver:6  Wrote to disk for 16 BG nodes. BG> BigSim emulator shutdown gracefully! BG> Emulation took 0.023349 seconds! [Partition 0][Node 0] End of program
Note the timings on rank 0 and the reduction operations, particularly how rank 0, which is the root of the reductions, seems to run ahead of the reduction, like it is being resumed prematurely. The second reduction would abort normally but I have changed ampi::rednResult to just return here after recv'ing an unexpected message.
#8 Updated by Sam White almost 3 years ago
I can reproduce it on 1 PE with the examples/ampi/creduce program, which is just one MPI_Reduce followed by one MPI_Ireduce. We get the same abort on the MPI_Ireduce call, even for just 1 VP, run like './pgm +vp1 +x1 +y1 +z1 '.
Update: that failure was just because of a typo in one of the patch sets. It passes now.
#9 Updated by Sam White almost 3 years ago
I think the issue here might actually a bug in AMPI_Wait(all) and BigSim that manifests as a failure in the reduction in the second iteration of stencil4d. BigSim passes all 'make bgtest TESTOPTS=+bglog' programs except the run below, where it segfaults in AMPI_Wait. The stencil4d example passes if I remove the Waitall. Here's the failure from one of megampi's bgtest runs (the other runs pass):
$ ./charmrun ./pgm +p2 +vp4 +x2 +y1 +z1 +bglog Charmrun> scalable start enabled. Charmrun> started all node programs in 1.236 seconds. Charm++> Running in non-SMP mode: numPes 2 Converse/Charm++ Commit ID: v6.7.0-279-g02d0d52 Charm++> scheduler running in netpoll mode. BG info> Simulating 2x1x1 nodes with 1 comm + 1 work threads each. BG info> Network type: bluegene. alpha: 1.000000e-07 packetsize: 1024 CYCLE_TIME_FACTOR:1.000000e-03. CYCLES_PER_HOP: 5 CYCLES_PER_CORNER: 75. BG info> cpufactor is 1.000000. BG info> floating point factor is 0.000000. BG info> Using WallTimer for timing method. BG info> Generating timing log. CharmLB> Load balancer ignores processor background load. CharmLB> Load balancer assumes all CPUs are same. Trace: traceroot: /dcsdata/home/swhite/tmp/charm/netlrts-linux-x86_64-bigemulator/tests/ampi/megampi/./pgm  RandCentLB created  Testing: Testing...  Testing: Testing... ------------- Processor 0 Exiting: Caught Signal ------------ Reason: segmentation violation Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).  Stack Traceback: [0:0] [0x6d1777] [0:1] +0x36cb0 [0x7ffff722bcb0] [0:2] _ZN5CkVecIP9BgTimeLogE11insertAtEndERKS1_+0x1c [0x5cdce6] [0:3] _ZN9BgTimeLog14addBackwardDepEPS_+0xbd [0x5cc13b] [0:4] _Z16BgAddBackwardDepP9BgTimeLogS0_+0x23 [0x5ce0db] [0:5] _ZN13TraceBluegene14addBackwardDepEPv+0xb5 [0x704b57] [0:6] AMPI_Wait+0x4b1 [0x5779ff] [0:7] _ZN10MPI_Tester4testEv+0x193 [0x553df7] [0:8] _Z13AMPI_Main_cppiPPc+0xd7 [0x554ab4] [0:9] AMPI_Fallback_Main+0x20 [0x569581] [0:10] _ZN17MPI_threadstart_t5startEv+0x5c [0x59de7a] [0:11] AMPI_threadstart+0x57 [0x56995a] [0:12] [0x555559] [0:13] CthStartThread+0x59 [0x6cf07f] [0:14] +0x49800 [0x7ffff723e800] Fatal error on PE 0> segmentation violation
Update: this above issue and the bigger issue with the new redn recv path for BigSim on AMPI are fixed in patch set #30 of the commit above. The issue was with collectives and pt2pt messages sharing ampi's resumeOnRecv variable. Separating them out into resumeOnRecv and resumeOnColl fixed the issue.
#10 Updated by Sam White almost 3 years ago
- translation missing: en.field_closed_date set to 2016-08-21 15:47:24.558291
- Status changed from Implemented to Merged
- Subject changed from Eliminate extra copies in AMPI reduce/gather(v) recv paths to Eliminate extra copies in AMPI reduce/gather(v) receive paths