Project

General

Profile

Bug #1084

Eliminate extra copies in AMPI reduce/gather(v) receive paths

Added by Sam White about 3 years ago. Updated almost 3 years ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
AMPI
Target version:
Start date:
06/19/2016
Due date:
% Done:

100%


Description

As pointed out by Phil in review of change #985:

Aside from inter-communicator collectives, which we don't support currently (and maybe even in that case, depending on implementation details):
When gather(v)Result is called, it means that the object on which it was called must have participated in the collective that fed into it (otherwise the reduction to perform the collective couldn't have completed). That means that there's already a matching receive buffer allocated on that rank, that the data could be copied into directly, without additional dynamic memory allocation and copy in gather(v)Result.


Subtasks

Feature #1107: Use built-in reducer types for basic types/operations in AMPIMergedSam White


Related issues

Related to Charm++ - Feature #985: Make AMPI_Gather and its variants use Tuple/Set Reductions Merged 02/17/2016

History

#1 Updated by Phil Miller about 3 years ago

Just a note for when this gets implemented: we won't be able to use memcpy straight out of the buffer in the reduction message, because the receive data type might not be contiguous. We'll have to go through DDT to unpack from the sequential representation into the receive buffer.

#2 Updated by Sam White about 3 years ago

  • Status changed from New to In Progress
  • Subject changed from Eliminate extra copy in ampi::gather(v)Result to Eliminate extra copy in AMPI reductions/gathers

Started working on this. A couple notes:
1) We can eliminate an intermediate copy from the CkReductionMsg to an AmpiMsg for reductions.
2) We can also eliminate the ugly hack that is MPI_REDUCE_TAG having its own special semantics in recv (to account for the result message data containing the AmpiOpHeader).
3) Removing AMPI support for CMI_PIC_ELFCOPY will simplify the implementations of these (issue #1106).

#3 Updated by Sam White about 3 years ago

A couple more notes:

1. Gathers have two extra copies on the recv side right now. The recv path is like this: first copy from a CkReductionMsg to a buffer ordered by contributor's rank, then from the intermediate buffer to an AmpiMsg, then from the AmpiMsg to the user's recvbuf. We can deserialize and reorder the data directly from the CkReductionMsg to the user's recvbuf to get rid of both extra copies.

2. I found another extra copy in AMPI reductions, this one on the sender side: we have to copy the contributed data into a CkReductionMsg prefixed with an AmpiOpHeader struct, which contains the MPI_Op user function to perform the reduction. For predefined types like MPI_INT, MPI_LONG, MPI_FLOAT, and MPI_DOUBLE and basic operations like MPI_SUM, MPI_PROD, MPI_MAX, and MPI_MIN at least. For now we can avoid the copy of AmpiOpHeader and benefit from most builtin reducers being streamable, and we could completely eliminate this copy if/when Charm++ gets RDMA versions of contribute(). We could also potentially expand the set of built-in reduction types in Charm for types supported in MPI but not in Charm. See issue #1107.

#4 Updated by Sam White about 3 years ago

  • Subject changed from Eliminate extra copy in AMPI reductions/gathers to Eliminate extra copies in AMPI reduce/gather(v) recv paths
  • Status changed from In Progress to Implemented

#5 Updated by Sam White about 3 years ago

This patch was held up on this issue: #1146

#6 Updated by Sam White almost 3 years ago

This patch is now held up because it breaks BigSim, with aborts for recv'ing unexpected reductions in ampi::rednResult. Here is the output of a 4d stencil program that does an MPI_Reduce at the end of each iteration, run on 1 PE to trace 16 VPs:

Running Jacobi on 16 processors with (2, 2, 2, 2) elements
Array Dimensions: 2 2 2 2
Block Dimensions: 1 1 1 1
Rank  1 completed iter 1 at time 0.001960
Rank  2 completed iter 1 at time 0.001972
Rank  5 completed iter 1 at time 0.001895
Rank  6 completed iter 1 at time 0.001908
Rank  7 completed iter 1 at time 0.001900
Rank  9 completed iter 1 at time 0.001895
Rank 10 completed iter 1 at time 0.001898
Rank 11 completed iter 1 at time 0.001890
Rank 13 completed iter 1 at time 0.001892
Rank 14 completed iter 1 at time 0.001884
Rank 15 completed iter 1 at time 0.001890
Rank  4 completed iter 1 at time 0.001914
Rank  8 completed iter 1 at time 0.001895
Rank 12 completed iter 1 at time 0.001889
    In ampi::blockOnRedn() at time 0.001961
Rank  3 completed iter 1 at time 0.001956
Rank 11 completed iter 2 at time 0.001935
Rank 11 done at time 0.001944
Rank 15 completed iter 2 at time 0.001931
Rank 15 done at time 0.001938
Rank  7 completed iter 2 at time 0.001947
Rank  7 done at time 0.001954
Rank  8 completed iter 2 at time 0.001938
Rank  8 done at time 0.001945
Rank 10 completed iter 2 at time 0.001942
Rank 10 done at time 0.001949
Rank  9 completed iter 2 at time 0.001926
Rank  9 done at time 0.001933
Rank 12 completed iter 2 at time 0.001923
Rank 12 done at time 0.001929
Rank 14 completed iter 2 at time 0.001915
Rank 14 done at time 0.001922
Rank 13 completed iter 2 at time 0.001935
Rank 13 done at time 0.001942
Rank  4 completed iter 2 at time 0.001952
Rank  4 done at time 0.001960
Rank  6 completed iter 2 at time 0.001945
Rank  6 done at time 0.001952
Rank  5 completed iter 2 at time 0.001938
Rank  5 done at time 0.001946
Rank  0 completed iter 1 at time 0.001999
    In ampi::blockOnRedn() at time 0.002024
Rank  3 completed iter 2 at time 0.002000
Rank  3 done at time 0.002007
Rank  2 completed iter 2 at time 0.002021
Rank  2 done at time 0.002028
Rank  1 completed iter 2 at time 0.002024
Rank  1 done at time 0.002032
    AMPI recv'ed a blocking reduction at time 0.002041
    AMPI recv'ed a blocking reduction unexpectedly at time 0.002064
Rank  0 completed iter 2 at time 0.002071
Rank  0 done at time 0.002087
Time elapsed per iteration: 0.000077
AMPI_Exit called by rank 0 at time 0.002096
[0] Number is numX:16 numY:1 numZ:1 numCth:1 numWth:1 numEmulatingPes:1 totalWorkerProcs:16 bglog_ver:6
[0] Wrote to disk for 16 BG nodes. 

BG> BigSim emulator shutdown gracefully!
BG> Emulation took 0.023349 seconds!
[Partition 0][Node 0] End of program

Note the timings on rank 0 and the reduction operations, particularly how rank 0, which is the root of the reductions, seems to run ahead of the reduction, like it is being resumed prematurely. The second reduction would abort normally but I have changed ampi::rednResult to just return here after recv'ing an unexpected message.

#7 Updated by Phil Miller almost 3 years ago

Can you reduce that test case further? Fewer PEs? One PE? Fewer VPs?

#8 Updated by Sam White almost 3 years ago

I can reproduce it on 1 PE with the examples/ampi/creduce program, which is just one MPI_Reduce followed by one MPI_Ireduce. We get the same abort on the MPI_Ireduce call, even for just 1 VP, run like './pgm +vp1 +x1 +y1 +z1 '.

Update: that failure was just because of a typo in one of the patch sets. It passes now.

#9 Updated by Sam White almost 3 years ago

I think the issue here might actually a bug in AMPI_Wait(all) and BigSim that manifests as a failure in the reduction in the second iteration of stencil4d. BigSim passes all 'make bgtest TESTOPTS=+bglog' programs except the run below, where it segfaults in AMPI_Wait. The stencil4d example passes if I remove the Waitall. Here's the failure from one of megampi's bgtest runs (the other runs pass):

$ ./charmrun ./pgm +p2 +vp4 +x2 +y1 +z1  +bglog
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 1.236 seconds.
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.7.0-279-g02d0d52
Charm++> scheduler running in netpoll mode.
BG info> Simulating 2x1x1 nodes with 1 comm + 1 work threads each.
BG info> Network type: bluegene.
alpha: 1.000000e-07    packetsize: 1024    CYCLE_TIME_FACTOR:1.000000e-03.
CYCLES_PER_HOP: 5    CYCLES_PER_CORNER: 75.
BG info> cpufactor is 1.000000.
BG info> floating point factor is 0.000000.
BG info> Using WallTimer for timing method. 
BG info> Generating timing log. 
CharmLB> Load balancer ignores processor background load.
CharmLB> Load balancer assumes all CPUs are same.
Trace: traceroot: /dcsdata/home/swhite/tmp/charm/netlrts-linux-x86_64-bigemulator/tests/ampi/megampi/./pgm
[0] RandCentLB created
[0] Testing: Testing...
[0] Testing: Testing...
------------- Processor 0 Exiting: Caught Signal ------------
Reason: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
[0] Stack Traceback:
  [0:0]   [0x6d1777]
  [0:1] +0x36cb0  [0x7ffff722bcb0]
  [0:2] _ZN5CkVecIP9BgTimeLogE11insertAtEndERKS1_+0x1c  [0x5cdce6]
  [0:3] _ZN9BgTimeLog14addBackwardDepEPS_+0xbd  [0x5cc13b]
  [0:4] _Z16BgAddBackwardDepP9BgTimeLogS0_+0x23  [0x5ce0db]
  [0:5] _ZN13TraceBluegene14addBackwardDepEPv+0xb5  [0x704b57]
  [0:6] AMPI_Wait+0x4b1  [0x5779ff]
  [0:7] _ZN10MPI_Tester4testEv+0x193  [0x553df7]
  [0:8] _Z13AMPI_Main_cppiPPc+0xd7  [0x554ab4]
  [0:9] AMPI_Fallback_Main+0x20  [0x569581]
  [0:10] _ZN17MPI_threadstart_t5startEv+0x5c  [0x59de7a]
  [0:11] AMPI_threadstart+0x57  [0x56995a]
  [0:12]   [0x555559]
  [0:13] CthStartThread+0x59  [0x6cf07f]
  [0:14] +0x49800  [0x7ffff723e800]
Fatal error on PE 0> segmentation violation

Update: this above issue and the bigger issue with the new redn recv path for BigSim on AMPI are fixed in patch set #30 of the commit above. The issue was with collectives and pt2pt messages sharing ampi's resumeOnRecv variable. Separating them out into resumeOnRecv and resumeOnColl fixed the issue.

#10 Updated by Sam White almost 3 years ago

  • translation missing: en.field_closed_date set to 2016-08-21 15:47:24.558291
  • Status changed from Implemented to Merged
  • Subject changed from Eliminate extra copies in AMPI reduce/gather(v) recv paths to Eliminate extra copies in AMPI reduce/gather(v) receive paths

Also available in: Atom PDF