Project

General

Profile

Bug #1633

NodeGroup Broadcasts creates many copies of the message for point to point sends

Added by Eric Bohm about 1 year ago. Updated 2 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
-
Start date:
07/14/2017
Due date:
% Done:

0%


Description

Broadcast to a nodegroup results in the message being copied once for each destination node. These are then sent in a loop.

This results in numnodes copies of the message, which has a time, space, and bandwidth cost proportional to the size of the message and the number of hosts.

For large messages (where large varies by machine layer, but 128k should be fine) this should be replaced by an RDMA scheme that arranges for each destination to RDMA get the payload. Thereby reducing the space and time cost. This could be further enriched by a spanning tree approach to reduce the single link bandwidth cost, though at the price of increased latency at the leaves. Probably best to cut over to that scheme when the number of hosts is large, and/or the size of the message is very large (~1G). This should be implementable using our existing zero copy semantics to avoid writing a new implementation at each machine layer. If not, then we should extend our semantics to facilitate this kind of usage.

An even more advanced scheme would use message layer broadcast primitives. However, their applicability and portability has its own research agenda, so a distinct subtask should be created for that by whoever goes that direction.


Related issues

Related to Charm++ - Feature #1184: SMP-safe CmiReference and CmiFree Implemented 08/24/2016

History

#1 Updated by Phil Miller about 1 year ago

  • Related to Feature #1184: SMP-safe CmiReference and CmiFree added

#2 Updated by Phil Miller about 1 year ago

This all boils down to src/arch/util/machine-broadcast.c and the setting of CMK_BROADCAST_USE_CMIREFERENCE in src/conv-cor/conv-config.h. It's default set to 0, because CmiReference and CmiFree would have to use atomic increment/decrement operations. We should make a version of those that uses the atomic operations, and performance test them on applications. My expectation is that they shouldn't hurt anything, and make a lot of other things way easier.

The important characteristic to note is that there's nothing in the messages themselves (including any header fields) that should change from one recipient to the next. It looks like the 'destination rank' field does get set in SendSpanningTreeChildren.

There's also the concern about recipient PEs unpacking the message once they have a pointer to it. Depending on whether we want to optimize for latency or memory footprint, there's a tradeoff between making a copy for local delivery while sending the still-packed copy off to other nodes, or waiting for the remote sends to complete and then locally delivering the original.

#3 Updated by Sam White 9 months ago

  • Target version deleted (6.9.0)

#4 Updated by Evan Ramos 3 months ago

  • Assignee changed from Nitin Bhat to Evan Ramos
  • Status changed from New to Implemented

#5 Updated by Evan Ramos 3 months ago

Enabling this for SMP builds causes megatest to hang at test 0: initiated [groupring (milind)] when run as ./charmrun ++local +p4 ++ppn 2 ./pgm.

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fd6d00 (LWP 24496))]
#0  0x0000555555929f82 in CmiGetNonLocalNodeQ () at machine-common-core.c:1835
1835        if (!CMIQueueEmpty(CsvAccess(NodeState).NodeRecv)) {
(gdb) bt
#0  0x0000555555929f82 in CmiGetNonLocalNodeQ () at machine-common-core.c:1835
#1  0x0000555555931e80 in CsdNextMessage (s=0x7fffffffe160) at convcore.C:1815
#2  0x0000555555932095 in CsdScheduleForever () at convcore.C:1922
#3  0x0000555555932004 in CsdScheduler (maxmsgs=-1) at convcore.C:1861
#4  0x0000555555929beb in ConverseRunPE (everReturn=0) at machine-common-core.c:1591
#5  0x0000555555929857 in ConverseInit (argc=2, argv=0x7fffffffe3a8, fn=0x555555834dd3 <_initCharm(int, char**)>, usched=0, initret=0) at machine-common-core.c:1484
#6  0x000055555582dd53 in main (argc=2, argv=0x7fffffffe3a8) at main.C:9
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff6ee2700 (LWP 24500))]
#0  0x0000555555931e27 in CsdNextMessage (s=0x7ffff6ee1850) at convcore.C:1789
1789            if ( NULL!=(msg=CmiGetNonLocal()) || 
(gdb) bt
#0  0x0000555555931e27 in CsdNextMessage (s=0x7ffff6ee1850) at convcore.C:1789
#1  0x0000555555932095 in CsdScheduleForever () at convcore.C:1922
#2  0x0000555555932004 in CsdScheduler (maxmsgs=-1) at convcore.C:1861
#3  0x0000555555929beb in ConverseRunPE (everReturn=0) at machine-common-core.c:1591
#4  0x000055555592659f in call_startfn (vindex=0x1) at machine-smp.c:414
#5  0x00007ffff7bbd7fc in start_thread (arg=0x7ffff6ee2700) at pthread_create.c:465
#6  0x00007ffff6ff7b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 3
[Switching to thread 3 (Thread 0x7ffff66e1700 (LWP 24502))]
#0  0x00007ffff6feb951 in __GI___poll (fds=0x7ffff66e0780, nfds=4, timeout=0) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6feb951 in __GI___poll (fds=0x7ffff66e0780, nfds=4, timeout=0) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000055555592cf7b in CheckSocketsReady (withDelayMs=0) at machine-eth.c:79
#2  0x000055555592e742 in CommunicationServerNet (sleepTime=0, where=0) at machine-eth.c:720
#3  0x000055555592ee72 in LrtsAdvanceCommunication (whileidle=1) at machine.C:1762
#4  0x0000555555929c1e in AdvanceCommunication (whenidle=1) at machine-common-core.c:1611
#5  0x0000555555929c36 in CommunicationServer (sleepTime=5) at machine-common-core.c:1636
#6  0x0000555555929c77 in CommunicationServerThread (sleepTime=5) at machine-common-core.c:1655
#7  0x0000555555929baf in ConverseRunPE (everReturn=0) at machine-common-core.c:1586
#8  0x000055555592659f in call_startfn (vindex=0x2) at machine-smp.c:414
#9  0x00007ffff7bbd7fc in start_thread (arg=0x7ffff66e1700) at pthread_create.c:465
#10 0x00007ffff6ff7b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 4
Unknown thread 4.
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fd6d00 (LWP 24489))]
#0  0x0000555555938057 in ccd_heap_update (curWallTime=18.305255174636841) at conv-conds.c:379
379       ccd_heap_elem *e = h+CpvAccess(ccd_heapmaxlen);
(gdb) bt
#0  0x0000555555938057 in ccd_heap_update (curWallTime=18.305255174636841) at conv-conds.c:379
#1  0x0000555555938a5a in CcdCallBacks () at conv-conds.c:561
#2  0x000055555593211a in CsdScheduleForever () at convcore.C:1959
#3  0x0000555555932004 in CsdScheduler (maxmsgs=-1) at convcore.C:1861
#4  0x0000555555929beb in ConverseRunPE (everReturn=0) at machine-common-core.c:1591
#5  0x0000555555929857 in ConverseInit (argc=2, argv=0x7fffffffe3a8, fn=0x555555834dd3 <_initCharm(int, char**)>, usched=0, initret=0) at machine-common-core.c:1484
#6  0x000055555582dd53 in main (argc=2, argv=0x7fffffffe3a8) at main.C:9
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff6ee2700 (LWP 24501))]
#0  0x000055555595d7a3 in CdsFifo_Dequeue (q=0x7ffff0001470) at conv-lists.C:18
18      void *  CdsFifo_Dequeue(CdsFifo q) { return ((_Fifo*)q)->deq(); }
(gdb) bt
#0  0x000055555595d7a3 in CdsFifo_Dequeue (q=0x7ffff0001470) at conv-lists.C:18
#1  0x0000555555931e3b in CsdNextMessage (s=0x7ffff6ee1850) at convcore.C:1790
#2  0x0000555555932095 in CsdScheduleForever () at convcore.C:1922
#3  0x0000555555932004 in CsdScheduler (maxmsgs=-1) at convcore.C:1861
#4  0x0000555555929beb in ConverseRunPE (everReturn=0) at machine-common-core.c:1591
#5  0x000055555592659f in call_startfn (vindex=0x1) at machine-smp.c:414
#6  0x00007ffff7bbd7fc in start_thread (arg=0x7ffff6ee2700) at pthread_create.c:465
#7  0x00007ffff6ff7b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 3
[Switching to thread 3 (Thread 0x7fffee6e1700 (LWP 24503))]
#0  0x00007ffff6feb951 in __GI___poll (fds=0x7fffee6e0780, nfds=4, timeout=0) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6feb951 in __GI___poll (fds=0x7fffee6e0780, nfds=4, timeout=0) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000055555592cf7b in CheckSocketsReady (withDelayMs=0) at machine-eth.c:79
#2  0x000055555592e742 in CommunicationServerNet (sleepTime=0, where=0) at machine-eth.c:720
#3  0x000055555592ee72 in LrtsAdvanceCommunication (whileidle=1) at machine.C:1762
#4  0x0000555555929c1e in AdvanceCommunication (whenidle=1) at machine-common-core.c:1611
#5  0x0000555555929c36 in CommunicationServer (sleepTime=5) at machine-common-core.c:1636
#6  0x0000555555929c77 in CommunicationServerThread (sleepTime=5) at machine-common-core.c:1655
#7  0x0000555555929baf in ConverseRunPE (everReturn=0) at machine-common-core.c:1586
#8  0x000055555592659f in call_startfn (vindex=0x2) at machine-smp.c:414
#9  0x00007ffff7bbd7fc in start_thread (arg=0x7fffee6e1700) at pthread_create.c:465
#10 0x00007ffff6ff7b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) thread 4
Unknown thread 4.

#6 Updated by Evan Ramos 3 months ago

  • Assignee changed from Evan Ramos to Nitin Bhat
  • Status changed from Implemented to In Progress

The std::atomic prerequisite has been satisfied as of https://charm.cs.illinois.edu/gerrit/4108 but the existing implementation of CMK_BROADCAST_USE_CMIREFERENCE is incomplete and/or buggy.

#7 Updated by Evan Ramos 2 months ago

When running megatest as ./pgm +p2, failure sometimes occurs before the first test begins, with the message

[1] Assertion "getMsgtype()==BocInitMsg || getMsgtype()==ForBocMsg || getMsgtype()==NodeBocInitMsg || getMsgtype()==ForNodeBocMsg" failed in file ./envelope.h line 437.

getMsgtype() is reported to be 18, or ArrayEltInitMsg.

I sometimes get this instead:

------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: _processBufferedBocInits: empty message

I changed CMK_BROADCAST_SPANNING_TREE to 0 and the problems still occur, so we can rule out those code sections. I also disabled the larger switched sections to no effect -- it must come down to some side effect in the system that is brought out by the small code sections that replace a CopyMsg with a CmiReference.

Also available in: Atom PDF