Project

General

Profile

Bug #1329

Hang in exit in TRAM test on gni-crayxc-smp

Added by Sam White over 2 years ago. Updated over 2 years ago.

Status:
Merged
Priority:
Normal
Category:
-
Target version:
Start date:
12/14/2016
Due date:
% Done:

100%


Description

tests/charm++/streamingAllToAll/ appears to be hanging every night in autobuild on gni-crayxc-smp then timing out:

http://charm.cs.uiuc.edu/autobuild/cur/gni-crayxc-smp.txt

Entering directory `/scratch1/scratchdirs/acun/autobuild/gni-crayxc-smp/charm/gni-crayxc-smp/tests/charm++/streamingAllToAll'
srun -n 4 -c 2 ./ataTest +CmiNoProcForComThread
Charm++> Running on Gemini (GNI) with 4 processes
Charm++> static SMSG
Charm++> SMSG memory: 19.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 4,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: 0e19d77
Charm++> Note: The option +CmiNoProcForComThread has been superseded by +CmiSleepOnIdle
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (48-way SMP).
size of envelope: 80

TEST 1: Using 2D TRAM Topology: 4 1
Elapsed time for all-to-all of       32 bytes sent in      1  iteration of 32 bytes each (    using TRAM): 0.019991 seconds
Elapsed time for all-to-all of       64 bytes sent in      2 iterations of 32 bytes each (    using TRAM): 0.000056 seconds
Elapsed time for all-to-all of      128 bytes sent in      4 iterations of 32 bytes each (    using TRAM): 0.000043 seconds
Elapsed time for all-to-all of      256 bytes sent in      8 iterations of 32 bytes each (    using TRAM): 0.000040 seconds
Elapsed time for all-to-all of      512 bytes sent in     16 iterations of 32 bytes each (    using TRAM): 0.000040 seconds
Elapsed time for all-to-all of     1024 bytes sent in     32 iterations of 32 bytes each (    using TRAM): 0.000193 seconds
Elapsed time for all-to-all of     2048 bytes sent in     64 iterations of 32 bytes each (    using TRAM): 0.000050 seconds
Elapsed time for all-to-all of     4096 bytes sent in    128 iterations of 32 bytes each (    using TRAM): 0.003155 seconds
Elapsed time for all-to-all of     8192 bytes sent in    256 iterations of 32 bytes each (    using TRAM): 0.000087 seconds
Elapsed time for all-to-all of    16384 bytes sent in    512 iterations of 32 bytes each (    using TRAM): 0.000136 seconds

TEST 2: Using point to point sends
Elapsed time for all-to-all of       32 bytes sent in      1  iteration of 32 bytes each (not using TRAM): 0.000032 seconds
Elapsed time for all-to-all of       64 bytes sent in      2 iterations of 32 bytes each (not using TRAM): 0.000051 seconds
Elapsed time for all-to-all of      128 bytes sent in      4 iterations of 32 bytes each (not using TRAM): 0.000060 seconds
Elapsed time for all-to-all of      256 bytes sent in      8 iterations of 32 bytes each (not using TRAM): 0.000075 seconds
Elapsed time for all-to-all of      512 bytes sent in     16 iterations of 32 bytes each (not using TRAM): 0.000130 seconds
Elapsed time for all-to-all of     1024 bytes sent in     32 iterations of 32 bytes each (not using TRAM): 0.000199 seconds
Elapsed time for all-to-all of     2048 bytes sent in     64 iterations of 32 bytes each (not using TRAM): 0.000381 seconds
Elapsed time for all-to-all of     4096 bytes sent in    128 iterations of 32 bytes each (not using TRAM): 0.000838 seconds
Elapsed time for all-to-all of     8192 bytes sent in    256 iterations of 32 bytes each (not using TRAM): 0.001428 seconds
Elapsed time for all-to-all of    16384 bytes sent in    512 iterations of 32 bytes each (not using TRAM): 0.002528 seconds
[Partition 0][Node 0] End of program
slurmstepd: error: *** STEP 2982328.38 ON nid00546 CANCELLED AT 2016-12-12T05:08:20 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 2982328 ON nid00546 CANCELLED AT 2016-12-12T05:08:20 DUE TO TIME LIMIT ***

History

#1 Updated by Sam White over 2 years ago

  • Assignee set to Karthik Senthil

Assigning to Karthik because he is looking at optimizations for TRAM in SMP mode.

Also note that this hang is nondeterministic, it doesn't happen every time.

#2 Updated by Sam White over 2 years ago

Core group would like to see some progress on this issue. Do you have access to Edison to reproduce this?

#3 Updated by Karthik Senthil over 2 years ago

The bug does not happen in 'debug' build of Charm++ on Edison.

I am currently trying to find the exact location of the bug. The program flow reaches CkExit() invocation and there is no QD involved in the program or TRAM library. One possible suspect could be the CthYield() that is explicitly called in the non-TRAM test case whenever counter reaches 1024.

#4 Updated by Phil Miller over 2 years ago

Any progress on this? If you need help, please ask for it.

#5 Updated by Karthik Senthil over 2 years ago

I have isolated the bug and it occurs in the "directSends" version of the test. The "usingTram" part is clean. As suspected it is mostly occurring due to some race condition or mistakes in invocations made to CthYield(). I am looking into this right now and will update as soon as I find the fix.

#6 Updated by Karthik Senthil over 2 years ago

Some notes on the experiments I performed today for this bug:
https://charm.cs.illinois.edu/newTms/tasks/1753

#7 Updated by Sam White over 2 years ago

  • Status changed from New to In Progress
  • Subject changed from Hang in TRAM tests on gni-crayxc-smp to Hang in exit in TRAM test on gni-crayxc-smp

The hang is not related to TRAM, it just happens in the TRAM test.

The real issue seems to be somewhere in our exit sequence on gni-smp builds.

#8 Updated by Karthik Senthil over 2 years ago

The hang does not happen when using run options like +p4 ++ppn4. However, for +p8 ++ppn4 the program hangs.

#10 Updated by Karthik Senthil over 2 years ago

Debugged the test case with various setups. Following are some notes from the experiments :
  • The hang is not related to TRAM as such. I commented out all code in the test case related to TRAM and program still hangs
  • It is also not related to threaded entry methods(Participant::communicate) and explicit invocations to CthYield(). I ran multiple tests with code related to this removed, but the program still hangs.
  • Premature call to CkExit() is possible in the test case. The chares in the allToAllGroup (a group chare) contribute to exit after messaging all other chares in the same array. It is possible that all these chares process the contribute call before processing messages from other chares, which leads to early CkExit()
  • However is this premature exit an issue since these messages are not significant for the test program? If yes, why doesn't the bug reproduce in other builds on other machines?
  • This could also be a bug related to Charm++ internal exit sequence handling for group chares (particularly for gni layer?)

#11 Updated by Karthik Senthil over 2 years ago

  • translation missing: en.field_closed_date set to 2017-03-02 13:26:54.131311
  • % Done changed from 0 to 100
  • Status changed from In Progress to Merged

Also available in: Atom PDF