Project

General

Profile

Bug #1726

Bigsim autobuild failures in checkpoint/restart test

Added by Sam White over 1 year ago. Updated over 1 year ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
10/28/2017
Due date:
% Done:

0%


Description

tests/charm++/chkpt/ is failing the past couple of days on the Bigsim autobuild target on Charity.

./build LIBS netlrts-linux-x86_64 bigsim
cd netlrts-linux-x86_64/tests/charm++/chkpt/
make

./charmrun  ./hello +p2 +x3 +y1 +z1  ++local ++no-va-randomization
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.007 seconds.
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: 57cc10f
Charm++> scheduler running in netpoll mode.
BG info> Simulating 3x1x1 nodes with 1 comm + 1 work threads each.
BG info> Network type: bluegene.
alpha: 1.000000e-07    packetsize: 1024    CYCLE_TIME_FACTOR:1.000000e-03.
CYCLES_PER_HOP: 5    CYCLES_PER_CORNER: 75.
BG info> cpufactor is 1.000000.
BG info> floating point factor is 0.000000.
BG info> Using WallTimer for timing method. 
CharmLB> Load balancer ignores processor background load.
CharmLB> Load balancer assumes all CPUs are same.
Trace: traceroot: ./hello
Running Hello on 3 processors for 8 elements
Charmrun> error on request socket to node 0 '127.0.0.1'--
Socket closed before recv.

https://charm.cs.illinois.edu/autobuild/cur/netlrts-linux-x86_64-bigsim.txt


Related issues

Related to Charm++ - Bug #1735: Hang in syncfttest restart after fixing #537 Merged 11/02/2017

History

#1 Updated by Sam White over 1 year ago

  • Assignee set to Karthik Senthil

#2 Updated by Sam White over 1 year ago

Buil system changes broke bigsim even worse for the past couple days but those have been fixed so this is again showing up.

#3 Updated by Sam White over 1 year ago

Phil commented that git bisect shows this commit at fault: https://charm.cs.illinois.edu/gerrit/#/c/381/

#4 Updated by Karthik Senthil over 1 year ago

This bug is not specific to the checkpoint/restart test. The hang happens when number of Charm PEs does not divide the number of bigsim PEs. For example the following run command also hangs for tests/charm++/simplearrayhello

./charmrun +p2 ./hello 10 +x3 +y1 +z1 +wth1

#5 Updated by Phil Miller over 1 year ago

  • Related to Bug #1735: Hang in syncfttest restart after fixing #537 added

#6 Updated by Phil Miller over 1 year ago

  • Status changed from New to Implemented

It looks like the fix for #1735 also solves this. I'll run the tests a few more times to gain a bit of confidence.

#7 Updated by Phil Miller over 1 year ago

Looks like bgtest is happy with that patch, too.

#9 Updated by Phil Miller over 1 year ago

  • Assignee changed from Karthik Senthil to Phil Miller

#10 Updated by Phil Miller over 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF