Project

General

Profile

Bug #1630

OpenAtom crashes during launch at startup on BlueWaters using master charm (again)

Added by Eric Bohm 13 days ago. Updated 7 days ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
07/12/2017
Due date:
% Done:

100%

Spent time:

Description

OpenAtom ran fine on 75ea9d902febb0552170ab61ff79d14588494d7c

But crashes during application startup with current master when run on 32 nodes using water_256M_70Ry.

Currently bisecting, time consuming due to the ridiculous (40 minutes+) time to build charm on bluewaters.


Related issues

Related to Charm++ - Bug #1632: ST_RecursivePartition (spanning tree) crashes during init with large node counts on systems without topo info Merged 07/14/2017

History

#1 Updated by Eric Bohm 11 days ago

Edison dies with a topology error at higher node counts. Unclear if this is actually a related bug, or a different issue. 32 nodes works, 45 nodes crashes thusly:

------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: ST_RecursivePartition:: Increase bitset size to match size of largest to
pology dimension
------------- Processor 8 Exiting: Called CmiAbort ------------
Reason: ST_RecursivePartition:: Increase bitset size to match size of largest to
pology dimension
... etc

#2 Updated by Juan Galvez 11 days ago

That wouldn't happen on Blue Waters.

#3 Updated by Juan Galvez 11 days ago

Opened new issue for that here: https://charm.cs.illinois.edu/redmine/issues/1632
Fix coming soon.

#4 Updated by Phil Miller 11 days ago

  • Related to Bug #1632: ST_RecursivePartition (spanning tree) crashes during init with large node counts on systems without topo info added

#5 Updated by Juan Galvez 11 days ago

Like I said, these shouldn't be related at all, since the error on Edison would not appear on Blue Waters.

#6 Updated by Phil Miller 11 days ago

The relationship linkage is their discovery - a link in the comments here doesn't indicate over there where #1632 came from. Not meant to indicate anything more

#7 Updated by Phil Miller 11 days ago

Also, one hopes that fixing #1632 will make debugging this issue faster

#8 Updated by Juan Galvez 11 days ago

Sounds reasonable, I agree.

#9 Updated by Juan Galvez 9 days ago

I built latest charm master branch on BW (up to 58f6f16077535e24f5c474cc1ee7ea8aaa059bb2: Bug #1632: ST_RecursivePartition crashes with large node counts without topo info)

with ./build charm++ gni-crayxe smp --enable-error-checking -O0 -g

And built OpenAtom against this, and ran water_256M_70Ry like this:
./tidy water
aprun -n 256 -N 4 -d 8 ~/openatom/build-O0-g/OpenAtom cpaimd_config.1024.smart.topo water.input +ppn 7 +commap 0,8,16,24 +pemap 1-7,9-15,17-23,25-31

and it ran without issues.

I then ran with 32 nodes and got:

Before control_new_mapping_function user mem 8.369904 MB
Before make_rho_runs user mem 8.369904 MB
tempering output dir TEMPER_OUT
Temperature exchange frequency set to 1000
TemperController for Tempers 1 and beads 1
expected energies 1
Initializing TopoManager
After Init topoManager user mem 8.370239 MB
Choose a larger Gstates_per_pe than 1 such that { (no. of processors [896] / no. of Instances [1]) / (no. of states [1024] / Gstates_per_pe [1]) } is > 0
OpenAtom: ../src_charm_driver/main/cpaimd.C:523: main::main(CkArgMsg*): Assertion `pm > 0' failed.
_pmiu_daemon(SIGCHLD): [NID 08508] [c14-9c0s1n0] [Sun Jul 16 22:31:25 2017] PE RANK 0 exit signal Aborted
[NID 08508] 2017-07-16 22:31:26 Apid 59517080: initiated application termination
Application 59517080 exit codes: 134
Application 59517080 exit signals: Killed
Application 59517080 resources: utime ~66s, stime ~55s, Rss ~62192, inblocks ~1183088, outblocks ~3600973

But this just looks like a ran with incorrect input.

#10 Updated by Eric Bohm 8 days ago

  • Status changed from New to Rejected
  • % Done changed from 0 to 100

I managed to halfway automate the bisect process and have reproduced Juan's finding. Binaries produced by current charm master and openatom master do not have this bug. Best guess is that there must have been some residual cruft in the build tree used in the original process. I have started building a test of a fresh build using the charm version implicated in the original bug, in the hopes of trying to learn a bit more about the original issue. Regardless, I'm rejecting this redmine issue, as it is not reproducible.

#11 Updated by Eric Bohm 7 days ago

Run of openatom build with charm version that produced the original failure v6.8.0-beta1-295-g8b554d0, now runs to completion. Mystery unsolved, but unlikely to learn anything useful in further digging.

Also available in: Atom PDF