Project

General

Profile

Bug #1899

AMPI jacobi.iso crashes in migration on gni-crayxe-persistent-smp autobuild

Added by Nitin Bhat 7 months ago. Updated 5 months ago.

Status:
New
Priority:
Low
Assignee:
Category:
-
Target version:
-
Start date:
05/08/2018
Due date:
% Done:

0%

Tags:

Description

Running on 3 processors:  ./jacobi.iso 2 2 2 40 +vp8 +balancer RotateLB +LBDebug 1 +CmiSleepOnIdle
srun -n 3 -c 2 ./jacobi.iso 2 2 2 40 +vp8 +balancer RotateLB +LBDebug 1 +CmiSleepOnIdle
Launched in background. Redirecting stdin to /dev/null
srun: job 9141768 queued and waiting for resources
srun: job 9141768 has been allocated resources
Charm++> Running on Gemini (GNI) with 3 processes
Charm++> static SMSG
Charm++> SMSG memory: 14.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 3,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Warning> Using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
Converse/Charm++ Commit ID: 48fb2e9
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (48-way SMP).
CharmLB> RotateLB created.
iter 1 time: 0.035972 maxerr: 2020.200000
iter 2 time: 0.034367 maxerr: 1696.968000
iter 3 time: 0.034251 maxerr: 1477.170240
iter 4 time: 0.034648 maxerr: 1319.433024
iter 5 time: 0.034682 maxerr: 1200.918072

CharmLB> RotateLB: PE [0] step 0 starting at 0.538062 Memory: 35.077301 MB
CharmLB> RotateLB: PE [0] strategy starting at 0.549759
CharmLB> RotateLB: PE [0] Memory: LBManager: 890 KB CentralLB: 2 KB
CharmLB> RotateLB: PE [0] #Objects migrating: 8, LBMigrateMsg size: 0.00 MB
CharmLB> RotateLB: PE [0] strategy finished at 0.549769 duration 0.000009 s
srun: error: nid03300: tasks 1-2: Segmentation fault
srun: Terminating job step 9141768.0
slurmstepd: error: *** STEP 9141768.0 ON nid03300 CANCELLED AT 2018-05-07T02:26:58 ***
srun: error: nid03300: task 0: Terminated
srun: Force Terminated job step 9141768.0
Makefile:42: recipe for target 'test' failed

History

#1 Updated by Nitin Bhat 7 months ago

  • Subject changed from gni-crayxe-persistent-smp crashes with a segmentation fault on Edison to AMPI jacobi crashes with a segmentation fault on Edison with gni-crayxe-persistent-smp build

#2 Updated by Nitin Bhat 7 months ago

Stacktrace:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/global/project/projectdirs/m2609/autobuild/gni-crayxe-persistent-smp/charm/gni'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000203242be in _INTERNALf2d32e59::registerMessage (msg=<optimized out>, size=<optimized out>, seqno=<optimized out>, memh=0x2aaaecbfb178) at machine.C:2611
2611            *memh = GetMemHndl(msg);
[Current thread is 1 (Thread 0x2aaaecb2c700 (LWP 47666))]
(gdb) bt
#0  0x00000000203242be in _INTERNALf2d32e59::registerMessage (msg=<optimized out>, size=<optimized out>, seqno=<optimized out>, memh=0x2aaaecbfb178) at machine.C:2611
#1  _INTERNALf2d32e59::getLargeMsgRequest (header=0x2aaaaaab9c88, inst_id=<optimized out>, tag=<optimized out>, bufferRdmaQueue=<optimized out>) at machine.C:2712
#2  _INTERNALf2d32e59::PumpNetworkSmsg () at machine.C:2286
#3  0x0000000020329437 in LrtsAdvanceCommunication (whileidle=<optimized out>) at machine.C:3675
#4  _INTERNALf2d32e59::AdvanceCommunication (whenidle=<optimized out>) at machine-common-core.c:1559
#5  _INTERNALf2d32e59::CommunicationServer (sleepTime=<optimized out>) at machine-common-core.c:1584
#6  CommunicationServerThread (sleepTime=<optimized out>) at machine-common-core.c:1603
#7  _INTERNALf2d32e59::ConverseRunPE (everReturn=1448209285) at machine-common-core.c:1534
#8  0x000000002031d43a in _INTERNALf2d32e59::call_startfn (vindex=0x7d1740ca5651eb85) at machine-smp.c:414
#9  0x00000000204126c4 in start_thread (arg=0x2aaaecb2c700) at pthread_create.c:457
#10 0x00000000207b4e89 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)

#3 Updated by Sam White 7 months ago

  • Subject changed from AMPI jacobi crashes with a segmentation fault on Edison with gni-crayxe-persistent-smp build to AMPI jacobi.iso crashes in migration on gni-crayxe-persistent-smp autobuild

#4 Updated by Eric Bohm 7 months ago

  • Assignee set to Sam White

#5 Updated by Sam White 5 months ago

  • Priority changed from Normal to Low

This seems to have been a transient failure? It's not showing up anymore, so downgrading to "low" priority.

Also available in: Atom PDF