Project

General

Profile

Bug #1871

get_put_pingpong segfaults on gni-crayxc-smp

Added by Nitin Bhat over 1 year ago. Updated about 1 year ago.

Status:
Merged
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
04/17/2018
Due date:
% Done:

0%

Tags:

Description

Running on 2 processors:  ./get_put_pingpong 20 +CmiSleepOnIdle
srun -n 2 -c 2 ./get_put_pingpong 20 +CmiSleepOnIdle
Launched in background. Redirecting stdin to /dev/null
srun: job 9001760 queued and waiting for resources
srun: job 9001760 has been allocated resources
Charm++> Running on Gemini (GNI) with 2 processes
Charm++> static SMSG
Charm++> SMSG memory: 9.9KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 2,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: ef05db8
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (48-way SMP).
[1][1][1] Rget call completed
srun: error: nid00087: task 1: Segmentation fault
srun: Terminating job step 9001760.0
slurmstepd: error: *** STEP 9001760.0 ON nid00087 CANCELLED AT 2018-04-17T03:01:10 ***
srun: error: nid00087: task 0: Terminated
srun: Force Terminated job step 9001760.0
Makefile:19: recipe for target 'test' failed
make[6]: Leaving directory '/global/project/projectdirs/m2609/autobuild/gni-crayxc-smp/charm/gni-crayxc-smp/examples/charm++/zerocopy/direct_api/prereg/get_put_pingpong

Autobuild: http://charm.cs.illinois.edu/autobuild/old.2018_04_17__01_00/gni-crayxc-smp.txt

History

#1 Updated by Nitin Bhat over 1 year ago

  • Status changed from New to In Progress

Stack trace generated. This happens for simple_rget and simple_rput too. It only occurs in the prereg and reg modes and not in the unreg mode.

Core was generated by `/global/u1/n/nbhat4/software/charm/gni-crayxc-smp/examples/charm++/zerocopy/dir'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000202748f2 in GNI_GetCompleted ()
[Current thread is 1 (Thread 0x2aaaeb1d4700 (LWP 6141))]
(gdb) bt
#0  0x00000000202748f2 in GNI_GetCompleted ()
#1  0x0000000020178079 in PumpOneSidedRDMATransactions (rdma_cq=0x40875a20, rdma_cq_lock=0x0) at machine-onesided.c:228
#2  0x00000000201745a8 in LrtsAdvanceCommunication (whileidle=1) at machine.c:3697
#3  0x0000000020167cb9 in AdvanceCommunication (whenidle=1) at machine-common-core.c:1555
#4  0x000000002016f2c6 in CommunicationServer (sleepTime=5) at machine-common-core.c:1579
#5  0x000000002016f307 in CommunicationServerThread (sleepTime=5) at machine-common-core.c:1598
#6  0x000000002016f24e in ConverseRunPE (everReturn=0) at machine-common-core.c:1530
#7  0x000000002016b3f9 in call_startfn (vindex=0x1) at machine-smp.c:414
#8  0x0000000020238784 in start_thread (arg=0x2aaaeb1d4700) at pthread_create.c:457
#9  0x00000000205cbaf9 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) q

#2 Updated by Nitin Bhat about 1 year ago

  • Status changed from In Progress to Implemented

#3 Updated by Nitin Bhat about 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF