Project

General

Profile

Bug #1675

OFI replica crashes

Added by Jim Phillips 2 months ago. Updated about 1 month ago.

Status:
Merged
Priority:
Normal
Category:
Machine Layers
Target version:
Start date:
09/13/2017
Due date:
% Done:

0%


Description

Testing NAMD replicas (8 nodes, 2 replicas) on Bridges with the non-smp ofi layer hangs or crashes most of the time in startup phase 7 right after "Info: PME USING 54 GRID NODES AND 54 TRANS NODES"

Converse/Charm++ Commit ID: v6.8.0-57-g5705b64-namd-charm-6.8.0-build-2017-Sep-12-21260
Info: useSync: 1 useProxySync: 0
Info: useSync: 1 useProxySync: 0
e kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 0-27
Charm++> Running on 4 unique compute nodes (28-way SMP).
Charm++> cpu topology info is gathered in 0.015 seconds.
Info: NAMD 2.12 for Linux-x86_64-Bridges
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60800 for ofi-linux-x86_64-iccstatic
Info: Built Tue Sep 12 17:58:38 EDT 2017 by jphillip on br005.pvt.bridges.psc.edu
Info: 1 NAMD  2.12  Linux-x86_64-Bridges  112    r423.pvt.bridges.psc.edu  jphillip
Info: Running on 112 processors, 112 nodes, 4 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.0766191 s
Info: 199.551 MB of memory in use based on /proc/self/stat
Info: Configuration file is apoa1/apoa1.namd

Related issues

Related to Charm++ - Feature #975: OFI Layer Merged 09/11/2017
Related to Charm++ - Bug #1709: Need a test that uses +partitions Implemented 10/08/2017

History

#1 Updated by Eric Bohm 2 months ago

  • Assignee set to Ronak Buch

#2 Updated by Ronak Buch 2 months ago

  • Assignee changed from Ronak Buch to Karthik Senthil

#3 Updated by Karthik Senthil 2 months ago

On 8 nodes of Stampede 2 with OFI layer (non-smp), the following command results in a hang at the end of simulation:

ibrun -n 512 -o 0 ./namd2 apoa1/apoa1.namd +replicas 2

#4 Updated by Karthik Senthil 2 months ago

I was able to replicate the hang for a smaller case with following command on Stampede 2:

ibrun -n 4 -o 0 ./namd2 apoa1/apoa1.namd +replicas 2

On inspection with gdb I got the following stack trace:

Program received signal SIGINT, Interrupt.
0x00002aaaab4c023b in pthread_spin_trylock () from /usr/lib64/libpthread.so.0
(gdb) bt
#0  0x00002aaaab4c023b in pthread_spin_trylock ()
   from /usr/lib64/libpthread.so.0
#1  0x00002aaaaad350b3 in psmx_cq_poll_mq () from /usr/lib64/libfabric.so.1
#2  0x00002aaaaad35756 in psmx_cq_readfrom () from /usr/lib64/libfabric.so.1
#3  0x0000000000eeda48 in fi_cq_read (cq=0x14cef70, buf=0x7fffffff8920, 
    count=8) at /usr/include/rdma/fi_eq.h:375
#4  0x0000000000ef2bb2 in process_completion_queue () at machine.c:1162
#5  0x0000000000ef2d54 in LrtsAdvanceCommunication (whileidle=0)
    at machine.c:1298
#6  0x0000000000eed5b6 in AdvanceCommunication (whenidle=0)
    at machine-common-core.c:1317
#7  0x0000000000eed826 in CmiGetNonLocal () at machine-common-core.c:1487
#8  0x0000000000ef5290 in CsdNextMessage (s=0x7fffffff8bf0) at convcore.c:1779
#9  0x0000000000ef55dc in CsdSchedulePoll () at convcore.c:1970
#10 0x0000000000c49815 in replica_barrier ()
#11 0x0000000000bd6621 in ScriptTcl::run() ()
#12 0x000000000073d44d in after_backend_init(int, char**) ()
#13 0x00000000006c409b in main ()

This points to the while loop in namd/src/DataExchanger.C: 172 inside the replica_barrier function. The control is handed to the machine layer after this.
It looks like a trylock failure, meaning that something that has acquired a lock is not releasing it. So, the hang is in this process trying to acquire the lock.

#5 Updated by Jim Phillips 2 months ago

This is a non-smp run, right? Are there other threads created by libfabric? Does pthreads require initialization?

#6 Updated by Nitin Bhat 2 months ago

Yes. It is a non-smp run. I'm guessing these are pthreads created by libfabric. I've asked the folks from Intel about it.

#7 Updated by Jim Phillips about 2 months ago

But are there actually multiple threads launched, or is there just a single thread making pthread calls?

#8 Updated by Nitin Bhat about 2 months ago

In a single instance, 1 thread is created. But I'm not sure how exactly replicas work. Will 2 replicas (2 charm instances) cause two user threads contesting for the same resource? If that's the case, then the two user threads might be causing the deadlock.

#9 Updated by Jim Phillips about 2 months ago

Replicas never share a process, they simply distribute the processes among the replicas.

#10 Updated by Phil Miller about 2 months ago

#12 Updated by Karthik Senthil about 1 month ago

  • Status changed from New to Implemented

#13 Updated by Phil Miller about 1 month ago

  • Status changed from Implemented to Merged

#14 Updated by Phil Miller about 1 month ago

  • Related to Bug #1709: Need a test that uses +partitions added

Also available in: Atom PDF