OFI replica crashes
Testing NAMD replicas (8 nodes, 2 replicas) on Bridges with the non-smp ofi layer hangs or crashes most of the time in startup phase 7 right after "Info: PME USING 54 GRID NODES AND 54 TRANS NODES"
Converse/Charm++ Commit ID: v6.8.0-57-g5705b64-namd-charm-6.8.0-build-2017-Sep-12-21260 Info: useSync: 1 useProxySync: 0 Info: useSync: 1 useProxySync: 0 e kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'. CharmLB> Load balancer assumes all CPUs are same. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 0-27 Charm++> Running on 4 unique compute nodes (28-way SMP). Charm++> cpu topology info is gathered in 0.015 seconds. Info: NAMD 2.12 for Linux-x86_64-Bridges Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: for updates, documentation, and support information. Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 60800 for ofi-linux-x86_64-iccstatic Info: Built Tue Sep 12 17:58:38 EDT 2017 by jphillip on br005.pvt.bridges.psc.edu Info: 1 NAMD 2.12 Linux-x86_64-Bridges 112 r423.pvt.bridges.psc.edu jphillip Info: Running on 112 processors, 112 nodes, 4 physical nodes. Info: CPU topology information available. Info: Charm++/Converse parallel runtime startup completed at 0.0766191 s Info: 199.551 MB of memory in use based on /proc/self/stat Info: Configuration file is apoa1/apoa1.namd
#4 Updated by Karthik Senthil 9 months ago
I was able to replicate the hang for a smaller case with following command on Stampede 2:
ibrun -n 4 -o 0 ./namd2 apoa1/apoa1.namd +replicas 2
On inspection with gdb I got the following stack trace:
Program received signal SIGINT, Interrupt. 0x00002aaaab4c023b in pthread_spin_trylock () from /usr/lib64/libpthread.so.0 (gdb) bt #0 0x00002aaaab4c023b in pthread_spin_trylock () from /usr/lib64/libpthread.so.0 #1 0x00002aaaaad350b3 in psmx_cq_poll_mq () from /usr/lib64/libfabric.so.1 #2 0x00002aaaaad35756 in psmx_cq_readfrom () from /usr/lib64/libfabric.so.1 #3 0x0000000000eeda48 in fi_cq_read (cq=0x14cef70, buf=0x7fffffff8920, count=8) at /usr/include/rdma/fi_eq.h:375 #4 0x0000000000ef2bb2 in process_completion_queue () at machine.c:1162 #5 0x0000000000ef2d54 in LrtsAdvanceCommunication (whileidle=0) at machine.c:1298 #6 0x0000000000eed5b6 in AdvanceCommunication (whenidle=0) at machine-common-core.c:1317 #7 0x0000000000eed826 in CmiGetNonLocal () at machine-common-core.c:1487 #8 0x0000000000ef5290 in CsdNextMessage (s=0x7fffffff8bf0) at convcore.c:1779 #9 0x0000000000ef55dc in CsdSchedulePoll () at convcore.c:1970 #10 0x0000000000c49815 in replica_barrier () #11 0x0000000000bd6621 in ScriptTcl::run() () #12 0x000000000073d44d in after_backend_init(int, char**) () #13 0x00000000006c409b in main ()
This points to the while loop in namd/src/DataExchanger.C: 172 inside the
replica_barrier function. The control is handed to the machine layer after this.
It looks like a trylock failure, meaning that something that has acquired a lock is not releasing it. So, the hang is in this process trying to acquire the lock.