Project

General

Profile

Bug #1048

Verbs on Bridges at PSC crashes or hangs

Added by Bilge Acun about 3 years ago. Updated almost 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
Start date:
04/28/2016
Due date:
% Done:

0%

Spent time:

Description

Megatest sometimes hangs with:

Info: Startup phase 0 took 0.000252008 s, 221.703 MB of memory in use
[0] wc[0] status 9 wc[i].opcode 0

or crashes with length mismatch error, messages seems to be corrupted.


Charm++> Running in non-SMP mode: numPes 56
Converse/Charm++ Commit ID:
v6.7.0-42-gf835f2a-namd-charm-6.7.0-build-2016-Feb-07-94042
Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 0-27
------------- Processor 6 Exiting: Called CmiAbort ------------
Reason:

                 Length mismatch!!

[6] Stack Traceback:
   [6:0]   [0x5f0cc5]
   [6:1]   [0x608312]
   [6:2]   [0x627cfc]
   [6:3]   [0x4d3799]
   [6:4]   [0x604173]
   [6:5]   [0x4d0ed1]
   [6:6] __libc_start_main+0xf5  [0x7feb01fb1b15]
   [6:7]   [0x4076e9]
------------- Processor 9 Exiting: Called CmiAbort ------------
Reason:

                 Length mismatch!!

[9] Stack Traceback:
   [9:0]   [0x5f0cc5]
   [9:1]   [0x608312]
   [9:2]   [0x627cfc]
   [9:3]   [0x4d3799]
   [9:4]   [0x604173]
   [9:5]   [0x4d0ed1]
   [9:6] __libc_start_main+0xf5  [0x7f3f9597cb15]
   [9:7]   [0x4076e9]
------------- Processor 11 Exiting: Called CmiAbort ------------
Reason:

                 Length mismatch!!

[11] Stack Traceback:
   [11:0]   [0x5f0cc5]
   [11:1]   [0x608312]
   [11:2]   [0x627cfc]
   [11:3]   [0x4d3799]
   [11:4]   [0x604173]
   [11:5]   [0x4d0ed1]
   [11:6] __libc_start_main+0xf5  [0x7f1236209b15]
   [11:7]   [0x4076e9]
Fatal error on PE 6>

                 Length mismatch!!


Related issues

Related to Charm++ - Bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer Rejected 01/26/2017
Related to Charm++ - Bug #1409: verbs crashes on Stampede KNL and Bridges Rejected 02/10/2017
Related to Charm++ - Bug #529: LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors Closed 07/06/2014

History

#1 Updated by Sam White about 3 years ago

ChaNGa is seeing crashes/hangs for some runs on verbs, which are fixed with this commit: https://charm.cs.illinois.edu/gerrit/#/c/1152/

#2 Updated by Jim Phillips about 3 years ago

Sam White wrote:

ChaNGa is seeing crashes/hangs for some runs on verbs, which are fixed with this commit: https://charm.cs.illinois.edu/gerrit/#/c/1152/

That is unlikely to be connected.

Bilge neglected to mention that what is unique about Bridges is the new Intel Omni-Path interconnect.

#3 Updated by Eric Bohm about 3 years ago

  • Assignee set to Bilge Acun

#4 Updated by Jim Phillips about 3 years ago

Ping.

#5 Updated by Jim Phillips about 3 years ago

Ran a few more tests. Debug or error checking make no difference. Running git latest verbs-linux-x86_64-iccstatic megatest +p56 on 2 nodes of Bridges always succeeds for me, but +p56 on 3, 4, or 8 nodes always crashes after "test 17: initiated [marshall (olawlor)]" with these two errors:

[0] Assertion "0" failed in file machine-ibverbs.c line 1406.
[19] Assertion "0" failed in file machine-ibverbs.c line 1372.

It's always test 17 and always pes 0 and 19, regardless of node count.

These lines are ibv_poll_cq returned wc[i].status != IBV_WC_SUCCESS in pollSendCq (line 1406) and pollRecvCq (line 1372), so the error is related to some message sent from pe 0 to pe 19. With 2 nodes these pes are on the same node, but with 3 nodes they are not. I see that marshall.C defines numElements=20 and element 19 calls megatest_finish().

If I remove the marshall test the rest of megatest completes.

If I run 8 nodes with 28 pes per node (224 pes) then all megatest tests pass.

If I run NAMD with 28 pes per node on 2 nodes I get the same errors in startup phase 1 (namdOneSend/Recv):

[0] Assertion "0" failed in file machine-ibverbs.c line 1406.
[28] Assertion "0" failed in file machine-ibverbs.c line 1372.

Running NAMD on a single node succeeds.

#6 Updated by Jim Phillips about 3 years ago

Error on the send side:

[0] pollSendCq wc[0] status 9 vendor_err 0 qp_num 300682 size 4167
[0] Assertion "0" failed in file machine-ibverbs.c line 1407.

IBV_WC_REM_INV_REQ_ERR (9) - This event is generated when the receive buffer is smaller than the incoming send. It is generated on the sender side of the connection. It may also be generated if the QP attributes are not set correctly, particularly those governing MR access.

and on the receive side:

[28] pollRecvCq wc[0] status 2 vendor_err 0 qp_num 151486 size 4800
[28] Assertion "0" failed in file machine-ibverbs.c line 1373.

IBV_WC_LOC_QP_OP_ERR (2) - This event is generated when a QP error occurs. For example, it may be generated if a user neglects to specify responder_resources and initiator_depth values in struct rdma_conn_param before calling rdma_connect() on the client side and rdma_accept() on the server side.

If I reduce the message buffer size in NAMD's Communicate.C from 4096 to 2048 I start getting a garbage size and a different error:

[28] pollSendCq wc[0] status 1 vendor_err 0 qp_num 44532 size 52017920

#7 Updated by Jim Phillips about 3 years ago

If I #define QLOGIC 1 in machine-ibverbs.c NAMD runs!

#8 Updated by Bilge Acun about 3 years ago

Strange, I'm pretty sure I've tried testing with QLOGIC enabled and it didn't work for me. Did you change anything else?

#9 Updated by Jim Phillips about 3 years ago

Adding --with-qlogic to the Charm++ build line appears to fix the issue for NAMD.

There is a performance issue with verbs (but not net-ibverbs) on 2 nodes (but not 8) non-smp, and smp for 2 or 8 nodes:

jun24_ibverbs_2.log:Info: Initial time: 56 CPUs 0.0215851 s/step 0.249828 days/ns 1296.97 MB memory
jun24_ibverbs_2.log:Info: Initial time: 56 CPUs 0.0215135 s/step 0.248999 days/ns 1296.97 MB memory
jun24_ibverbs_2.log:Info: Initial time: 56 CPUs 0.0178656 s/step 0.206777 days/ns 1296.97 MB memory
jun24_ibverbs_2.log:Info: Benchmark time: 56 CPUs 0.0170679 s/step 0.197545 days/ns 1296.97 MB memory
jun24_ibverbs_2.log:Info: Benchmark time: 56 CPUs 0.0169763 s/step 0.196485 days/ns 1296.97 MB memory
jun24_ibverbs_2.log2:Info: Initial time: 56 CPUs 0.0211198 s/step 0.244442 days/ns 1297.02 MB memory
jun24_ibverbs_2.log2:Info: Initial time: 56 CPUs 0.0211551 s/step 0.244851 days/ns 1297.02 MB memory
jun24_ibverbs_2.log2:Info: Initial time: 56 CPUs 0.0179597 s/step 0.207867 days/ns 1297.02 MB memory
jun24_ibverbs_2.log2:Info: Benchmark time: 56 CPUs 0.0172216 s/step 0.199325 days/ns 1297.02 MB memory
jun24_ibverbs_2.log2:Info: Benchmark time: 56 CPUs 0.0171032 s/step 0.197954 days/ns 1297.02 MB memory

jun24_ibverbs_8.log:Info: Initial time: 224 CPUs 0.00948906 s/step 0.109827 days/ns 1297.77 MB memory
jun24_ibverbs_8.log:Info: Initial time: 224 CPUs 0.0094577 s/step 0.109464 days/ns 1297.77 MB memory
jun24_ibverbs_8.log:Info: Initial time: 224 CPUs 0.00636057 s/step 0.0736177 days/ns 1297.77 MB memory
jun24_ibverbs_8.log:Info: Benchmark time: 224 CPUs 0.00542004 s/step 0.062732 days/ns 1297.77 MB memory
jun24_ibverbs_8.log:Info: Benchmark time: 224 CPUs 0.00545617 s/step 0.0631501 days/ns 1297.77 MB memory

jun24_verbs_2.log:Info: Initial time: 56 CPUs 0.0316243 s/step 0.366022 days/ns 1301.98 MB memory
jun24_verbs_2.log:Info: Initial time: 56 CPUs 0.0373169 s/step 0.431909 days/ns 1301.98 MB memory
jun24_verbs_2.log:Info: Initial time: 56 CPUs 0.0336255 s/step 0.389184 days/ns 1301.98 MB memory
jun24_verbs_2.log:Info: Benchmark time: 56 CPUs 0.0351351 s/step 0.406656 days/ns 1301.98 MB memory
jun24_verbs_2.log:Info: Benchmark time: 56 CPUs 0.0350701 s/step 0.405903 days/ns 1301.98 MB memory
jun24_verbs_2.log2:Info: Initial time: 56 CPUs 0.0293083 s/step 0.339216 days/ns 1302.04 MB memory
jun24_verbs_2.log2:Info: Initial time: 56 CPUs 0.0384236 s/step 0.444718 days/ns 1302.04 MB memory
jun24_verbs_2.log2:Info: Initial time: 56 CPUs 0.0337439 s/step 0.390554 days/ns 1302.04 MB memory
jun24_verbs_2.log2:Info: Benchmark time: 56 CPUs 0.0367625 s/step 0.425492 days/ns 1302.04 MB memory
jun24_verbs_2.log2:Info: Benchmark time: 56 CPUs 0.0351549 s/step 0.406886 days/ns 1302.04 MB memory
jun24_verbs_2.log3:Info: Initial time: 56 CPUs 0.0294955 s/step 0.341383 days/ns 1302.04 MB memory
jun24_verbs_2.log3:Info: Initial time: 56 CPUs 0.0358186 s/step 0.414567 days/ns 1302.04 MB memory
jun24_verbs_2.log3:Info: Initial time: 56 CPUs 0.0352052 s/step 0.407467 days/ns 1302.04 MB memory
jun24_verbs_2.log3:Info: Benchmark time: 56 CPUs 0.034646 s/step 0.400995 days/ns 1302.04 MB memory
jun24_verbs_2.log3:Info: Benchmark time: 56 CPUs 0.0378174 s/step 0.437701 days/ns 1302.04 MB memory
jun24_verbs_2.log4:Info: Initial time: 56 CPUs 0.0300484 s/step 0.347783 days/ns 1302.04 MB memory
jun24_verbs_2.log4:Info: Initial time: 56 CPUs 0.036702 s/step 0.424792 days/ns 1302.04 MB memory
jun24_verbs_2.log4:Info: Initial time: 56 CPUs 0.0349405 s/step 0.404404 days/ns 1302.04 MB memory
jun24_verbs_2.log4:Info: Benchmark time: 56 CPUs 0.037088 s/step 0.429259 days/ns 1302.04 MB memory
jun24_verbs_2.log4:Info: Benchmark time: 56 CPUs 0.0361525 s/step 0.418432 days/ns 1302.04 MB memory

jun24_verbs_8.log:Info: Initial time: 224 CPUs 0.0095907 s/step 0.111003 days/ns 1303.32 MB memory
jun24_verbs_8.log:Info: Initial time: 224 CPUs 0.00950899 s/step 0.110058 days/ns 1303.32 MB memory
jun24_verbs_8.log:Info: Initial time: 224 CPUs 0.00640776 s/step 0.0741639 days/ns 1303.32 MB memory
jun24_verbs_8.log:Info: Benchmark time: 224 CPUs 0.00547675 s/step 0.0633883 days/ns 1303.32 MB memory
jun24_verbs_8.log:Info: Benchmark time: 224 CPUs 0.00538353 s/step 0.0623094 days/ns 1303.32 MB memory

jun24_ismp_2.log:Info: Initial time: 54 CPUs 0.02172 s/step 0.251389 days/ns 30842.1 MB memory
jun24_ismp_2.log:Info: Initial time: 54 CPUs 0.0213978 s/step 0.24766 days/ns 30847.9 MB memory
jun24_ismp_2.log:Info: Initial time: 54 CPUs 0.0193198 s/step 0.223608 days/ns 30847.9 MB memory
jun24_ismp_2.log:Info: Benchmark time: 54 CPUs 0.019652 s/step 0.227454 days/ns 30847.9 MB memory
jun24_ismp_2.log:Info: Benchmark time: 54 CPUs 0.0174578 s/step 0.202058 days/ns 30847.9 MB memory

jun24_ismp_8.log:Info: Initial time: 216 CPUs 0.00887376 s/step 0.102706 days/ns 30842.1 MB memory
jun24_ismp_8.log:Info: Initial time: 216 CPUs 0.010526 s/step 0.121829 days/ns 30842.1 MB memory
jun24_ismp_8.log:Info: Initial time: 216 CPUs 0.00722224 s/step 0.0835907 days/ns 30842.1 MB memory
jun24_ismp_8.log:Info: Benchmark time: 216 CPUs 0.00774039 s/step 0.0895878 days/ns 30842.1 MB memory
jun24_ismp_8.log:Info: Benchmark time: 216 CPUs 0.00814082 s/step 0.0942225 days/ns 30842.1 MB memory

jun24_smp_2.log:Info: Initial time: 54 CPUs 0.02053 s/step 0.237616 days/ns 30842.3 MB memory
jun24_smp_2.log:Info: Initial time: 54 CPUs 0.0203944 s/step 0.236047 days/ns 30845.8 MB memory
jun24_smp_2.log:Info: Initial time: 54 CPUs 0.0260431 s/step 0.301425 days/ns 30843.7 MB memory
jun24_smp_2.log:Info: Benchmark time: 54 CPUs 0.0205534 s/step 0.237886 days/ns 30843.7 MB memory
jun24_smp_2.log:Info: Benchmark time: 54 CPUs 0.0263952 s/step 0.3055 days/ns 30843.7 MB memory

jun24_smp_8.log:Info: Initial time: 216 CPUs 0.0179319 s/step 0.207545 days/ns 30841.6 MB memory
jun24_smp_8.log:Info: Initial time: 216 CPUs 0.019759 s/step 0.228692 days/ns 30841.6 MB memory
jun24_smp_8.log:Info: Initial time: 216 CPUs 0.0276869 s/step 0.32045 days/ns 30841.6 MB memory
jun24_smp_8.log:Info: Benchmark time: 216 CPUs 0.0205866 s/step 0.23827 days/ns 30841.6 MB memory
jun24_smp_8.log:Info: Benchmark time: 216 CPUs 0.0198554 s/step 0.229808 days/ns 30841.6 MB memory

#10 Updated by Jim Phillips about 3 years ago

Bilge Acun wrote:

Strange, I'm pretty sure I've tried testing with QLOGIC enabled and it didn't work for me. Did you change anything else?

No, --with-qlogic is all. It's possible they fixed something, or megatest fails for some other reason.

#11 Updated by Jim Phillips about 3 years ago

Jim Phillips wrote:

Adding --with-qlogic to the Charm++ build line appears to fix the issue for NAMD.

There is a performance issue with verbs (but not net-ibverbs) on 2 nodes (but not 8) non-smp, and smp for 2 or 8 nodes:

[...]

The 2-node non-smp issue is stretches in ComputePmeMgr::recvTrans and recvUntrans entries that send no messages but immediately follow entries that create 55 messages. I can't see any network polling inside these entries so maybe it is an interrupt. For apoa1 PME uses a maximum of 54 slabs, so on two nodes the transposes are all-to-all (27 pes per node each sending 54 messages, so 1458 per node). Leaving a core free or disabling cpu affinity has no effect.

#12 Updated by Phil Miller almost 3 years ago

  • Category set to Machine Layers

#13 Updated by Phil Miller over 2 years ago

Would it be possible to try loading older versions of the relevant libraries to see which ones do or don't crash? Then we could test for having a sufficiently new version in configure, and pressure admins at other sites to upgrade appropriately.

#14 Updated by Phil Miller over 2 years ago

  • Target version set to 6.8.0-beta1

#15 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0-beta1 to 6.8.1

#16 Updated by Phil Miller about 2 years ago

  • Related to Bug #529: LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors added

#17 Updated by Sam White almost 2 years ago

  • Assignee changed from Bilge Acun to Jaemin Choi
  • Subject changed from Verbs on Bridges at PSC crashes or hanges to Verbs on Bridges at PSC crashes or hangs

#18 Updated by Phil Miller almost 2 years ago

(Updating all of #1048 #1381 #1409 )

Since these machines are all ostensibly going to be using the ofi network layer from 6.8.1 onward, do we want to close these bugs as 'rejected'?

#19 Updated by Sam White almost 2 years ago

  • Status changed from New to Rejected

OFI should be used on Bridges.

Also available in: Atom PDF