Project

General

Profile

Bug #1473

verbs build hangs in tests/charm++/communication_overhead

Added by Phil Miller over 1 year ago. Updated 9 months ago.

Status:
Implemented
Priority:
Normal
Assignee:
Category:
Build & Test Automation
Target version:
Start date:
03/21/2017
Due date:
% Done:

0%


Description

==> old.2017_03_18__01_00/verbs-linux-x86_64.txt <==
Charm++ group communication with allocation timing enabled

 PE             MSG SIZE     PER MSG TIME(us)             BW(MB/s)         OVERHEAD(us)
slurmstepd: *** JOB 8367429 ON c457-401 CANCELLED AT 2017-03-18T08:55:38 DUE TO TIME LIMIT ***
make[2]: *** [test] Terminated
make[1]: *** [test] Terminated
make: make[3]: *** [test] Terminated*** [test] Terminated

fatal> error code 1 during remote> ./instead_test.sh charm/verbs-linux-x86_64/tmp make  test ++timeout 180 +isomalloc_sync
Returned from executing scripts/verbs-linux-x86_64/test on remote host
fatal> Test on remote host failed with fatal error (0)
Bad: Test on remote host failed with fatal error (0)

==> old.2017_03_19__01_00/verbs-linux-x86_64.txt <==
Charm++ group communication with allocation timing enabled

 PE             MSG SIZE     PER MSG TIME(us)             BW(MB/s)         OVERHEAD(us)
slurmstepd: *** JOB 8368997 ON c457-703 CANCELLED AT 2017-03-19T15:39:30 DUE TO TIME LIMIT ***
make[2]: make[1]: *** [test] Terminated*** [test] Terminated

make: *** [test] Terminatedmake[3]: 
*** [test] Terminated
fatal> error code 1 during remote> ./instead_test.sh charm/verbs-linux-x86_64/tmp make  test ++timeout 180 +isomalloc_sync
Returned from executing scripts/verbs-linux-x86_64/test on remote host
fatal> Test on remote host failed with fatal error (0)
Bad: Test on remote host failed with fatal error (0)

==> old.2017_03_20__01_00/verbs-linux-x86_64.txt <==
Charm++ 1D array communication with allocation timing enabled

 PE             MSG SIZE     PER MSG TIME(us)             BW(MB/s)         OVERHEAD(us)
slurmstepd: *** JOB 8370620 ON c461-402 CANCELLED AT 2017-03-20T18:12:32 DUE TO TIME LIMIT ***
make[2]: make[1]: *** [test] Terminated*** [test] Terminated

make: *** [test] Terminated
make[3]: *** [test] Terminated
fatal> error code 1 during remote> ./instead_test.sh charm/verbs-linux-x86_64/tmp make  test ++timeout 180 +isomalloc_sync
Returned from executing scripts/verbs-linux-x86_64/test on remote host
fatal> Test on remote host failed with fatal error (0)
Bad: Test on remote host failed with fatal error (0)

Related issues

Related to Charm++ - Bug #664: charm++/communication_overhead test fails with randomized queues New 02/11/2015

History

#1 Updated by Phil Miller over 1 year ago

  • Tags changed from verbs to verbs, hang

#2 Updated by Phil Miller over 1 year ago

  • Tags changed from verbs, hang to hang, verbs

Tried the same test back on 6.7.1. It also hangs, but somewhat later - in the 1D array cases, instead of the group cases.

Might there be a memory leak at issue here, possibly screwing with the registration pool?

#3 Updated by Phil Miller over 1 year ago

Observed what seems like a memory leak in operationFinished, in that the message it receives is sometimes not deleted. Fixing that doesn't make the hang go away, though.

#4 Updated by Phil Miller over 1 year ago

  • Related to Bug #664: charm++/communication_overhead test fails with randomized queues added

#5 Updated by Phil Miller over 1 year ago

  • Tags changed from verbs, hang to verbs, hang, message-race
  • Category changed from Machine Layers to Build & Test Automation
  • Target version changed from 6.8.0-beta1 to 6.8.0
  • Priority changed from Urgent to Normal
  • Status changed from New to Implemented

Test is indeed broken, disablement commit pushed for now.

#6 Updated by Eric Bohm over 1 year ago

  • Assignee set to Eric Bohm

#7 Updated by Sam White over 1 year ago

  • Target version changed from 6.8.0 to 6.8.1

This has been worked around for now.

#8 Updated by Phil Miller about 1 year ago

  • Target version changed from 6.8.1 to 6.9.0

#9 Updated by Eric Bohm 11 months ago

  • Target version changed from 6.9.0 to 6.9.1

#10 Updated by Eric Bohm 9 months ago

Current issue is not a hang, it crashes.

PE             MSG SIZE     PER MSG TIME(us)             BW(MB/s)         OVERHEAD(us)
[1] 16 3.634 4.199 0.496
Expected message of size 16, got message of size 32
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Out of order message
[0] Stack Traceback:
[0:0] CmiAbortHelper+0x65 [0x5a56a5]
[0:1] [0x5a56fb]
[0:2] _ZN26CkIndex_CommunicationGroup37_call_operationFinished_SimpleMessageEPvS0_+0x42 [0x49cf32]
[0:3] CkDeliverMessageFree+0x3a [0x4afa2a]
[0:4] _Z15_processHandlerPvP11CkCoreState+0x369 [0x4b7f39]
[0:5] CsdScheduleForever+0x70 [0x5af4b0]
[0:6] CsdScheduler+0x2d [0x5af7fd]
[0:7] [0x5abb22]
[0:8] ConverseInit+0x307 [0x5ad2d7]
[0:9] main+0x27 [0x495027]
[0:10] __libc_start_main+0xfd [0x2b96c185ed1d]
[0:11] [0x4955ad]
Fatal error on PE 0> Out of order message

Also available in: Atom PDF