Project

General

Profile

Bug #1988

Exit hang on Summit with pami

Added by Bilge Acun 2 months ago. Updated about 2 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
Start date:
10/09/2018
Due date:
% Done:

0%


Description

NAMD occasionally hangs at exit when using pami&pamilrts layer on Summit because of a race condition. Below is the key code piece where the race condition happens. exit(0) might get called before all threads call pthread_exit(0). It looks like the problem was prevented with a hack on BlueGeneQ by adding a sleep on rank 0 before exit.

charm-v6.8.2/src/arch/pami/machine.c line 1038:
CmiNodeBarrier(); //wait for all worker threads
if (rank0) {
#if CMK_BLUEGENEQ
   Delay(100000);
#endif
   exit(0);
}
else
   pthread_exit(0);

The sleep is not there for non-BGQ architectures. Adding usleep(100000) resolved the hang for me on Summit. A simple counter based exit-scheme could be implemented too.

History

#1 Updated by Sam White 2 months ago

  • Category set to Machine Layers
  • Target version set to 6.9.0
  • Assignee set to Nitin Bhat

#2 Updated by Nitin Bhat 2 months ago

I'm currently not able to run jobs on Summit as their batch queue is inactive because of acceptance testing. For now, I'll try to reproduce it on BGQ by removing the Delay(100000).

Bilge, what are the conditions for reproducing the hang? Do you see the hang with other smaller applications? How many nodes/threads did you run it with?

#3 Updated by Nitin Bhat 2 months ago

  • Status changed from New to Implemented

Gerrit: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4681/

I was not able to reproduce the bug on BG/Q with smaller examples or tests. And for some reason, I couldn't compile NAMD successfully on BG/Q.
Bilge, could you verify if the fix is working?

#4 Updated by Bilge Acun about 2 months ago

I did about 200 NAMD runs and there are no hangs with the fix. It used to hang 1 in 20 when large number of threads (+ppn42) is used.

#5 Updated by Sam White about 2 months ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF