Project

General

Profile

Bug #1585

CmiCheckAffinity fails on large core counts with SMP

Added by Thomas Quinn about 2 years ago. Updated about 2 years ago.

Status:
Merged
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
06/04/2017
Due date:
% Done:

0%


Description

Running ChaNGa on Blue Waters with 1024 nodes and affinity options of:

+ppn 15 +setcpuaffinity +pemap 1-15,17-31 +commap 0,16

I get:
ATP Stack walkback for Rank 649 starting:
start_thread@pthread_create.c:301
call_startfn@0x202c4bdd
ConverseRunPE@0x202c4705
_initCharm(int, char**)@0x201d3cc4
:650
:379
ATP Stack walkback for Rank 649 done
Process died with signal 11: 'Segmentation fault'

charm commit cd4d6f80b970ae702874a14404189f042e83478a

Going in with the debugger, I see:
#0 0x00000000202da1cc in LrtsNodeOf (pe=31369) at cputopology.C:379
#1 0x00000000202da27d in CmiPhysicalNodeID (pe=<optimized out>)
at cputopology.C:629
#2 0x00000000202d9359 in CmiCheckAffinity () at cpuaffinity.c:650
#3 0x00000000201d3cc5 in _initCharm(int, char**) ()

Line 650 : } else if ((CmiPhysicalNodeID(CmiMyPe()) == 0) && (CmiMyPe() < CmiNumPes())) {

CmiMyPe() is returning "31369". This is larger than the total number of worker PEs (30*1024), but smaller than the total number of PEs (32*1024). It appears that cpuTopo.nodeIDs[] in LrtsNodeOf() has only been initialized for the number of worker PEs.

History

#1 Updated by Sam White about 2 years ago

  • Target version set to 6.8.0
  • Priority changed from High to Urgent
  • Assignee set to Juan Galvez

#2 Updated by Juan Galvez about 2 years ago

  • Status changed from New to Implemented

#3 Updated by Juan Galvez about 2 years ago

Yep, diagnosis is correct. CmiPhysicalNodeID can only be called for non-comm threads/pes. I had the order of conditionals wrong in the if statement. Fixed now.

#4 Updated by Sam White about 2 years ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF