Project

General

Profile

Bug #833

mpi smp build is locked to one core per node by default

Added by Thomas Quinn about 3 years ago. Updated 4 months ago.

Status:
Merged
Priority:
High
Assignee:
Category:
Machine Layers
Target version:
Start date:
09/11/2015
Due date:
% Done:

0%


Description

Repeat by:
build ChaNGa mpi-linux_x86_64 smp ...
Compile changa
Run a test on (e.g.) stampede with "ibrun ChaNGa ++ppn 15 test.param
Note that it is an order of magnitude slower that it should be; also logging in to a compute node shows only one core is running while the load is 16.
Adding "+cpuaffinity +commap 0 +pemap 1-15" gets it running at full speed, but charm should do something sensible when these options are not included.

I also suspect this is a problem for the GNI builds.


Related issues

Related to Charm++ - Feature #1173: Automatic process launching, thread spawning, and hardware binding In Progress 08/23/2016

History

#1 Updated by Eric Bohm about 3 years ago

  • Assignee set to Eric Bohm

#2 Updated by Jim Phillips about 3 years ago

  • Assignee deleted (Eric Bohm)

Does adding this to the job script fix the problem?
export OMP_NUM_THREADS=16

Cray's aprun does have similar issues.

It would be good for the smp/multicore builds to detect at startup that the cpu mask is limited to a single core (or insufficient cores) and treat this as a fatal error if cpu affinity options are not set.

NAMD also refuses to run smp builds with ppn 1.

#3 Updated by Jim Phillips about 3 years ago

  • Assignee set to Eric Bohm

#4 Updated by Nikhil Jain almost 3 years ago

  • Assignee changed from Eric Bohm to Juan Galvez

#5 Updated by Phil Miller almost 3 years ago

Basically, look at whatever code handles +pemap currently. In there, if no +pemap argument was provided, have each PE examine its CPU affinity mask. If there are multiple PEs assigned uniquely to a single CPU, then we should detect that and at least print a warning, if not error out entirely.

If that case happens even with +pemap, we may still want to warn about potentially inadvertent oversubscription, and recommend adjusting options or if it's intentional, the passing +CmiSleepOnIdle so that they won't interfere as badly.

#6 Updated by Juan Galvez almost 3 years ago

  • Status changed from New to In Progress

#7 Updated by Juan Galvez almost 3 years ago

I have implemented a solution for this issue.

I added a function called checkAffinity in cpuaffinity.c that checks if there are multiple PEs assigned to the same core (currently only on physical node 0). This check is performed whether affinity options have been specified or not. If affinity options weren't set by the user or calculated by charm++ the function will abort if it detects oversubscription. Otherwise it will print a warning if there is oversubscription.

Note that checkAffinity requires cputopology to be enabled, so it's necessary to change ck-core/init.C to initialize cputopology before cpuaffinity.

Currently the check is very simple and will not detect oversubscription in many cases, but it probably will in all instances we are interested in. The check consists in doing an OR of the cpu sets of PEs in the physical node, and checking that the size of the resulting set is greater or equal to the number of PEs. I also implemented an algorithm which detects most (all?) oversubscription cases but it's C++ and cpuaffinity module is C. The algorithm is simple but I feel implementing it in C would make it unnecessarily complex (I'm using std::vector to dynamically store PE affinities in a specific way). cputopology.C is C++ so not sure if we can also make cpuaffinity.c C++ or if it's worth it, or maybe we can just stick to the simple check.

Tested this running jacobi3d test on SMP mode across multiple lab machines and on Bluewaters (4 physical nodes). Haven't encountered any issues.

On Bluewaters I ran cases where the system assigns all threads to core 0 and it detects it accordingly. I also tested specifying a good and bad pemap and also acts accordingly in both cases.

#8 Updated by Phil Miller almost 3 years ago

Porting cpuaffinity from C to C++ should be totally fine, and is well justified if it simplifies implementation of desired features.

On the other hand, getting this check in a small, simple patch would make porting it into a 6.7.1 patch release much easier.

#9 Updated by Juan Galvez over 2 years ago

  • Status changed from In Progress to Implemented

#10 Updated by Sam White about 2 years ago

  • Target version set to 6.8.0

Bump: this has been stuck in gerrit since March

#11 Updated by Juan Galvez about 2 years ago

I sort of agreed with Eric to wait until hwloc changes are finalized, since this is going to bring changes to cputopology and cpuaffinity. I will merge my patch with these additions if appropiate.

#12 Updated by Phil Miller about 2 years ago

  • Category set to Machine Layers

#13 Updated by Sam White over 1 year ago

Since we won't have hwloc integration for the release, I think it would be good to get this into 6.8.0 if it is ready?

#14 Updated by Sam White over 1 year ago

Eric Bohm will test this on a machine that is known to have bad default mapping

#15 Updated by Phil Miller over 1 year ago

We're not immediately going to worry about the case where the user gives a bad pemap/commap argument set.

#16 Updated by Phil Miller over 1 year ago

  • Priority changed from Normal to High

#17 Updated by Eric Bohm over 1 year ago

  • Assignee changed from Juan Galvez to Eric Bohm

#18 Updated by Michael Robson over 1 year ago

  • Tags set to namd, changa, openatom

#19 Updated by Eric Bohm over 1 year ago

  • Assignee changed from Eric Bohm to Juan Galvez

#20 Updated by Juan Galvez over 1 year ago

Code hangs on Blue Waters with multiple processes per node.

Problem is, the first process on the node doesn't receive any messages from the other processes on the node. For some reason, having the commthreads in a loop doing `CmiNetworkProgress` doesn't work on CRAYXE. Instead, in order for messages to actually be sent you need to let the commthreads go through charm init

But the problem now would be achieving synchronization between threads... several parts of charm init do CmiNodeAllBarrier, but for my code to work, the comm threads shouldn't encounter any more such barriers, because if the comm threads stop, my code hangs.

From what I have seen in cpuaffinity and cputopology, the code avoids using converse communication for crayxe. So, not sure how to fix it right now.

there don't seem to be anymore CmiNodeAllBarrier after doing cpuaffinity initialization, only this block:

#if CMK_USE_PXSHM && ( CMK_CRAYXE || CMK_CRAYXC ) && CMK_SMP
    // for SMP on Cray XE6 (hopper) it seems pxshm has to be initialized
    // again after cpuaffinity is done
    if (CkMyRank() == 0) {
      CmiInitPxshm(argv);
    }
    CmiNodeAllBarrier();
#endif

which isn't enabled by default. An option could be, for cray-specific code, let commthreads exit cpuaffinity and assume they will finish charm init and won't block, but this assumption might not hold in the future?

#21 Updated by Sam White over 1 year ago

  • Status changed from Implemented to In Progress

#22 Updated by Juan Galvez over 1 year ago

  • Status changed from In Progress to Implemented

Posted a new patch in gerrit.

Should work on all architectures, including Cray because it does not rely on the NetworkProgress call.

Also there was a very rare race condition which happened infrequently on netlrts-linux which caused segfault due to converse attempting to process a message for which a handler had not been registered yet. Solved it for netlrts and made a change which should make it very difficult to happen in other architectures (haven't actually analyzed the code to see if it can still potentially happen or not).

Have tested on Blue Waters and netlrts-linux with multiple processes per physical node and it's working.

#23 Updated by Sam White over 1 year ago

  • Status changed from Implemented to Merged

#24 Updated by Evan Ramos 4 months ago

  • Tags changed from changa, namd, openatom to changa, namd, openatom, provisioning

Also available in: Atom PDF