Bug #571

pxshm shared queue lockless implementation is invalid

Added by Eric Bohm almost 5 years ago. Updated over 1 year ago.

Target version:
Start date:
Due date:
% Done:



In studying the issue of the auto-detection of +nodesize, I found several bugs in the existing implementation.

The correct implementations in pxshm use locking, which requires thread infrastructure from the -pthread set of libraries. Works ok for the SMP context where multiple reader/writer situations arise. It is theoretically possible to do a lockless version of that, but that is outside the scope of the immediate problem.

The non-smp (PXHSM only) build defaults to using memory fences and is erroneous. This can be reproduced in megatest in with --with-production --enable-error-checking build of megatest with TESTOPTS="+nodesize 4", which will abort. "ptr - recvBuf->data == recvBuf->header->bytes" failed in file machine-pxshm.c Other choices of +nodesize may result in hangs.

The code as implemented tries to use volatile variables and memory fences as though they were locks. Neither of those is equivalent to a critical section lock on its own, or in combination, and therefore the implementation is riddled with race condiiions. It needs to be rewritten. I am working on a scheme that uses atomic operations.

Related issues

Related to Charm++ - Feature #420: PXSHM: use external information instead of +nodesize runtime argument Merged 02/17/2014
Related to Charm++ - Bug #717: pxshm and smp build options don't work together Merged 04/21/2015


#1 Updated by Eric Bohm over 4 years ago

  • Status changed from New to Upstream

Best fix would use C11 atomics, or conversion of machine layer infrastructure to C++ and use C++11 atomics. Need compiler ecosystem to catch up before the former could be applied, and the latter would be a massive undertaking.

#2 Updated by Phil Miller about 4 years ago

  • Priority changed from Normal to Low

#3 Updated by Phil Miller almost 4 years ago

  • Target version changed from 6.7.0 to 6.8.0

#4 Updated by Sam White over 2 years ago

  • Status changed from Upstream to New

Note that the compiler ecosystem has basically all caught up, and so C11 atomics are being used in the new lockless PCQueue:

#5 Updated by Sam White over 2 years ago

  • Category changed from Machine Layers to SMP
  • Priority changed from Low to Normal
  • Target version changed from 6.8.0 to 6.8.1

pxshm is becoming slightly more important with wider nodes, so bumping up the priority. Also putting under SMP because this issue is more related to SMP ones than other machine layer ones.

Perhaps we should re-assign this if it is not a priority for you?

#6 Updated by Eric Bohm almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#7 Updated by Sam White over 1 year ago

  • Target version deleted (6.9.0)

We are looking to use CMA rather than pxshm where possible, possibly obviating the need for this.

Also available in: Atom PDF