pxshm shared queue lockless implementation is invalid
In studying the issue of the auto-detection of +nodesize, I found several bugs in the existing implementation.
The correct implementations in pxshm use locking, which requires thread infrastructure from the -pthread set of libraries. Works ok for the SMP context where multiple reader/writer situations arise. It is theoretically possible to do a lockless version of that, but that is outside the scope of the immediate problem.
The non-smp (PXHSM only) build defaults to using memory fences and is erroneous. This can be reproduced in megatest in with --with-production --enable-error-checking build of megatest with TESTOPTS="+nodesize 4", which will abort. "ptr - recvBuf->data == recvBuf->header->bytes" failed in file machine-pxshm.c Other choices of +nodesize may result in hangs.
The code as implemented tries to use volatile variables and memory fences as though they were locks. Neither of those is equivalent to a critical section lock on its own, or in combination, and therefore the implementation is riddled with race condiiions. It needs to be rewritten. I am working on a scheme that uses atomic operations.
#1 Updated by Eric Bohm over 4 years ago
- Status changed from New to Upstream
Best fix would use C11 atomics, or conversion of machine layer infrastructure to C++ and use C++11 atomics. Need compiler ecosystem to catch up before the former could be applied, and the latter would be a massive undertaking.
#5 Updated by Sam White over 2 years ago
- Category changed from Machine Layers to SMP
- Priority changed from Low to Normal
- Target version changed from 6.8.0 to 6.8.1
pxshm is becoming slightly more important with wider nodes, so bumping up the priority. Also putting under SMP because this issue is more related to SMP ones than other machine layer ones.
Perhaps we should re-assign this if it is not a priority for you?