Project

General

Profile

Cleanup #536

Cleanup #535: Errors reported by ThreadSanitizer

Data Races in SMP PCQueue

Added by Phil Miller almost 5 years ago. Updated about 1 year ago.

Status:
Merged
Priority:
High
Assignee:
Category:
SMP
Target version:
Start date:
07/28/2014
Due date:
% Done:

0%


Description

// PCQueuePush vs PCQueueEmpty - lots more like this, with PCQueuePush vs PCQueuePop as well
SUMMARY: ThreadSanitizer: data race ??:0 __tsan_atomic32_fetch_add
SUMMARY: ThreadSanitizer: data race ./pcqueue.h:226 PCQueuePop
SUMMARY: ThreadSanitizer: data race ./pcqueue.h:311 PCQueuePush

This is the most core bit of the SMP messaging functionality, and probably the most sensitive to compilers messing us up. We need versions that just lock, and that use atomic intrinsics appropriately where they're available.

(Detailed stack traces to follow)


Related issues

Related to Charm++ - Cleanup #621: Convert Converse from C to C++ Merged 11/11/2014
Related to Charm++ - Feature #541: SMP mesage passing must enforce memory ordering consistency Merged 07/28/2014

History

#1 Updated by Yanhua Sun almost 5 years ago

In push, code is protected by lock for smp case. In pop case, lock is not needed since pcqueue is only for one consumer. I do not see the race condition in our use case.

#2 Updated by Yanhua Sun over 4 years ago

Phil and I looked together and moved the memory fence to the places where we think correct. However, thread sanitizer still complains no synchronization. We also tested to use lock (PCQUEUE_LOCK), it also complains about data race due to some unprotected variables. This requires more careful work than we expect.

#3 Updated by Yanhua Sun over 4 years ago

  • Status changed from New to In Progress

Discussed with Phil about memory fence again. Only moving the fence to right order (write fence after data[push] write in push function, read fence before read data) does not solve the problem. Q->len still has the race. We tried to make Q->len as atomic and add __ATOMIC_SEQ_CST as full barrier for Q->len removes all the complaints.

Phil will check that code in the git. This only works with C11.

#4 Updated by Phil Miller over 4 years ago

  • Status changed from In Progress to Implemented

http://charm.cs.uiuc.edu/gerrit/#/c/418/ depends on C11 support. In GCC, that appears in 4.9, which is available essentially nowhere yet. Commercial compilers are likely even further behind. If we can convert the relevant code to C++, the standardized atomics there seem to have been implemented earlier.

#5 Updated by Phil Miller over 4 years ago

  • Target version changed from 6.6.1 to 6.7.0

#6 Updated by Phil Miller about 4 years ago

  • Assignee changed from Yanhua Sun to Phil Miller
  • Priority changed from Normal to High
  • Status changed from Implemented to In Progress

It turns out that fixing this is rather more important than we realized, because ThreadSanitizer freaks out and fails if it detects too many races. This is the only case in the runtime that occurs repeatedly during runtime, rather than in small constant numbers at startup and shutdown.

I don't think we can just wait on C11 _Atomic becoming available with adoption of GCC 4.9, and later in commercial compilers. GCC introduced ThreadSanitizer in 4.8, which is much more widely installed (Cray & BGQ, stable releases of many Linux distros). Clang introduced it in 3.2.

#7 Updated by Nikhil Jain over 3 years ago

  • Target version changed from 6.7.0 to 6.8.0

#8 Updated by Sam White over 2 years ago

  • Category set to SMP

Does the new lockless queue fix this?

#9 Updated by Phil Miller about 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#10 Updated by Phil Miller almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#11 Updated by Eric Bohm over 1 year ago

  • Related to Feature #541: SMP mesage passing must enforce memory ordering consistency added

#12 Updated by Phil Miller over 1 year ago

  • Assignee changed from Phil Miller to Nitin Bhat

#13 Updated by Nitin Bhat over 1 year ago

  • Status changed from In Progress to New

#14 Updated by Phil Miller over 1 year ago

The work in the linked Gerrit change should be directly translatable from C to C++, if the machine layer is actually going through the switch-over. Hopefully, there wouldn't be a whole lot of additional work involved here.

#15 Updated by Evan Ramos over 1 year ago

  • Status changed from New to In Progress

#16 Updated by Evan Ramos over 1 year ago

  • Assignee changed from Nitin Bhat to Evan Ramos
  • Status changed from In Progress to Implemented

Implemented in patch series concluding with: https://charm.cs.illinois.edu/gerrit/418

Merge will need to wait for other pending changes to files renamed in the series. I am also having trouble using TSan to validate that the data races are in fact resolved.

#17 Updated by Sam White about 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF