Project

General

Profile

Bug #860

CUDA threads not killed at end of program in multicore builds

Added by Jim Phillips almost 4 years ago. Updated over 3 years ago.

Status:
Merged
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
10/20/2015
Due date:
% Done:

0%

Spent time:

Description

Hangs at end of NAMD CUDA run on multicore-linux64-iccstatic in CUDA library:

[Thread 0x7ffff58da700 (LWP 185513) exited]
[Partition 0][Node 0] End of program
[Thread 0x7ffff7fbc0c0 (LWP 185444) exited]
[Thread 0x7ffff7fb8700 (LWP 185511) exited]
[Thread 0x7ffff75b7700 (LWP 185512) exited]
^C
Program received signal SIGINT, Interrupt.
[Switching to Thread 0x7ffff4ed9700 (LWP 185524)]
0x000000301e0df343 in poll () from /lib64/libc.so.6
(gdb) where
#0  0x000000301e0df343 in poll () from /lib64/libc.so.6
#1  0x00007ffff6312a59 in ?? () from /usr/lib64/libcuda.so
#2  0x00007ffff5d1c6e2 in ?? () from /usr/lib64/libcuda.so
#3  0x00007ffff63130e8 in ?? () from /usr/lib64/libcuda.so
#4  0x000000301e4079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x000000301e0e8b6d in clone () from /lib64/libc.so.6
(gdb) thread
[Current thread is 5 (Thread 0x7ffff4ed9700 (LWP 185524))]
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fffe216e700 (LWP 185525))]#0  0x000000301e0df343 in poll ()
   from /lib64/libc.so.6
(gdb) where
#0  0x000000301e0df343 in poll () from /lib64/libc.so.6
#1  0x00007ffff6312a59 in ?? () from /usr/lib64/libcuda.so
#2  0x00007ffff5d1c6e2 in ?? () from /usr/lib64/libcuda.so
#3  0x00007ffff63130e8 in ?? () from /usr/lib64/libcuda.so
#4  0x000000301e4079d1 in start_thread () from /lib64/libpthread.so.0
#5  0x000000301e0e8b6d in clone () from /lib64/libc.so.6

Almost certainly the same cause as Bug #847: "Tcl notifier thread not killed at end of program", which was due to the fix for "segfault at exit in multicore build".


Related issues

Related to Charm++ - Bug #847: Tcl notifier thread not killed at end of program Rejected 09/25/2015

History

#1 Updated by Jim Phillips almost 4 years ago

  • Target version set to 6.7.0

#2 Updated by Jim Phillips almost 4 years ago

  • Target version deleted (6.7.0)

Workaround in NAMD to kill CUDA threads:

code:
int ndevs = 0;
cudaGetDeviceCount(&ndevs);
for ( int dev=0; dev < ndevs; ++dev ) {
cudaSetDevice(dev);
cudaDeviceReset();
}

#3 Updated by Jim Phillips over 3 years ago

MIC offload has the same issue.

#4 Updated by Prateek Jindal over 3 years ago

  • Assignee set to Prateek Jindal

I will be pair-programming with Mike on this bug.

#5 Updated by Prateek Jindal over 3 years ago

Michael and I were able to reproduce this bug on 'talent'. If we run 'hello' cuda example with 'net-linux-x86_64' build, it works fine. But if we run the same example with 'multicore-linux64' build, it hangs.

#6 Updated by Jim Phillips over 3 years ago

Ping.

#7 Updated by Phil Miller over 3 years ago

  • Subject changed from CUDA threads not killed at end of program to CUDA threads not killed at end of program in multicore builds

On SMP builds, the commthread explicitly calls exit() and so this isn't an issue.

#8 Updated by Jim Phillips over 3 years ago

  • Target version set to 6.7.0

#9 Updated by Jim Phillips over 3 years ago

I anything happening on this?

#10 Updated by Prateek Jindal over 3 years ago

If this bug needs to be resolved before SuperComputing, it needs to be assigned to someone else. I need to fix some Argo bugs for the demo at supercomputing. So, I am unable to work on this bug till then.

#11 Updated by Michael Robson over 3 years ago

  • Assignee changed from Prateek Jindal to Michael Robson

Reassigned to me to finish ASAP

#12 Updated by Prateek Jindal over 3 years ago

Now, I am getting the error that cuda is not supported in the multicore builds. Was there any cuda-related change for the multicore-build recently?

#13 Updated by Prateek Jindal over 3 years ago

Now, I remember that Mike previously told me to copy 3 files: conv-cuda-mach.{h,sh}, special.sh from net-linux-x86_64 directory. After copying these files, the multicore version builds fine.

#14 Updated by Prateek Jindal over 3 years ago

I ran the program through gdb. I see that Cuda threads are created in a function initHybridAPI(). After CsdScheduleForever() returns, the main threads exits fine. However, the cuda threads keep hanging.

#15 Updated by Jim Phillips over 3 years ago

Yes, that's the issue. The problem is that CkExit calls pthread_exit() on every Charm++ thread but libraries like CUDA launch their own service threads. The solution is to do what net-smp does and call pthread_exit on all but one Charm++ thread (in net-smp's case the comm thread but for multicore probably pe 0) and then call exit() on that thread alone, which will kill the process cleanly. (This bug was introduced in trying to clean up the old behavior of calling exit() on every thread, which was causing crashes because exit() is not thread safe.)

#16 Updated by Phil Miller over 3 years ago

  • Description updated (diff)

#18 Updated by Jim Phillips over 3 years ago

FYI, cudaDeviceReset() does not fix the hang on exit on MacOS, so CkExit() really does need to call exit() on multicore.

#20 Updated by Phil Miller over 3 years ago

  • Status changed from New to Merged
  • translation missing: en.field_closed_date set to 2015-12-03 14:59:44.522224
  • Assignee changed from Michael Robson to Jim Phillips

Also available in: Atom PDF