Project

General

Profile

Bug #802

CUDA examples broken on Blue Waters

Added by Michael Robson almost 4 years ago. Updated over 1 year ago.

Status:
In Progress
Priority:
Normal
Category:
GPU Support
Target version:
-
Start date:
08/11/2015
Due date:
% Done:

80%

Spent time:

Description

From Eric Bohm:

Fatal CUDA Error at cuda-hybrid-api.cu:450.
Return value 46 from 'cudaStreamCreate(&kernel_stream)'.  Exiting.

This on bluewaters in non smp mode. Looks like it happens when there is more than 1 PE per node, which is kind of the expected case in non-smp.

NOTE: I think the node level manager we're working on will fix this but I wanted to go ahead and document it outside of Changa.

History

#1 Updated by Jim Phillips almost 4 years ago

I'm guessing that the GPU is in process-exclusive mode.
The real bug is that the raw error code is printed rather than the equivalent error string, as well as the node and device number.

#2 Updated by Jim Phillips almost 4 years ago

As a follow-up, you never want multiple processes using the same GPU unless you are using the CUDA Multi Process Service (MPS), which aggregates the API calls to allow kernels from different processes to overlap.

#3 Updated by Michael Robson over 3 years ago

When running the same hello program on blue waters using the nodeGPU patch we also get an error, albeit a different one:

mprobson@nid25349:~/charm/gni-crayxe-cuda-nodeGPU-g/examples/charm++/cuda/hello> make test
./charmrun  ./hello +p4 10 

Running on 4 processors:  ./hello 10
aprun -n 4 -d 2 ./hello 10
Init HYBRID API [0]
Init HYBRID API [0]
Fatal CUDA Error all CUDA-capable devices are busy or unavailable at cuda-hybrid-api.cu:913.
 Return value 46.Exiting on cudaMallocHost((void **)&hd, (sizeof(Header)+bufSize)) .
_pmiu_daemon(SIGCHLD): [NID 25200] [c5-2c0s7n0] [Mon Feb 15 14:13:29 2016] PE RANK 2 exit signal Aborted
[NID 25200] 2016-02-15 14:13:29 Apid 33630471: initiated application termination
Application 33630471 exit codes: 134
Application 33630471 resources: utime ~0s, stime ~1s, Rss ~6240, inblocks ~8653, outblocks ~19125
make: *** [test] Error 134

I'm currently working on improving error reporting to debug.

#4 Updated by Michael Robson over 3 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 80
  • Subject changed from CUDA Hybri API Broken in non-SMP mode (Blue Waters) to CUDA examples broken on Blue Waters
As a result of Debug Day:
  1. I have an improved error checking patch that I'm going to clean up and submit, I'll post the link here when I do
  2. Due to this patch I was able to determine that nodeGPU has the same error
  3. Which means that Jim was (probably) right about the machine being in process exclusive mode by default
  • I check and BW uses default so multiple processes should be able to access the device
  • However that doesn't actually seem to be the case
  • Fortunately, you can fix this with export CRAY_CUDA_MPS=1
  • Check this link: https://bluewaters.ncsa.illinois.edu/accelerator-jobs-on-xk-nodes
  • Enabling CRAY_CUDA_MPS fixes the crash in both builds
  1. My plan is to add this to the manual
  2. I also plan to try this on Titan and see if there is a similar configuration issue

#5 Updated by Sam White over 2 years ago

  • Category set to GPU Support
  • Target version set to 6.8.1

Is this issue up to date or did things get merged for it?

#6 Updated by Eric Bohm almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#7 Updated by Sam White over 1 year ago

Bump, what is the status of this issue?

#8 Updated by Sam White over 1 year ago

  • Target version deleted (6.9.0)

Not a release blocker

Also available in: Atom PDF