CUDA examples broken on Blue Waters
From Eric Bohm:
Fatal CUDA Error at cuda-hybrid-api.cu:450. Return value 46 from 'cudaStreamCreate(&kernel_stream)'. Exiting.
This on bluewaters in non smp mode. Looks like it happens when there is more than 1 PE per node, which is kind of the expected case in non-smp.
NOTE: I think the node level manager we're working on will fix this but I wanted to go ahead and document it outside of Changa.
#3 Updated by Michael Robson over 3 years ago
When running the same hello program on blue waters using the nodeGPU patch we also get an error, albeit a different one:
mprobson@nid25349:~/charm/gni-crayxe-cuda-nodeGPU-g/examples/charm++/cuda/hello> make test ./charmrun ./hello +p4 10 Running on 4 processors: ./hello 10 aprun -n 4 -d 2 ./hello 10 Init HYBRID API  Init HYBRID API  Fatal CUDA Error all CUDA-capable devices are busy or unavailable at cuda-hybrid-api.cu:913. Return value 46.Exiting on cudaMallocHost((void **)&hd, (sizeof(Header)+bufSize)) . _pmiu_daemon(SIGCHLD): [NID 25200] [c5-2c0s7n0] [Mon Feb 15 14:13:29 2016] PE RANK 2 exit signal Aborted [NID 25200] 2016-02-15 14:13:29 Apid 33630471: initiated application termination Application 33630471 exit codes: 134 Application 33630471 resources: utime ~0s, stime ~1s, Rss ~6240, inblocks ~8653, outblocks ~19125 make: *** [test] Error 134
I'm currently working on improving error reporting to debug.
#4 Updated by Michael Robson over 3 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 80
- Subject changed from CUDA Hybri API Broken in non-SMP mode (Blue Waters) to CUDA examples broken on Blue Waters
- I have an improved error checking patch that I'm going to clean up and submit, I'll post the link here when I do
- Due to this patch I was able to determine that nodeGPU has the same error
- Which means that Jim was (probably) right about the machine being in process exclusive mode by default
- I check and BW uses default so multiple processes should be able to access the device
- However that doesn't actually seem to be the case
- Fortunately, you can fix this with
- Check this link: https://bluewaters.ncsa.illinois.edu/accelerator-jobs-on-xk-nodes
- Enabling CRAY_CUDA_MPS fixes the crash in both builds
- My plan is to add this to the manual
- I also plan to try this on Titan and see if there is a similar configuration issue