send messages from non-PE threads
CUDA has the ability to call back to the CPU when a kernel finishes, but the callback executes in a thread that is launched by the CUDA runtime so messages sends fail. If messages could be sent from this thread then it would not be necessary to poll the GPU for kernel completion. This would only need to be supported in smp/multicore versions. This could also be used for dedicated I/O threads.