GPU Handler PEs
It is often the case that the execution of heterogeneous tasks on the same PE deters the execution of GPU tasks:
1) The completion of a GPU task cannot be handled until the currently executing CPU task is complete.
2) The Charm++ callback function invoked by the handler of the completed GPU task could be delayed if other messages are sitting in the queue, which subsequently delays dependent tasks. The delay of dependent GPU tasks in iterative applications results in underutilization of the GPU and thus degrades performance.
In situations where there are spare PEs (e.g. two cores sharing a FP scheduler as in OLCF Titan, where using both cores will not be much better than using one in a FP operation intensive application, or MIC architectures such as Intel's Xeon Phi), the aforementioned issues could potentially be resolved by utilizing these PEs to handle only GPU tasks.