Debugging is a fundamental part of software development, and one of the largest in terms of time spent. When developing parallel applications, debugging becomes much harder due to a whole new set of problems not present in sequential applications. One famously difficult example is a race condition. Moreover, sometimes a problem does not manifest itself when executing an application using few processors, only to appear when a larger number of processors is used. In this scenario, it is important to develop techniques to assist both the debugger and the programmer to handle large scale applications. One problem consists in the capacity of the programmer to directly control the execution of all the allocated processors, even if the debugger is capable of handling them. Another problem concerns the feasibility of occupying a large machine for the time necessary to discover the cause of a problem---typically many hours.
In this thesis, we explore a new approach based on a tight integration between the debugger and the application's underlying parallel runtime system. The debugger is responsible for interacting with the user and receiving commands from him; the parallel runtime system is responsible for managing the application, and performing the operations requested by the user through the debugger interface. This integration facilitates the scaling of the debugging techniques to very large machines, and helps the user to focus on the processors where a problem manifests. Furthermore, the parallel runtime system is in a unique position to enable powerful techniques that can help reduce the need for large parallel machines when debugging a large-scale application.