E. Performance Tracing for Analysis

Projections is a performance analysis/visualization framework that helps you understand and investigate performance-related problems in the (Charm++ ) applications. It is a framework with an event tracing component which allows to control the amount of information generated. The tracing has low perturbation on the application. It also has a Java-based visualization and analysis component with various views that help present the performance information in a visually useful manner.

Performance analysis with Projections typically involves two simple steps:

  1. Prepare your application by linking with the appropriate trace generation modules and execute it to generate trace data.
  2. Using the Java-based tool to visually study various aspects of the performance and locate the performance issues for that application execution.

The Charm++ runtime automatically records pertinent performance data for performance-related events during execution. These events include the start and end of entry method execution, message send from entry methods and scheduler idle time. This means most users do not need to manually insert code into their applications in order to generate trace data. In scenarios where special performance information not captured by the runtime is required, an API (see section E.2 ) is available for user-specific events with some support for visualization by the Java-based tool. If greater control over tracing activities (e.g. dynamically turning instrumentation on and off) is desired, the API also allows users to insert code into their applications for such purposes.

The automatic recording of events by the Projections framework introduces the overhead of an if-statement for each runtime event, even if no performance analysis traces are desired. Developers of Charm++ applications who consider such an overhead to be unacceptable (e.g. for a production application which requires the absolute best performance) may recompile the Charm++ runtime with the --with-production flag, which removes the instrumentation stubs. To enable the instrumentation stubs while retaining the other optimizations enabled by --with-production , one may compile Charm++ with both --with-production and --enable-tracing , which explicitly enables Projections tracing.

To enable performance tracing of your application, users simply need to link the appropriate trace data generation module(s) (also referred to as tracemode(s) ). (see section E.1 )

E. 1 Enabling Performance Tracing at Link/Run Time

Projections tracing modules dictate the type of performance data, data detail and data format each processor will record. They are also referred to as ``tracemodes''. There are currently 2 tracemodes available. Zero or more tracemodes may be specified at link-time. When no tracemodes are specified, no trace data is generated.

E. 1 . 1 Tracemode projections

Link time option: -tracemode projections

This tracemode generates files that contain information about all Charm++ events like entry method calls and message packing during the execution of the program. The data will be used by Projections in visualization and analysis.

This tracemode creates a single symbol table file and $p$ ASCII log files for $p$ processors. The names of the log files will be NAME.#.log where NAME is the name of your executable and # is the processor #. The name of the symbol table file is NAME.sts where NAME is the name of your executable.

This is the main source of data needed by the performance visualizer. Certain tools like timeline will not work without the detail data from this tracemode.

The following is a list of runtime options available under this tracemode:

E. 1 . 2 Tracemode summary

Compile option: -tracemode summary

In this tracemode, execution time across all entry points for each processor is partitioned into a fixed number of equally sized time-interval bins. These bins are globally resized whenever they are all filled in order to accommodate longer execution times while keeping the amount of space used constant.

Additional data like the total number of calls made to each entry point is summarized within each processor.

This tracemode will generate a single symbol table file and $p$ ASCII summary files for $p$ processors. The names of the summary files will be NAME.#.sum where NAME is the name of your executable and # is the processor #. The name of the symbol table file is NAME.sum.sts where NAME is the name of your executable.

This tracemode can be used to control the amount of output generated in a run. It is typically used in scenarios where a quick look at the overall utilization graph of the application is desired to identify smaller regions of time for more detailed study. Attempting to generate the same graph using the detailed logs of the prior tracemode may be unnecessarily time consuming or impossible.

The following is a list of runtime options available under this tracemode:

E. 1 . 3 General Runtime Options

The following is a list of runtime options available with the same semantics for all tracemodes:

E. 1 . 4 End-of-run Analysis for Data Reduction

As applications are scaled to thousands or hundreds of thousands of processors, the amount of data generated becomes extremely large and potentially unmanageable by the visualization tool. At the time of documentation, Projections is capable of handling data from 8000+ processors but with somewhat severe tool responsiveness issues. We have developed an approach to mitigate this data size problem with options to trim-off ``uninteresting'' processors' data by not writing such data at the end of an application's execution.

This is currently done through heuristics to pick out interesting extremal (i.e. poorly behaved) processors and at the same time using a k-means clustering to pick out exemplar processors from equivalence classes to form a representative subset of processor data. The analyst is advised to also link in the summary module via +tracemode summary and enable the +sumDetail option in order to retain some profile data for processors whose data were dropped.

This feature is still being developed and refined as part of our research. It would be appreciated if users of this feature could contact the developers if you have input or suggestions.

E. 2 Controlling Tracing from Within the Program

E. 2 . 1 Selective Tracing

Charm++ allows users to start/stop tracing the execution at certain points in time on the local processor. Users are advised to make these calls on all processors and at well-defined points in the application.

Users may choose to have instrumentation turned off at first (by command line option +traceoff - see section E.1.3 ) if some period of time in middle of the applications execution is of interest to the user.

Alternatively, users may start the application with instrumentation turned on (default) and turn off tracing for specific sections of the application.

Again, users are advised to be consistent as the +traceoff runtime option applies to all processors in the application.

E. 2 . 2 Explicit Flushing

By default, when linking with -tracemode projections , log files are flushed to disk whenever the number of entries on a processor reaches the logsize limit (see Section  E.1.1 ). However, this can occur at any time during the execution of the program, potentially causing performance perturbations. To address this, users can explicitly flush to disk using the traceFlushLog() function. Note that automatic flushing will still occur if the logsize limit is reached, but sufficiently frequent explicit flushes should prevent that from happening.

E. 2 . 3 User Events

Projections has the ability to visualize traceable user specified events. User events are usually displayed in the Timeline view as vertical bars above the entry methods. Alternatively the user event can be displayed as a vertical bar that vertically spans the timelines for all processors. Follow these following basic steps for creating user events in a charm++ program:

  1. Register an event with an identifying string and either specify or acquire a globally unique event identifier. All user events that are not registered will be displayed in white.

  2. Use the event identifier to specify trace points in your code of interest to you.

The functions available are as follows:

There are two main types of user events, bracketed and non bracketed. Non-bracketed user events mark a specific point in time. Bracketed user events span an arbitrary contiguous time range. Additionally, the user can supply a short user supplied text string that is recorded with the event in the log file. These strings should not contain newline characters, but they may contain simple html formatting tags such as <br> , <b> , <i> , <font color=#ff00ff> , etc.

The calls for recording user events are the following:

E. 2 . 4 User Stats

Charm++ allows the user to track the progression of any variable or value throughout the program execution. These user specified stats can then be plotted in Projections , either over time or by processor. To enable this feature for Charm++ , build Charm++ with the -enable-tracing flag.

Follow these steps to track user stats in a Charm++ program:

  1. Register a stat with an identifying string and a globally unique integer identifier.

  2. Update the value of the stat at points of interest in the code by calling the update stat functions.

  3. Compile program with -tracemode projections flag.

The functions available are as follows:

E. 2 . 5 Function-level Tracing for Adaptive MPI Applications

Adaptive MPI (AMPI) is an implementation of the MPI interface on top of Charm++ . As with standard MPI programs, the appropriate semantic context for performance analysis is captured through the observation of MPI calls within C/C++/Fortran functions. Users can selectively begin and end tracing in AMPI programs using the routines AMPI_Trace_begin and AMPI_Trace_end . Unfortunately, AMPI's implementation does not grant the runtime access to information about user function calls. As a result, the tracing framework must provide an explicit API for capturing this piece of performance information in addition to MPI calls (which are known to the runtime).

The functions, similar to those used to capture user events, are as follows:

AMPI function events captured by the use of this API are recognized by the visualization system and used for special AMPI-specific views in addition to standard Charm++ entry methods.