C. Running Charm++ Programs

C.1 Launching Programs with charmrun

When compiling Charm++ programs, the charmc linker produces both an executable file and an utility called charmrun, which is used to load the executable onto the parallel machine. To run a Charm++ program named ``pgm'' on four processors, type:

charmrun pgm +p4

Execution on platforms which use platform specific launchers, (i.e., aprun, ibrun), can proceed without charmrun, or charmrun can be used in coordination with those launchers via the ++mpiexec (see C.2.1 parameter. Programs built using the network version of Charm++ can be run alone, without charmrun. This restricts you to using the processors on the local machine, but it is convenient and often useful for debugging. For example, a Charm++ program can be run on one processor in the debugger using:

gdb pgm

If the program needs some environment variables to be set for its execution on compute nodes (such as library paths), they can be set in .charmrunrc under home directory. charmrun will run that shell script before running the executable. Charmrun normally limits the number of status messages it prints to a minimal level to avoid flooding the terminal with unimportant information. However, if you encounter trouble launching a job, try using the ++verbose option to help diagnose the issue. (See the ++quiet option to suppress output entirely.)

C.2 Command Line Options

A Charm++ program accepts the following command line options:
Automatically determine the number of worker threads to launch in order to fully subscribe the machine running the program.

Same as above.

Launch one worker thread per compute host. By the definition of standalone mode, this always results in exactly one worker thread.

Launch one worker thread per CPU socket.

Launch one worker thread per CPU core.

Launch one worker thread per CPU processing unit, i.e. hardware thread.

Explicitly request exactly N worker threads. The default is 1.

Print summary statistics about chare creation. This option prints the total number of chare creation requests, and the total number of chare creation requests processed across all processors.

Print statistics about the number of create chare messages requested and processed, the number of messages for chares requested and processed, and the number of messages for branch office chares requested and processed, on a per processor basis. Note that the number of messages created and processed for a particular type of message on a given node may not be the same, since a message may be processed by a different processor from the one originating the request.

Options that are be interpreted by the user program may be included mixed with the system options. However, user_options cannot start with +. The user_options will be passed as arguments to the user program via the usual argc/argv construct to the main entry point of the main chare. Charm++ system options will not appear in argc/argv.

C.2.1 Additional Network Options

The following ++ command line options are available in the network version.

First, commands related to subscription of computing resources:

Automatically determine the number of processes and threads to launch in order to fully subscribe the available resources.

Same as above.

++processPerHost N
Launch N processes per compute host.

++processPerSocket N
Launch N processes per CPU socket.

++processPerCore N
Launch N processes per CPU core.

++processPerPU N
Launch N processes per CPU processing unit, i.e. hardware thread.

The above options should allow sufficient control over process provisioning for most users. If you need more control, the following advanced options are available:

++n N
Run the program with N processes. Functionally identical to +p in non-SMP mode (see below). The default is 1.

++p N
Total number of processing elements to create. In SMP mode, this refers to worker threads (where ${\tt n} * {\tt ppn} = {\tt p}$), otherwise it refers to processes ( ${\tt n} = {\tt p}$). The default is 1. Use of ++p is discouraged in favor of ++processPer* (and ++oneWthPer* in SMP mode) where desirable, or ++n (and ++ppn) otherwise.

The remaining options cover details of process launch and connectivity:

Run charm program only on local machines. No remote shell invocation is needed in this case. It starts node programs right on your local machine. This could be useful if you just want to run small program on only one machine, for example, your laptop.

Use the cluster's mpiexec job launcher instead of the built in ssh method.

This will pass -n $P to indicate how many processes to launch. If -n $P is not required because the number of processes to launch is determined via queueing system environment variables then use ++mpiexec-no-n rather than ++mpiexec. An executable named something other than mpiexec can be used with the additional argument ++remote-shell runmpi, with `runmpi' replaced by the necessary name.

To pass additional arguments to mpiexec, specify ++remote-shell and list them as part of the value after the executable name as follows:

./charmrun ++mpiexec ++remote-shell "mpiexec -YourArgumentsHere" ./pgm

Use of this option can potentially provide a few benefits:

  • Faster startup compared to the SSH approach charmrun would otherwise use
  • No need to generate a nodelist file
  • Multi-node job startup on clusters that do not allow connections from the head/login nodes to the compute nodes

At present, this option depends on the environment variables for some common MPI implementations. It supports OpenMPI (OMPI_COMM_WORLD_RANK and OMPI_COMM_WORLD_SIZE), M(VA)PICH (MPIRUN_RANK and MPIRUN_NPROCS or PMI_RANK and PMI_SIZE), and IBM POE (MP_CHILD and MP_PROCS).

Run each node under gdb in an xterm window, prompting the user to begin execution.

Run each node under gdb in an xterm window immediately (i.e. without prompting the user to begin execution).

If using one of the ++debug or ++debug-no-pause options, the user must ensure the following:

  1. The DISPLAY environment variable points to your terminal. SSH's X11 forwarding does not work properly with Charm++ .

  2. The nodes must be authorized to create windows on the host machine (see man pages for xhost and xauth).

  3. xterm, xdpyinfo, and gdb must be in the user's path.

  4. The path must be set in the .cshrc file, not the .login file, because ssh does not run the .login file.

Scalable start, or SMP-aware startup. It is useful for scalable process launch on multi-core systems since it creates only one ssh session per node and spawns all clients from that ssh session. This is the default startup strategy and the option is retained for backward compatibility.

Ssh a set of node programs at a time, avoiding overloading Charmrun pe. In this strategy, the nodes assigned to a charmrun are divided into sets of fixed size. Charmrun performs ssh to the nodes in the current set, waits for the clients to connect back and then performs ssh on the next set. We call the number of nodes in one ssh set as batch size.

Maximum number of ssh's to run at a time. For backwards compatibility, this option is also available as ++maxrsh.

File containing list of nodes.

Number of nodes from the nodelist to use. If the value requested is larger than the number of nodes found, Charmrun will error out.

Print help messages

Script to run node-program with. The specified run script is invoked with the node program and parameter. For example:

./charmrun +p4 ./pgm 100 2 3 ++runscript ./set_env_script

In this case, the set_env_script is invoked on each node before launching pgm.

Which xterm to use

Run each node in an xterm window

X Display for xterm

Which debugger to use

Which remote shell to use

Use IP address provided for charmrun IP

Send nodes our symbolic hostname instead of IP address

CCS Authentication file

Port to listen for CCS requests

Enable client-server (CCS) mode

Which group of nodes to use

Print diagnostic messages

Suppress runtime output during startup and shutdown

Seconds to wait per host connection

Seconds to wait for program to complete

C.2.2 SMP Options

SMP mode in Charm++ spawns one OS process per logical node. Within this process there are two types of threads:

  1. Worker Threads that have objects mapped to them and execute entry methods

  2. Communication Thread that sends and receives data (depending on the network layer)

Charm++ always spawns one communication thread per process when using SMP mode and as many worker threads as the user specifies (see the options below). In general, the worker threads produce messages and hand them to the communication thread, which receives messages and schedules them on worker threads.

To use SMP mode in Charm++ , build charm with the smp option, e.g.:

./build charm++ netlrts-linux-x86_64 smp

There are various trade-offs associated with SMP mode. For instance, when using SMP mode there is no waiting to receive messages due to long running entry methods. There is also no time spent in sending messages by the worker threads and memory is limited by the node instead of per core. In SMP mode, intra-node messages use simple pointer passing, which bypasses the overhead associated with the network and extraneous copies. Another benefit is that the runtime will not pollute the caches of worker threads with communication-related data in SMP mode.

However, there are also some drawbacks associated with using SMP mode. First and foremost, you sacrifice one core to the communication thread. This is not ideal for compute bound applications. Additionally, this communication thread may become a serialization bottleneck in applications with large amounts of communication. Keep these trade-offs in mind when evaluating whether to use SMP mode for your application or deciding how many processes to launch per physical node when using SMP mode. Finally, any library code the application may call needs to be thread-safe.

Charm++ provides the following options to control the number of worker threads spawned and the placement of both worker and communication threads:

Launch one worker thread per compute host.

Launch one worker thread per CPU socket.

Launch one worker thread per CPU core.

Launch one worker thread per CPU processing unit, i.e. hardware thread.

The above options (and ++auto-provision) should allow sufficient control over thread provisioning for most users. If you need more precise control over thread count and placement, the following options are available:

++ppn N
Number of PEs (or worker threads) per logical node (OS process). This option should be specified even when using platform specific launchers (e.g., aprun, ibrun).

+pemap L[-U[:S[.R]+O
][,...]] Bind the execution threads to the sequence of cores described by the arguments using the operating system's CPU affinity functions. Can be used outside SMP mode.

A single number identifies a particular core. Two numbers separated by a dash identify an inclusive range (lower bound and upper bound). If they are followed by a colon and another number (a stride), that range will be stepped through in increments of the additional number. Within each stride, a dot followed by a run will indicate how many cores to use from that starting point. A plus represents the offset to the previous core number. Multiple +offset flags are supported, e.g., 0-7+8+16 equals 0,8,16,1,9,17.

For example, the sequence 0-8:2,16,20-24 includes cores 0, 2, 4, 6, 8, 16, 20, 21, 22, 23, 24. On a 4-way quad-core system, if one wanted to use 3 cores from each socket, one could write this as 0-15:4.3. ++ppn 10 +pemap 0-11:6.5+12 equals ++ppn 10 +pemap 0,12,1,13,2,14,3,15,4,16,6,18,7,19,8,20,9,21,10,22

+commap p[,q,...]
Bind communication threads to the listed cores, one per process.

To run applications in SMP mode, we generally recommend using one logical node per socket or NUMA domain. ++ppn will spawn N threads in addition to 1 thread spawned by the runtime for the communication threads, so the total number of threads will be N+1 per node. Consequently, you should map both the worker and communication threads to separate cores. Depending on your system and application, it may be necessary to spawn one thread less than the number of cores in order to leave one free for the OS to run on. An example run command might look like:

./charmrun ++ppn 3 +p6 +pemap 1-3,5-7 +commap 0,4 ./app <args>

This will create two logical nodes/OS processes (2 = 6 PEs/3 PEs per node), each with three worker threads/PEs (++ppn 3). The worker threads/PEs will be mapped thusly: PE 0 to core 1, PE 1 to core 2, PE 2 to core 3 and PE 4 to core 5, PE 5 to core 6, and PE 6 to core 7. PEs/worker threads 0-2 compromise the first logical node and 3-5 are the second logical node. Additionally, the communication threads will be mapped to core 0, for the communication thread of the first logical node, and to core 4, for the communication thread of the second logical node.

Please keep in mind that +p always specifies the total number of PEs created by Charm++ , regardless of mode (the same number as returned by CkNumPes()). The +p option does not include the communication thread, there will always be exactly one of those per logical node.

C.2.3 Multicore Options

On multicore platforms, operating systems (by default) are free to move processes and threads among cores to balance load. This however sometimes can degrade the performance of Charm++ applications due to the extra overhead of moving processes and threads, especially for Charm++ applications that already implement their own dynamic load balancing.

Charm++ provides the following runtime options to set the processor affinity automatically so that processes or threads no longer move. When cpu affinity is supported by an operating system (tested at Charm++ configuration time), the same runtime options can be used for all flavors of Charm++ versions including network and MPI versions, smp and non-smp versions.

Set cpu affinity automatically for processes (when Charm++ is based on non-smp versions) or threads (when smp). This option is recommended, as it prevents the OS from unnecessarily moving processes/threads around the processors of a physical node.

+excludecore <core #>
Do not set cpu affinity for the given core number. One can use this option multiple times to provide a list of core numbers to avoid.

C.2.4 IO buffering options

There may be circumstances where a Charm++ application may want to take or relinquish control of stdout buffer flushing. Most systems default to giving the Charm++ runtime control over stdout but a few default to giving the application that control. The user can override these system defaults with the following runtime options:

User (application) controls stdout flushing
The Charm++ runtime controls flushing

C.3 Nodelist file

For network of workstations, the list of machines to run the program can be specified in a file. Without a nodelist file, Charm++ runs the program only on the local machine.

The format of this file allows you to define groups of machines, giving each group a name. Each line of the nodes file is a command. The most important command is:

host <hostname> <qualifiers>

which specifies a host. The other commands are qualifiers: they modify the properties of all hosts that follow them. The qualifiers are:

group <groupname>   - subsequent hosts are members of specified group
login <login>   		 - subsequent hosts use the specified login
shell <shell>   		 - subsequent hosts use the specified remoteshell
setup <cmd>   		 - subsequent hosts should execute cmd
pathfix <dir1> <dir2> 		 - subsequent hosts should replace dir1 with dir2 in the program path
cpus <n> 		 - subsequent hosts should use N light-weight processes
speed <s> 		 - subsequent hosts have relative speed rating
ext <extn> 		 - subsequent hosts should append extn to the pgm name

Note: By default, charmrun uses a remote shell ``ssh'' to spawn node processes on the remote hosts. The shell qualifier can be used to override it with say, ``rsh''. One can set the CONV_RSH environment variable or use charmrun option ++remote-shell to override the default remote shell for all hosts with unspecified shell qualifier.

All qualifiers accept ``*'' as an argument, this resets the modifier to its default value. Note that currently, the passwd, cpus, and speed factors are ignored. Inline qualifiers are also allowed:

host beauty ++cpus 2 ++shell ssh

Except for ``group'', every other qualifier can be inlined, with the restriction that if the ``setup'' qualifier is inlined, it should be the last qualifier on the ``host'' or ``group'' statement line.

Here is a simple nodes file:

        group kale-sun ++cpus 1
          host ++shell ssh
        group kale-sol
          host ++cpus 2
        group main
          host localhost

This defines three groups of machines: group kale-sun, group kale-sol, and group main. The ++nodegroup option is used to specify which group of machines to use. Note that there is wraparound: if you specify more nodes than there are hosts in the group, it will reuse hosts. Thus,

        charmrun pgm ++nodegroup kale-sun +p6

uses hosts (charm, dp, grace, dagger, charm, dp) respectively as nodes (0, 1, 2, 3, 4, 5).

If you don't specify a ++nodegroup, the default is ++nodegroup main. Thus, if one specifies

        charmrun pgm +p4

it will use ``localhost'' four times. ``localhost'' is a Unix trick; it always find a name for whatever machine you're on.

Using ``ssh'', the user will have to setup password-less login to remote hosts using public key authentication based on a key-pair and adding public keys to ``.ssh/authorized_keys'' file. See ``ssh'' documentation for more information. If ``rsh'' is used for remote login to the compute nodes, the user is required to set up remote login permissions on all nodes using the ``.rhosts'' file in their home directory.

In a network environment, charmrun must be able to locate the directory of the executable. If all workstations share a common file name space this is trivial. If they don't, charmrun will attempt to find the executable in a directory with the same path from the $HOME directory. Pathname resolution is performed as follows:

  1. The system computes the absolute path of pgm.
  2. If the absolute path starts with the equivalent of $HOME or the current working directory, the beginning part of the path is replaced with the environment variable $HOME or the current working directory. However, if ++pathfix dir1 dir2 is specified in the nodes file (see above), the part of the path matching dir1 is replaced with dir2.
  3. The system tries to locate this program (with modified pathname and appended extension if specified) on all nodes.