- C. 1 Launching Programs with charmrun
Command Line Options
- C. 2 . 1 Additional Network Options
- C. 2 . 2 SMP Options
- C. 2 . 3 Multicore Options
- C. 2 . 4 IO buffering options
- C. 3 Nodelist file
charmrun, which is used to load the executable onto the parallel machine. To run a Charm++ program named ``pgm'' on four processors, type:
Execution on platforms which use platform specific launchers, (i.e., aprun , ibrun ), can proceed without charmrun, or charmrun can be used in coordination with those launchers via the
charmrun pgm +p4
++mpiexec(see C.2.1 parameter. Programs built using the network version of Charm++ can be run alone, without charmrun. This restricts you to using the processors on the local machine, but it is convenient and often useful for debugging. For example, a Charm++ program can be run on one processor in the debugger using:
If the program needs some environment variables to be set for its execution on compute nodes (such as library paths), they can be set in .charmrunrc under home directory. charmrun will run that shell script before running the executable.
Run the program with N processors. The default is 1.
Print summary statistics about chare creation. This option
prints the total number of chare creation requests, and the total number of
chare creation requests processed across all processors.
Print statistics about the number of create chare messages
requested and processed, the number of messages for chares requested and
processed, and the number of messages for branch office chares requested and
processed, on a per processor basis. Note that the number of messages
created and processed for a particular type of message on a given node
may not be the same, since a message may be processed by a different
processor from the one originating the request.
Options that are be interpreted by the user
program may be included mixed with the system options.
user_optionscannot start with +. The
user_optionswill be passed as arguments to the user program via the usual
argc/argvconstruct to the
mainentry point of the main chare. Charm++ system options will not appear in
command line options are available in
the network version:
Run charm program only on local machines. No
remote shell invocation is needed in this case. It starts node programs
right on your local machine. This could be useful if you just want to
run small program on only one machine, for example, your laptop.
Use the cluster's
mpiexecjob launcher instead of the built in ssh method.
This will pass
-n $Pto indicate how many processes to launch. If
-n $Pis not required because the number of processes to launch is determined via queueing system environment variables then use
++mpiexec. An executable named something other than
mpiexeccan be used with the additional argument
++remote-shellrunmpi , with `runmpi' replaced by the necessary name.
Use of this option can potentially provide a few benefits:
- Faster startup compared to the SSH approach charmrun would otherwise use
- No need to generate a nodelist file
- Multi-node job startup on clusters that do not allow connections from the head/login nodes to the compute nodes
At present, this option depends on the environment variables for some common MPI implementations. It supports OpenMPI (
OMPI_COMM_WORLD_SIZE), M(VA)PICH (
PMI_SIZE), and IBM POE (
Run each node under gdb in an xterm window, prompting
the user to begin execution.
Run each node under gdb in an xterm window
immediately (i.e. without prompting the user to begin execution).
If using one of the
++debug-no-pauseoptions, the user must ensure the following:
DISPLAYenvironment variable points to your terminal. SSH's X11 forwarding does not work properly with Charm++ .
The nodes must be authorized to create windows on the host machine (see
man pages for
gdbmust be in the user's path.
The path must be set in the
.cshrcfile, not the
sshdoes not run the
Scalable start, or SMP-aware startup. It is useful for scalable process launch on multi-core systems since it creates only one ssh session per node and spawns all clients from that ssh session. This is the default startup strategy and the option is retained for backward compatibility.
Ssh a set of node programs at a time, avoiding overloading Charmrun pe. In this strategy, the nodes assigned to a charmrun are divided into sets of fixed size. Charmrun performs ssh to the nodes in the current set, waits for the clients to connect back and then performs ssh on the next set. We call the number of nodes in one ssh set as batch size.
Maximum number of
ssh's to run at a time. For backwards compatibility, this option is also available as
File containing list of nodes.
Print help messages
Script to run node-program with. The specified run script is invoked with the node program and parameter. For example:
./charmrun +p4 ./pgm 100 2 3 ++runscript ./set_env_script
In this case, the
set_env_scriptis invoked on each node before launching
Which xterm to use
Run each node in an xterm window
X Display for xterm
Which debugger to use
Which remote shell to use
Use IP address provided for charmrun IP
Send nodes our symbolic hostname instead of IP address
CCS Authentication file
Port to listen for CCS requests
Enable client-server (CCS) mode
Which group of nodes to use
Print diagnostic messages
Suppress runtime output during startup and shutdown
Seconds to wait per host connection
Number of processes to create
SMP mode in Charm++ spawns one OS process per logical node. Within this process there are two types of threads:
Worker Threads that have objects mapped to them and execute entry methods
Communication Thread that sends and receives data (depending on the
Charm++ always spawns one communication thread per process when using SMP
mode and as many worker threads as the user specifies (see the options below).
In general, the worker threads produce messages and hand them to the communication
thread, which receives messages and schedules them on worker threads.
To use SMP mode in Charm++ , build charm with the
./build charm++ netlrts-linux-x86_64 smp
There are various trade-offs associated with SMP mode. For instance, when using SMP mode there is no waiting to receive messages due to long running entry methods. There is also no time spent in sending messages by the worker threads and memory is limited by the node instead of per core. In SMP mode, intra-node messages use simple pointer passing, which bypasses the overhead associated with the network and extraneous copies. Another benefit is that the runtime will not pollute the caches of worker threads with communication-related data in SMP mode.
However, there are also some drawbacks associated with using SMP mode. First and foremost, you sacrifice one core to the communication thread. This is not ideal for compute bound applications. Additionally, this communication thread may become a serialization bottleneck in applications with large amounts of communication. Keep these trade-offs in mind when evaluating whether to use SMP mode for your application or deciding how many processes to launch per physical node when using SMP mode. Finally, any library code the application may call needs to be thread-safe.
Charm++ provides the following options to control the number of worker threads spawned and the placement of both worker and communication threads:
Number of PEs (or worker threads) per logical node (OS process).
This option should be specified even when using platform specific launchers
(e.g., aprun, ibrun).
][,...]] Bind the execution threads to
the sequence of cores described by the arguments using the operating
system's CPU affinity functions. Can be used outside SMP mode.
A single number identifies a particular core. Two numbers separated by a dash identify an inclusive range ( lower bound and upper bound ). If they are followed by a colon and another number (a stride ), that range will be stepped through in increments of the additional number. Within each stride, a dot followed by a run will indicate how many cores to use from that starting point. A plus represents the offset to the previous core number. Multiple
+offsetflags are supported, e.g., 0-7+8+16 equals 0,8,16,1,9,17.
For example, the sequence
0-8:2,16,20-24includes cores 0, 2, 4, 6, 8, 16, 20, 21, 22, 23, 24. On a 4-way quad-core system, if one wanted to use 3 cores from each socket, one could write this as
++ppn 10 +pemap 0-11:6.5+12equals
++ppn 10 +pemap 0,12,1,13,2,14,3,15,4,16,6,18,7,19,8,20,9,21,10,22
Bind communication threads to the
listed cores, one per process.
To run applications in SMP mode, we generally recommend using one logical node
per socket or NUMA domain.
will spawn N threads in addition to 1 thread spawned
by the runtime for the communication threads, so the total number of threads
will be N+1 per node. Consequently, you should map both the worker and communication
threads to separate cores. Depending on your system and application, it may be
necessary to spawn one thread less than the number of cores in order to leave
one free for the OS to run on. An example run command might look like:
./charmrun ++ppn 3 +p6 +pemap 1-3,5-7 +commap 0,4 ./app <args>
This will create two logical nodes/OS processes (2 = 6 PEs/3 PEs per node),
each with three worker threads/PEs (
). The worker threads/PEs
will be mapped thusly: PE 0 to core 1, PE 1 to core 2, PE 2 to core 3 and
PE 4 to core 5, PE 5 to core 6, and PE 6 to core 7. PEs/worker threads 0-2
compromise the first logical node and 3-5 are the second logical node.
Additionally, the communication threads will be mapped to core 0, for the
communication thread of the first logical node, and to core 4, for the
communication thread of the second logical node.
Please keep in mind that
always specifies the total number of PEs
created by Charm++ , regardless of mode (the same number as returned by
option does not include the communication
thread, there will always be exactly one of those per logical node.
On multicore platforms, operating systems (by default) are free to move processes and threads among cores to balance load. This however sometimes can degrade the performance of Charm++ applications due to the extra overhead of moving processes and threads, especially for Charm++ applications that already implement their own dynamic load balancing.
Charm++ provides the following runtime options to set the processor affinity automatically so that processes or threads no longer move. When cpu affinity is supported by an operating system (tested at Charm++ configuration time), the same runtime options can be used for all flavors of Charm++ versions including network and MPI versions, smp and non-smp versions.
Set cpu affinity
automatically for processes (when Charm++ is based on non-smp
versions) or threads (when smp). This option is recommended, as it
prevents the OS from unnecessarily moving processes/threads around
the processors of a physical node.
+excludecore <core #>
Do not set cpu affinity for the given core number. One can use this option multiple times to provide a list of core numbers to avoid.
- User (application) controls stdout flushing
- The Charm++ runtime controls flushing
For network of workstations, the list of machines to run the program can be specified in a file. Without a nodelist file, Charm++ runs the program only on the local machine.
The format of this file allows you to define groups of machines, giving each group a name. Each line of the nodes file is a command. The most important command is:
host <hostname> <qualifiers>
which specifies a host. The other commands are qualifiers: they modify the properties of all hosts that follow them. The qualifiers are:
group <groupname>- subsequent hosts are members of specified group
login <login>- subsequent hosts use the specified login
shell <shell>- subsequent hosts use the specified remoteshell
setup <cmd>- subsequent hosts should execute cmd
pathfix <dir1> <dir2>- subsequent hosts should replace dir1 with dir2 in the program path
cpus <n>- subsequent hosts should use N light-weight processes
speed <s>- subsequent hosts have relative speed rating
ext <extn>- subsequent hosts should append extn to the pgm name
By default, charmrun uses a remote shell ``ssh'' to spawn node processes
on the remote hosts. The
qualifier can be used to override
it with say, ``rsh''. One can set the
or use charmrun option
to override the default remote
shell for all hosts with unspecified
All qualifiers accept ``*'' as an argument, this resets the modifier to its default value. Note that currently, the passwd, cpus, and speed factors are ignored. Inline qualifiers are also allowed:
host beauty ++cpus 2 ++shell ssh
Except for ``group'', every other qualifier can be inlined, with the restriction that if the ``setup'' qualifier is inlined, it should be the last qualifier on the ``host'' or ``group'' statement line.
Here is a simple nodes file:
group kale-sun ++cpus 1 host charm.cs.illinois.edu ++shell ssh host dp.cs.illinois.edu host grace.cs.illinois.edu host dagger.cs.illinois.edu group kale-sol host beauty.cs.illinois.edu ++cpus 2 group main host localhost
This defines three groups of machines: group kale-sun, group kale-sol, and group main. The ++nodegroup option is used to specify which group of machines to use. Note that there is wraparound: if you specify more nodes than there are hosts in the group, it will reuse hosts. Thus,
charmrun pgm ++nodegroup kale-sun +p6
uses hosts (charm, dp, grace, dagger, charm, dp) respectively as nodes (0, 1, 2, 3, 4, 5).
If you don't specify a ++nodegroup, the default is ++nodegroup main. Thus, if one specifies
charmrun pgm +p4
it will use ``localhost'' four times. ``localhost'' is a Unix trick; it always find a name for whatever machine you're on.
Using ``ssh'', the user will have to setup password-less login to remote hosts using public key authentication based on a key-pair and adding public keys to ``.ssh/authorized_keys'' file. See ``ssh'' documentation for more information. If ``rsh'' is used for remote login to the compute nodes, the user is required to set up remote login permissions on all nodes using the ``.rhosts'' file in their home directory.
In a network environment,
be able to locate the directory of the executable. If all workstations
share a common file name space this is trivial. If they don't,
will attempt to find the executable in a directory with the same path
directory. Pathname resolution is performed as
The system computes the absolute path of
If the absolute path starts with the equivalent of
or the current working directory, the beginning part of the
is replaced with the environment variable
current working directory. However, if
++pathfix dir1 dir2is specified in the nodes file (see above), the part of the path matching
dir1is replaced with
- The system tries to locate this program (with modified pathname and appended extension if specified) on all nodes.