Adaptive MPI Manual
1 Introduction
This manual describes Adaptive MPI (AMPI ), which is an implementation of a significant subset 1 of MPI-1.1 Standard over Charm++ . Charm++ is a C++ -based parallel programming library being developed by Prof. L. V. Kalé and his students back from 1992 until now at University of Illinois.
We first describe our philosophy behind this work (why we do what we do). Later we give a brief introduction to Charm++ and rationale for AMPI (tools of the trade). We then describe AMPI in detail. Finally we summarize the changes required for original MPI codes to get them working with AMPI (current state of our work). Appendices contain the gory details of installing AMPI , building and running AMPI programs.
1 . 1 Overview
Developing parallel Computational Science and Engineering (CSE) applications is a complex task. One has to implement the right physics, develop or choose and code appropriate numerical methods, decide and implement the proper input and output data formats, perform visualizations, and be concerned with correctness and efficiency of the programs. It becomes even more complex for multi-physics coupled simulations such as the solid propellant rocket simulation application. In addition, many applications are dynamic and adaptively refined so load imbalance is a major challenge. Our philosophy is to lessen the burden of the application developers by providing advanced programming paradigms and versatile runtime systems that can handle many common programming and performance concerns automatically and let the application programmers focus on the actual application content.
Many of these concerns can be addressed using processor virtualization and
over-decomposition philosophy of Charm++ . Thus, the developer only sees
virtual processors and lets the runtime system deal with underlying physical
processors. This is implemented in AMPI by mapping MPI ranks to Charm++ user-level
threads as illustrated in Figure
1
. As an immediate and simple benefit, the programmer can use as
many virtual processors ("MPI ranks") as the problem can be easily decomposed
to them. For example, suppose the problem domain has
parts that can be
easily distributed but programming for general number of MPI processes is burdensome,
then the developer can have
virtual processors on any number of physical ones using AMPI .
AMPI 's execution model consists of multiple user-level threads per process and, typically, there is one process per physical processor. Charm++ scheduler coordinates execution of these threads (also called Virtual Processors or VPs) and controls execution as shown in Figure 2 . These VPs can also migrate between processors because of load balancing or other reasons. The number of VPs per processor specifies the virtualization ratio (degree of over-decomposition). For example, in Figure 2 virtualization ratio is four (there are four VPs per each processor). Figure 3 show how the problem domain is over-decomposed in AMPI 's VPs as opposed to other MPI implementations.
Another benefit of virtualization is communication and computation overlap which is automatically achieved without programming effort. Techniques such as software pipelining require significant programming effort to achieve this goal and improve performance. However, one can use AMPI to have more virtual processors than physical processors to overlap communication and computation. Each time a VP is blocked for communication, Charm++ scheduler picks the next VP among those that are ready to execute. In this manner, while some of the VPs of a physical processor are waiting for a message to arrive, others can continue their execution. Thus, performance will be improved without any change to the source code.
A potential benefit is that of better cache utilization. With over-decomposition, a smaller subdomain is accessed by a VP repeatedly in different function calls before getting blocked by communication and switching to another VP. That smaller subdomain may fit into cache if over-decomposition is enough. This concept is illustrated in Figure 4 where each AMPI subdomain (such as 12) is smaller than corresponding MPI subdomain (such as 3) and may fit into cache memory. Thus, there is a potential performance improvement without changing the source code.
One important concern is that of load imbalance. New generation parallel applications are dynamically varying, meaning that processors' load is shifting during execution. In a dynamic simulation application such as rocket simulation, burning solid fuel, sub-scaling for a certain part of the mesh, crack propagation, particle flows all contribute to load imbalance. Centralized load balancing strategy built into an application is impractical since each individual modules are developed almost independently by various developers. In addition, embedding a load balancing strategy in the code complicates it and programming effort increases significantly. Thus, the runtime system support for load balancing becomes even more critical. Figure 5 shows migration of a VP because of load imbalance. For instance, this domain may correspond to a weather forecast model where there is a tornado in top-left side, which requires more computation to simulate. AMPI will then migrate VP 13 to balance the division of work across processors and improve performance. Note that incorporating this sort of load balancing inside the application code may take a lot of effort and complicates the code.
There are different load balancing strategies built into Charm++ that can be selected. Among those, some may fit better for an application depending on its characteristics. Moreover, one can write a new load balancer, best suited for an application, by the simple API provided inside Charm++ infrastructure. Our approach is based on actual measurement of load information at runtime, and on migrating computations from heavily loaded to lightly loaded processors.
For this approach to be effective, we need the computation to be split into pieces many more in number than available processors. This allows us to flexibly map and re-map these computational pieces to available processors. This approach is usually called ``multi-domain decomposition''.
Charm++ , which we use as a runtime system layer for the work described here, simplifies our approach. It embeds an elaborate performance tracing mechanism, a suite of plug-in load balancing strategies, infrastructure for defining and migrating computational load, and is interoperable with other programming paradigms.
1 . 2 Terminology
- Module
-
A module refers to either a complete program or a library with an
orchestrator subroutine
2
. An orchestrator subroutine specifies the main control flow
of the module by calling various subroutines from the associated library and
does not usually have much state associated with it.
- Thread
-
A thread is a lightweight process that owns a stack and machine
registers including program counter, but shares code and data with other
threads within the same address space. If the underlying operating system
recognizes a thread, it is known as kernel thread, otherwise it is known as
user-thread. A context-switch between threads refers to suspending one thread's
execution and transferring control to another thread. Kernel threads typically
have higher context switching costs than user-threads because of operating
system overheads. The policy implemented by the underlying system for
transferring control between threads is known as thread scheduling policy.
Scheduling policy for kernel threads is determined by the operating system, and
is often more inflexible than user-threads. Scheduling policy is said to be
non-preemptive if a context-switch occurs only when the currently running
thread willingly asks to be suspended, otherwise it is said to be preemptive.
AMPI threads are non-preemptive user-level threads.
- Chunk
-
A chunk is a combination of a user-level thread and the data it
manipulates. When a program is converted from MPI to AMPI , we convert an MPI
process into a chunk. This conversion is referred to as chunkification.
- Object
-
An object is just a blob of memory on which certain computations
can be performed. The memory is referred to as an object's state, and the set
of computations that can be performed on the object is called the interface of
the object.
2 Charm++
Charm++ is an object-oriented parallel programming library for C++ . It differs from traditional message passing programming libraries (such as MPI) in that Charm++ is ``message-driven''. Message-driven parallel programs do not block the processor waiting for a message to be received. Instead, each message carries with itself a computation that the processor performs on arrival of that message. The underlying runtime system of Charm++ is called Converse , which implements a ``scheduler'' that chooses which message to schedule next (message-scheduling in Charm++ involves locating the object for which the message is intended, and executing the computation specified in the incoming message on that object). A parallel object in Charm++ is a C++ object on which a certain computations can be asked to performed from remote processors.
Charm++ programs exhibit latency tolerance since the scheduler always picks up the next available message rather than waiting for a particular message to arrive. They also tend to be modular, because of their object-based nature. Most importantly, Charm++ programs can be dynamically load balanced , because the messages are directed at objects and not at processors; thus allowing the runtime system to migrate the objects from heavily loaded processors to lightly loaded processors. It is this feature of Charm++ that we utilize for AMPI .
Since many CSE applications are originally written using MPI, one would have to do a complete rewrite if they were to be converted to Charm++ to take advantage of dynamic load balancing and other Charm++ benefits. This is indeed impractical. However, Converse - the runtime system of Charm++ - came to our rescue here, since it supports interoperability between different parallel programming paradigms such as parallel objects and threads. Using this feature, we developed AMPI , an implementation of a significant subset of MPI-1.1 standard over Charm++ . AMPI is described in the next section.
3 AMPI
AMPI utilizes the dynamic load balancing and other capabilities of Charm++ by associating a ``user-level'' thread with each Charm++ migratable object. User's code runs inside this thread, so that it can issue blocking receive calls similar to MPI, and still present the underlying scheduler an opportunity to schedule other computations on the same processor. The runtime system keeps track of computation loads of each thread as well as communication graph between AMPI threads, and can migrate these threads in order to balance the overall load while simultaneously minimizing communication overhead.
3 . 1 AMPI Status
Currently all the MPI-1.1 Standard functions are supported in AMPI , with a
collection of our extentions explained in detail in this manual. One-sided
communication calls in MPI-2 are implemented, but they are not taking advantage
of RMA features yet. Also ROMIO
3
has been integrated to support parallel I/O features. Link with
-lampiromio
to take advantage of this library.
Following MPI-1.1 basic datatypes are supported in AMPI . (Some are not available in Fortran binding. Refer to MPI-1.1 Standard for details.)
MPI_DATATYPE_NULL MPI_BYTE MPI_UNSIGNED_LONG MPI_LONG_DOUBLE_INT
MPI_DOUBLE MPI_PACKED MPI_LONG_DOUBLE MPI_2FLOAT
MPI_INT MPI_SHORT MPI_FLOAT_INT MPI_2DOUBLE
MPI_FLOAT MPI_LONG MPI_DOUBLE_INT MPI_LB
MPI_COMPLEX MPI_UNSIGNED_CHAR MPI_LONG_INT MPI_UB
MPI_LOGICAL MPI_UNSIGNED_SHORT MPI_2INT
MPI_CHAR MPI_UNSIGNED MPI_SHORT_INT
Following MPI-1.1 reduction operations are supported in AMPI .
MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_MAXLOC MPI_MINLOC
MPI_LAND MPI_LOR MPI_LXOR MPI_BAND MPI_BOR MPI_BXOR
Following are AMPI extension calls, which will be explained in detail in this manual.
MPI_Migrate MPI_Checkpoint MPI_Restart MPI_Register MPI_Get_userdata
MPI_Ialltoall MPI_Iallgather MPI_Iallreduce MPI_Ireduce MPI_IGet
3 . 2 Name for Main Program
To convert an existing program to use AMPI, the main function or program may need to be renamed. The changes should be made as follows:
3 . 2 . 1 Fortran
You must declare the main program as a subroutine called ``MPI_MAIN''. Do not declare the main subroutine as a program because it will never be called by the AMPI runtime.
program pgm -> subroutine MPI_Main
... ...
end program -> end subroutine
3 . 2 . 2 C or C++
The main function can be left as is, if
mpi.h
is included before the main function. This header file has a preprocessor macro that renames main, and the renamed version is called by the AMPI runtime by each thread.
3 . 3 Global Variable Privatization
For the before-mentioned benefits to be effective, one needs to map multiple
user-level threads onto each processor.
Traditional MPI programs assume that the
entire processor is allocated to themselves, and that only one thread of
control exists within the process's address space. So, they may use global and static variables in the program.
However, global and static variables are problematic for multi-threaded environments such as AMPI or OpenMP.
This is because there is a single instance of those variables so they will be
shared among different threads in the single address space and a wrong result may be produced by the program.
Figure
6
shows an example of a multi-threaded application with
two threads in a single process.
is a global or static variable in this
example. Thread 1 assigns a value to it, then it gets blocked for communication
and another thread can continue. Thereby, thread 2 is scheduled next and
accesses
which is wrong. Semantics of this program needs separate
instances of
for each of the threads. Thats where the need arises
to make some transformations to the original MPI program in order to run
correctly with AMPI .
The basic transformation needed to port the MPI program to AMPI is privatization of global variables. 4 With the MPI process model, each MPI node can keep a copy of its own ``permanent variables'' - variables that are accessible from more than one subroutines without passing them as arguments. Module variables, ``saved'' subroutine local variables, and common blocks in Fortran 90 belong to this category. If such a program is executed without privatization on AMPI , all the AMPI threads that reside on one processor will access the same copy of such variables, which is clearly not the desired semantics. To ensure correct execution of the original source program, it is necessary to make such variables ``private'' to individual threads. We are two choices: automatic global swapping and manual code modification.
3 . 3 . 1 Automatic Globals Swapping
Thanks to the ELF Object Format, we have successfully automated the procedure of switching the set of user global variables when switching thread contexts. Executable and Linkable Format (ELF) is a common standard file format for Object Files in Unix-like operating systems. ELF maintains a Global Offset Table (GOT) for globals so it is possible to switch GOT contents at thread context-switch by the runtime system.
The only thing that the user needs to do is to set flag
-swapglobals
at compile and link time (e.g. ``ampicc Ðo prog prog.c -swapglobals"). It does not need
any change to the source code and works with any language (C, C++, Fortran, etc).
However, it does not handle static variables and has a context switching overhead that grows with the number of global variables.
Currently, this feature only works on x86 and x86_64
(e.g. amd64) platforms that fully support ELF. Thus, it may not work on PPC or
Itanium, or on some microkernels such as Catamount. When this feature does
not work for you,
you can try other ways of handling global or static variables, which are detailed in the
following sections.
3 . 3 . 2 Manual Change
We have employed a strategy of argument passing to do this privatization transformation. That is, the global variables are bunched together in a single user-defined type, which is allocated by each thread dynamically. Then a pointer to this type is passed from subroutine to subroutine as an argument. Since the subroutine arguments are passed on a stack, which is not shared across all threads, each subroutine, when executing within a thread operates on a private copy of the global variables.
This scheme is demonstrated in the following examples. The original Fortran 90
code contains a module
shareddata
. This module is used in the main
program and a subroutine
subA
.
!FORTRAN EXAMPLE
MODULE shareddata
INTEGER :: myrank
DOUBLE PRECISION :: xyz(100)
END MODULE
SUBROUTINE MPI_MAIN
USE shareddata
include 'mpif.h'
INTEGER :: i, ierr
CALL MPI_Init(ierr)
CALL MPI_Comm_rank(MPI_COMM_WORLD, myrank, ierr)
DO i = 1, 100
xyz(i) = i + myrank
END DO
CALL subA
CALL MPI_Finalize(ierr)
END PROGRAM
SUBROUTINE subA
USE shareddata
INTEGER :: i
DO i = 1, 100
xyz(i) = xyz(i) + 1.0
END DO
END SUBROUTINE
//C Example
#include <mpi.h>
int myrank;
double xyz[100];
void subA();
int main(int argc, char** argv){
int i;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
for(i=0;i<100;i++)
xyz[i] = i + myrank;
subA();
MPI_Finalize();
}
void subA(){
int i;
for(i=0;i<100;i++)
xyz[i] = xyz[i] + 1.0;
}
AMPI executes the main subroutine inside a user-level thread as a subroutine.
Now we transform this program using the argument passing strategy. We first group the shared data into a user-defined type.
!FORTRAN EXAMPLE
MODULE shareddata
TYPE chunk
INTEGER :: myrank
DOUBLE PRECISION :: xyz(100)
END TYPE
END MODULE
//C Example
struct shareddata{
int myrank;
double xyz[100];
};
Now we modify the main subroutine to dynamically allocate this data and change the
references to them. Subroutine
subA
is then modified to take this data
as argument.
!FORTRAN EXAMPLE
SUBROUTINE MPI_Main
USE shareddata
USE AMPI
INTEGER :: i, ierr
TYPE(chunk), pointer :: c
CALL MPI_Init(ierr)
ALLOCATE(c)
CALL MPI_Comm_rank(MPI_COMM_WORLD, c%myrank, ierr)
DO i = 1, 100
c%xyz(i) = i + c%myrank
END DO
CALL subA(c)
CALL MPI_Finalize(ierr)
END SUBROUTINE
SUBROUTINE subA(c)
USE shareddata
TYPE(chunk) :: c
INTEGER :: i
DO i = 1, 100
c%xyz(i) = c%xyz(i) + 1.0
END DO
END SUBROUTINE
//C Example
void MPI_Main{
int i,ierr;
struct shareddata *c;
ierr = MPI_Init();
c = (struct shareddata*)malloc(sizeof(struct shareddata));
ierr = MPI_Comm_rank(MPI_COMM_WORLD, c.myrank);
for(i=0;i<100;i++)
c.xyz[i] = i + c.myrank;
subA(c);
ierr = MPI_Finalize();
}
void subA(struct shareddata *c){
int i;
for(i=0;i<100;i++)
c.xyz[i] = c.xyz[i] + 1.0;
}
With these changes, the above program can be made thread-safe. Note that it is
not really necessary to dynamically allocate
chunk
. One could have
declared it as a local variable in subroutine
MPI_Main
. (Or for a
small example such as this, one could have just removed the
shareddata
module, and instead declared both variables
xyz
and
myrank
as
local variables). This is indeed a good idea if shared data are small in size.
For large shared data, it would be better to do heap allocation because in
AMPI , the stack sizes are fixed at the beginning (can be specified from the
command line) and stacks do not grow dynamically.
3 . 3 . 3 Source-to-source Transformation
Another approach is to do the changes described in the previous scheme automatically. It means that we can use a tool to transform the source code to move global or static variables in an object and pass them around. This approach is portable across systems and compilers and may also improve locality and hence cache utilization. It also does not have the context-switch overhead of swapping globals. However, it requires a new implementation of the tool for each language. Currently, there is a tool called Photran 5 for refactoring Fortran codes that can do this transformation. It is Eclipse-based and works by constructing Abstract Syntax Trees (ASTs) of the program.
3 . 3 . 4 TLS-Globals
Thread Local Store (TLS) was originally employed in kernel threads to localize variables and thread safety. It can be used by annotating global/static variables with __thread in the source code. Thus, those variables will have one instance per extant thread. This keyword is not an official extension of the C language, however compiler writers are encouraged to implement this feature. Currently, the ELF file format supports Thread Local Storage.It handles both global and static variables and has no context-switching overhead. Context-switching is just changing the TLS segment register to point to the thread's local copy. However, although it is popular, it is not supported by all compilers. Currently, Charm++ supports it for x86/x86_64 platforms. A modified gfortran is also available to use this feature. To use TLS-Globals, one has to add __thread before all the global variables. For the example above, the following changes to the code handles the global variables:
__thread int myrank;
__thread double xyz[100];
The runtime system also should know that TLS-Globals is used at compile time:
ampiCC -o example example.C -tlsglobals
Table
1
shows portability of different schemes.
3 . 4 Extensions for Migrations
For MPI chunks to migrate, we have added a few calls to AMPI . These include ability to register thread-specific data with the run-time system, to pack all the thread's data, and to express willingness to migrate.
3 . 4 . 1 Registering Chunk data
When the AMPI runtime system decides that load imbalance exists within the
application, it will invoke one of its internal load balancing strategies,
which determines the new mapping of AMPI chunks so as to balance the load.
Then AMPI runtime has to pack up the chunk's state and move it to its new
home processor. AMPI packs up any internal data in use by the chunk,
including the thread's stack in use. This means that the local variables
declared in subroutines in a chunk, which are created on stack, are
automatically packed up by the AMPI runtime system. However, it has no way
of knowing what other data are in use by the chunk. Thus upon starting
execution, a chunk needs to notify the system about the data that it is going
to use (apart from local variables.) Even with the data registration, AMPI
cannot determine what size the data is, or whether the registered data contains
pointers to other places in memory. For this purpose, a packing subroutine also
needs to be provided to the AMPI runtime system along with registered data.
(See next section for writing packing subroutines.) The call provided by
AMPI for doing this is
MPI_Register
. This function takes two
arguments: A data item to be transported alongwith the chunk, and the pack
subroutine, and returns an integer denoting the registration identifier. In
C/C++ programs, it may be necessary to use this return value after migration
completes and control returns to the chunk, using function
MPI_Get_userdata
. Therefore, the return value should be stored in a
local variable.
3 . 4 . 2 Migration
The AMPI runtime system could detect load imbalance by itself and invoke the load balancing strategy. However, since the application code is going to pack/unpack the chunk's data, writing the pack subroutine will be complicated if migrations occur at a stage unknown to the application. For example, if the system decides to migrate a chunk while it is in initialization stage (say, reading input files), application code will have to keep track of how much data it has read, what files are open etc. Typically, since initialization occurs only once in the beginning, load imbalance at that stage would not matter much. Therefore, we want the demand to perform load balance check to be initiated by the application.
AMPI provides a subroutine
MPI_Migrate
for this purpose. Each
chunk periodically calls
MPI_Migrate
. Typical CSE applications are
iterative and perform multiple time-steps. One should call
MPI_Migrate
in each chunk at the end of some fixed number of
timesteps. The frequency of
MPI_Migrate
should be determined by a
tradeoff between conflicting factors such as the load balancing overhead, and
performance degradation caused by load imbalance. In some other applications,
where application suspects that load imbalance may have occurred, as in the
case of adaptive mesh refinement; it would be more effective if it performs a
couple of timesteps before telling the system to re-map chunks. This will give
the AMPI runtime system some time to collect the new load and communication
statistics upon which it bases its migration decisions. Note that
MPI_Migrate
does NOT tell the system to migrate the chunk, but
merely tells the system to check the load balance after all the chunks call
MPI_Migrate
. To migrate the chunk or not is decided only by the
system's load balancing strategy.
3 . 4 . 3 Packing/Unpacking Thread Data
Once the AMPI runtime system decides which chunks to send to which
processors, it calls the specified pack subroutine for that chunk, with the
chunk-specific data that was registered with the system using
MPI_Register
. This section explains how a subroutine should be
written for performing pack/unpack.
There are three steps for transporting the chunk's data to other processor. First, the system calls a subroutine to get the size of the buffer required to pack the chunk's data. This is called the ``sizing'' step. In the next step, which is called immediately afterward on the source processor, the system allocates the required buffer and calls the subroutine to pack the chunk's data into that buffer. This is called the ``packing'' step. This packed data is then sent as a message to the destination processor, where first a chunk is created (along with the thread) and a subroutine is called to unpack the chunk's data from the buffer. This is called the ``unpacking'' step.
Though the above description mentions three subroutines called by the AMPI runtime system, it is possible to actually write a single subroutine that will perform all the three tasks. This is achieved using something we call a ``pupper''. A pupper is an external subroutine that is passed to the chunk's pack-unpack-sizing subroutine, and this subroutine, when called in different phases performs different tasks. An example will make this clear:
Suppose the chunk data is defined as a user-defined type in Fortran 90:
!FORTRAN EXAMPLE
MODULE chunkmod
TYPE, PUBLIC :: chunk
INTEGER , parameter :: nx=4, ny=4, tchunks=16
REAL(KIND=8) t(22,22)
INTEGER xidx, yidx
REAL(KIND=8), dimension(400):: bxm, bxp, bym, byp
END TYPE chunk
END MODULE
//C Example
struct chunk{
double t;
int xidx, yidx;
double bxm,bxp,bym,byp;
};
Then the pack-unpack subroutine
chunkpup
for this chunk module is
written as:
!FORTRAN EXAMPLE
SUBROUTINE chunkpup(p, c)
USE pupmod
USE chunkmod
IMPLICIT NONE
INTEGER :: p
TYPE(chunk) :: c
call pup(p, c%t)
call pup(p, c%xidx)
call pup(p, c%yidx)
call pup(p, c%bxm)
call pup(p, c%bxp)
call pup(p, c%bym)
call pup(p, c%byp)
end subroutine
//C Example
void chunkpup(pup_er p, struct chunk c){
pup_double(p,c.t);
pup_int(p,c.xidx);
pup_int(p,c.yidx);
pup_double(p,c.bxm);
pup_double(p,c.bxp);
pup_double(p,c.bym);
pup_double(p,c.byp);
}
There are several things to note in this example. First, the same subroutine
pup
(declared in module
pupmod
) is called to size/pack/unpack
any type of data. This is possible because of procedure overloading possible in
Fortran 90. Second is the integer argument
p
. It is this argument that
specifies whether this invocation of subroutine
chunkpup
is sizing,
packing or unpacking. Third, the integer parameters declared in the type
chunk
need not be packed or unpacked since they are guaranteed to be
constants and thus available on any processor.
A few other functions are provided in module
pupmod
. These functions
provide more control over the packing/unpacking process. Suppose one modifies
the
chunk
type to include allocatable data or pointers that are
allocated dynamically at runtime. In this case, when the chunk is packed, these
allocated data structures should be deallocated after copying them to buffers,
and when the chunk is unpacked, these data structures should be allocated
before copying them from the buffers. For this purpose, one needs to know
whether the invocation of
chunkpup
is a packing one or unpacking one.
For this purpose, the
pupmod
module provides functions
fpup_isdeleting
(
fpup_isunpacking
). These functions return logical value
.TRUE.
if the invocation is for packing (unpacking), and
.FALSE.
otherwise. Following example demonstrates this:
Suppose the type
dchunk
is declared as:
!FORTRAN EXAMPLE
MODULE dchunkmod
TYPE, PUBLIC :: dchunk
INTEGER :: asize
REAL(KIND=8), pointer :: xarr(:), yarr(:)
END TYPE dchunk
END MODULE
//C Example
struct dchunk{
int asize;
double* xarr, *yarr;
};
Then the pack-unpack subroutine is written as:
!FORTRAN EXAMPLE
SUBROUTINE dchunkpup(p, c)
USE pupmod
USE dchunkmod
IMPLICIT NONE
INTEGER :: p
TYPE(dchunk) :: c
pup(p, c%asize)
IF (fpup_isunpacking(p)) THEN !! if invocation is for unpacking
allocate(c%xarr(asize))
ALLOCATE(c%yarr(asize))
ENDIF
pup(p, c%xarr)
pup(p, c%yarr)
IF (fpup_isdeleting(p)) THEN !! if invocation is for packing
DEALLOCATE(c%xarr(asize))
DEALLOCATE(c%yarr(asize))
ENDIF
END SUBROUTINE
//C Example
void dchunkpup(pup_er p, struct dchunk c){
pup_int(p,c.asize);
if(pup_isUnpacking(p)){
c.xarr = (double *)malloc(sizeof(double)*c.asize);
c.yarr = (double *)malloc(sizeof(double)*c.asize);
}
pup_doubles(p,c.xarr,c.asize);
pup_doubles(p,c.yarr,c.asize);
if(pup_isPacking(p)){
free(c.xarr);
free(c.yarr);
}
}
One more function
fpup_issizing
is also available in module
pupmod
that returns
.TRUE.
when the invocation is a sizing one. In practice one
almost never needs to use it.
3 . 5 Extensions for Checkpointing
The pack-unpack subroutines written for migrations make sure that the current state of the program is correctly packed (serialized) so that it can be restarted on a different processor. Using the same subroutines, it is also possible to save the state of the program to disk, so that if the program were to crash abruptly, or if the allocated time for the program expires before completing execution, the program can be restarted from the previously checkpointed state. Thus, the pack-unpack subroutines act as the key facility for checkpointing in addition to their usual role for migration.
A subroutine for checkpoint purpose has been added to AMPI:
void MPI_Checkpoint(char *dirname);
This subroutine takes a directory name as its argument. It is a collective
function, meaning every virtual processor in the program needs to call this
subroutine and specify the same directory name. (Typically, in an
iterative AMPI program, the iteration number, converted to a character string,
can serve as a checkpoint directory name.) This directory is created, and the
entire state of the program is checkpointed to this directory. One can restart
the program from the checkpointed state by specifying
"+restart
dirname"
on the command-line. This capability is powered by the Charm++
runtime system. For more information about Charm++ checkpoint/restart
mechanism please refer to Charm++ manual.
3 . 6 Extensions for Memory Efficiency
MPI functions usually require the user to preallocate the data buffers needed before the functions being called. For unblocking communication primitives, sometimes the user would like to do lazy memory allocation until the data actually arrives, which gives the oppotunities to write more memory efficient programs. We provide a set of AMPI functions as an extension to the standard MPI-2 one-sided calls, where we provide a split phase MPI_Get called MPI_IGet. MPI_IGet preserves the similar semantics as MPI_Get except that no user buffer is provided to hold incoming data. MPI_IGet_Wait will block until the requested data arrives and runtime system takes care to allocate space, do appropriate unpacking based on data type, and return. MPI_IGet_Free lets the runtime system free the resources being used for this get request including the data buffer. And MPI_IGet_Data is the utility program that returns the actual data.
int MPI_IGet(MPI_Aint orgdisp, int orgcnt, MPI_Datatype orgtype, int rank,
MPI_Aint targdisp, int targcnt, MPI_Datatype targtype, MPI_Win win,
MPI_Request *request);
int MPI_IGet_Wait(MPI_Request *request, MPI_Status *status, MPI_Win win);
int MPI_IGet_Free(MPI_Request *request, MPI_Status *status, MPI_Win win);
char* MPI_IGet_Data(MPI_Status status);
3 . 7 Extensions for Interoperability
Interoperability between different modules is essential for coding coupled
simulations. In this extension to AMPI , each MPI application module runs
within its own group of user-level threads distributed over the physical
parallel machine. In order to let AMPI know which chunks are to be created,
and in what order, a top level registration routine needs to be written. A
real-world example will make this clear. We have an MPI code for fluids and
another MPI code for solids, both with their main programs, then we first
transform each individual code to run correctly under AMPI as standalone
codes. This involves the usual ``chunkification'' transformation so that
multiple chunks from the application can run on the same processor without
overwriting each other's data. This also involves making the main program into
a subroutine and naming it
MPI_Main
.
Thus now, we have two
MPI_Main
s, one for the fluids code and one for
the solids code. We now make these codes co-exist within the same executable,
by first renaming these
MPI_Main
s as
Fluids_Main
and
Solids_Main
6
writing a
subroutine called
MPI_Setup
.
!FORTRAN EXAMPLE
SUBROUTINE MPI_Setup
USE ampi
CALL MPI_Register_main(Solids_Main)
CALL MPI_Register_main(Fluids_Main)
END SUBROUTINE
//C Example
void MPI_Setup(){
MPI_Register_main(Solids_Main);
MPI_Register_main(Fluids_Main);
}
This subroutine is called from the internal initialization routines of AMPI and tells AMPI how many number of distinct chunk types (modules) exist, and which orchestrator subroutines they execute.
The number of chunks to create for each chunk type is specified on the command
line when an AMPI program is run. Appendix B explains how AMPI programs
are run, and how to specify the number of chunks (
+vp
option). In the
above case, suppose one wants to create 128 chunks of Solids and 64 chunks of
Fluids on 32 physical processors, one would specify those with multiple
+vp
options on the command line as:
> charmrun gen1.x +p 32 +vp 128 +vp 64
This will ensure that multiple chunk types representing different complete
applications can co-exist within the same executable. They can also continue to
communicate among their own chunk-types using the same AMPI function calls
to send and receive with communicator argument as
MPI_COMM_WORLD
.
But this would be completely useless if these individual applications cannot
communicate with each other, which is essential for building efficient coupled
codes. For this purpose, we have extended the AMPI functionality to allow
multiple ``
COMM_WORLD
s''; one for each application. These
world
communicators
form a ``communicator universe'': an array of communicators
aptly called
MPI_COMM_UNIVERSE
. This array of communicators is
indexed [1 . . .
MPI_MAX_COMM
]. In the current implementation,
MPI_MAX_COMM
is 8, that is, maximum of 8 applications can co-exist
within the same executable.
The order of these
COMM_WORLD
s within
MPI_COMM_UNIVERSE
is determined by the order in which individual applications are registered in
MPI_Setup
.
Thus, in the above example, the communicator for the Solids module would be
MPI_COMM_UNIVERSE(1)
and communicator for Fluids module would be
MPI_COMM_UNIVERSE(2)
.
Now any chunk within one application can communicate with any chunk in the other application using the familiar send or receive AMPI calls by specifying the appropriate communicator and the chunk number within that communicator in the call. For example if a Solids chunk number 36 wants to send data to chunk number 47 within the Fluids module, it calls:
!FORTRAN EXAMPLE
INTEGER , PARAMETER :: Fluids_Comm = 2
CALL MPI_Send(InitialTime, 1, MPI_Double_Precision, tag,
47, MPI_Comm_Universe(Fluids_Comm)
, ierr)
//C Example
int Fluids_Comm = 2;
ierr = MPI_Send(InitialTime, 1, MPI_DOUBLE, tag,
47, MPI_Comm_Universe(Fluids_Comm)
);
The Fluids chunk has to issue a corresponding receive call to receive this data:
!FORTRAN EXAMPLE
INTEGER , PARAMETER :: Solids_Comm = 1
CALL MPI_Recv(InitialTime, 1, MPI_Double_Precision, tag,
36, MPI_Comm_Universe(Solids_Comm)
, stat, ierr)
//C Example
int Solids_Comm = 1;
ierr = MPI_Recv(InitialTime, 1, MPI_DOUBLE, tag,
36, MPI_Comm_Universe(Solids_Comm)
, &stat);
3 . 8 Extensions for Sequential Re-run of a Parallel Node
In some scenarios, a sequential re-run of a parallel node is desired. One example is instruction-level accurate architecture simulations, in which case the user may wish to repeat the execution of a node in a parallel run in the sequential simulator. AMPI provides support for such needs by logging the change in the MPI environment on a certain processors. To activate the feature, build AMPI module with variable ``AMPIMSGLOG'' defined, like the following command in charm directory. (Linking with zlib ``-lz'' might be required with this, for generating compressed log file.)
> ./build AMPI net-linux -DAMPIMSGLOG
The feature is used in two phases: writing (logging) the environment and repeating the run. The first logging phase is invoked by a parallel run of the AMPI program with some additional command line options.
> ./charmrun ./pgm +p4 +vp4 +msgLogWrite +msgLogRank 2 +msgLogFilename "msg2.log"
In the above example, a parallel run with 4 processors and 4 VPs will be executed, and the changes in the MPI environment of processor 2 (also VP 2, starting from 0) will get logged into diskfile "msg2.log".
Unlike the first run, the re-run is a sequential program, so it is not invoked by charmrun (and omitting charmrun options like +p4 and +vp4), and additional comamnd line options are required as well.
> ./pgm +msgLogRead +msgLogRank 2 +msgLogFilename "msg2.log"
3 . 9 Communication Optimizations for AMPI
AMPI is powered by the Charm++ communication optimization support now! Currently the user needs to specify the communication pattern by command line option. In the future this can be done automatically by the system.Currently there are four strategies available: USE_DIRECT, USE_MESH, USE_HYPERCUBE and USE_GRID. USE_DIRECT sends the message directly. USE_MESH imposes a 2d Mesh virtual topology on the processors so each processor sends messages to its neighbors in its row and column of the mesh which forward the messages to their correct destinations. USE_HYPERCUBE and USE_GRID impose a hypercube and a 3d Grid topologies on the processors. USE_HYPERCUBE will do best for very small messages and small number of processors, 3d has better performance for slightly higher message sizes and then Mesh starts performing best. The programmer is encouraged to try out all the strategies. (Stolen from the CommLib manual by Sameer :)
For more details please refer to the CommLib paper 7 .
Specifying the strategy is as simple as a command line option +strategy. For example:
> ./charmrun +p64 alltoall +vp64 1000 100 +strategy USE_MESH
tells the system to use MESH strategy for CommLib. By default USE_DIRECT is
used.
3 . 10 User Defined Initial Mapping
You can define the initial mapping of virtual processors (vp) to physical processors (p) as a runtime option. You can choose from predefined initial mappings or define your own mappings. Following predefined mappings are available:
- Round Robin
-
This mapping scheme, maps virtual processor to physical processor in round-robin fashion, i.e. if there are 8 virtual processors and 2 physical processors then virtual processors indexed 0,2,4,6 will be mapped to physical processor 0 and virtual processors indexed 1,3,5,7 will be mapped to physical processor 1.
> ./charmrun ./hello +p2 +vp8 +mapping RR_MAP - Block Mapping
-
This mapping scheme, maps virtual processors to physical processor in chunks, i.e. if there are 8 virtual processors and 2 physical processors then virtual processors indexed 0,1,2,3 will be mapped to physical processor 0 and virtual processors indexed 4,5,6,7 will be mapped to physical processor 1.
> ./charmrun ./hello +p2 +vp8 +mapping BLOCK_MAP - Proportional Mapping
-
This scheme takes the processing capability of physical processors into account for mapping virtual processors to physical processors, i.e. if there are 2 processors with different processing power, then number of virtual processors mapped to processors will be in proportion to their processing power.
> ./charmrun ./hello +p2 +vp8 +mapping PROP_MAP > ./charmrun ./hello +p2 +vp8
If you want to define your own mapping scheme, please contact us for help.
3 . 11 Compiling AMPI Programs
Charm++ provides a cross-platform compile-and-link script called
charmc
to compile C, C++ , Fortran, Charm++ and AMPI programs. This script
resides in the
bin
subdirectory in the Charm++ installation
directory. The main purpose of this script is to deal with the differences of
various compiler names and command-line options across various machines on
which Charm++ runs. While,
charmc
handles C and C++ compiler
differences most of the time, the support for Fortran 90 is new, and may have
bugs. But Charm++ developers are aware of this problem and are working to
fix them. Even in its alpha stage of Fortran 90 support,
charmc
still
handles many of the compiler differences across many machines, and it is
recommended that
charmc
be used to compile and linking AMPI programs. One
major advantage of using
charmc
is that one does not have to specify which
libraries are to be linked for ensuring that C++ and Fortran 90 codes are
linked correctly together. Appropriate libraries required for linking such
modules together are known to
charmc
for various machines.
In spite of the platform-neutral syntax of
charmc
, one may have to specify
some platform-specific options for compiling and building AMPI codes.
Fortunately, if
charmc
does not recognize any particular options on its
command line, it promptly passes it to all the individual compilers and linkers
it invokes to compile the program.
A. Installing AMPI
AMPI is included in the source distribution of Charm++ . To get the latest sources from PPL, visit: http://charm.cs.uiuc.edu/
and follow the download link. Now one has to build Charm++ and AMPI from source.
The build script for Charm++ is called
build
. The syntax for this
script is:
> build <target> <version> <opts>
For building AMPI (which also includes building Charm++ and other
libraries needed by AMPI ), specify
<target>
to be
AMPI
. And
<opts>
are command line options passed to the
charmc
compile
script. Common compile time options such as
-g, -O, -Ipath, -Lpath,
-llib
are accepted.
To build a debugging version of AMPI , use the option: ``
-g
''.
To build a production version of AMPI , use the options: ``
-O
-DCMK_OPTIMIZE=1
''.
<version>
depends on the machine, operating system, and the underlying
communication library one wants to use for running AMPI programs.
See the charm/README file for details on picking the proper version.
Following is an example of how to build AMPI under linux and ethernet
environment, with debugging info produced:
> build AMPI net-linux -g
B. Building and Running AMPI Programs
B.. 1 Building
Charm++ provides a compiler called charmc in your charm/bin/ directory. You can use this compiler to build your AMPI program the same way as other compilers like cc. Especially, to build an AMPI program, a command line option -language ampi should be applied. All the command line flags that you would use for other compilers can be used with charmc the same way. For example:
> charmc -language ampi -c pgm.c -O3
> charmc -language ampi -o pgm pgm.o -lm -O3
Shortcuts to the AMPI compiler are provided. If you have added charm/bin into your $PATH environment variable, simply type mpicc, mpiCC, mpif77, and mpif90 as provided by other MPI implementations.
> mpicc -c pgm.c -g
B.. 2 Running
Charm++ distribution contains a script called
charmrun
that makes
the job of running AMPI programs portable and easier across all parallel
machines supported by Charm++ .
charmrun
is copied to a directory
where an AMPI prgram is built using
charmc
. It takes a command line
parameter specifying number of processors, and the name of the program followed
by AMPI options (such as number of chunks to create, and the stack size of
every chunk) and the program arguments. A typical invocation of AMPI program
pgm
with
charmrun
is:
> charmrun pgm +p16 +vp32 +tcharm_stacksize 3276800
Here, the AMPI program
pgm
is run on 16 physical processors with
32 chunks (which will be mapped 2 per processor initially), where each
user-level thread associated with a chunk has the stack size of 3,276,800 bytes.
Footnotes
- ... subset 1
- Currently, 110 MPI-1.1 Standard functions have been implemented.
- ... subroutine 2
- Like many software engineering terms, this term is overused, and unfortunately clashes with Fortran 90 module that denotes a program unit. We specifically refer to the later as ``Fortran 90 module'' to avoid confusion.
- ... ROMIO 3
- http://www-unix.mcs.anl.gov/romio/
- ... variables. 4
-
Typical Fortran MPI programs
contain three types of global variables.
-
Global variables that are ``read-only''. These are either
parameters
that are set at compile-time. Or other variables that are
read as input or set at the beginning of the program and do not change during
execution. It is not necessary to privatize such variables.
-
Global variables that are used as temporary buffers. These are variables
that are used temporarily to store values to be accessible across subroutines.
These variables have a characteristic that there is no blocking call such as
MPI_recvbetween the time the variable is set and the time it is ever used. It is not necessary to privatize such variables either. -
True global variables. These are used across subroutines that contain
blocking receives and therefore the possibility of a context switch between the
definition and use of the variable. These variables need to be privatized.
-
Global variables that are ``read-only''. These are either
parameters
that are set at compile-time. Or other variables that are
read as input or set at the beginning of the program and do not change during
execution. It is not necessary to privatize such variables.
- ...Photran 5
- http://www.eclipse.org/photran
- ...Solids_Main 6
- Currently, we assume that the interface code, which does mapping and interpolation among the boundary values of Fluids and Solids domain, is integrated with one of Fluids and Solids.
- ... paper 7
- L. V. Kale and Sameer Kumar and Krishnan Vardarajan, 2002. http://finesse.cs.uiuc.edu/papers/CommLib.pdf