Converse and Charm++
Libraries Manual


1 . Introduction

This manual describes Charm++ and Converse libraries. This is a work in progress towards a standard library for parallel programming on top of the Converse and Charm++ system. All of these libraries are included in the source and binary distributions of Charm++ /Converse .

2 . liveViz Library

2 . 1 Introduction

If array elements compute a small piece of a large 2D image, then these image chunks can be combined across processors to form one large image using the liveViz library. In other words, liveViz provides a way to reduce 2D-image data, which combines small chunks of images deposited by chares into one large image. This visualization library follows the client server model. The server, a parallel Charm++ program, does all image assembly, and opens a network (CCS) socket which clients use to request and download images. The client is a small Java program. A typical use of this is:

  cd charm/examples/charm++/lvServer
 make
 ./charmrun ./lvServer +p2 ++server ++server-port 1234 
 ~/ccs_tools/bin/lvClient localhost 1234

Use git to obtain a copy of ccs_tools (prior to using lvClient) and build it by:

       cd ccs_tools;
      ant;

2 . 2 How to use liveViz with Charm++ program

The liveViz routines are in the Charm++ header ``liveViz.h''. A typical program provides a chare array with one entry method with the following prototype:


   entry void functionName(liveVizRequestMsg *m);

This entry method is supposed to deposit its (array element's) chunk of the image. This entry method has following structure:


   void myArray::functionName (liveVizRequestMsg *m)
  {
    // prepare image chunk
       ...

    liveVizDeposit (m, startX, startY, width, height, imageBuff, this);

    // delete image buffer if it was dynamically allocated
  }

Here, ``width'' and ``height'' are the size, in pixels, of this array element's portion of the image, contributed in ``imageBuff'' (described below). This will show up on the client's assembled image at 0-based pixel (startX,startY). The client's display width and height are stored in m->req.wid and m->req.ht.

By default, liveViz combines image chunks by doing a saturating sum of overlapping pixel values. If you want liveViz to combine image chunks by using max (i.e. for overlapping pixels in deposited image chunks, final image will have the pixel with highest intensity or in other words largest value), you need to pass one more parameter (liveVizCombine_t) to the ``liveVizDeposit'' function:


  liveVizDeposit (m, startX, startY, width, height, imageBuff, this, 
                   max_image_data);

You can also reduce floating-point image data using sum_float_image_data or max_float_image_data.

2 . 3 Format of deposit image

``imageBuff'' is run of bytes representing a rectangular portion of the image. This buffer represents image using a row-major format, so 0-based pixel (x,y) (x increasing to the right, y increasing downward in typical graphics fashion) is stored at array offset ``x+y*width''.

If the image is gray-scale (as determined by liveVizConfig, below), each pixel is represented by one byte. If the image is color, each pixel is represented by 3 consecutive bytes representing red, green, and blue intensity.

If the image is floating-point, each pixel is represented by a single `float', and after assembly colorized by calling the user-provided routine below. This routine converts fully assembled `float' pixels to RGB 3-byte pixels, and is called only on processor 0 after each client request.


 extern "C"

void liveVizFloatToRGB(liveVizRequest &req, 
        const float *floatSrc, unsigned char *destRgb,
        int nPixels);

2 . 4 liveViz Initialization

liveViz library needs to be initialized before it can be used for visualization. For initialization follow the following steps from your main chare:

  1. Create your chare array (array proxy object 'a') with the entry method 'functionName' (described above). You must create the chare array using a CkArrayOptions 'opts' parameter. For instance,

    
      CkArrayOptions opts(rows, cols);
     array = CProxy_Type::ckNew(opts);
    
    
  2. Create a CkCallback object ('c'), specifying 'functionName' as the callback function. This callback will be invoked whenever the client requests a new image.
  3. Create a liveVizConfig object ('cfg'). LiveVizConfig takes a number of parameters, as described below.
  4. Call liveVizInit (cfg, a, c, opts).

The liveVizConfig parameters are:

A typical 2D, RGB, non-push call to liveVizConfig looks like this:


    liveVizConfig cfg(true,false);

2 . 5 Compilation

A Charm++ program that uses liveViz must be linked with '-module liveViz'.

Before compiling a liveViz program, the liveViz library may need to be compiled. To compile the liveViz library:

2 . 6 Poll Mode

In some cases you may want a server to deposit images only when it is ready to do so. For this case the server will not register a callback function that triggers image generation, but rather the server will deposit an image at its convenience. For example a server may want to create a movie or series of images corresponding to some timesteps in a simulation. The server will have a timestep loop in which an array computes some data for a timestep. At the end of each iteration the server will deposit the image. The use of LiveViz's Poll Mode supports this type of server generation of images.

Poll Mode contains a few significant differences to the standard mode. First we describe the use of Poll Mode, and then we will describe the differences. liveVizPoll must get control during the creation of your array, so you call liveVizPollInit with no parameters.


  liveVizPollInit();
 CkArrayOptions opts(nChares);
 arr = CProxy_lvServer::ckNew(opts);

To deposit an image, the server just calls liveVizPollDeposit. The server must take care not to generate too many images, before a client requests them. Each server generated image is buffered until the client can get the image. The buffered images will be stored in memory on processor 0.


   liveVizPollDeposit( this,
                      startX,startY,            // Location of local piece
                      localSizeX,localSizeY,    // Dimensions of the piece I'm depositing
                      globalSizeX,globalSizeY,  // Dimensions of the entire image
                      img,                      // Image byte array
                      sum_image_data,           // Desired image combiner
                      3                         // Bytes/pixel
                    );

The last two parameters are optional. By default they are set to sum_image_data and 3 bytes per pixel.

A sample liveVizPoll server and client are available at:


            .../charm/examples/charm++/lvServer
           .../charm/java/bin/lvClient

This example server uses a PythonCCS command to cause an image to be generated by the server. The client also then gets the image.

LiveViz provides multiple image combiner types. Any supported type can be used as a parameter to liveVizPollDeposit. Valid combiners include: sum_float_image_data, max_float_image_data, sum_image_data, and max_image_data.

The differences in Poll Mode may be apparent. There is no callback function which causes the server to generate and deposit an image. Furthermore, a server may generate an image before or after a client has sent a request. The deposit function, therefore is more complicated, as the server will specify information about the image that it is generating. The client will no longer specify the desired size or other configuration options, since the server may generate the image before the client request is available to the server. The liveVizPollInit call takes no parameters.

The server should call Deposit with the same global size and combiner type on all of the array elements which correspond to the ``this'' parameter.

The latest version of liveVizPoll is not backwards compatable with older versions. The old version had some fundamental problems which would occur if a server generated an image before a client requested it. Thus the new version buffers server generated images until requested by a client. Furthermore the client requests are also buffered if they arrive before the server generates the images. Problems could also occur during migration with the old version.

2 . 7 Caveats

If you use the old version of ``liveVizInit" method that only receives 3 parameters, you will find a known bug caused by how ``liveVizDeposit'' internally uses a reduction to build the image.

Using that version of the ``liveVizInit" method, its contribute call is handled as if it were the chare calling ``liveVizDeposit'' that actually contributed to the liveViz reduction. If there is any other reduction going on elsewhere in this chare, some liveViz contribute calls might be issued before the corresponding non-liveViz contribute is reached. This would imply that image data would be treated as if were part of the non-liveViz reduction, leading to unexpected behavior potentially anywhere in the non-liveViz code.

3 . Multi-phase Shared Arrays Library

The Multiphase Shared Arrays (MSA) library provides a specialized shared memory abstraction in Charm++ that provides automatic memory management. Explicitly shared memory provides the convenience of shared memory programming while exposing the performance issues to programmers and the ``intelligent'' ARTS.

Each MSA is accessed in one specific mode during each phase of execution: read-only mode, in which any thread can read any element of the array; write-once mode, in which each element of the array is written to (possibly multiple times) by at most one worker thread, and no reads are allowed and accumulate mode, in which any threads can add values to any array element, and no reads or writes are permitted. A sync call is used to denote the end of a phase.

We permit multiple copies of a page of data on different processors and provide automatic fetching and caching of remote data. For example, initially an array might be put in write-once mode while it is populated with data from a file. This determines the cache behavior and the permitted operations on the array during this phase. write-once means every thread can write to a different element of the array. The user is responsible for ensuring that two threads do not write to the same element; the system helps by detecting violations. From the cache maintenance viewpoint, each page of the data can be over-written on it's owning processor without worrying about transferring ownership or maintaining coherence. At the sync , the data is simply merged. Subsequently, the array may be read-only for a while, thereafter data might be accumulate 'd into it, followed by it returning to read-only mode. In the accumulate phase, each local copy of the page on each processor could have its accumulations tracked independently without maintaining page coherence, and the results combined at the end of the phase. The accumulate operations also include set-theoretic union operations, i.e. appending items to a set of objects would also be a valid accumulate operation. User-level or compiler-inserted explicit prefetch calls can be used to improve performance.

A software engineering benefit that accrues from the explicitly shared memory programming paradigm is the (relative) ease and simplicity of programming. No complex, buggy data-distribution and messaging calculations are required to access data.

To use MSA in a Charm++ program:

The API is as follows: See the example programs in charm/pgms/charm++/multiphaseSharedArrays .

4 . 3D FFT Library

4 . 1 Introduction and Motivation

The 3D FFT library provides an interface to do paralle FFT computation in a scalable fashion. The parallelization is achieved by splitting the 3D transform into multiple phases. There are two possibilities for doing the splitting: One is dividing the data space (over which the fft is to be performed) into a set of slabs (figure 1). Each slab is essentially a collection of planes). First, 2D FFTs are done over the planes in the slab. Then a distributed 'transform' will send the data to destination so that fft in the third direction is performed. This approach takes two computation phases and one 'transform' phase. The second way for splitting is dividing the data into collections of pencils. First, 1D FFTs are computed over the pencils; then a 'transform' is performed and 1D FFTs are done over second dimention; again a 'transform' is performed and FFTs are computed over the last dimension. So this approach takes three computation phases and two 'transform' phases. In first approach, the parallelism is limited by the number of planes.While in second approach, it's limited by the number of pencils. So the second approach provides finer grained parallelism and it's possible to perform better when the number of processing units is larger than the number of planes.

4 . 2 Compilation and Execution

To install the FFT library, you will need to have charm++ installed in you system. You can follow the Charm++ manual to do that. Also you will need to have FFTW (version 2.1.5) installed. FFTW can be downloaded from http://www.fftw.org. The FFT library source is at your-charm-dir/src/libs/ck-libs/fftlib. Before installation of the library, make sure that the path for FFTW library is consistent with your FFTW installation. Then cd to your-charm-dir/tmp, and do 'make fftlib'. To compile a program using the fftlib, pass the '-lfftlib -L(your-fftwlib-dir) -lfftw' flag to charmc.

4 . 3 Library Initialization and Data Format

To initialize the library, user will need to construct a data struct and pass it to the library. For plane-based version, the struct is called: NormalFFTinfo . And the constructor of 'NormalFFTinfo' is defined as:

         NormalFFTinfo(CProxy_NormalSlabArray &srcProxy, CProxy_NormalSlabArray &destProxy,
                      int srcDim[2], int destDim[2], int isSrcArray, complex *dataptr, 
                      int srcPlanesPerSlab=1, int destPlanesPerSlab=1)

        Where: 
        CProxy_NormalSlabArray &srcProxy : proxy for source charm array 
        CProxy_NormalSlabArray &destProxy : proxy for destination charm array 
        int srcDim[2] : FFT plane data dimension at source array (*)
        int destDim[2] : FFT plane data dimension at destination array ( srcDim[1] must equal to destDim[0].) (*) 
        int isSrcArray : whether this array is source (1) or destination (0)
        complex *dataptr : pointer to FFT data 
        int srcPlanesPerSlab : number of planes in each slab at source array, default value is 1 (**) 
        int destPlanesPerSlab : number of planes in each slab at destination array, default value is 1 (**)

          * Data layout : The multi-dimensional FFT data are supposed
            to be stored in a contiguous one-dimensional array in
            row-major order. For example, in source array, data is
             srcPlanesPerSlab planes, each plane is 
            srcDim[0] rows of size  srcDim[1] numbers. Similar
            rules apply to destination side.


          ** Currently, srcPlanesPerSlab/destPlanesPerSlab has to be
             the same across all array elements.

          *** Total data size can be calculated by:
              srcPlanesPerSlab*srcDim[0]*srcDim[1] at source array, and
              destPlanesPerSlab*destDim[0]*destDim[1] at destination array
 

For pencil-based version, the struct is called: LineFFTinfo.


         LineFFTinfo(CProxy_NormalLineArray &xProxy, 
                    CProxy_NormalLineArray &yProxy, 
                    CProxy_NormalLineArray &zProxy, 
                    int size[3], int isSrcArray, complex *dataptr, 
                    int srcPencilsPerSlab=1, int destPencilsPerSlab=1) 

        where: 
        CProxy_NormalSlabArray &xProxy : proxy for first charm array 
        CProxy_NormalSlabArray &yProxy : proxy for second charm array 
        CProxy_NormalSlabArray &zProxy : proxy for third charm array 
        int size[3] : FFT plane data dimension (*)
        int isSrcArray : whether this array is source (1) or intermediate (2) or  destination (0)
        complex *dataptr : pointer to FFT data 
        int srcPencilsPerSlab : number of pencils in each slab at source array, default value is 1
        int destPencilsPerSlab : number of pencils in each slab at destination array, default value is 1

          *data layout : pencils in the three arrays are of size 
           size[0]/size[1]/size[2]. And if there is more than one
           pencil per slab, the other dimension is the dimension for
           pencils in the next array.


In both cases, data is deposited by passing in a pointer to the data field, and the pointer will be stored in 'complex *dataptr' in the struct. Memory allocation and initializtion of data field needs to be done by user before pointer is passed in. The library doesn't allocate any memory for data field. Also note that FFT's done internally in the library are in-place FFTs, which means that data field will be overwriten with results.

4 . 4 Library Interfaces

There are two types of interfaces provided by the library: Charm++ based and AMPI based.

4 . 4 . 1 Charm++ interface

The Charm++ interface is the raw interface of the library and slightly more difficult to use but gives more flexibility. To use the charm++ based library, user has to create their own charm arrays which derive from predefined arrays in library. By overiding default methods, user can add in additional functions.

For the plane-based library, there are several relevant member functions: 'doFFT', 'doIFFT', 'doneFFT' and 'doneIFFT' . 'doFFT' and 'doIFFT' need to be called to start the computation. 'doneFFT' and 'doneIFFT' are callback functions, and they need to be inheritated.

The sample codes below should shed more light on this. For complete sample programs, refer to file under your-charm-dir/pgms/charm++/fftdemo/.

In the sample code below, we will illustrate how to use the plane-based library in 4 steps: initializing the data struct; creating array element; starting the computation and finally ending the computation.

For initializing, a NormalFFTinfo struct will be used. Keep in mind that data storage needs to be allocated and initalized by the user. Since in-place FFT will occur, user should also make duplicate copies of data when needed.


     main::main(CkArgMsg *m)
    {
         ...
         /* Assume FFT of size  nx*ny*nz */
         int srcDim[] = ny, nx, destDim[] = nx, nz;

         complex *plane =  new complex[nx*ny];
         ... // Initialize FFT data here 

         NormalFFTinfo src_fftinfo(srcArray, destArray, 
                                   srcDim, destDim, true, plane, 1, 1);

         ...
     }

Next step is to create the charm array:


      main::main(CkArgMsg *m)
     {
          ...
          /* Assume FFT of size  nx*ny*nz */
          int srcDim[] = ny, nx, destDim[] = nx, nz;

          /* create the source array */
          srcArray = CProxy_SrcArray::ckNew();
          for (z = 0; z < dim; z+=1) {
              complex *plane =  new complex[nx*ny];
              ... // Initialize FFT data here: data needs to be in x-major order 

              NormalFFTinfo src_fftinfo(srcArray, destArray, 
                                       srcDim, destDim, true, plane, 1, 1);

              // insert one plane object: this contains data of x-y plane at z coordinate       
              srcArray(z).insert(src_fftinfo);  
           }

          /* destination array will be created in similar fashion */
          ...
      }

Following we will start the FFT computation by making a call to 'doFFT()' . 'doFFT(int id1, int id2)' takes two inputs: id1 defines the ID number of the source FFT, while id2 defines the ID number of the destination FFT. There is a similar method called 'doFFT()' to be used to invoke inverse FFTs. In this example, 3 FFT's are done simultaneously by invoking a 'doAllFFT()' method. And 'doAllFFT()' is defined as:


     void SrcArray::doAllFFT() {
        doFFT(0, 0);
        doFFT(1, 1);
        doFFT(2, 2);
    }

The last step is to get data at destination side. For this purpose, inheritance of method 'doneFFT()' is defined below. 'doneFFT(int id)' takes the FFT ID number as input. For inverse FFTs, relevant member function is 'doneIFFT()' .


     void destArray::doneFFT(int id) {
        count ++; 
        if(count==3) {
            count = 0;
            /* A reduction is induced:  this will call the predefined reduction client when all array elements finish their computation */
            contribute(sizeof(int), &count, CkReduction::sum_int);
       }
    }

Next we will demonstrate the usage of pencil-based library in similar steps.

First is the initialization of data struct LineFFTinfo :


      main::main()  
    {
         ...
         /* Assume FFT of size  nx*ny*nz */
         int size[] = nx, ny, nz;

         complex *pencil =  new complex[nx];
         ... /* Initialize FFT data here */ 

         LineFFTinfo fftinfo(xlinesProxy, ylinesProxy, zlinesProxy, size, true, pencil);
         ...
     }

Second is the creation of array:


      main::main()    
    {
         ...
         /* Assume FFT of size  nx*ny*nz */
         int size[] = nx, ny, nz;

         xlinesProxy = CProxy_myXLines::ckNew();
         for (z = 0; z < sizeZ; z++) 
             for (y = 0; y < sizeY; y+=THICKNESS) 
                 complex *pencil =  new complex[nx];
                 ... /* Initialize FFT data here */

                 LineFFTinfo fftinfo(xlinesProxy, ylinesProxy, zlinesProxy, 
                                     size, true, pencil);
                 xlinesProxy(y, z).insert(fftinfo);
             
         
         xlinesProxy.doneInserting();

         /* ylinesProxy /zlinesProxy are created in similar fashion */
         ...
     }

Next is the starting of the computation. A method called doFirstFFT() needs to be called. doFirstFFT(int id, int direction) takes two parameters: id specifies the ID number of the target FFT, direction tells whether FFTs is to be done in forward( direction=1 ) or backward( direction=0 ) direction.


     void myXLines::doAllFFT() {
        doFirstFFT(0, 1);
        doFirstFFT(1, 1);
        doFirstFFT(2, 1);
    }

Finally, it's the step to finish the FFT at receiver side. In this case, we call the array of destination myZLines . Similarly as in the plane-based version, doneFFT() is inherited. doneFFT(int id, int direction) takes two inputs, which are explained the same as in doFirstFFT(int id, int direction) .


     void myZLines::doneFFT(int id, int direction) {
        count ++;
        if(count==3) {
            count = 0;
            contribute(sizeof(int), &count, CkReduction::sum_int);
        }
    }

4 . 4 . 2 AMPI Interface

The MPI-like interface aims at easy migration of MPI program to use the library. not available in CVS yet.

The AMPI interface has five functions:

(sample code here)


   

5 . TRAM

5 . 1 Overview

Topological Routing and Aggregation Module is a library for optimization of many-to-many and all-to-all collective communication patterns in Charm++ applications. The library performs topological routing and aggregation of network communication in the context of a virtual mesh topology comprising the Charm++ Processing Elements (PEs) in the parallel run. The number of dimensions and their sizes within this topology are specified by the user when initializing an instance of the library. TRAM is implemented as a Charm++ group, so an instance of TRAM has one object on every PE used in the run. We use the term local instance to denote a member of the TRAM group on a particular PE. Most collective communication patterns involve sending linear arrays of a single data type. In order to more efficiently aggregate and process network data, TRAM restricts the data sent using the library to a single data type specified by the user through a template parameter when initializing an instance of the library. We use the term data item to denote a single object of this datatype submitted to the library for sending. While the library is active (i.e. after initialization and before termination), an arbitrary number of data items can be submitted to the library at each PE. Due to the underlying assumption of a mesh topology, the benefits of TRAM are most pronounced on systems with a mesh or torus network topology, where the virtual mesh of PEs is made to match the physical topology of the network. The latter can easily be accomplished using the Charm++ Topology Manager.

The next two sections explain the routing and aggregation techniques used in the library.

5 . 1 . 1 Routing

Let the variables $ j$ and $ k$ denote PEs within an N-dimensional virtual topology of PEs and $ x$ denote a dimension of the mesh. We represent the coordinates of $ j$ and $ k$ within the mesh as $ \left
(j_0, j_1, \ldots, j_{N-1} \right) $ and $ \left (k_0, k_1, \ldots,
k_{N-1} \right) $ . Also, let

\begin{displaymath}
f(x, j, k) =
\begin{cases}
0, & \text{if } j_x = k_x \\
1, & \text{if } j_x \ne k_x
\end{cases}\end{displaymath}

$ j$ and $ k$ are peers if

$\displaystyle \sum_{d=0}^{N-1} f(d, j, k) = 1 .$ ( 5 . 1 )

When using TRAM, PEs communicate directly only with their peers. Sending to a PE which is not a peer is handled inside the library by routing the data through one or more intermediate destinations along the route to the final destination .

Suppose a data item destined for PE $ k$ is submitted to the library at PE $ j$ . If $ k$ is a peer of $ j$ , the data item will be sent directly to $ k$ , possibly along with other data items for which $ k$ is the final or intermediate destination. If $ k$ is not a peer of $ j$ , the data item will be sent to an intermediate destination $ m$ along the route to $ k$ whose index is $ \left (j_0, j_1, \ldots, j_{i-1}, k_i,
j_{i+1}, \ldots, j_{N-1} \right)$ , where $ i$ is the greatest value of $ x$ for which $ f(x, j, k) = 1$ .

Note that in obtaining the coordinates of $ m$ from $ j$ , exactly one of the coordinates of $ j$ which differs from the coordinates of $ k$ is made to agree with $ k$ . It follows that m is a peer of $ j$ , and that using this routing process at $ m$ and every subsequent intermediate destination along the route eventually leads to the data item being received at $ k$ . Consequently, the number of messages $ F(j, k)$ that will carry the data item to the destination is

$\displaystyle F(j,k) = \sum_{d=0}^{N-1}f(d, j, k) .$ ( 5 . 2 )

5 . 1 . 2 Aggregation

Communicating over the network of a parallel machine involves per message bandwidth and processing overhead. TRAM amortizes this overhead by aggregating data items at the source and every intermediate destination along the route to the final destination.

Every local instance of the TRAM group buffers the data items that have been submitted locally or received from another PE for forwarding. Because only peers communicate directly in the virtual mesh, it suffices to have a single buffer per PE for every peer. Given a dimension d within the virtual topology, let $ s_d$ denote its size , or the number of distinct values a coordinate for dimension d can take. Consequently, each local instance allocates up to $ s_d - 1 $ buffers per dimension, for a total of $ \sum_{d=0}^{N-1} (s_d - 1) $ buffers. Buffers are of a constant size. This size can either be specified directly by the user when creating the TRAM instance or, when using an alternative constructor, specified indirectly through a parameter denoting the maximum number of data items to be buffered at any point in time per local instance of the library. This amount is multiplied by a constant overallocation factor inside the library and divided by the number of peers when determining the size of each buffer. To prevent data allocation for peers which rarely or never communicate, TRAM allocates buffers on demand as soon as a data item needs to be sent out to a particular peer.

Sending with TRAM is done by submitting a data item and a destination identifier, either PE or array index, using a function call to the local instance. If the index belongs to a peer, the library places the data item in the buffer for the peer's PE. Otherwise, the library calculates the index of the intermediate destination using the previously described algorithm, and places the data item in the buffer for the resulting PE, which by design is always a peer of the local PE. Buffers are sent out immediately when they become full. When a message is received at an intermediate destination, the data items comprising it are distributed into the appropriate buffers for subsequent sending. In the process, if a data item is determined to have reached its final destination, it is immediately returned to the user through a virtual function reserved for that purpose.

The total buffering capacity specified by the user may be reached even when no single buffer is completely filled up. In that case the buffer with the greatest number of buffered data items is sent.

5 . 2 Application User Interface

A typical usage scenario for TRAM involves a start-up phase followed by one or more streaming steps . We next describe the interface and details relevant to usage of the library, which normally follows these steps:

  1. Start-up Creation of a TRAM group of an appropriate type and set up of client arrays and groups
  2. Initialization Calling an appropriate initialization function, which returns through a callback
  3. Sending An arbitrary number of sends using the insertData function call
  4. Receiving Processing received data items through the virtual process function which serves as the delivery interface for the library and must be defined by the user
  5. Termination Termination of a streaming step
  6. Reinitialization Chare migration, potentially followed by reinitialization of TRAM (step 2) for a new streaming step

5 . 2 . 1 Start-Up

Start-up is typically performed once in a program, often inside the Main function of the mainchare, and involves creating the Mesh Streamer group.

An instance of TRAM is restricted to sending data items of a single user-specified type, which we denote by dtype , to a single user-specified chare array or group. The target group or array should inherit from MeshStreamerGroupClient<dtype> and MeshStreamerArrayClient<dtype, itype>, respectively.

5 . 2 . 1 . 1 Sending to a Group

To use TRAM for sending to a group, a GroupMeshStreamer group should be created. Either of the following two GroupMeshStreamer constructors can be used for that purpose:


 template<class dtype, class ClientType, class RouterType>

GroupMeshStreamer<dtype, ClientType, RouterType>::

GroupMeshStreamer(int maxNumDataItemsBuffered,
                  int numDimensions,
                  int *dimensionSizes,
                  CkGroupID clientGID,
                  bool yieldFlag = 0,
                  double progressPeriodInMs = -1.0);


template<class dtype, class ClientType, class RouterType>

GroupMeshStreamer<dtype, ClientType, RouterType>::

GroupMeshStreamer(int numDimensions,
                  int *dimensionSizes,
                  CkGroupID clientGID,
                  int bufferSize,
                  bool yieldFlag = 0,
                  double progressPeriodInMs = -1.0);


5 . 2 . 1 . 2 Sending to a Chare Array

For sending to a chare array, an ArrayMeshStreamer group should be created, which has a similar constructor interface to GroupMeshStreamer:


 template <class dtype, class itype, class ClientType, class RouterType>

ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::

ArrayMeshStreamer(int maxNumDataItemsBuffered,
                  int numDimensions,
                  int *dimensionSizes,
                  CkArrayID clientAID,
                  bool yieldFlag = 0,
                  double progressPeriodInMs = -1.0);


template <class dtype, class itype, class ClientType, class RouterType>

ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::

ArrayMeshStreamer(int numDimensions,
                  int *dimensionSizes,
                  CkArrayID clientAID,
                  int bufferSize,
                  bool yieldFlag = 0,
                  double progressPeriodInMs = -1.0);


Description of parameters:

template parameters:

5 . 2 . 2 Initialization

Before any sending can be done, the library needs to be initialized. There are currently three main modes of operation, depending on the type of termination used: staged completion, completion detection, or quiescence detection. The modes of termination are described later. Here, we present the interface for intializing a streaming step in each case.

When using completion detection, each local instance of TRAM must be initialized using the following init function:


 template <class dtype, class RouterType>

void MeshStreamer<dtype, RouterType>::

init(int numContributors,
     CkCallback startCb,
     CkCallback endCb,
     CProxy_CompletionDetector detector,
     int prio,
     bool usePeriodicFlushing);

Description of parameters:

When using staged completion, a completion detector object is not required as input, as the library performs its own specialized form of termination. In this case, each local instance of TRAM must be initialized using a different interface for the overloaded init function:


 template <class dtype, class RouterType>

void MeshStreamer<dtype, RouterType>::

init(int numLocalContributors,
     CkCallback startCb,
     CkCallback endCb,
     int prio,
     bool usePeriodicFlushing);


Note that numLocalContributors denotes the local number of done calls expected, rather than the global as in the first interface of init. In practice, this often precludes calling this version of init through a broadcast, as the desired number of done calls may be different on each PE.

A common case is to have a single chare array perform all the sends in a streaming step, with each element of the array as a contributor. For this case there is a special version of init that takes as input the CkArrayID object for the chare array that will perform the sends, precluding the need to manually determine the number of client chares per PE:


 template <class dtype, class RouterType>

void MeshStreamer<dtype, RouterType>::

init(CkArrayID senderArrayID,
     CkCallback startCb,
     CkCallback endCb,
     int prio,
     bool usePeriodicFlushing);

The init interface for using quiescence detection is:

template <class dtype, class RouterType> void MeshStreamer<dtype, RouterType>::init(CkCallback startCb, int prio);

After initialization is finished, the system invokes startCb, signalling to the user that the library is ready to accept data items for sending.

5 . 2 . 3 Sending

Sending with TRAM is done through calls to insertData and broadcast.


 template <class dtype, class RouterType>

void MeshStreamer<dtype, RouterType>::

insertData(const dtype& dataItem,
           int destinationPe);


template <class dtype, class itype, class ClientType, class RouterType>

void ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::

insertData(const dtype& dataItem,
           itype arrayIndex);


template <class dtype, class RouterType>

void MeshStreamer<dtype, RouterType>::

broadcast(const dtype& dataItem);

Broadcasting has the effect of delivering the data item:

5 . 2 . 4 Receiving

To receive data items sent using TRAM, the user must define the process function for each client group and array:


 void process(const dtype  &ran);

process() is called by the library on the destination processor once for every delivered item. The call is made locally, so process does not need to be an entry method.

5 . 2 . 5 Termination

A termination mechanism is required due to the buffering of data items in TRAM local instances at the source and intermediate destinations. Without a flushing or termination mechanism, data items may remain buffered indefinitely without ever getting delivered to their final destinations. A termination scheme ensures that these items are delivered, preventing deadlock.

Currently, three means of termination are supported, staged completion, completion detection, and quiescence detection. Periodic flushing is a secondary mechanism which can be enabled or disabled when initiating one of the primary mechanisms.

In general, termination requires the user to issue a number of calls to the done function:


 template <class dtype, class RouterType>

void MeshStreamer<dtype, RouterType>::

done(int numContributorsFinished = 1);

When using completion detection, the number of done calls that are expected globally by the TRAM instance is specified using the numContributors parameter to init. Safe termination requires that no calls to insertData or broadcast are made after the last call to done is performed globally. Because order of execution is uncertain in parallel applications, some care is required to ensure the above condition is met. A simple way to terminate safely is to set numContributors equal to the number of senders, and call done once for each sender that is done sending.

In contrast to using completion detection, using staged completion involves setting the local number of expected done calls on each local instance using the numLocalContributors parameter in the init function. To ensure safe termination, no insertData or broadcast calls should be made on any PE that has received all its expected done calls.

Staged completion is not supported when array location data is not guaranteed to be correct, as this can potentially violate the termination conditions used to guarantee successful termination. In order to guarantee correct location data in applications with load balancing, Charm++ must be compiled with -DCMK GLOBAL LOCATION UPDATE, which has the effect of performing a global broadcast of location data for chare array elements that migrate during load balancing. Unfortunately, this operation is expensive when migrating large numbers of elements. As an alternative, completion detection and quiescence detection modes will work properly without the global location update mechanism, and even in the case of anytime migration.

Another version of init, which takes a CkArrayID object as an argument, provides a simplified interface in the common case when a single chare array performs all the sends within a streaming step, with each of its elements as a contributor. This version of init determines the appropriate number of local contributors automatically and can be called through a broadcast. It also correctly handles the case of PEs without any contributors by immediately marking those PEs as having finished the streaming step. As such, this version of init should be preferred by the user when applicable.

When using quiescence detection, no end callback is specified. Instead, termination of a streaming step would be achieved using quiescence detection framework, which allows passing a callback as parameter. TRAM is set up such that quiescence will not be detected until all items sent in the last streaming step have been delivered at the final destination.

The choice of which termination mechanism to use is left to the user. There are cases where it may be easier to use the completion detection scheme. If either mode can be used with ease, staged completion should be preferred, as it achieves termination faster and does not involve the persistent overhead of completion detection due to reductions for determining when the global number of expected done calls is reached. Staged completion, unlike completion detection, does incur a small bandwidth overhead (4 bytes) for every TRAM message, but in practice this is more than offset by the persistent traffic incurred by completion detection.

Periodic flushing is an auxiliary mechanism which checks at a regular interval whether any sends have taken place since the last time the check was performed. If not, the mechanism sends out all the data items buffered per local instance of the library. The period is specified by the user in the TRAM constructor. A typical use case for periodic flushing is when the submission of a data item B to TRAM happens as a result of the delivery of another data item A sent using the same TRAM instance. If A is buffered inside the library and insufficient data items are submitted to cause the buffer holding A to be sent out, a deadlock could arise. With the periodic flushing mechanism, the buffer holding A is guaranteed to be sent out eventually, and deadlock is prevented.

5 . 2 . 6 Reinitialization

A TRAM instance that has terminated cannot be used for sending more data items until it has been reinitialized. Reinitialization is achieved by calling init, which prepares the instance of the library for a new streaming step. Reinitialization is useful for time stepped applications. A new streaming step is normally done for each step of the application.