22. Support for Loop-level Parallelism

To better utilize the multicore chip, it has become increasingly popular to adopt shared-memory multithreading programming methods to exploit parallelism on a node. For example, in hybrid MPI programs, OpenMP is the most popular choice. When launching such hybrid programs, users have to make sure there are spare physical cores allocated to the shared-memory multithreading runtime. Otherwise, the runtime that handles distributed-memory programming may interfere with resource contention because the two independent runtime systems are not coordinated. If spare cores are allocated, in the same way of launching a MPI+OpenMP hybrid program, Charm++ will work perfectly with any shared-memory parallel programming languages (e.g. OpenMP). As with ordinary OpenMP applications, the number of threads used in the OpenMP parts of the program can be controlled with the OMP_NUM_THREADS environment variable. See Sec. C.1 for details on how to propagate such environment variables.

If there are no spare cores allocated, to avoid resource contention, a unified runtime is needed to support both intra-node shared-memory multithreading parallelism and inter-node distributed-memory message-passing parallelism. Additionally, considering that a parallel application may have only a small fraction of its critical computation be suitable for porting to shared-memory parallelism (the savings on critical computation may also reduce the communication cost, thus leading to more performance improvement), dedicating physical cores on every node to the shared-memory multithreading runtime will waste computational power because those dedicated cores are not utilized at all during most of the application's execution time. This case indicates the necessity of a unified runtime supporting both types of parallelism.

22.1 CkLoop library

The CkLoop library is an add-on to the Charm++ runtime to achieve such a unified runtime. The library implements a simple OpenMP-like shared-memory multithreading runtime that reuses Charm++ PEs to perform tasks spawned by the multithreading runtime. This library targets the SMP mode of Charm++ .

The CkLoop library is built in $CHARM_DIR/$MACH_LAYER/tmp/libs/ck-libs/ckloop by executing ``make''. To use it for user applications, one has to include ``CkLoopAPI.h'' in the source code. The interface functions of this library are as follows:

Lambda syntax for CkLoop is also supported. The interface for using lambda syntax is as follows:

Examples using this library can be found in examples/charm++/ckloop and the widely used molecular dynamics simulation application NAMD22.1.

22.1.1 The CkLoop Hybrid library

The CkLoop_Hybrid library is a mode of CkLoop that incorporates specific adaptive scheduling strategies aimed at providing a tradeoff between dynamic load balance and spatial locality. It is used in a build of Charm++ where all chares are placed on core 0 of each node (called the drone-mode, or all-drones-mode). It incorporates a strategy called staggered static-dynamic scheduling (from dissertation work of Vivek Kale). The iteration space is first tentatively divided approximately equally to all available PEs. Each PE's share of the iteration space is divided into a static portion, specified by the staticFraction parameter below, and the remaining dynamic portion. The dynamic portion of a PE is divided into chunks of specified chunksize, and enqueued in the task-queue associated with that PE. Each PE works on its static portion, and then on its own task queue (thus preserving spatial locality, as well as persistence of allocations across outer iterations), and after finishing that, steals work from other PE’s task queues.

CkLoopHybrid support requires the SMP mode of Charm++ and the additional flags -enable-drone-mode and -enable-task-queue to be passed as build options when Charm++ is built.

The changes to the CkLoop API call are the following:

Reduction is supported for type CKLOOP_INT_SUM, CKLOOP_FLOAT_SUM, CKLOOP_DOUBLE_SUM. It is recommended to use this mode without reduction.

22.2 Charm++/Converse Runtime Scheduler Integrated OpenMP

The compiler-provided OpenMP runtime library can work with Charm++ but it creates its own thread pool so that Charm++ and OpenMP can have oversubscription problem. The integrated OpenMP runtime library parallelizes OpenMP regions in each chare and runs on the Charm++ runtime without oversubscription. The integrated runtime creates OpenMP user-level threads, which can migrate among PEs within a node. This fine-grained parallelism by the integrated runtime helps resolve load imbalance within a node easily. When PEs become idle, they help other busy PEs within a node via work-stealing.

22.2.1 Instructions to build and use the integrated OpenMP library Instructions to build

The OpenMP library can be built with `omp' keyword and any smp version of Charm++ including multicore build when you build Charm++ or AMPI.
e.g.) $CHARM_DIR/build charm++ multicore-linux64 omp
      $CHARM_DIR/build charm++ netlrts-linux-x86_64 smp omp
This library is based on the LLVM OpenMP runtime library. So it supports the ABI used by clang, intel and gcc compilers.

The following is the list of compilers which are verified to support this integrated library on Linux.

You can use this integrated OpenMP with clang on IBM Bluegene machines without special compilation flags. (Don't need to add -fopenmp or -openmp on Bluegene clang)

On Linux, the OpenMP supported version of clang has been installed in default recently. For example, Ubuntu has been released with clang higher than 3.7 since 15.10. Depending on which version of clang is installed in your working environments, you should follow additional instructions to use this integrated OpenMP with Clang. The following is the instruction to use clang on Ubuntu where the default clang is older than 3.7. If you want to use clang on other Linux distributions, you can use package managers on those Linux distributions to install clang and OpenMP library. This installation of clang will add headers for OpenMP environmental routines and allow you to parse the OpenMP directives. However, on Ubuntu, the installation of clang doesn't come with its OpenMP runtime library so it results in an error message saying that it fails to link the compiler provided OpenMP library. This library is not needed to use the integrated OpenMP runtime but you need to avoid this error to succeed compiling your codes. The following is the instruction to avoid the error.

/* When you want to compile Integrated OpenMP on Ubuntu where the pre-installed clang
is older than 3.7, you can use integrated openmp with the following instructions.
e.g.) Ubuntu 14.04, the version of default clang is 3.4.  */
sudo apt-get install clang-3.8 //you can use any version of clang higher than 3.8
sudo ln -svT /usr/bin/clang-3.8 /usr/bin/clang
sudo ln -svT /usr/bin/clang++-3.8 /usr/bin/clang

$(CHARM_DIR)/build charm++ multicore-linux64 clang omp --with-production -j8
echo '!<arch>' > $(CHARM_DIR)/lib/libomp.a //Dummy library. This will make you avoid the error message.

On Mac, the Apple-provided clang installed in default doesn't have OpenMP feature. We're working on the support of this library on Mac with OpenMP enabled clang which can be downloaded and installed through `Homebrew or MacPorts`. Currently, this integrated library is built and compiled on Mac with the normal GCC which can be downloaded and installed via Homebrew and MacPorts. If installed globally, GCC will be accessible by appending the major version number and adding it to the invocation of the Charm++ build script. For example:

$CHARM_DIR/build charm++ multicore-linux64 omp gcc-7
$CHARM_DIR/build charm++ netlrts-linux-x86_64 smp omp gcc-7

If this does not work, you should set environment variables so that the Charm++ build script uses the normal gcc installed from Homebrew or MacPorts. The following is an example using Homebrew on Mac OS X 10.12.5.

/* Install Homebrew at
 * Install gcc using 'brew' */
brew install gcc
/* gcc, g++ and other binaries are installed at /usr/local/Cellar/gcc/<version>/bin
 * You need to make symbolic links to the gcc binaries at /usr/local/bin
 * In this example, gcc 7.1.0 is installed at the directory.
cd /usr/local/bin
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-7 gcc
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/g++-7 g++
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-nm-7 gcc-nm
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ranlib-7 gcc-ranlib
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ar-7 gcc-ar
/* Finally, you should set PATH variable so that these binaries are accessed first in the build script.
   export PATH=/usr/local/bin:$PATH

In addition, this library will be supported on Windows in the next release of Charm++. How to use the integrated OpenMP on Charm++

To use this library on your applications, you have to add `-module OmpCharm' in compile flags to link this library instead of the compiler-provided library in compilers. Without `-module OmpCharm', your application will use the compiler-provided OpenMP library which running on its own separate runtime. (You don't need to add `-fopenmp or -openmp' with gcc and icc. These flags are included in the predefined compile options when you build Charm++ with `omp')

This integrated OpenMP adjusts the number of OpenMP instances on each chare so the number of OpenMP instances can be changed for each OpenMP region over execution. If your code shares some data structures among OpenMP instances in a parallel region, you can set the size of the data structures before the start of the OpenMP region with ``omp_get_max_threads()'' and use the data structure within each OpenMP instance with ``omp_get_thread_num()''. After the OpenMP region, you can iterate over the data structure to combine partial results with ``CmiGetCurKnownOmpThreads()''. ``CmiGetCurKnownOmpThreads() returns the number of OpenMP threads for the latest OpenMP region on the PE where a chare is running.'' The following is an example to describe how you can use shared data structures for OpenMP regions on the integrated OpenMP with Charm++.

/* Maximum possible number of OpenMP threads in the upcoming OpenMP region.
   Users can restrict this number with 'omp_set_num_threads()' for each chare
   and the environmental variable, 'OMP_NUM_THREADS' for all chares.
   By default, omp_get_max_threads() returns the number of PEs for each logical node.
int maxAvailableThreads = omp_get_max_threads();
int *partialResult = new int[maxAvailableThreads]{0};

/* Partial sum for subsets of iterations assigned to each OpenMP thread.
   The integrated OpenMP runtime decides how many OpenMP threads to create
   with some heuristics internally.
#pragma omp parallel for
for (int i = 0; i < 128; i++) {
  partialResult[omp_get_thread_num()] +=i;
/* We can know how many OpenMP threads are created in the latest OpenMP region
   by CmiCurKnownOmpthreads().
   You can get partial results each OpenMP thread generated */
for (int j = 0; j < CmiCurKnownOmpThreads(); j++)
  CkPrintf("partial sum of thread %d: %d \n", j, partialResult[j]);

22.2.2 The list of supported pragmas

This library is forked from LLVM OpenMP Library supporting OpenMP 4.0. Among many number of directives specified in OpenMP 4.0, limited set of directives are supported. The following list of supported pragmas is verified from the openmp conformance test suite which forked from LLVM OpenMP library and ported to Charm++ program running multiple OpenMP instances on chares. The test suite can be found in tests/converse/openmp_test.
/* omp_<directive>_<clauses> */

/* the following directives means they work within omp parallel region */


/* the following directives means the combination of 'omp parallel and omp for/section' works */

The other directives in OpenMP standard will be supported in the next version.

A simple example using this library can be found in examples/charm++/openmp. You can compare ckloop and the integrated OpenMP with this example. You can see that the total execution time of this example with enough big size of problem is faster with OpenMP than CkLoop thanks to load balancing through work-stealing between threads within a node while the execution time of each chare can be slower on OpenMP because idle PEs helping busy PEs.

22.3 API to control which PEs participating in CkLoop/OpenMP work

User may want certain PE not to be involved in other PE's loop-level parallelization for some cases because it may add latency to works in the PE by helping other PEs. User can enable or disable each PE to participate in the loop-level parallelization through the following API.

void CkSetPeHelpsOtherThreads (int value)

value can be 0 or 1, 0 means this API disable the current PE to help other PEs. value 1 or others can enable the current PE for loop-level parallelization. By default, all the PEs are enabled for the loop-level parallelization by CkLoop and OpenMP. User should explicitly enable the PE again by calling this API with value 1 after they disable it during certain procedure so that the PE can help others after that. The following example shows how this API can be used.


/* codes which can run without the current PE
interuppted by loop-level work from other PEs */