Subsections


22 . Support for Loop-level Parallelism

To better utilize the multicore chip, it has become increasingly popular to adopt shared-memory multithreading programming methods to exploit parallelism on a node. For example, in hybrid MPI programs, OpenMP is the most popular choice. When launching such hybrid programs, users have to make sure there are spare physical cores allocated to the shared-memory multithreading runtime. Otherwise, the runtime that handles distributed-memory programming may interfere with resource contention because the two independent runtime systems are not coordinated. If spare cores are allocated, in the same way of launching a MPI+OpenMP hybrid program, Charm++ will work perfectly with any shared-memory parallel programming languages (e.g. OpenMP). As with ordinary OpenMP applications, the number of threads used in the OpenMP parts of the program can be controlled with the OMP_NUM_THREADS environment variable. See Sec.  C.1 for details on how to propagate such environment variables.

If there are no spare cores allocated, to avoid resource contention, a unified runtime is needed to support both intra-node shared-memory multithreading parallelism and inter-node distributed-memory message-passing parallelism. Additionally, considering that a parallel application may have only a small fraction of its critical computation be suitable for porting to shared-memory parallelism (the savings on critical computation may also reduce the communication cost, thus leading to more performance improvement), dedicating physical cores on every node to the shared-memory multithreading runtime will waste computational power because those dedicated cores are not utilized at all during most of the application's execution time. This case indicates the necessity of a unified runtime supporting both types of parallelism.

The CkLoop library is an add-on to the Charm++ runtime to achieve such a unified runtime. The library implements a simple OpenMP-like shared-memory multithreading runtime that reuses Charm++ PEs to perform tasks spawned by the multithreading runtime. This library targets the SMP mode of Charm++ .

The CkLoop library is built in $CHARM_DIR/$MACH_LAYER/tmp/libs/ck-libs/ckloop by executing ``make''. To use it for user applications, one has to include ``CkLoopAPI.h'' in the source code. The interface functions of this library are as follows:

Examples using this library can be found in examples/charm++/ckloop and the widely used molecular dynamics simulation application NAMD 22 . 1 .

22 . 1 Charm++/Converse Runtime Scheduler Integrated OpenMP

The compiler-provided OpenMP runtime library can work with Charm++ but it creates its own thread pool so that Charm++ and OpenMP can have oversubscription problem. The integrated OpenMP runtime library parallelizes OpenMP regions in each chare and runs on the Charm++ runtime without oversubscription. The integrated runtime creates OpenMP user-level threads, which can migrate among PEs within a node. This fine-grained parallelism by the integrated runtime helps resolve load imbalance within a node easily. When PEs become idle, they help other busy PEs within a node via work-stealing.

22 . 1 . 1 Instructions to build and use the integrated OpenMP library

22 . 1 . 1 . 1 Instructions to build

The OpenMP library can be built with `omp' keyword and any smp version of Charm++ including multicore build when you build Charm++ or AMPI.
e.g.) $CHARM_DIR/build charm++ multicore-linux64 omp
      $CHARM_DIR/build charm++ netlrts-linux-x86_64 smp omp
This library is based on the LLVM OpenMP runtime library. So it supports the ABI used by clang, intel and gcc compilers.

The following is the list of compilers which are verified to support this integrated library on Linux.

You can use this integrated OpenMP with clang on IBM Bluegene machines without special compilation flags. (Don't need to add -fopenmp or -openmp on Bluegene clang)

On Linux, the OpenMP supported version of clang has been installed in default recently. For example, Ubuntu has been released with clang higher than 3.7 since 15.10. Depending on which version of clang is installed in your working environments, you should follow additional instructions to use this integrated OpenMP with Clang. The following is the instruction to use clang on Ubuntu where the default clang is older than 3.7. If you want to use clang on other Linux distributions, you can use package managers on those Linux distributions to install clang and OpenMP library. This installation of clang will add headers for OpenMP environmental routines and allow you to parse the OpenMP directives. However, on Ubuntu, the installation of clang doesn't come with its OpenMP runtime library so it results in an error message saying that it fails to link the compiler provided OpenMP library. This library is not needed to use the integrated OpenMP runtime but you need to avoid this error to succeed compiling your codes. The following is the instruction to avoid the error.

/* When you want to compile Integrated OpenMP on Ubuntu where the pre-installed clang
is older than 3.7, you can use integrated openmp with the following instructions.
e.g.) Ubuntu 14.04, the version of default clang is 3.4.  */
sudo apt-get install clang-3.8 //you can use any version of clang higher than 3.8
sudo ln -svT /usr/bin/clang-3.8 /usr/bin/clang
sudo ln -svT /usr/bin/clang++-3.8 /usr/bin/clang

$(CHARM_DIR)/build charm++ multicore-linux64 clang omp --with-production -j8 
echo '!<arch>' > $(CHARM_DIR)/lib/libomp.a //Dummy library. This will make you avoid the error message.

On Mac, the Apple-provided clang installed in default doesn't have OpenMP feature. We're working on the support of this library on Mac with OpenMP enabled clang which can be downloaded and installed through `Homebrew or MacPorts`. Currently, this integrated library is built and compiled on Mac with the normal GCC which can be downloaded and installed via Homebrew and MacPorts. You should set environmental variables so that Charm++ build script use the normal gcc installed from Homebrew or MacPorts. The following is an example using Homebrew on Mac OS X 10.12.5.

/* Install Homebrew at https://brew.sh
 * Install gcc using 'brew' */
brew install gcc
/* gcc, g++ and other binaries are installed at /usr/local/Cellar/gcc/<version>/bin
 * You need to make symbolic links to the gcc binaries at /usr/local/bin
 * In this example, gcc 7.1.0 is installed at the directory.
 */
cd /usr/local/bin
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-7 gcc
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/g++-7 g++
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-nm-7 gcc-nm
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ranlib-7 gcc-ranlib
ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ar-7 gcc-ar
/* Finally, you should set PATH variable so that these binaries are accessed first in the build script.
   export PATH=/usr/local/bin:$PATH

In addition, this library will be supported on Windows in the next release of Charm++.

22 . 1 . 1 . 2 How to use the integrated OpenMP on Charm++

To use this library on your applications, you have to add `-module OmpCharm' in compile flags to link this library instead of the compiler-provided library in compilers. Without `-module OmpCharm', your application will use the compiler-provided OpenMP library which running on its own separate runtime. (You don't need to add `-fopenmp or -openmp' with gcc and icc. These flags are included in the predefined compile options when you build Charm++ with `omp')

This integrated OpenMP adjusts the number of OpenMP instances on each chare so the number of OpenMP instances can be changed for each OpenMP region over execution. If your code shares some data structures among OpenMP instances in a parallel region, you can set the size of the data structures before the start of the OpenMP region with ``omp_get_max_threads()'' and use the data structure within each OpenMP instance with ``omp_get_thread_num()''. After the OpenMP region, you can iterate over the data structure to combine partial results with ``CmiGetCurKnownOmpThreads()''. ``CmiGetCurKnownOmpThreads() returns the number of OpenMP threads for the latest OpenMP region on the PE where a chare is running.'' The following is an example to describe how you can use shared data structures for OpenMP regions on the integrated OpenMP with Charm++.

/* Maximum possible number of OpenMP threads in the upcoming OpenMP region.
   Users can restrict this number with 'omp_set_num_threads()' for each chare 
   and the environmental variable, 'OMP_NUM_THREADS' for all chares.
   By default, omp_get_max_threads() returns the number of PEs for each logical node.
*/
int maxAvailableThreads = omp_get_max_threads();
int *partialResult = new int[maxAvailableThreads]{0};

/* Partial sum for subsets of iterations assigned to each OpenMP thread.
   The integrated OpenMP runtime decides how many OpenMP threads to create 
   with some heuristics internally.
*/
#pragma omp parallel for
for (int i = 0; i < 128; i++) {
  partialResult[omp_get_thread_num()] +=i;
}
/* We can know how many OpenMP threads are created in the latest OpenMP region
   by CmiCurKnownOmpthreads().
   You can get partial results each OpenMP thread generated */
for (int j = 0; j < CmiCurKnownOmpThreads(); j++)
  CkPrintf("partial sum of thread %d: %d \n", j, partialResult[j]);

22 . 1 . 2 Limitation of the current implementation

22 . 1 . 2 . 1 The lack of barrier within OpenMP region

In OpenMP standards, within a OpenMP region, each worksharing constructs has implicit barriers between each other so that the result of the earlier worksharing construct can be used for later constructs. However our current implementation doesn't support this barrier between worksharing constructs in the same OpenMP region. The following is an example to execute two `omp for' loops within a single OpenMP region.
int result = 0;
#pragma omp parallel 
{
  #pragma omp for reduction(+: result)
  for (int i=0; i < 128 ; i++) { 
    result+= i;
  }/* Implicit barrier should exist here for threads within this team.
      But the current implementation of the integrated OpenMP doesn't provide barrier here.
      The OpenMP threads within this team continue to move forward without being blocked here 
      as if it works with 'nowait' clause on 'omp for' .

  #pragma omp for reduction(+: result)
  for (int j=0; j < 128; j++) {
    result += j;
  }
  /* Because there was not a barrier before this 'omp for' construct,
     the value in `result` may not be consistent with what is expected. */
} /* We provide an implicit barrier at the end of each OpenMP parallel region. */

So if you want to use multiple worksharing constructs, you should use them with separate 'parallel' pragmas. The example above can be rewritten as the following.

int result = 0;
#pragma omp parallel for reduction(+: result)
for (int i = 0; i < 128; i ++) {
  result +=i;
}

#pragma omp parallel for reduction(+: result)
for (int j = 0; j < 128; j++) {
  result +=j;
}
The implicit barrier between worksharing constructs will be supported in the next release of Charm++.

22 . 1 . 2 . 2 The list of supported pragmas

This library is forked from LLVM OpenMP Library supporting OpenMP 4.0. Among many number of directives specified in OpenMP 4.0, limited set of directives are supported. The following is the list of supported pragmas which has been confirmed to work on this library.
omp_atomic
omp_master
omp_critical
omp_master_3
omp_get_wtick
omp_for_private
omp_in_parallel
omp_parallel_if
omp_for_reduction
omp_for_lastprivate
omp_for_firstprivate
omp_for_schedule_static
omp_for_schedule_dynamic
omp_get_num_threads
omp_parallel_for_if
omp_parallel_shared
omp_section_private
omp_parallel_default
omp_parallel_private
omp_parallel_reduction
omp_parallel_for_private
omp_parallel_firstprivate
omp_sections_reduction
omp_section_lastprivate
omp_section_firstprivate
omp_parallel_for_firstprivate
omp_parallel_for_lastprivate
omp_parallel_for_reduction
omp_parallel_sections_firstprivate
omp_parallel_sections_lastprivate
omp_parallel_sections_private
omp_parallel_sections_reduction
omp_for_schedule_guided
omp_flush
omp_get_wtime
The other directives in OpenMP standard will be supported in the next version.

A simple example using this library can be found in examples/charm++/openmp . You can compare ckloop and the integrated OpenMP with this example. You can see that the total execution time of this example with enough big size of problem is faster with OpenMP than CkLoop thanks to load balancing through work-stealing between threads within a node while the execution time of each chare can be slower on OpenMP because idle PEs helping busy PEs.