MPI and OpenMP both have their advantages and disadvantages.
MPI can be used on distributed memory clusters and can scale to thousands of nodes. However, it was designed in the days that clusters had nodes with only one or two cores. Nowadays CPUs often have more than ten cores and sometimes support multiple hardware threads (or logical cores) per physical core (and in fact may need multiple threads to run at full performance). At the same time, the amount of memory per hardware thread is not increasing and is in fact quite low on several architectures that rely on a large number of slower cores or hardware threads to obtain a high performance within a reasonable power budget. Starting one MPI process per hardware thread is then a waste of resources as each process needs its communication buffers, OS resources, etc. Managing the hundreds of thousands of MPI processes that we are nowadays seeing on the biggest clusters, is very hard.
OpenMP on the other hand is limited to shared memory parallelism, typically within a node of a cluster. Moreover, many OpenMP programs don't scale past some tens of threads partly because of thread overhead in the OS implementation and partly because of overhead in the OpenMP run-time.
Hybrid programs try to combine the advantages of both to deal with the disadvantages. Hybrid programs use a limited number of MPI processes ("MPI ranks") per node and use OpenMP threads to further exploit the parallelism within the node. An increasing number of applications is designed or re-engineered in this way. The optimum number of MPI processes (and hence OpenMP threads per process) depends on the code, the cluster architecture and the problem that is being solved, but often one or, on newer CPUs such as the Intel Haswell, two MPI processes per socket (so two to four for a typical two-socket node) is close to optimal. Compiling and starting such applications requires some care as we explain on this page.
Preparing your hybrid application to run
To compile and link your hybrid application, you basically have to combine the
instructions for MPI and OpenMP
mpicc -fopenmp for the GNU
mpiicc -qopenmp for the Intel
mpiicc -openmp for older versions) or the corresponding
MPI Fortran compiler wrappers for Fortran programs.
Running hybrid programs on the VSC clusters
When running a hybrid MPI/OpenMP program, fewer MPI processes
have to be started than there are logoical cores available to
the application as every process uses multiple cores in
OpenMP parallelism. Yet when requesting logical cores per node to the
scheduler, one still has to request the total number of cores
needed per node. Hence the PBS property "ppn" should not be read
as "processes per node" but rather as "logical cores per node" or
processing units per node". Instead we
have to tell the MPI launcher (
mpirun for most applications) to
launch fewer processes than there are logical cores on a node and tell
each MPI process to use the correct number of OpenMP threads.
For optimal performance, the threads of one MPI process should be put together as close as possible in the logical core hierarchy implied by the cache and core topology of a given node. E.g., on a dual socket node it may make a lot of sense to run 2 MPI processes with each MPI process using all cores on a single socket. In other applications, it might be better to run only one MPI process per node, or multiple MPI processes per socket. In more technical words, each MPI process runs in its MPI domain consisting of a number of logical cores, and we want these domains to be non-overlapping and fixed in time during the life of the MPI job and the logical cores in the domain to be "close" to each other. This optimises the use the memory hierarchy (cache and RAM).
OpenMP has several environment variables that can then control the number
of OpenMP threads and the placement of the threads in the MPI domain. All of these
may also be overwritten by the application, so it is not a
bullet-proof way to control the behaviour of OpenMP applications.
Moreover, some of these environment variables are
implementation-specific and hence are different between the Intel
and GNU OpenMP runtimes. The most important variable is
sets the number of threads to be used in parallel regions. As
parallel constructs can be nested, a process may still start more
threads than indicated by
the total number of threads can be limited by the variable
Script mympirun (VSC)
The mympirunn script is developed by the UGent VSC-team to cope
with differences between different MPI implementations
automatically. It offers support for hybrid programs through
--hybrid command line switch to specify the number of
processes per node. The number of threads per process can then be
computed by dividing the number of logical cores per node by the
number of processes per node.
E.g., to run a hybrid MPI/OpenMP program on 2 nodes using 20 cores on each node and running 4 MPI ranks per node (hence 5 OpenMP threads per MPI rank), your script would contain
#PBS -l nodes=2:ppn20
near the top to request the resources from the scheduler. It would then load the appropriate module with the mympirun command:
module load vsc-mympirun
(besides other modules that are needed to run your application) and finally start your application:
mympirun --hybrid=4 ./hybrid_mpi
assuming your executable is called hybrid_mpi and resides in the
working directory. The mympirun launcher will automatically
determine the correct number of MPI processes to start based on
the resource specifications and the given number of processes per
On Intel MPI defining the MPI domains is done through the environment variable
Note however that the Linux scheduler is still
free to move all threads of a MPI process to any core within its MPI domain
at any time, so there may be a point in further pinning the OpenMP threads through
the OpenMP environment variables also.
This is definitely the case if there are more logical cores available
in the process partition than there are OpenMP threads. Some environment
variables to influence the thread placement are
the Intel-specific variable
KMP_AFFINITY and the OpenMP 3.1
standard environment variable
In our case, we want to use all logical cores of a node but make sure
that all cores for a domain are as close together as possible. The
easiest way to accomplish this is to set
to the desired number of OpenMP threads per MPI process and then set
I_MPI_PIN_DOMAIN to the value omp:
The longer version is
where compact tells the launcher explicitly to pack threads for a single MPI process as close together as possible. This layout is the default on current versions of Intel MPI so it is not really needed to set this. An alternative, when running 1 MPI process per socket, is to set
To enforce binding of each OpenMP thread to a particular logical core, one can set
As an example, assume again we want to run the program
on 2 nodes containing 20 cores each, running 4 MPI processes per
node, so 5 OpenMP threads per process.
The following are then essential components of the job script:
Specify the resource requirements:
Load the modules, including one which contains Intel MPI,
module load intel
Create a list of unique hosts assigned to the job
export HOSTS=`sort -u $PBS_NODEFILE | paste -s -d,`This step is very important; the program will not start with the correct number of MPI ranks if it is not provided with a list of unique host names.
Set the number of OpenMP threads per MPI process:
Pin the MPI processes:
And launch hybrid_mpi using the Intel MPI launcher and
specifying 4 MPI processes per host:
mpirun -hosts $HOSTS -perhost 4 ./hybrid_mpi
In this case we do need to specify both the total number of MPI
ranks and the number of MPI ranks per host as we want the same
number of MPI ranks on each host.
In case you need a more automatic script that is easy to adapt to a different node configuration or different number of processes per node, you can do some of the computations in Bash. The number of processes per node is set in the shell variable
MPI_RANKS_PER_NODE. The above commands become:
#! /bin/bash -l # Adapt nodes and ppn on the next line according to the cluster your're using!#PBS -lnodes=2:ppn=20 ... MPI_RANKS_PER_NODE=4 # module load intel # export HOSTS=`sort -u $PBS_NODEFILE | paste -s -d,` # export OMP_NUM_THREADS=$(($PBS_NUM_PPN / $MPI_RANKS_PER_NODE)) # export OMP_PROC_BIND=true # export I_MPI_PIN_DOMAIN=omp # mpirun -hosts $HOSTS -perhost $MPIPROCS_PER_NODE ./hybrid_mpi
Intel documentation on hybrid programming
Some documents on the Intel web site that contain more information on developing and running hybrid programs:
- "Interoperability with OpenMP API" in the MPI Reference Manual explains the concept of MPI domains and how they should be used/set for hybrid programs.
- Beginning Hybrid MPI/OpenMP Development, useful if you develop your own code.
Foss toolchain (GCC and Open MPI)
Open MPI has very flexible options for process and thread placement, but they are not always easy to use. There is however also a simple option to indicate the number of logical cores you want to assign to each MPI rank (MPI process):
-cpus-per-proc <num> with <num> the number of logical cores assigned to each MPI rank.
You may want to further control the thread placement one can using the standard OpenMP
mechanism, e.g. the GNU-specific variable
or the OpenMP 3.1 standard environment variable
As long as we want to use all cores, it won't matter whether
is set to true, close or spread. However, setting
OMP_PROC_BIND to true is generally a safe choice to assure that all threads remain on the same core as they were started on to improve cache performance.
Essential elements of our job script are:
#! /bin/bash -l # Adapt nodes and ppn on the next line according to the cluster your're using! #PBS -lnodes=2:ppn=20 ... # module load foss # export OMP_NUM_THREADS=5 # export OMP_PROC_BIND=true # mpirun -cpus-per-proc $OMP_NUM_THREADS ./hybrid_mpi
Open MPI allows a lot of control over process placement and rank assignment. The Open MPI mpirun command has several options that influence this process:
--map-byinfluences the mapping of processes on the available processing resources
--rank-byinfluences the rank assignment
--bind-toinfluences the binding of processes to sets of processing resources
--report-bindingscan then be used to report on the process binding.
More information can be found in the manual pages for
mpirun which can be found on the Open MPI web pages and in the following presentations: