.. _openmp:
=============================================================
OpenMP
=============================================================
OpenMP is discussed in :ref:`slides` starting with Lecture 13.
Sample codes
------------
There are a few sample codes in the `$CLASSHG/codes/openmp` directory.
See the `README.txt` file for instructions on compiling and executing.
Here is a very simple code, that simply evaluates a costly function at many
points:
.. literalinclude:: ../codes/openmp/yeval.f90
:language: fortran
:linenos:
Note the following:
* Lines starting with `!$` are only executed if the code is compiled and run
with the flag `-fopenmp`, otherwise they are comments.
* `x` must be declared a `private` variable in the `omp parallel do` loop,
so that each thread has its own version. Otherwise another thread might
reset `x` between the time its assigned a value and the time this value is
used to set `y(i)`.
* The loop iterator `i` is private by default, but all other varaibles
are shared by default.
Fine-grain vs. coarse-grain paralellism
---------------------------------------
Consider the problem of normalizing a vector by dividing each element by the
1-norm of the vector, defined by :math:`\|x\|_1 = \sum_{i=1}^n |x_i|`.
We must first loop over all points to compute the norm. Then we must loop
over all points and set :math:`y_i = x_i / \|x\|_1`. Note that we cannot
combine these two loops into a single loop!
Here is an example with *fine-grain paralellism*, where we use the OpenMP
`omp parallel do` directive or the `omp do` directive within a `omp
parallel` block.
.. literalinclude:: ../codes/openmp/normalize1.f90
:language: fortran
:linenos:
Note the following:
* We initialize :math:`x_i=i` as a test, so :math:`\|x\|_1 = n(n+1)/2`.
* The compiler decides how to split the loop between threads.
The loop starting on line 38 might be split differently than the
loop starting on line 45.
* Because of this, all threads must have access to all of memory.
Next is a version with *coarse-grain parallelism*, were we decide how to
split up the array between threads and then execute the same code on each
thread, but each thread will compute its own version of `istart` and `iend`
for its portion of the array. With this code we are guaranteed that
thread 0 always handles `x(1)`, for example, so in principle the data could
be distributed. When using OpenMP on a shared memory computer this doesn't
matter, but this version is more easily generalized to MPI.
.. literalinclude:: ../codes/openmp/normalize2.f90
:language: fortran
:linenos:
Note the following:
* `istart` and `iend`, the starting and ending values of `i`
taken by each thread, are explicitly computed in terms of the thread
number. We must be careful to handle the case when the number of
threads does not evenly divide `n`.
* Various variables must be declared `private` in lines 37-38.
* `norm` must be initialized to 0 before the `omp parallel` block.
Otherwise some thread might set it to 0 after another thread has already
updated it by its `norm_thread`.
* The update to `norm` on line 60
must be in a `omp critical` block, so two threads
don't try to update it simultaneously (data race).
* There must be an `omp barrier` on line 65
between updating `norm` by each thread and
using `norm` to compute each `y(i)`. We must make sure all threads have
updated `norm` or it won't have the correct value when we use it.
For comparison of fine-grain and
coarse-grain parallelism on Jacobi iteration, see
* :ref:`jacobi1`
* :ref:`jacobi2_omp`
Further reading
---------------
* :ref:`biblio_openmp` in bibliography
* ``_
* `Livermore tutorial `_
* `NERSC tutorial `_