.. _openmp:

=============================================================
OpenMP
=============================================================

OpenMP is discussed in :ref:`slides` starting with Lecture 13.


Sample codes
------------

There are a few sample codes in the `$CLASSHG/codes/openmp` directory.
See the `README.txt` file for instructions on compiling and executing.


Here is a very simple code, that simply evaluates a costly function at many
points:

.. literalinclude:: ../codes/openmp/yeval.f90
   :language: fortran
   :linenos:

Note the following:

* Lines starting with `!$` are only executed if the code is compiled and run
  with the flag `-fopenmp`, otherwise they are comments.

* `x` must be declared a `private` variable in the `omp parallel do` loop,
  so that each thread has its own version.  Otherwise another thread might
  reset `x` between the time its assigned a value and the time this value is
  used to set `y(i)`.

* The loop iterator `i` is private by default, but all other varaibles
  are shared by default.


Fine-grain vs. coarse-grain paralellism
---------------------------------------

Consider the problem of normalizing a vector by dividing each element by the
1-norm of the vector, defined by :math:`\|x\|_1 = \sum_{i=1}^n |x_i|`.

We must first loop over all points to compute the norm.  Then we must loop
over all points and set :math:`y_i = x_i / \|x\|_1`.  Note that we cannot
combine these two loops into a single loop!

Here is an example with *fine-grain paralellism*, where we use the OpenMP
`omp parallel do` directive or the `omp do` directive within a `omp
parallel` block.  


.. literalinclude:: ../codes/openmp/normalize1.f90
   :language: fortran
   :linenos:

Note the following:

* We initialize :math:`x_i=i` as a test, so :math:`\|x\|_1 = n(n+1)/2`.

* The compiler decides how to split the loop between threads.
  The loop starting on line 38 might be split differently than the
  loop starting on line 45.

* Because of this, all threads must have access to all of memory.  


Next is a version with *coarse-grain parallelism*, were we decide how to
split up the array between threads and then execute the same code on each
thread, but each thread will compute its own version of `istart` and `iend`
for its portion of the array.  With this code we are guaranteed that 
thread 0 always handles `x(1)`, for example, so in principle the data could
be distributed.  When using OpenMP on a shared memory computer this doesn't
matter, but this version is more easily generalized to MPI.

.. literalinclude:: ../codes/openmp/normalize2.f90
   :language: fortran
   :linenos:

Note the following:

* `istart` and `iend`, the starting and ending values of `i`
  taken by each thread, are explicitly computed in terms of the thread
  number.  We must be careful to handle the case when the number of 
  threads does not evenly divide `n`.

* Various variables must be declared `private` in lines 37-38.

* `norm` must be initialized to 0 before the `omp parallel` block.
  Otherwise some thread might set it to 0 after another thread has already
  updated it by its `norm_thread`.

* The update to `norm`  on line 60 
  must be in a `omp critical` block, so two threads
  don't try to update it simultaneously (data race).

* There must be an `omp barrier` on line 65
  between updating `norm` by each thread and
  using `norm` to compute each `y(i)`.   We must make sure all threads have
  updated `norm` or it won't have the correct value when we use it.

For comparison of fine-grain and
coarse-grain parallelism on Jacobi iteration, see

* :ref:`jacobi1` 
* :ref:`jacobi2_omp` 

Further reading
---------------

* :ref:`biblio_openmp` in bibliography

* `<http://openmp.org/wp/>`_

* `Livermore tutorial <https://computing.llnl.gov/tutorials/openmp/>`_

* `NERSC tutorial <http://www.nersc.gov/nusers/help/tutorials/openmp>`_