.. _openmp: ============================================================= OpenMP ============================================================= OpenMP is discussed in :ref:slides starting with Lecture 13. Sample codes ------------ There are a few sample codes in the $CLASSHG/codes/openmp directory. See the README.txt file for instructions on compiling and executing. Here is a very simple code, that simply evaluates a costly function at many points: .. literalinclude:: ../codes/openmp/yeval.f90 :language: fortran :linenos: Note the following: * Lines starting with !$ are only executed if the code is compiled and run with the flag -fopenmp, otherwise they are comments. * x must be declared a private variable in the omp parallel do loop, so that each thread has its own version. Otherwise another thread might reset x between the time its assigned a value and the time this value is used to set y(i). * The loop iterator i is private by default, but all other varaibles are shared by default. Fine-grain vs. coarse-grain paralellism --------------------------------------- Consider the problem of normalizing a vector by dividing each element by the 1-norm of the vector, defined by :math:\|x\|_1 = \sum_{i=1}^n |x_i|. We must first loop over all points to compute the norm. Then we must loop over all points and set :math:y_i = x_i / \|x\|_1. Note that we cannot combine these two loops into a single loop! Here is an example with *fine-grain paralellism*, where we use the OpenMP omp parallel do directive or the omp do directive within a omp parallel block. .. literalinclude:: ../codes/openmp/normalize1.f90 :language: fortran :linenos: Note the following: * We initialize :math:x_i=i as a test, so :math:\|x\|_1 = n(n+1)/2. * The compiler decides how to split the loop between threads. The loop starting on line 38 might be split differently than the loop starting on line 45. * Because of this, all threads must have access to all of memory. Next is a version with *coarse-grain parallelism*, were we decide how to split up the array between threads and then execute the same code on each thread, but each thread will compute its own version of istart and iend for its portion of the array. With this code we are guaranteed that thread 0 always handles x(1), for example, so in principle the data could be distributed. When using OpenMP on a shared memory computer this doesn't matter, but this version is more easily generalized to MPI. .. literalinclude:: ../codes/openmp/normalize2.f90 :language: fortran :linenos: Note the following: * istart and iend, the starting and ending values of i taken by each thread, are explicitly computed in terms of the thread number. We must be careful to handle the case when the number of threads does not evenly divide n. * Various variables must be declared private in lines 37-38. * norm must be initialized to 0 before the omp parallel block. Otherwise some thread might set it to 0 after another thread has already updated it by its norm_thread. * The update to norm on line 60 must be in a omp critical block, so two threads don't try to update it simultaneously (data race). * There must be an omp barrier on line 65 between updating norm by each thread and using norm to compute each y(i). We must make sure all threads have updated norm or it won't have the correct value when we use it. For comparison of fine-grain and coarse-grain parallelism on Jacobi iteration, see * :ref:jacobi1 * :ref:jacobi2_omp Further reading --------------- * :ref:biblio_openmp in bibliography * _ * Livermore tutorial _ * NERSC tutorial _