UW AMath High Performance Scientific Computing
 
AMath 483/583 Class Notes
 
Spring Quarter, 2013

Table Of Contents

Previous topic

Fortran example for Newton’s method

Next topic

MPI

This Page

OpenMP

OpenMP is discussed in slides starting with Lecture 13.

Sample codes

There are a few sample codes in the $UWHPSC/codes/openmp directory. See the README.txt file for instructions on compiling and executing.

Here is a very simple code, that simply evaluates a costly function at many points:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
! $UWHPSC/codes/openmp/yeval.f90

! If this code gives a Segmentation Fault when compiled and run with -fopenmp 
! Then you could try:
!   $ ulimit -s unlimited
! to increase the allowed stack size.
! This may not work on all computers.  On a Mac the best you can do is
!   $ ulimit -s hard
! Correction... you can do the following on a Mac:
!   $ gfortran -fopenmp -Wl,-stack_size,2faf2000 yeval.f90 
! here 2faf2000 is a hexadecimal number slightly larger than 8e8, the 
! number of bytes needed for the value of n used in this program.


program yeval
   
   use omp_lib
   implicit none
   integer, parameter :: n = 100000000
   integer :: i, nthreads
   real(kind=8), dimension(n) :: y
   real(kind=8) :: dx, x

   ! Specify number of threads to use:
   !$ print *, "How many threads to use? "
   !$ read *, nthreads
   !$ call omp_set_num_threads(nthreads)
   !$ print "('Using OpenMP with ',i3,' threads')", nthreads

   dx = 1.d0 / (n+1.d0)

   !$omp parallel do private(x) 
   do i=1,n
      x = i*dx
      y(i) = exp(x)*cos(x)*sin(x)*sqrt(5*x+6.d0)
   enddo

   print *, "Filled vector y of length", n

end program yeval

Note the following:

  • Lines starting with !$ are only executed if the code is compiled and run with the flag -fopenmp, otherwise they are comments.

  • x must be declared a private variable in the omp parallel do loop, so that each thread has its own version. Otherwise another thread might reset x between the time its assigned a value and the time this value is used to set y(i).

  • The loop iterator i is private by default, but all other varaibles are shared by default.

  • If you try to run this and get a “segmentation fault”, it is probably because the stack size limit is too small. You can see the limit with:

    $ ulimit -s
    

    On linux you can do:

    $ ulimit -s unlimited
    

    But on a Mac there is a hard limit and the best you can do is:

    $ ulimit -s hard
    

    If you still get a segmentation fault you will have to decrease n for this example.

Other directives

This example illustrates some directives beyond the parallel do:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
! $UWHPSC/codes/openmp/demo2

program demo2
   
   use omp_lib
   implicit none
   integer :: i
   integer, parameter :: n = 100000
   real(kind=8), dimension(n) :: x,y,z
   
   ! Specify number of threads to use:
   !$ call omp_set_num_threads(2)

   !$omp parallel  ! spawn two threads
   !$omp sections  ! split up work between them

     !$omp section
     x = 1.d0   ! one thread initializes x array

     !$omp section
     y = 1.d0   ! another thread initializes y array

   !$omp end sections
   !$omp barrier   ! not needed, implied at end of sections

   !$omp single    ! only want to print once:
   print *, "Done initializing x and y"
   !$omp end single nowait  ! ok for other thread to continue

   !$omp do   ! split work between threads:
   do i=1,n
       z(i) = x(i) + y(i)
       enddo

   !$omp end parallel
   print *, "max value of z is ",maxval(z)
    

end program demo2

Notes:

  • !$omp sections is used to split up work between threads
  • There is an implicit barrier after !$omp end sections, so the explicit barrier here is optional.
  • The print statement is only done once since it is in !$omp single. The nowait clause indicates that the other thread can proceed without waiting for this print to be executed.

Fine-grain vs. coarse-grain paralellism

Consider the problem of normalizing a vector by dividing each element by the 1-norm of the vector, defined by \(\|x\|_1 = \sum_{i=1}^n |x_i|\).

We must first loop over all points to compute the norm. Then we must loop over all points and set \(y_i = x_i / \|x\|_1\). Note that we cannot combine these two loops into a single loop!

Here is an example with fine-grain paralellism, where we use the OpenMP omp parallel do directive or the omp do directive within a omp parallel block.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
! $UWHPSC/codes/openmp/normalize1.f90

! Example of normalizing a vector using fine-grain parallelism.

program main
   
    use omp_lib
    implicit none
    integer :: i, thread_num
    integer, parameter :: n = 1000
 
    real(kind=8), dimension(n) :: x, y
    real(kind=8) :: norm,ynorm
 
    integer :: nthreads
    
    ! Specify number of threads to use:
    nthreads = 1       ! need this value in serial mode
    !$ nthreads = 4    
    !$ call omp_set_num_threads(nthreads)
    !$ print "('Using OpenMP with ',i3,' threads')", nthreads

    ! Specify number of threads to use:
    !$ call omp_set_num_threads(4)
 
    ! initialize x:
    !$omp parallel do 
    do i=1,n
        x(i) = dble(i)  ! convert to double float
    enddo

    norm = 0.d0
    ynorm = 0.d0

    !$omp parallel private(i)

    !$omp do reduction(+ : norm)
    do i=1,n
        norm = norm + abs(x(i))
        enddo

     !$omp barrier   ! not needed (implicit)

    !$omp do reduction(+ : ynorm)
    do i=1,n
        y(i) = x(i) / norm
        ynorm = ynorm + abs(y(i))
        enddo

    !$omp end parallel

    print *, "norm of x = ",norm, "  n(n+1)/2 = ",n*(n+1)/2
    print *, 'ynorm should be 1.0:   ynorm = ', ynorm

end program main

Note the following:

  • We initialize \(x_i=i\) as a test, so \(\|x\|_1 = n(n+1)/2\).
  • The compiler decides how to split the loop between threads. The loop starting on line 38 might be split differently than the loop starting on line 45.
  • Because of this, all threads must have access to all of memory.

Next is a version with coarse-grain parallelism, were we decide how to split up the array between threads and then execute the same code on each thread, but each thread will compute its own version of istart and iend for its portion of the array. With this code we are guaranteed that thread 0 always handles x(1), for example, so in principle the data could be distributed. When using OpenMP on a shared memory computer this doesn’t matter, but this version is more easily generalized to MPI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
! $UWHPSC/codes/openmp/normalize2.f90

! Example of normalizing a vector using coarse-grain parallelism.

program main
    
    use omp_lib
    implicit none
    integer, parameter :: n = 1000
    real(kind=8), dimension(n) :: x,y
    real(kind=8) :: norm,norm_thread,ynorm,ynorm_thread
    integer :: nthreads, points_per_thread,thread_num
    integer :: i,istart,iend

    ! Specify number of threads to use:
    nthreads = 1       ! need this value in serial mode
    !$ nthreads = 4    
    !$ call omp_set_num_threads(nthreads)
    !$ print "('Using OpenMP with ',i3,' threads')", nthreads

    ! Determine how many points to handle with each thread.
    ! Note that dividing two integers and assigning to an integer will
    ! round down if the result is not an integer.  
    ! This, together with the min(...) in the definition of iend below,
    ! insures that all points will get distributed to some thread.
    points_per_thread = (n + nthreads - 1) / nthreads
    print *, "points_per_thread = ",points_per_thread

    ! initialize x:
    do i=1,n
        x(i) = dble(i)  ! convert to double float
        enddo

    norm = 0.d0
    ynorm = 0.d0

    !$omp parallel private(i,norm_thread, &
    !$omp                  istart,iend,thread_num,ynorm_thread) 

    thread_num = 0     ! needed in serial mode
    !$ thread_num = omp_get_thread_num()    ! unique for each thread

    ! Determine start and end index for the set of points to be 
    ! handled by this thread:
    istart = thread_num * points_per_thread + 1
    iend = min((thread_num+1) * points_per_thread, n)

    !$omp critical
    print 201, thread_num, istart, iend
    !$omp end critical
201 format("Thread ",i2," will take i = ",i6," through i = ",i6)

    norm_thread = 0.d0
    do i=istart,iend
        norm_thread = norm_thread + abs(x(i))
        enddo

    ! update global norm with value from each thread:
    !$omp critical
      norm = norm + norm_thread
      print *, "norm updated to: ",norm
    !$omp end critical

    ! make sure all have updated norm before proceeding:
    !$omp barrier

    ynorm_thread = 0.d0
    do i=istart,iend
        y(i) = x(i) / norm
        ynorm_thread = ynorm_thread + abs(y(i))
        enddo

    ! update global ynorm with value from each thread:
    !$omp critical
      ynorm = ynorm + ynorm_thread
      print *, "ynorm updated to: ",ynorm
    !$omp end critical
    !$omp barrier

    !$omp end parallel 

    print *, "norm of x = ",norm, "  n(n+1)/2 = ",n*(n+1)/2
    print *, 'ynorm should be 1.0:   ynorm = ', ynorm

end program main

Note the following:

  • istart and iend, the starting and ending values of i taken by each thread, are explicitly computed in terms of the thread number. We must be careful to handle the case when the number of threads does not evenly divide n.
  • Various variables must be declared private in lines 37-38.
  • norm must be initialized to 0 before the omp parallel block. Otherwise some thread might set it to 0 after another thread has already updated it by its norm_thread.
  • The update to norm on line 60 must be in a omp critical block, so two threads don’t try to update it simultaneously (data race).
  • There must be an omp barrier on line 65 between updating norm by each thread and using norm to compute each y(i). We must make sure all threads have updated norm or it won’t have the correct value when we use it.

For comparison of fine-grain and coarse-grain parallelism on Jacobi iteration, see

Further reading