UW AMath High Performance Scientific Computing
 
AMath 483/583 Class Notes
 
Spring Quarter, 2013

This Page

Homework 3 [2014] Discussion of solutionΒΆ

Sample solutions can be found in $UWHPSC/solutions/homework3.

Some common errors that people made:

  • If you have a module variable such as nsamples declared in the approx_norm module but you do not use this module in your main program, or only use certain other variables from the module, then it will not be defined in the the main program. If you use implicit none then you should get a compiler error if you don’t have it declared. Otherwise, or if you declare it as a local variable in the main program, then the module variable will not be set other places it is used, such as in your subroutine. It might have the value 0 or possibly some garbage.

  • Several people tried to use if clauses in a omp parallel directive in order to switch between Method 1 and Method 2 for parallelizing approx_norm1. This is very hard to do properly. The easiest way to do this problem is to simply do

    if (method==1) then
          ! all the code for method 1
    else
          ! all the code for method 2
    endif
    

    with the appropriate omp parallel directives in each section. This means some duplication of code, which it would be nice to avoid, but it is both much clearer and much easier to do correctly.

    if clauses in directives are best used for things like:

    !$omp parallel do if(n>100000)
    do i=1,n
        ! work on i
        enddo
    

    Note that the loop is always done, regardless of the value of n. The if clause only controls whether it is done by the master thread alone or whether additional threads are forked and the work is split up.

  • Many people were missing reduction clauses in parallel do loops. Remember that if you are summing up elements of a vector or taking a maximum over different columns then this is a reduction, and if a common variable is being updated by more than one thread without this clause then it might still get the right answer, but there’s a race condition and it is likely to get the wrong answer sometimes.

  • Remember that in a parallel do loop the loop index itself should not be declared private to each thread, e.g. this is wrong

    !$omp parallel do private(i)
    do i=1,n
        ! work on i
        enddo
    
  • Also you don’t want to declare n private in this loop. If you do, it will create a private n that is uninitialized for each thread and they will not loop privately. (It would work if you declared firstprivate(n) so they are all initialized to the master thread value going into the loop, but there’s no need for each thread to have a private n.)

  • Note that if you create a parallel block using omp parallel and then have a do loop inside it, the do loop will be executed in full by each thread, for example:

    !$omp parallel
    do i=1,n
        ! work on i
        enddo
    !$omp end parallel
    

    will execute n times on each thread.

    If you have parallel blocks and within a block you want to split work between threads, you need to write it as:

    !$omp parallel
    ! do other stuff perhaps
    !$omp do
    do i=1,n
        ! work on i
        enddo
    !$omp end parallel
    
  • Spacing is important in omp statements...

    There should be no space in !$omp directives, e.g.

    !$omp parallel do
    

    not:

    !$ omp parallel do
    

    But there should be a space following !$ if the line is an ordinary fortran statement that should only be executed if compiled with OpenMP, e.g.

    !$ nthreads = 4
    

    not:

    !$nthreads = 4
    

    The latter form will simply be interpreted as a comment also with OpenMP!

Some other comments:

  • Note that in Method 2, threads are being forked twice for each sample because of the two parallel do loops in the loop on k. If nsamples is large and n is not very large, then this is much slower than Method 1, in which threads are forked only once at the start, before the loop on samples. The main program in the sample solution has had some timing statements added to help illustrate this:

    $ make test n=50 nsamples=100000 method=1
    Wrote data to input_data.txt
    ./test.exe
     nthreads =            2
     seed1 for random number generator:       12345
    CPU time =   0.72087800 seconds
    Elapsed time =   0.46263900 seconds
    Using matrix with n =    50   True 1-norm:          28.091553
    Based on 100000 samples, Approximate 1-norm:        25.297151
    

    For this case Method 1 takes less than 1 second and the elapsed time is considerably shorter than the CPU time.

    But with Method 2, it takes more than 4 seconds and most of the time is spent forking threads, not doing useful computation:

    $ make test n=50 nsamples=100000 method=2
    Wrote data to input_data.txt
    ./test.exe
     nthreads =            2
     seed1 for random number generator:       12345
    CPU time =   4.77755200 seconds
    Elapsed time =   4.04603004 seconds
    Using matrix with n =    50   True 1-norm:          28.091553
    Based on 100000 samples, Approximate 1-norm:        25.297151
    

    Method 2 may be the obvious way to introduce parallelism if you’re starting with a serial code – just add omp parallel do directives, but for this problem it is hard to find a case where it works as well as Method 1.

  • The sample solution for approx_norm.f90 contains code for methods 1 and 2, and also variants of these called methods 3 and 4. In these variants, the random number generator is called only once to generate enough numbers needed for all the sample vectors x, rather than once for each sample. If you compare timings, this can be somewhat faster. You might think it would be faster for different threads to be calling it simultaneously for shorter arrays of results, but as noted in Homework 4 [2014], the random_number routine has been made thread safe by including critical sections that allow only one thread at a time to access the internal state.