UW AMath High Performance Scientific Computing
 
Coursera Edition

Table Of Contents

This Page

Homework 5

The goals of this homework are to:

  • Get more experience with Fortran and OpenMP.
  • Experiment with coarse grain parallelism.
  • Learn a bit more about quadrature.
  1. I wrote an article recently urging applied mathematicians to share research code, which appeared in SIAM News. Some colleagues at other universities have told me it’s required reading for their students, so of course I have to assign it too! It’s pretty light reading.

  2. The IPython notebook $UWHPSC/codes/homework5/notebook/quadrature2.ipynb is an improved version of the notebook from the last homework. Some of the functions have been made more general and a discussion has been added of Simpson’s Rule, a more accurate formula than Trapezoid.

    This notebook is best viewed live so that you can experiment with changing things in order to explore these examples. If you have a sufficiently recent version of the notebok installed (see IPython_notebook) then you should be able to do:

    $ cd $UWHPSC/codes/homework5/notebook
    $ ipython notebook --pylab inline

    and then click on the quadrature2 notebook.

    You can also view it at https://www.wakari.io/sharing/bundle/rjleveque/quadrature2 . If you have created a Wakari account, you should be able to download it to your account to try it out.

    You can also view a static version of the notebook, which is in $UWHPSC/codes/homework5/notebook/quadrature2.pdf.

    There is also a Python script version of the code in the notebook at $UWHPSC/codes/homework5/notebook/quadrature.py if you find that easier to experiment with. (But see the comments at the top of that file.)

    Experiment with the notebook or the module to make sure you understand the material presented.

    Study this code to make sure you understand it.

  3. The directory $UWHPSC/codes/homework5/ also contains Fortran code that implements the last part of homework 4, with some added enhancements. In particular:

    • timing has been added
    • a counter has been added to the f2 function to count how many times it is called, and this is printed at the end.
    • fevals is a module variable (shared between threads) to store the number of function calls by each thread when OpenMP is used.
    • the error_table subroutine has a new input parameter method to pass in the function that approximates the integral. When this is called from the main program test.f90 the function trapezoid is passed in. In this homework you will also test Simpson’s rule.
    • The function f2 has been moved to a module and the parameter k is a module variable.

    Study this code and experiment with it.

    With 4 threads it might produce something like the following (timings will depend on how many cores you have):

    Using  4 threads
    true integral:   6.00136745954910E+00
    
           n         approximation        error       ratio
          50  6.00200615142458E+00    6.387E-04    0.000E+00
         100  6.01762134207395E+00    1.625E-02    3.929E-02
         200  5.99787907396672E+00    3.488E-03    4.659E+00
         400  5.99537682567465E+00    5.991E-03    5.823E-01
         800  6.00057196798962E+00    7.955E-04    7.531E+00
        1600  6.00118591794817E+00    1.815E-04    4.382E+00
        3200  6.00132301603504E+00    4.444E-05    4.085E+00
        6400  6.00135640717690E+00    1.105E-05    4.021E+00
       12800  6.00136470029559E+00    2.759E-06    4.006E+00
       25600  6.00136677000209E+00    6.895E-07    4.002E+00
       51200  6.00136728718235E+00    1.724E-07    4.000E+00
      102400  6.00136741645906E+00    4.309E-08    4.000E+00
    
    Elapsed time =   0.00554200 seconds
    CPU time =   0.01890300 seconds
    fevals by thread  0:         51211
    fevals by thread  1:         51187
    fevals by thread  2:         51187
    fevals by thread  3:         51165
    Total number of fevals:     204750

    You do not need to code anything for this part.

  4. Create a new subdirectory $MYHPSC/homework5 for the code you write below. You can use the code provided as a starting point.

    Create a new module quadrature2.f90 by starting with quadrature.f90 and adding a new function simpson that implements Simpson’s rule. It should have the same input arguments as trapezoid.

    Write a new main program test2.f90 to test this. Check that it is 4th order accurate on the function f2 provided with various values of k. Check it also with some other functions if you want.

    Note: this module should also be called quadrature2, not just the file name, i.e. it should start with the line:

    module quadrature2

    and test2.f90 should:

    use quadrature2, only: ...
  5. Your simpson routine should include an omp parallel do loop similar to trapezoid. Make sure it gives the same results in the error table for both with and without the -fopenmp during compilation, and for different choices of the number of threads.

    Remember that you can run with more threads than your computer has cores and it should still work, but will probably make it run slower rather than faster.

  6. Create a new version of the quadrature module named quadrature3 that has no parallel loops in trapezoid and instead has a parallel do loop in the error_table routine when it loops over the different values of n to test from the nvals array.

    In this loop make last_error a firstprivate variable and think about what other variables need to be private. More about this below.

    Test this version with a new test program test3.f90 that calls error_table with method = trapezoid.

    Note the following:

    • If you run this with more than one thread, the different lines of the error table probably will not print out in the same order as on a single thread.
    • The values of ratio in the table will be wrong relative to the single thread code for various n. Make sure you understand why. (The values of the error should still agree with the single-thread code, however.)
    • This is not a very good way to try to parallelize this code because it does not have good load balancing. If you run with 2 threads, for example, one of them will do many more function evaluations than the other thread, if you allow OpenMP to split up the values of n between threads in the default manner. Think about why this is so and make sure you understand what’s going on.

    Make the changes for the next two parts also in your quadrature3.f90 version.

  7. Because of the load-balancing issue just mentioned, it is useful to include another clause in the omp parallel do loop directive in error table:

    !$omp parallel do ...  &   ! whatever you needed before
    !$omp          schedule(dynamic)
    do j=1,size(nvals)

    This instructs the compiler to split up the values of j from 1 to size(nvals) dynamically rather than deciding in advance that the first half of the values will go to Thread 0 and the second half to Thread 1, for example. Instead the two threads would start working on j=1 and j=2 and whichever finishes first would start on j=3. This should give a somewhat better balance between threads.

    Note that it can’t do a perfect job for this example since computing the error for the last value of j (the largest value of n) takes more function evaluations that all the others put together!

  8. In order to improve load balancing, reorder the parallel loop so that n is decreasing rather than increasing via:

    do j=size(nvals),1,-1

    Think about why this is better.

    In this case you might get results like this:

    Using  4 threads
    true integral:   6.00136745954910E+00
    
           n         approximation        error       ratio
       12800  6.00136470029559E+00    2.759E-06    0.000E+00
        6400  6.00135640717688E+00    1.105E-05    2.497E-01
       25600  6.00136677000212E+00    6.895E-07    0.000E+00
        1600  6.00118591794817E+00    1.815E-04    3.798E-03
        3200  6.00132301603504E+00    4.444E-05    2.487E-01
         800  6.00057196798962E+00    7.955E-04    2.282E-01
         400  5.99537682567465E+00    5.991E-03    7.419E-03
         200  5.99787907396672E+00    3.488E-03    2.280E-01
         100  6.01762134207395E+00    1.625E-02    3.686E-01
          50  6.00200615142457E+00    6.387E-04    5.462E+00
       51200  6.00136728718236E+00    1.724E-07    0.000E+00
      102400  6.00136741645906E+00    4.309E-08    0.000E+00
    
    Elapsed time =   0.00621600 seconds
    CPU time =   0.01550900 seconds
    fevals by thread  0:         51200
    fevals by thread  1:        102400
    fevals by thread  2:         22600
    fevals by thread  3:         28550
    Total number of fevals:     204750

    (Can you guess from this which thread got which values of n?) Notice that the table is very much out of order in this case, since lines were printed as threads finished their work.

    One could clean up the table by keeping the approximation and error values for each n in a short array and then printing at the end in the proper order, along with the correct ratios. But you don’t need to do this for the assignment.

    The changes for Problems 6,7,8 should all be made in the same version. So what you end up with in quadrature3.f90 will have the parallel loop in error_table, will use dynamic scheduling, and have the loop on j reordered. The test3.f90 program should call error_table with method = trapezoid.

  9. Suppose we want to compute an integral in two space dimensions of the form

    \(\int_a^b \int_c^d g(x,y) \, dy \, dx\)

    This can be rewritten as \(\int_a^b f(x) \, dx\) where the function \(f(x) = \int_c^d g(x,y) \, dy\). As usual, we could approximate the integral of \(f(x)\) by the trapezoid rule in x. But now for each x, in order to approximate \(f(x)\) we must approximate \(f(x)\) by a trapezoid rule approximation to the integral of \(g(x,y)\) in \(y\).

    Create a new directory homework5/quad2d that contains new versions of the codes functions.f90, quadrature.f90, and test.f90 that can be used to approximate

    \(\int_0^2 \int_1^4 \sin(x+y)~dy~dx\)

    for which the true value can be easily calculated for comparison.

    In this case the function f(x) defined in functions.f90 should contain an implementation of the trapezoid rule (in y) that estimates \(\int_1^4 g(x,y) \, dy\) for any value x.

    The functions module should also contain a function g(x,y) that will be called by f.

    For the trapezoid rule in y, always use ny = 1000 points. (Not a great idea, see below, but let’s keep it simple.)

    Modify the test program so that it produces an error table for ten values of n as shown in the sample output below. (These are the values used in the trapezoid rule approximation in x).

    Also modify your code so that it keeps track of how many evaluations of the function g(x,y) each thread does, by introducing a new module variable gevals that is initialized and incremented appropriately.

    Start with the modules provided in $UWHPSC/codes/homework5 and you can leave quadrature.f90 alone. In this module, OpenMP is used for the loop in the trapezoid routine. You do not need to add it to your new trapezoid loop in the definition of f(x). (You would not want to since they you would have nested parallel loops.)

    Sample output might look like this:

    Using  4 threads
    true integral:  -1.17773797385703E+00
    
           n         approximation        error       ratio
           5 -1.15309805294824E+00    2.464E-02    0.000E+00
          10 -1.17288644038560E+00    4.852E-03    5.079E+00
          20 -1.17664941136820E+00    1.089E-03    4.457E+00
          40 -1.17747897159959E+00    2.590E-04    4.203E+00
          80 -1.17767418488683E+00    6.379E-05    4.060E+00
         160 -1.17772156012371E+00    1.641E-05    3.886E+00
         320 -1.17773323092813E+00    4.743E-06    3.461E+00
         640 -1.17773612733693E+00    1.847E-06    2.569E+00
        1280 -1.17773684879809E+00    1.125E-06    1.641E+00
        2560 -1.17773702883453E+00    9.450E-07    1.191E+00
    
    Elapsed time =   0.10095600 seconds
    CPU time =   0.36504200 seconds
    fevals by thread  0:          1298
    fevals by thread  1:          1278
    fevals by thread  2:          1278
    fevals by thread  3:          1261
    Total number of fevals:       5115
    gevals by thread  0:       1298000
    gevals by thread  1:       1278000
    gevals by thread  2:       1278000
    gevals by thread  3:       1261000
    Total number of gevals:    5115000

    Note that the error decreases at the expected rate initially but for larger values of n we do not get the factor of 4 improvement we might hope for. This is because the inner integral in y is always approximated with 1000 points so there is an error in the values of f(x) produced that does not decrease as we increase the number of points used for the outer integral. (A better idea of course would be to decrease ny along with n.)

    Note that you expect the total number of g evaluations to be 1000 times larger than the total number of f evaluations.

When finished

Your homework5 directory should contain:

  • functions.f90 (unchanged from $UWHPSC/codes/homework5)
  • quadrature2.f90
  • test2.f90
  • quadrature3.f90
  • test3.f90
  • Makefile (optional if you find it useful to enhance what’s provided)
  • quad2d/quadrature.f90 (original from $UWHPSC/codes/homework5 should work here)
  • quad2d/functions.f90 (with f modified and g added)
  • quad2d/test.f90 (modified for this problem)
  • Makefile (optional)