.. _homework5: ========================================== Homework 5 ========================================== The goals of this homework are to: * Get more experience with Fortran and OpenMP. * Experiment with coarse grain parallelism. * Learn a bit more about quadrature. #. I wrote an article recently urging applied mathematicians to share research code, which appeared in SIAM News. Some colleagues at other universities have told me it's required reading for their students, so of course I have to assign it too! It's pretty light reading. * ``_ * `some related links `_ #. The IPython notebook `$UWHPSC/codes/homework5/notebook/quadrature2.ipynb` is an improved version of the notebook from the last homework. Some of the functions have been made more general and a discussion has been added of Simpson's Rule, a more accurate formula than Trapezoid. This notebook is best viewed live so that you can experiment with changing things in order to explore these examples. If you have a sufficiently recent version of the notebok installed (see :ref:`ipython_notebook`) then you should be able to do:: $ cd $UWHPSC/codes/homework5/notebook $ ipython notebook --pylab inline and then click on the `quadrature2` notebook. You can also view it at ``_ . If you have created a Wakari account, you should be able to download it to your account to try it out. You can also view a static version of the notebook, which is in `$UWHPSC/codes/homework5/notebook/quadrature2.pdf`. There is also a Python script version of the code in the notebook at `$UWHPSC/codes/homework5/notebook/quadrature.py` if you find that easier to experiment with. (But see the comments at the top of that file.) Experiment with the notebook or the module to make sure you understand the material presented. Study this code to make sure you understand it. #. The directory `$UWHPSC/codes/homework5/` also contains Fortran code that implements the last part of homework 4, with some added enhancements. In particular: * timing has been added * a counter has been added to the f2 function to count how many times it is called, and this is printed at the end. * fevals is a module variable (shared between threads) to store the number of function calls by each thread when OpenMP is used. * the `error_table` subroutine has a new input parameter `method` to pass in the function that approximates the integral. When this is called from the main program `test.f90` the function `trapezoid` is passed in. In this homework you will also test Simpson's rule. * The function `f2` has been moved to a module and the parameter `k` is a module variable. Study this code and experiment with it. With 4 threads it might produce something like the following (timings will depend on how many cores you have):: Using 4 threads true integral: 6.00136745954910E+00 n approximation error ratio 50 6.00200615142458E+00 6.387E-04 0.000E+00 100 6.01762134207395E+00 1.625E-02 3.929E-02 200 5.99787907396672E+00 3.488E-03 4.659E+00 400 5.99537682567465E+00 5.991E-03 5.823E-01 800 6.00057196798962E+00 7.955E-04 7.531E+00 1600 6.00118591794817E+00 1.815E-04 4.382E+00 3200 6.00132301603504E+00 4.444E-05 4.085E+00 6400 6.00135640717690E+00 1.105E-05 4.021E+00 12800 6.00136470029559E+00 2.759E-06 4.006E+00 25600 6.00136677000209E+00 6.895E-07 4.002E+00 51200 6.00136728718235E+00 1.724E-07 4.000E+00 102400 6.00136741645906E+00 4.309E-08 4.000E+00 Elapsed time = 0.00554200 seconds CPU time = 0.01890300 seconds fevals by thread 0: 51211 fevals by thread 1: 51187 fevals by thread 2: 51187 fevals by thread 3: 51165 Total number of fevals: 204750 You do not need to code anything for this part. #. Create a new subdirectory `$MYHPSC/homework5` for the code you write below. You can use the code provided as a starting point. Create a new module `quadrature2.f90` by starting with `quadrature.f90` and adding a new function `simpson` that implements Simpson's rule. It should have the same input arguments as `trapezoid`. Write a new main program `test2.f90` to test this. Check that it is 4th order accurate on the function `f2` provided with various values of `k`. Check it also with some other functions if you want. **Note:** this module should also be called `quadrature2`, not just the file name, i.e. it should start with the line:: module quadrature2 and `test2.f90` should:: use quadrature2, only: ... #. Your `simpson` routine should include an `omp parallel do` loop similar to `trapezoid`. Make sure it gives the same results in the error table for both with and without the `-fopenmp` during compilation, and for different choices of the number of threads. Remember that you can run with more threads than your computer has cores and it should still work, but will probably make it run slower rather than faster. #. Create a new version of the `quadrature` module named `quadrature3` that has no parallel loops in `trapezoid` and instead has a parallel do loop in the `error_table` routine when it loops over the different values of `n` to test from the `nvals` array. In this loop make `last_error` a *firstprivate* variable and think about what other variables need to be *private*. More about this below. Test this version with a new test program `test3.f90` that calls `error_table` with `method = trapezoid`. Note the following: * If you run this with more than one thread, the different lines of the error table probably will not print out in the same order as on a single thread. * The values of `ratio` in the table will be wrong relative to the single thread code for various `n`. Make sure you understand why. (The values of the `error` should still agree with the single-thread code, however.) * This is not a very good way to try to parallelize this code because it does not have good *load balancing*. If you run with 2 threads, for example, one of them will do many more function evaluations than the other thread, if you allow OpenMP to split up the values of `n` between threads in the default manner. Think about why this is so and make sure you understand what's going on. Make the changes for the next two parts also in your `quadrature3.f90` version. #. Because of the load-balancing issue just mentioned, it is useful to include another clause in the `omp parallel do` loop directive in error table:: !$omp parallel do ... & ! whatever you needed before !$omp schedule(dynamic) do j=1,size(nvals) This instructs the compiler to split up the values of `j` from 1 to `size(nvals)` dynamically rather than deciding in advance that the first half of the values will go to Thread 0 and the second half to Thread 1, for example. Instead the two threads would start working on `j=1` and `j=2` and whichever finishes first would start on `j=3`. This should give a somewhat better balance between threads. Note that it can't do a perfect job for this example since computing the error for the last value of `j` (the largest value of `n`) takes more function evaluations that all the others put together! #. In order to improve load balancing, reorder the parallel loop so that `n` is decreasing rather than increasing via:: do j=size(nvals),1,-1 Think about why this is better. In this case you might get results like this:: Using 4 threads true integral: 6.00136745954910E+00 n approximation error ratio 12800 6.00136470029559E+00 2.759E-06 0.000E+00 6400 6.00135640717688E+00 1.105E-05 2.497E-01 25600 6.00136677000212E+00 6.895E-07 0.000E+00 1600 6.00118591794817E+00 1.815E-04 3.798E-03 3200 6.00132301603504E+00 4.444E-05 2.487E-01 800 6.00057196798962E+00 7.955E-04 2.282E-01 400 5.99537682567465E+00 5.991E-03 7.419E-03 200 5.99787907396672E+00 3.488E-03 2.280E-01 100 6.01762134207395E+00 1.625E-02 3.686E-01 50 6.00200615142457E+00 6.387E-04 5.462E+00 51200 6.00136728718236E+00 1.724E-07 0.000E+00 102400 6.00136741645906E+00 4.309E-08 0.000E+00 Elapsed time = 0.00621600 seconds CPU time = 0.01550900 seconds fevals by thread 0: 51200 fevals by thread 1: 102400 fevals by thread 2: 22600 fevals by thread 3: 28550 Total number of fevals: 204750 (Can you guess from this which thread got which values of `n`?) Notice that the table is very much out of order in this case, since lines were printed as threads finished their work. One could clean up the table by keeping the approximation and error values for each n in a short array and then printing at the end in the proper order, along with the correct ratios. But you don't need to do this for the assignment. The changes for Problems 6,7,8 should all be made in the same version. So what you end up with in `quadrature3.f90` will have the parallel loop in `error_table`, will use dynamic scheduling, and have the loop on `j` reordered. The `test3.f90` program should call `error_table` with `method = trapezoid`. #. Suppose we want to compute an integral in two space dimensions of the form :math:`\int_a^b \int_c^d g(x,y) \, dy \, dx` This can be rewritten as :math:`\int_a^b f(x) \, dx` where the function :math:`f(x) = \int_c^d g(x,y) \, dy`. As usual, we could approximate the integral of :math:`f(x)` by the trapezoid rule in `x`. But now for each `x`, in order to approximate :math:`f(x)` we must approximate :math:`f(x)` by a trapezoid rule approximation to the integral of :math:`g(x,y)` in :math:`y`. Create a new directory `homework5/quad2d` that contains new versions of the codes `functions.f90`, `quadrature.f90`, and `test.f90` that can be used to approximate :math:`\int_0^2 \int_1^4 \sin(x+y)~dy~dx` for which the true value can be easily calculated for comparison. In this case the function `f(x)` defined in `functions.f90` should contain an implementation of the trapezoid rule (in `y`) that estimates :math:`\int_1^4 g(x,y) \, dy` for any value `x`. The `functions` module should also contain a function `g(x,y)` that will be called by `f`. For the trapezoid rule in `y`, always use `ny = 1000` points. (Not a great idea, see below, but let's keep it simple.) Modify the test program so that it produces an error table for ten values of `n` as shown in the sample output below. (These are the values used in the trapezoid rule approximation in `x`). Also modify your code so that it keeps track of how many evaluations of the function `g(x,y)` each thread does, by introducing a new module variable `gevals` that is initialized and incremented appropriately. Start with the modules provided in `$UWHPSC/codes/homework5` and you can leave `quadrature.f90` alone. In this module, OpenMP is used for the loop in the `trapezoid` routine. You do not need to add it to your new trapezoid loop in the definition of `f(x)`. (You would not want to since they you would have nested parallel loops.) Sample output might look like this:: Using 4 threads true integral: -1.17773797385703E+00 n approximation error ratio 5 -1.15309805294824E+00 2.464E-02 0.000E+00 10 -1.17288644038560E+00 4.852E-03 5.079E+00 20 -1.17664941136820E+00 1.089E-03 4.457E+00 40 -1.17747897159959E+00 2.590E-04 4.203E+00 80 -1.17767418488683E+00 6.379E-05 4.060E+00 160 -1.17772156012371E+00 1.641E-05 3.886E+00 320 -1.17773323092813E+00 4.743E-06 3.461E+00 640 -1.17773612733693E+00 1.847E-06 2.569E+00 1280 -1.17773684879809E+00 1.125E-06 1.641E+00 2560 -1.17773702883453E+00 9.450E-07 1.191E+00 Elapsed time = 0.10095600 seconds CPU time = 0.36504200 seconds fevals by thread 0: 1298 fevals by thread 1: 1278 fevals by thread 2: 1278 fevals by thread 3: 1261 Total number of fevals: 5115 gevals by thread 0: 1298000 gevals by thread 1: 1278000 gevals by thread 2: 1278000 gevals by thread 3: 1261000 Total number of gevals: 5115000 Note that the error decreases at the expected rate initially but for larger values of `n` we do not get the factor of 4 improvement we might hope for. This is because the inner integral in `y` is always approximated with 1000 points so there is an error in the values of `f(x)` produced that does not decrease as we increase the number of points used for the outer integral. (A better idea of course would be to decrease `ny` along with `n`.) Note that you expect the total number of `g` evaluations to be 1000 times larger than the total number of `f` evaluations. When finished ------------- Your homework5 directory should contain: * functions.f90 (unchanged from `$UWHPSC/codes/homework5`) * quadrature2.f90 * test2.f90 * quadrature3.f90 * test3.f90 * Makefile (optional if you find it useful to enhance what's provided) * quad2d/quadrature.f90 (original from `$UWHPSC/codes/homework5` should work here) * quad2d/functions.f90 (with f modified and g added) * quad2d/test.f90 (modified for this problem) * Makefile (optional)