.. _timing: ============================================================= Timing code ============================================================= .. _timing_unix: Unix time command ----------------- There is a built in command `time` that can be used to time other commands. Just type `time command` at the prompt, e.g.:: $ time ./a.out real 0m5.279s user 0m1.915s sys 0m0.006s This executes the command `./a.out` in this case (running a Fortran executable) and then prints information about the time required to execute, where, roughly speaking, *real* is the wall-clock time, *user* is the time spent executing the user's program, and *sys* is the time spent on system tasks required by the program. There may be a big difference between the *real* time and the sum of the other two times if the computer is simulataneously executing many other tasks and your program is only getting some of its attention. Using *time* does not allow you to examine how much CPU time different parts of the code required, for example. .. _timing_fortran: Fortran timing utilities ------------------------ There are two Fortran intrinsic functions that are very useful. `system_clock` tells the elapsed wall time between two successive calls, and might be used as follows:: integer(kind=8) :: tclock1, tclock2, clock_rate real(kind=8) :: elapsed_time call system_clock(tclock1) call system_clock(tclock2, clock_rate) elapsed_time = float(tclock2 - tclock1) / float(clock_rate) `cpu_time` tells the CPU time used between two successive calls:: real(kind=8) :: t1, t2, elapsed_cpu_time call cpu_time(t1) call cpu_time(t2) elapsed_cpu_time = t2 - t1 Here is an example code that uses both, and tests matrix-matrix multiply. .. literalinclude:: ../codes/fortran/timings.f90 :language: fortran :linenos: Note that the matrix-matrix product is computed 20 times over to give a better estimate of the timings. You might want to experiment with this code and various sizes of the matrices and optimization levels. Here are a few sample results on a MacBook Pro. First, with no optimization and :math:`200\times 200` matrices:: $ gfortran timings.f90 $ ./a.out Will multiply n by n matrices, input n: 200 Performed 20 matrix multiplies: CPU time = 0.81523600 seconds Elapsed time = 5.94083357 seconds Note that the elapsed time include the time required to type in the response to the prompt for `n`, as well as the time required to allocate and initialize the matrices, whereas the CPU time is just for the matrix multiplication loops. Trying a larger :math:`400 \times 400` case gives:: $ ./a.out Will multiply n by n matrices, input n: 400 Performed 20 matrix multiplies: CPU time = 7.33721500 seconds Elapsed time = 9.87114525 seconds Since computing the product of :math:`n \times n` matrices takes :math:`O(n^3)` flops, we expect this to take about 8 times as much CPU time as the previous test. This is roughly what we observe. Doubling the size again gives requires much more that 8 times as long however:: $ ./a.out Will multiply n by n matrices, input n: 800 Performed 20 matrix multiplies: CPU time = 90.49682200 seconds Elapsed time = 93.98917389 seconds Note that 3 matrices that are :math:`400\times 400` require 3.84 MB of memory, whereas three :math:`800 \times 800` matrices require 15.6 MB. Since the MacBook used for this experiment has only 6 MB of L3 cache, the data no longer fit in cache. **Compiling with optimization** If we recompile with the -O3 optimization flag, the last two experiments give these results:: $ gfortran -O3 timings.f90 $ ./a.out Will multiply n by n matrices, input n: 400 Performed 20 matrix multiplies: CPU time = 1.39002200 seconds Elapsed time = 3.58041191 seconds and :: $ ./a.out Will multiply n by n matrices, input n: 800 Performed 20 matrix multiplies: CPU time = 66.39167200 seconds Elapsed time = 68.29921722 seconds Both have sped up relative to the un-optimized code, the first much more dramatically. .. timing_openmp: Timing OpenMP code ------------------ The code in `$UWHPSC/codes/openmp/timings.f90` shows an analogous code for matrix multiplication using OpenMP. The code has been slightly modified so that `system_clock` is only timing the inner loops in order to illustrate that `cpu_time` now computes the sum of the CPU time of all threads. Here's one sample result:: $ gfortran -fopenmp -O3 timings.f90 $ ./a.out Using OpenMP, how many threads? 4 Will multiply n by n matrices, input n: 400 Performed 20 matrix multiplies: CPU time = 1.99064000 seconds Elapsed time = 0.58711302 seconds Note that the CPU time reported is nearly 2 seconds but the elapsed time is only 0.58 seconds in this case when 4 threads are being used. The total CPU time is slightly more than the previous code that did not use OpenMP, but the wall time is considerably less. For :math:`800\times 800` matrices there is a similar speedup:: $ ./a.out Using OpenMP, how many threads? 4 Will multiply n by n matrices, input n: 800 Performed 20 matrix multiplies: CPU time = 79.70573500 seconds Elapsed time = 21.37633133 seconds Here is the code: .. literalinclude:: ../codes/openmp/timings.f90 :language: fortran :linenos: