High Performance Scientific Computing   AMath 483/583 Class Notes   Spring Quarter, 2013

#### Previous topic

Special functions

#### Next topic

Linear Algebra software

# Timing code¶

## Unix time command¶

There is a built in command time that can be used to time other commands. Just type time command at the prompt, e.g.:

$time ./a.out <output from code> real 0m5.279s user 0m1.915s sys 0m0.006s  This executes the command ./a.out in this case (running a Fortran executable) and then prints information about the time required to execute, where, roughly speaking, real is the wall-clock time, user is the time spent executing the user’s program, and sys is the time spent on system tasks required by the program. There may be a big difference between the real time and the sum of the other two times if the computer is simulataneously executing many other tasks and your program is only getting some of its attention. Using time does not allow you to examine how much CPU time different parts of the code required, for example. ## Fortran timing utilities¶ There are two Fortran intrinsic functions that are very useful. system_clock tells the elapsed wall time between two successive calls, and might be used as follows: integer(kind=8) :: tclock1, tclock2, clock_rate real(kind=8) :: elapsed_time call system_clock(tclock1) <code to be timed> call system_clock(tclock2, clock_rate) elapsed_time = float(tclock2 - tclock1) / float(clock_rate)  cpu_time tells the CPU time used between two successive calls: real(kind=8) :: t1, t2, elapsed_cpu_time call cpu_time(t1) <code to be timed> call cpu_time(t2) elapsed_cpu_time = t2 - t1  Here is an example code that uses both, and tests matrix-matrix multiply.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 !$UWHPSC/codes/fortran/optimize/timings.f90 ! Illustrate timing utilities in Fortran. ! system_clock can be used to compute elapsed time between ! two calls (wall time) ! cpu_time can be used to compute CPU time used between two calls. ! Try compiling with different levels of optimization, e.g. -O3 program timings implicit none integer, parameter :: ntests = 20 integer :: n real(kind=8), allocatable, dimension(:,:) :: a,b,c real(kind=8) :: t1, t2, elapsed_time integer(kind=8) :: tclock1, tclock2, clock_rate integer :: i,j,k,itest call system_clock(tclock1) print *, "Will multiply n by n matrices, input n: " read *, n allocate(a(n,n), b(n,n), c(n,n)) ! fill a and b with 1's just for demo purposes: a = 1.d0 b = 1.d0 call cpu_time(t1) ! start cpu timer do itest=1,ntests do j = 1,n do i = 1,n c(i,j) = 0.d0 do k=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo enddo call cpu_time(t2) ! end cpu timer print 10, ntests, t2-t1 10 format("Performed ",i4, " matrix multiplies: CPU time = ",f12.8, " seconds") call system_clock(tclock2, clock_rate) elapsed_time = float(tclock2 - tclock1) / float(clock_rate) print 11, elapsed_time 11 format("Elapsed time = ",f12.8, " seconds") end program timings 

Note that the matrix-matrix product is computed 20 times over to give a better estimate of the timings.

You might want to experiment with this code and various sizes of the matrices and optimization levels. Here are a few sample results on a MacBook Pro.

First, with no optimization and $$200\times 200$$ matrices:

$gfortran timings.f90$ ./a.out
Will multiply n by n matrices, input n:
200
Performed   20 matrix multiplies: CPU time =   0.81523600 seconds
Elapsed time =   5.94083357 seconds


Note that the elapsed time include the time required to type in the response to the prompt for n, as well as the time required to allocate and initialize the matrices, whereas the CPU time is just for the matrix multiplication loops.

Trying a larger $$400 \times 400$$ case gives:

$./a.out Will multiply n by n matrices, input n: 400 Performed 20 matrix multiplies: CPU time = 7.33721500 seconds Elapsed time = 9.87114525 seconds  Since computing the product of $$n \times n$$ matrices takes $$O(n^3)$$ flops, we expect this to take about 8 times as much CPU time as the previous test. This is roughly what we observe. Doubling the size again gives requires much more that 8 times as long however: $ ./a.out
Will multiply n by n matrices, input n:
800
Performed   20 matrix multiplies: CPU time =  90.49682200 seconds
Elapsed time =  93.98917389 seconds


Note that 3 matrices that are $$400\times 400$$ require 3.84 MB of memory, whereas three $$800 \times 800$$ matrices require 15.6 MB. Since the MacBook used for this experiment has only 6 MB of L3 cache, the data no longer fit in cache.

Compiling with optimization

If we recompile with the -O3 optimization flag, the last two experiments give these results:

$gfortran -O3 timings.f90$ ./a.out
Will multiply n by n matrices, input n:
400
Performed   20 matrix multiplies: CPU time =   1.39002200 seconds
Elapsed time =   3.58041191 seconds


and

$./a.out Will multiply n by n matrices, input n: 800 Performed 20 matrix multiplies: CPU time = 66.39167200 seconds Elapsed time = 68.29921722 seconds  Both have sped up relative to the un-optimized code, the first much more dramatically. ## Timing OpenMP code¶ The code in$UWHPSC/codes/openmp/timings.f90 shows an analogous code for matrix multiplication using OpenMP.

The code has been slightly modified so that system_clock is only timing the inner loops in order to illustrate that cpu_time now computes the sum of the CPU time of all threads.

Here’s one sample result:

$gfortran -fopenmp -O3 timings.f90$ ./a.out
Using OpenMP, how many threads?
4
Will multiply n by n matrices, input n:
400
Performed   20 matrix multiplies: CPU time =   1.99064000 seconds
Elapsed time =   0.58711302 seconds


Note that the CPU time reported is nearly 2 seconds but the elapsed time is only 0.58 seconds in this case when 4 threads are being used.

The total CPU time is slightly more than the previous code that did not use OpenMP, but the wall time is considerably less.

For $$800\times 800$$ matrices there is a similar speedup:

$./a.out Using OpenMP, how many threads? 4 Will multiply n by n matrices, input n: 800 Performed 20 matrix multiplies: CPU time = 79.70573500 seconds Elapsed time = 21.37633133 seconds  Here is the code:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 !$UWHPSC/codes/fortran/optimize/timings1.f90 ! Illustrate timing utilities in Fortran. ! system_clock can be used to compute elapsed time between ! two calls (wall time) ! cpu_time can be used to compute CPU time used between two calls. ! Try compiling with different levels of optimization, e.g. -O3 program timings1 use omp_lib implicit none integer, parameter :: ntests = 20 integer :: n, nthreads real(kind=8), allocatable, dimension(:,:) :: a,b,c real(kind=8) :: t1, t2, elapsed_time integer(kind=8) :: tclock1, tclock2, clock_rate integer :: i,j,k,itest ! Specify number of threads to use: !$print *, "Using OpenMP, how many threads? " !$ read *, nthreads !$call omp_set_num_threads(nthreads) print *, "Will multiply n by n matrices, input n: " read *, n allocate(a(n,n), b(n,n), c(n,n)) ! fill a and b with 1's just for demo purposes: a = 1.d0 b = 1.d0 call system_clock(tclock1) ! start wall timer call cpu_time(t1) ! start cpu timer do itest=1,ntests !$omp parallel do private(i,k) do j = 1,n do i = 1,n c(i,j) = 0.d0 do k=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo enddo call cpu_time(t2) ! end cpu timer print 10, ntests, t2-t1 10 format("Performed ",i4, " matrix multiplies: CPU time = ",f12.8, " seconds") call system_clock(tclock2, clock_rate) elapsed_time = float(tclock2 - tclock1) / float(clock_rate) print 11, elapsed_time 11 format("Elapsed time = ",f12.8, " seconds") end program timings1