There is a built in command time that can be used to time other commands. Just type time command at the prompt, e.g.:

```
$ time ./a.out
<output from code>
real 0m5.279s
user 0m1.915s
sys 0m0.006s
```

This executes the command ./a.out in this case (running a Fortran
executable) and then prints information
about the time required to execute, where, roughly speaking,
*real* is the wall-clock time, *user* is the time spent executing the
user’s program, and *sys* is the time spent on system tasks required by the
program.

There may be a big difference between the *real* time and the sum of the
other two times if the computer is simulataneously executing many other
tasks and your program is only getting some of its attention.

Using *time* does not allow you to examine how much CPU time different parts
of the code required, for example.

There are two Fortran intrinsic functions that are very useful.

system_clock tells the elapsed wall time between two successive calls, and might be used as follows:

```
integer(kind=8) :: tclock1, tclock2, clock_rate
real(kind=8) :: elapsed_time
call system_clock(tclock1)
<code to be timed>
call system_clock(tclock2, clock_rate)
elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
```

cpu_time tells the CPU time used between two successive calls:

```
real(kind=8) :: t1, t2, elapsed_cpu_time
call cpu_time(t1)
<code to be timed>
call cpu_time(t2)
elapsed_cpu_time = t2 - t1
```

Here is an example code that uses both, and tests matrix-matrix multiply.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | ```
! $UWHPSC/codes/fortran/optimize/timings.f90
! Illustrate timing utilities in Fortran.
! system_clock can be used to compute elapsed time between
! two calls (wall time)
! cpu_time can be used to compute CPU time used between two calls.
! Try compiling with different levels of optimization, e.g. -O3
program timings
implicit none
integer, parameter :: ntests = 20
integer :: n
real(kind=8), allocatable, dimension(:,:) :: a,b,c
real(kind=8) :: t1, t2, elapsed_time
integer(kind=8) :: tclock1, tclock2, clock_rate
integer :: i,j,k,itest
call system_clock(tclock1)
print *, "Will multiply n by n matrices, input n: "
read *, n
allocate(a(n,n), b(n,n), c(n,n))
! fill a and b with 1's just for demo purposes:
a = 1.d0
b = 1.d0
call cpu_time(t1) ! start cpu timer
do itest=1,ntests
do j = 1,n
do i = 1,n
c(i,j) = 0.d0
do k=1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
enddo
call cpu_time(t2) ! end cpu timer
print 10, ntests, t2-t1
10 format("Performed ",i4, " matrix multiplies: CPU time = ",f12.8, " seconds")
call system_clock(tclock2, clock_rate)
elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
print 11, elapsed_time
11 format("Elapsed time = ",f12.8, " seconds")
end program timings
``` |

Note that the matrix-matrix product is computed 20 times over to give a better estimate of the timings.

You might want to experiment with this code and various sizes of the matrices and optimization levels. Here are a few sample results on a MacBook Pro.

First, with no optimization and \(200\times 200\) matrices:

```
$ gfortran timings.f90
$ ./a.out
Will multiply n by n matrices, input n:
200
Performed 20 matrix multiplies: CPU time = 0.81523600 seconds
Elapsed time = 5.94083357 seconds
```

Note that the elapsed time include the time required to type in the response to the prompt for n, as well as the time required to allocate and initialize the matrices, whereas the CPU time is just for the matrix multiplication loops.

Trying a larger \(400 \times 400\) case gives:

```
$ ./a.out
Will multiply n by n matrices, input n:
400
Performed 20 matrix multiplies: CPU time = 7.33721500 seconds
Elapsed time = 9.87114525 seconds
```

Since computing the product of \(n \times n\) matrices takes \(O(n^3)\) flops, we expect this to take about 8 times as much CPU time as the previous test. This is roughly what we observe.

Doubling the size again gives requires much more that 8 times as long however:

```
$ ./a.out
Will multiply n by n matrices, input n:
800
Performed 20 matrix multiplies: CPU time = 90.49682200 seconds
Elapsed time = 93.98917389 seconds
```

Note that 3 matrices that are \(400\times 400\) require 3.84 MB of memory, whereas three \(800 \times 800\) matrices require 15.6 MB. Since the MacBook used for this experiment has only 6 MB of L3 cache, the data no longer fit in cache.

**Compiling with optimization**

If we recompile with the -O3 optimization flag, the last two experiments give these results:

```
$ gfortran -O3 timings.f90
$ ./a.out
Will multiply n by n matrices, input n:
400
Performed 20 matrix multiplies: CPU time = 1.39002200 seconds
Elapsed time = 3.58041191 seconds
```

and

```
$ ./a.out
Will multiply n by n matrices, input n:
800
Performed 20 matrix multiplies: CPU time = 66.39167200 seconds
Elapsed time = 68.29921722 seconds
```

Both have sped up relative to the un-optimized code, the first much more dramatically.

The code in $UWHPSC/codes/openmp/timings.f90 shows an analogous code for matrix multiplication using OpenMP.

The code has been slightly modified so that system_clock is only timing the inner loops in order to illustrate that cpu_time now computes the sum of the CPU time of all threads.

Here’s one sample result:

```
$ gfortran -fopenmp -O3 timings.f90
$ ./a.out
Using OpenMP, how many threads?
4
Will multiply n by n matrices, input n:
400
Performed 20 matrix multiplies: CPU time = 1.99064000 seconds
Elapsed time = 0.58711302 seconds
```

Note that the CPU time reported is nearly 2 seconds but the elapsed time is only 0.58 seconds in this case when 4 threads are being used.

The total CPU time is slightly more than the previous code that did not use OpenMP, but the wall time is considerably less.

For \(800\times 800\) matrices there is a similar speedup:

```
$ ./a.out
Using OpenMP, how many threads?
4
Will multiply n by n matrices, input n:
800
Performed 20 matrix multiplies: CPU time = 79.70573500 seconds
Elapsed time = 21.37633133 seconds
```

Here is the code:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | ```
! $UWHPSC/codes/fortran/optimize/timings1.f90
! Illustrate timing utilities in Fortran.
! system_clock can be used to compute elapsed time between
! two calls (wall time)
! cpu_time can be used to compute CPU time used between two calls.
! Try compiling with different levels of optimization, e.g. -O3
program timings1
use omp_lib
implicit none
integer, parameter :: ntests = 20
integer :: n, nthreads
real(kind=8), allocatable, dimension(:,:) :: a,b,c
real(kind=8) :: t1, t2, elapsed_time
integer(kind=8) :: tclock1, tclock2, clock_rate
integer :: i,j,k,itest
! Specify number of threads to use:
!$ print *, "Using OpenMP, how many threads? "
!$ read *, nthreads
!$ call omp_set_num_threads(nthreads)
print *, "Will multiply n by n matrices, input n: "
read *, n
allocate(a(n,n), b(n,n), c(n,n))
! fill a and b with 1's just for demo purposes:
a = 1.d0
b = 1.d0
call system_clock(tclock1) ! start wall timer
call cpu_time(t1) ! start cpu timer
do itest=1,ntests
!$omp parallel do private(i,k)
do j = 1,n
do i = 1,n
c(i,j) = 0.d0
do k=1,n
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
enddo
call cpu_time(t2) ! end cpu timer
print 10, ntests, t2-t1
10 format("Performed ",i4, " matrix multiplies: CPU time = ",f12.8, " seconds")
call system_clock(tclock2, clock_rate)
elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
print 11, elapsed_time
11 format("Elapsed time = ",f12.8, " seconds")
end program timings1
``` |