 High Performance Scientific Computing   AMath 483/583 Class Notes   Spring Quarter, 2011

#### Previous topic

Rotating particles and Python efficiency

#### Next topic

Bibliography and further reading

# Rotating particles and Fortran efficiency¶

These examples solve the simple problem described in Section Particle methods, and follow a nearly identical pattern to the sequence of Python programs discussed in Rotating particles and Python efficiency. Comparing these examples to the ones on that page may be useful in understanding the languages as well as some of the efficiency issues.

For a general introduction to Fortran, see Fortran.

For more about the timings reported here and the computer used, see Timing and profiling computer codes.

This is a simple example of a particle code (see Particle methods) in which a set of particles in 2 space dimensions (the x-y plane) are initialized and then moved around. Here the motion is simply taken to be rotation about the origin. For more about this example see Rotating particles and Python efficiency.

The results below show that in Fortran loops are not necessarily bad. In fact the fastest version on this page is rotate3.f90 which has nested loops. In Python loops are bad because it’s an interpreted language. Using the vectorized version gives considerable speedup because it calls a compiled routine to do the loops. Fortran is a compiled language so doing the loops directly can be faster.

## A poorly written code¶

Here we start with the analog of rotate2.py:

```# \$CLASSHG/codes/particles/rotate2.f90

program rotate2

implicit none
integer, parameter :: n = 150000   ! number of particles
integer :: nsteps = 40             ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, r, alpha, x1, y1
real, dimension(n) :: x, y         ! arrays of length n

call cpu_time(t1)                  ! start the CPU timer

pi = 4. * atan(1.)                 ! computes pi to full precision
dtheta = 2. * pi / nsteps          ! change in theta between steps

print *, "Rotating ", n, " particles through ", nsteps, " theta values"

do i=1,nsteps
theta = i*dtheta
do j=1,n
alpha = 2. * pi*j/n
r = 1. + 0.5*cos(10.*alpha)
x1 = 2. + r*cos(alpha)
y1 = r*sin(alpha)
x(j) = cos(theta)*x1 - sin(theta)*y1
y(j) = sin(theta)*x1 + cos(theta)*y1
enddo
enddo

call cpu_time(t2)

print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate2
```

The initial timing is:

```\$ gfortran rotate2.f90 -o rotate2.exe
\$ ./rotate2.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  1.03054905```

This is much faster than the corresponding Python code (note that we are using 150000 particles rather than 150 here!).

We can speed it up a bit more by turning optimization on:

```\$ gfortran -O3 rotate2.f90 -o rotate2.exe
\$ ./rotate2.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  0.70011801```

## Eliminating repeated computations¶

Next is the analog of rotate3.py, in which redudant computations have been taken out of the loops. Here is the Fortran version:

```# \$CLASSHG/codes/particles/rotate3.f90

program rotate3

implicit none
integer, parameter :: n = 150000   ! number of particles
integer :: nsteps = 40             ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, sinth, costh

real, dimension(n) :: x, y, alpha  ! arrays of length n
real, dimension(n) :: x1, y1, r    ! arrays of length n

call cpu_time(t1)                  ! start the CPU timer

pi = 4. * atan(1.)                 ! computes pi to full precision
dtheta = 2. * pi / nsteps          ! change in theta between steps

print *, "Rotating ", n, " particles through ", nsteps, " theta values"

do j=1,n
alpha(j) = (j-1)*2.*pi / n
r(j) = 1. + 0.5*cos(10*alpha(j))   ! radius for each alpha
x1(j) = 2. + r(j)*cos(alpha(j))    ! x value before rotation
y1(j) = r(j)*sin(alpha(j))         ! y value before rotation
enddo

do i=1,nsteps
theta = i*dtheta
costh = cos(theta)
sinth = sin(theta)
do j=1,n
x(j) = costh*x1(j) - sinth*y1(j)
y(j) = sinth*x1(j) + costh*y1(j)
enddo
enddo

call cpu_time(t2)

print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate3
```

Without optimization:

```\$ gfortran rotate3.f90 -o rotate3.exe
\$ ./rotate3.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  0.07432000```

With optimization:

```\$ gfortran -O3 rotate3.f90 -o rotate3.exe
\$ ./rotate3.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  0.03912600```

This is much faster than the Python version.

## Array operations¶

Version 4 uses array operations rather than loops.

```# \$CLASSHG/codes/particles/rotate4.f90

program rotate4

implicit none
integer, parameter :: n = 150000   ! number of particles
integer :: nsteps = 40             ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, sinth, costh

real, dimension(n) :: x, y, alpha  ! arrays of length n
real, dimension(n) :: x1, y1, r    ! arrays of length n

call cpu_time(t1)                  ! start the CPU timer

pi = 4. * atan(1.)                 ! computes pi to full precision
dtheta = 2. * pi / nsteps          ! change in theta between steps

print *, "Rotating ", n, " particles through ", nsteps, " theta values"

do j=1,n
alpha(j) = (j-1)*2.*pi / n
enddo

r = 1. + 0.5*cos(10*alpha)       ! vector with radius for each alpha
x1 = 2. + r*cos(alpha)           ! vector of x before rotation
y1 = r*sin(alpha)                ! vector of y before rotation

do i=1,nsteps
theta = i*dtheta
costh = cos(theta)
sinth = sin(theta)
x = cos(theta)*x1 - sin(theta)*y1
y = sin(theta)*x1 + cos(theta)*y1
enddo

call cpu_time(t2)

print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate4
```

Recall that in Python vectorizing the loops made a huge difference. Here it actually slows down a bit relative to the loops.

Without optimization:

```\$ gfortran rotate4.f90 -o rotate4.exe
./rotate4.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  0.08140400```

With optimization:

```\$ gfortran -O3 rotate4.f90 -o rotate4.exe
./rotate4.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  0.04200900```

## Linear algebra version¶

Finally, Version 5 uses the Fortran 90 matrix multiplication routine.

```# \$CLASSHG/codes/particles/rotate5.f90

program rotate5

implicit none
integer, parameter :: n = 150000   ! number of particles
integer :: nsteps = 40             ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, sdt, cdt

real, dimension(n) :: x, y, alpha  ! arrays of length n
real, dimension(n) :: x1, y1, r    ! arrays of length n
real, dimension(2,n) :: xy         ! 2 by n matrix for xy values
real, dimension(2,2) :: Rot        ! 2 by 2 rotation matrix

call cpu_time(t1)                  ! start the CPU timer

pi = 4. * atan(1.)                 ! computes pi to full precision
dtheta = 2. * pi / nsteps          ! change in theta between steps

print *, "Rotating ", n, " particles through ", nsteps, " theta values"

do j=1,n
alpha(j) = (j-1)*2.*pi / n
enddo

r = 1. + 0.5*cos(10*alpha)         ! vector with radius for each alpha
xy(1,:) = 2. + r*cos(alpha)        ! vector of x before rotation
xy(2,:) = r*sin(alpha)             ! vector of y before rotation
Rot(1,1) = cdt
Rot(1,2) = sdt
Rot(2,1) = -sdt
Rot(2,2) = cdt

do i=1,nsteps
xy = matmul(Rot, xy)         ! multiply by rotation matrix
x = xy(1,:)                  ! extract 1st row
y = xy(2,:)                  ! extract 2nd row
enddo

call cpu_time(t2)

print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate5
```

This slows the code down even further. Moreover, optimization doesn’t help at all!

Without optimization:

```\$ gfortran rotate5.f90 -o rotate5.exe
./rotate5.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  0.34880498```

With optimization:

```\$ gfortran -O3 rotate5.f90 -o rotate5.exe
./rotate5.exe
Rotating       150000  particles through           40  theta values
CPU time (sec):  2.14158487```