UW AMath High Performance Scientific Computing
 
AMath 483/583 Class Notes
 
Spring Quarter, 2011

Table Of Contents

Previous topic

Rotating particles and Python efficiency

Next topic

Bibliography and further reading

This Page

Rotating particles and Fortran efficiency

These examples solve the simple problem described in Section Particle methods, and follow a nearly identical pattern to the sequence of Python programs discussed in Rotating particles and Python efficiency. Comparing these examples to the ones on that page may be useful in understanding the languages as well as some of the efficiency issues.

For a general introduction to Fortran, see Fortran.

For more about the timings reported here and the computer used, see Timing and profiling computer codes.

This is a simple example of a particle code (see Particle methods) in which a set of particles in 2 space dimensions (the x-y plane) are initialized and then moved around. Here the motion is simply taken to be rotation about the origin. For more about this example see Rotating particles and Python efficiency.

The results below show that in Fortran loops are not necessarily bad. In fact the fastest version on this page is rotate3.f90 which has nested loops. In Python loops are bad because it’s an interpreted language. Using the vectorized version gives considerable speedup because it calls a compiled routine to do the loops. Fortran is a compiled language so doing the loops directly can be faster.

A poorly written code

Here we start with the analog of rotate2.py:

# $CLASSHG/codes/particles/rotate2.f90

program rotate2
   
   implicit none
   integer, parameter :: n = 150000   ! number of particles
   integer :: nsteps = 40             ! number of rotation steps
   integer :: i, j
   real :: t1, t2, pi, dtheta, theta, r, alpha, x1, y1
   real, dimension(n) :: x, y         ! arrays of length n
   
   call cpu_time(t1)                  ! start the CPU timer

   pi = 4. * atan(1.)                 ! computes pi to full precision
   dtheta = 2. * pi / nsteps          ! change in theta between steps
   
   print *, "Rotating ", n, " particles through ", nsteps, " theta values"
   
   do i=1,nsteps
      theta = i*dtheta
      do j=1,n
         alpha = 2. * pi*j/n
         r = 1. + 0.5*cos(10.*alpha)
         x1 = 2. + r*cos(alpha)
         y1 = r*sin(alpha)
         x(j) = cos(theta)*x1 - sin(theta)*y1
         y(j) = sin(theta)*x1 + cos(theta)*y1
      enddo
   enddo
   
   call cpu_time(t2)
   
   print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate2

The initial timing is:

$ gfortran rotate2.f90 -o rotate2.exe
$ ./rotate2.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  1.03054905

This is much faster than the corresponding Python code (note that we are using 150000 particles rather than 150 here!).

We can speed it up a bit more by turning optimization on:

$ gfortran -O3 rotate2.f90 -o rotate2.exe
$ ./rotate2.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  0.70011801

Eliminating repeated computations

Next is the analog of rotate3.py, in which redudant computations have been taken out of the loops. Here is the Fortran version:

# $CLASSHG/codes/particles/rotate3.f90

program rotate3
   
   implicit none
   integer, parameter :: n = 150000   ! number of particles
   integer :: nsteps = 40             ! number of rotation steps
   integer :: i, j
   real :: t1, t2, pi, dtheta, theta, sinth, costh

   real, dimension(n) :: x, y, alpha  ! arrays of length n
   real, dimension(n) :: x1, y1, r    ! arrays of length n
   
   call cpu_time(t1)                  ! start the CPU timer

   pi = 4. * atan(1.)                 ! computes pi to full precision
   dtheta = 2. * pi / nsteps          ! change in theta between steps
   
   print *, "Rotating ", n, " particles through ", nsteps, " theta values"
   
   do j=1,n
      alpha(j) = (j-1)*2.*pi / n       
      r(j) = 1. + 0.5*cos(10*alpha(j))   ! radius for each alpha
      x1(j) = 2. + r(j)*cos(alpha(j))    ! x value before rotation
      y1(j) = r(j)*sin(alpha(j))         ! y value before rotation
   enddo

   do i=1,nsteps
      theta = i*dtheta
      costh = cos(theta)
      sinth = sin(theta)
      do j=1,n
         x(j) = costh*x1(j) - sinth*y1(j)
         y(j) = sinth*x1(j) + costh*y1(j)
      enddo
   enddo
   
   call cpu_time(t2)
   
   print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate3

Without optimization:

$ gfortran rotate3.f90 -o rotate3.exe
$ ./rotate3.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  0.07432000

With optimization:

$ gfortran -O3 rotate3.f90 -o rotate3.exe
$ ./rotate3.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  0.03912600

This is much faster than the Python version.

Array operations

Version 4 uses array operations rather than loops.

# $CLASSHG/codes/particles/rotate4.f90

program rotate4
   
   implicit none
   integer, parameter :: n = 150000   ! number of particles
   integer :: nsteps = 40             ! number of rotation steps
   integer :: i, j
   real :: t1, t2, pi, dtheta, theta, sinth, costh

   real, dimension(n) :: x, y, alpha  ! arrays of length n
   real, dimension(n) :: x1, y1, r    ! arrays of length n
   
   call cpu_time(t1)                  ! start the CPU timer

   pi = 4. * atan(1.)                 ! computes pi to full precision
   dtheta = 2. * pi / nsteps          ! change in theta between steps
   
   print *, "Rotating ", n, " particles through ", nsteps, " theta values"
   
   do j=1,n
      alpha(j) = (j-1)*2.*pi / n
   enddo

   r = 1. + 0.5*cos(10*alpha)       ! vector with radius for each alpha
   x1 = 2. + r*cos(alpha)           ! vector of x before rotation
   y1 = r*sin(alpha)                ! vector of y before rotation

   do i=1,nsteps
      theta = i*dtheta
      costh = cos(theta)
      sinth = sin(theta)
      x = cos(theta)*x1 - sin(theta)*y1
      y = sin(theta)*x1 + cos(theta)*y1
   enddo
   
   call cpu_time(t2)
   
   print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate4

Recall that in Python vectorizing the loops made a huge difference. Here it actually slows down a bit relative to the loops.

Without optimization:

$ gfortran rotate4.f90 -o rotate4.exe
./rotate4.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  0.08140400

With optimization:

$ gfortran -O3 rotate4.f90 -o rotate4.exe
./rotate4.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  0.04200900

Linear algebra version

Finally, Version 5 uses the Fortran 90 matrix multiplication routine.

# $CLASSHG/codes/particles/rotate5.f90

program rotate5
   
   implicit none
   integer, parameter :: n = 150000   ! number of particles
   integer :: nsteps = 40             ! number of rotation steps
   integer :: i, j
   real :: t1, t2, pi, dtheta, theta, sdt, cdt

   real, dimension(n) :: x, y, alpha  ! arrays of length n
   real, dimension(n) :: x1, y1, r    ! arrays of length n
   real, dimension(2,n) :: xy         ! 2 by n matrix for xy values
   real, dimension(2,2) :: Rot        ! 2 by 2 rotation matrix
   
   call cpu_time(t1)                  ! start the CPU timer

   pi = 4. * atan(1.)                 ! computes pi to full precision
   dtheta = 2. * pi / nsteps          ! change in theta between steps
   
   print *, "Rotating ", n, " particles through ", nsteps, " theta values"
   
   do j=1,n
      alpha(j) = (j-1)*2.*pi / n
   enddo

   r = 1. + 0.5*cos(10*alpha)         ! vector with radius for each alpha
   xy(1,:) = 2. + r*cos(alpha)        ! vector of x before rotation
   xy(2,:) = r*sin(alpha)             ! vector of y before rotation
   Rot(1,1) = cdt
   Rot(1,2) = sdt
   Rot(2,1) = -sdt
   Rot(2,2) = cdt

   do i=1,nsteps
      xy = matmul(Rot, xy)         ! multiply by rotation matrix
      x = xy(1,:)                  ! extract 1st row
      y = xy(2,:)                  ! extract 2nd row
   enddo
   
   call cpu_time(t2)
   
   print '(" CPU time (sec):", f12.8 )', t2-t1

end program rotate5

This slows the code down even further. Moreover, optimization doesn’t help at all!

Without optimization:

$ gfortran rotate5.f90 -o rotate5.exe
./rotate5.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  0.34880498

With optimization:

$ gfortran -O3 rotate5.f90 -o rotate5.exe
./rotate5.exe
 Rotating       150000  particles through           40  theta values
 CPU time (sec):  2.14158487