These examples solve the simple problem described in Section Particle methods, and follow a nearly identical pattern to the sequence of Python programs discussed in Rotating particles and Python efficiency. Comparing these examples to the ones on that page may be useful in understanding the languages as well as some of the efficiency issues.
For a general introduction to Fortran, see Fortran.
For more about the timings reported here and the computer used, see Timing and profiling computer codes.
This is a simple example of a particle code (see Particle methods) in which a set of particles in 2 space dimensions (the x-y plane) are initialized and then moved around. Here the motion is simply taken to be rotation about the origin. For more about this example see Rotating particles and Python efficiency.
The results below show that in Fortran loops are not necessarily bad. In fact the fastest version on this page is rotate3.f90 which has nested loops. In Python loops are bad because it’s an interpreted language. Using the vectorized version gives considerable speedup because it calls a compiled routine to do the loops. Fortran is a compiled language so doing the loops directly can be faster.
Here we start with the analog of rotate2.py:
# $CLASSHG/codes/particles/rotate2.f90
program rotate2
implicit none
integer, parameter :: n = 150000 ! number of particles
integer :: nsteps = 40 ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, r, alpha, x1, y1
real, dimension(n) :: x, y ! arrays of length n
call cpu_time(t1) ! start the CPU timer
pi = 4. * atan(1.) ! computes pi to full precision
dtheta = 2. * pi / nsteps ! change in theta between steps
print *, "Rotating ", n, " particles through ", nsteps, " theta values"
do i=1,nsteps
theta = i*dtheta
do j=1,n
alpha = 2. * pi*j/n
r = 1. + 0.5*cos(10.*alpha)
x1 = 2. + r*cos(alpha)
y1 = r*sin(alpha)
x(j) = cos(theta)*x1 - sin(theta)*y1
y(j) = sin(theta)*x1 + cos(theta)*y1
enddo
enddo
call cpu_time(t2)
print '(" CPU time (sec):", f12.8 )', t2-t1
end program rotate2
The initial timing is:
$ gfortran rotate2.f90 -o rotate2.exe
$ ./rotate2.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 1.03054905
This is much faster than the corresponding Python code (note that we are using 150000 particles rather than 150 here!).
We can speed it up a bit more by turning optimization on:
$ gfortran -O3 rotate2.f90 -o rotate2.exe
$ ./rotate2.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 0.70011801
Next is the analog of rotate3.py, in which redudant computations have been taken out of the loops. Here is the Fortran version:
# $CLASSHG/codes/particles/rotate3.f90
program rotate3
implicit none
integer, parameter :: n = 150000 ! number of particles
integer :: nsteps = 40 ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, sinth, costh
real, dimension(n) :: x, y, alpha ! arrays of length n
real, dimension(n) :: x1, y1, r ! arrays of length n
call cpu_time(t1) ! start the CPU timer
pi = 4. * atan(1.) ! computes pi to full precision
dtheta = 2. * pi / nsteps ! change in theta between steps
print *, "Rotating ", n, " particles through ", nsteps, " theta values"
do j=1,n
alpha(j) = (j-1)*2.*pi / n
r(j) = 1. + 0.5*cos(10*alpha(j)) ! radius for each alpha
x1(j) = 2. + r(j)*cos(alpha(j)) ! x value before rotation
y1(j) = r(j)*sin(alpha(j)) ! y value before rotation
enddo
do i=1,nsteps
theta = i*dtheta
costh = cos(theta)
sinth = sin(theta)
do j=1,n
x(j) = costh*x1(j) - sinth*y1(j)
y(j) = sinth*x1(j) + costh*y1(j)
enddo
enddo
call cpu_time(t2)
print '(" CPU time (sec):", f12.8 )', t2-t1
end program rotate3
Without optimization:
$ gfortran rotate3.f90 -o rotate3.exe
$ ./rotate3.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 0.07432000
With optimization:
$ gfortran -O3 rotate3.f90 -o rotate3.exe
$ ./rotate3.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 0.03912600
This is much faster than the Python version.
Version 4 uses array operations rather than loops.
# $CLASSHG/codes/particles/rotate4.f90
program rotate4
implicit none
integer, parameter :: n = 150000 ! number of particles
integer :: nsteps = 40 ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, sinth, costh
real, dimension(n) :: x, y, alpha ! arrays of length n
real, dimension(n) :: x1, y1, r ! arrays of length n
call cpu_time(t1) ! start the CPU timer
pi = 4. * atan(1.) ! computes pi to full precision
dtheta = 2. * pi / nsteps ! change in theta between steps
print *, "Rotating ", n, " particles through ", nsteps, " theta values"
do j=1,n
alpha(j) = (j-1)*2.*pi / n
enddo
r = 1. + 0.5*cos(10*alpha) ! vector with radius for each alpha
x1 = 2. + r*cos(alpha) ! vector of x before rotation
y1 = r*sin(alpha) ! vector of y before rotation
do i=1,nsteps
theta = i*dtheta
costh = cos(theta)
sinth = sin(theta)
x = cos(theta)*x1 - sin(theta)*y1
y = sin(theta)*x1 + cos(theta)*y1
enddo
call cpu_time(t2)
print '(" CPU time (sec):", f12.8 )', t2-t1
end program rotate4
Recall that in Python vectorizing the loops made a huge difference. Here it actually slows down a bit relative to the loops.
Without optimization:
$ gfortran rotate4.f90 -o rotate4.exe
./rotate4.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 0.08140400
With optimization:
$ gfortran -O3 rotate4.f90 -o rotate4.exe
./rotate4.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 0.04200900
Finally, Version 5 uses the Fortran 90 matrix multiplication routine.
# $CLASSHG/codes/particles/rotate5.f90
program rotate5
implicit none
integer, parameter :: n = 150000 ! number of particles
integer :: nsteps = 40 ! number of rotation steps
integer :: i, j
real :: t1, t2, pi, dtheta, theta, sdt, cdt
real, dimension(n) :: x, y, alpha ! arrays of length n
real, dimension(n) :: x1, y1, r ! arrays of length n
real, dimension(2,n) :: xy ! 2 by n matrix for xy values
real, dimension(2,2) :: Rot ! 2 by 2 rotation matrix
call cpu_time(t1) ! start the CPU timer
pi = 4. * atan(1.) ! computes pi to full precision
dtheta = 2. * pi / nsteps ! change in theta between steps
print *, "Rotating ", n, " particles through ", nsteps, " theta values"
do j=1,n
alpha(j) = (j-1)*2.*pi / n
enddo
r = 1. + 0.5*cos(10*alpha) ! vector with radius for each alpha
xy(1,:) = 2. + r*cos(alpha) ! vector of x before rotation
xy(2,:) = r*sin(alpha) ! vector of y before rotation
Rot(1,1) = cdt
Rot(1,2) = sdt
Rot(2,1) = -sdt
Rot(2,2) = cdt
do i=1,nsteps
xy = matmul(Rot, xy) ! multiply by rotation matrix
x = xy(1,:) ! extract 1st row
y = xy(2,:) ! extract 2nd row
enddo
call cpu_time(t2)
print '(" CPU time (sec):", f12.8 )', t2-t1
end program rotate5
This slows the code down even further. Moreover, optimization doesn’t help at all!
Without optimization:
$ gfortran rotate5.f90 -o rotate5.exe
./rotate5.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 0.34880498
With optimization:
$ gfortran -O3 rotate5.f90 -o rotate5.exe
./rotate5.exe
Rotating 150000 particles through 40 theta values
CPU time (sec): 2.14158487