.. _rotating:

=============================================================
Rotating particles and Python efficiency
=============================================================

All the codes on this page can be found in the directory
$CLASSHG/codes/particles/ (see :ref:`classhg` for information on
obtaining the class repository).

For Fortran versions of these codes, see :ref:`rotating_fortran`.

For more about the timings reported here and the computer used, see
:ref:`timings`.  The main point of the example on this page is to show how
implementing a Python code using a few basic principles can speed it up by a
factor of 2000 relative to a first attempt.  These principles apply equally
well to Matlab codes (see :ref:`numerical_python` for more about Python vs.
Matlab for numerical calculations).

For a general introduction to Python, see :ref:`python`.


A poorly written code
---------------------

These examples solve the simple problem described in Section
:ref:`particle_methods`.

Below is a Python program that computes the rotating particles and plots them
at a number of revolution angles.  Note that this program is **not** written
at all efficiently, in fact it was designed to be nearly as bad as possible
for what it does (though it could be made even worse, see
:ref:`python_memory_alloc`). 


.. literalinclude:: ../codes/particles/rotate1.py
   :language: python
   :linenos:


The file $CLASSHG/codes/particles/rotate_plotter.py is not included here but
can be viewed `here <../../../codes/particles/rotate_plotter.py>`_
and is available in the repository. (See :ref:`pylab` for more about
plotting.)

Running this code at the command line gives::

        $ python rotate1.py
        Rotating 150 particles through 40 theta values
        CPU time:   7.27665200 seconds

It should also produce a sequence of figures similar to `this animation
<../../images/rotate-movie.gif>`_.


The output above shows that running this code took 7.2 seconds of CPU time.
**You should be shocked at how long this took.**

This is partly because plotting takes a lot of CPU time.  
This is an important issue in itself for scientific computing, see
:ref:`visualization`.

For now we do not 
want to be confused by this so let's make a version of the code that leaves
out the plotting and just computes the locations of the particles for each
value of theta.


.. literalinclude:: ../codes/particles/rotate2.py
   :language: python
   :linenos:


This takes much less time than the previous version::

        $ python rotate2.py
        Rotating 150 particles through 40 theta values
        CPU time:   0.26613700 seconds

Note that this is much quicker, but is still painfully slow.  All we are
doing is computing 6000 particle locations on a machine that in priciple is
capable of doing a few hundred million floating point operations per second (see
:ref:`timings`).

There are several major things wrong with this program.  We will make some
improvements by examing each in turn.

Eliminating repeated computations
---------------------------------

One problem with the code above is that the 
trigonometric functions *cos* and *sin* are being
evaluated over and over again with the same arguments in the loops.
Evaluating these functions takes longer than doing a single multiply
or add and it is much better to evaluate the required values once and reuse
them as needed.

Here is an improved version of the code:


.. literalinclude:: ../codes/particles/rotate3.py
   :language: python
   :linenos:


This cuts the time down by a factor of 10 or so::

    $ python rotate3.py
    Rotating 150 particles through 40 theta values
    CPU time:   0.03809200 seconds

This may seem pretty good, but note that this is for only 150 particles.
Real applications often involve millions of particles.  If we scaled this up
to 150,000 particles we'd expect it to take at least 100 times longer, as in
fact we see if we modify *N* in *rotate3.py*::

    $ python rotate3.py   # after modifying N
    Rotating 150000 particles through 40 theta values
    CPU time:  35.73830700 seconds

Vectorizing code and eliminating loops
--------------------------------------

The major problem with the code above is that the inner loop
on *j*, where we loop over all the particles, is written as a loop.  *Loops
should be avoided whenever possible* with interpreted languages like Python
or Matlab.  Instead, it is much better to use vectorized operations that are
applied to the vectors *x* and *y* rather than to individual elements.

An improved version takes the form:


.. literalinclude:: ../codes/particles/rotate4.py
   :language: python
   :linenos:


Note that we have kept *N = 150000* here.  Even so, this code executes in
under 1 second::

    $ python rotate4.py
    Rotating 150000 particles through 40 theta values
    CPU time:   0.27013900 seconds

One might also be able to replace the loop on *theta* values by vectorized
operations to compute all the locations at all times, but if one wants to
add back in the plotting or other operations to be done for each value of
*theta*, this may not be advantageous.  

Linear algebra version
----------------------

It is possible to improve this even further, however, by applying a bit of
linear algebra.  The location of a single particle at each theta can be
computed by taking its location *(x,y)* at the previous value of *theta*,
when viewed as a 2-vector, and multiplying by a *rotation matrix* of the
form 

:math:`R = \left[\begin{array}{cc}\cos(\Delta\theta) & -\sin(\Delta\theta) \\
\sin(\Delta\theta) & \cos(\Delta\theta) \end{array}\right]`.

where :math:`\Delta\theta` is the change in rotation angle, which is called
*dtheta* in the code.

We do not want to loop over all particles doing this, however, or we will be
back to a very slow code because of the loop.  Instead, notice if that we
define a :math:`2\times N` matrix *B* whose first row consists of the *x*
vector and whose second row is the *y* vector (so that the j'th column is
the location vector for the j'th particle), then we can update all particle
locations simultaneously simply by computing the matrix-matrix product *RB*,
where *R* is again the :math:`2\times 2` rotation matrix.  This gives the
following (note that *B* is called *xy* in the code):


.. literalinclude:: ../codes/particles/rotate5.py
   :language: python
   :linenos:


This gives slightly improved timing because Python takes advantage of
special routines that optimize matrix-matrix multiplication::

  $ python rotate5.py
  Rotating 150000 particles through 40 theta values
  CPU time:   0.16931200 seconds

This is about 200 times faster than the *rotate3.py* code, and probably
about 2000 times faster than our original code would have been.

We might now get brave and try increasing N by another factor of 100, using
15 million particles::

    $ python rotate5.py
    Rotating 15000000 particles through 40 theta values
    CPU time:  25.91272300 seconds

This is more like the speed expected from this computer!  Note that doing
the matrix-matrix multiply for the case with *N* particles requires doing
*4N* multiplies and *2N* additions, so a total of about *240N* floating
point operations and hence 3.6 billion operations when *N* is 15 million.
Dividing this by 25 seconds indicates that we have achieved a speed of about
144 Mflops (million floating point operations per second).

Further improvements are possible.  In particular, this code as written is
being executed on  only one core on this dual core laptop, while the other
sits idle.  If we can take advantage of both cores, we may be able to cut
the computing time in half.  See :ref:`parallel` for more discussion of
this.  We will return to this example later in the quarter.