.. _rotating: ============================================================= Rotating particles and Python efficiency ============================================================= All the codes on this page can be found in the directory $CLASSHG/codes/particles/ (see :ref:`classhg` for information on obtaining the class repository). For Fortran versions of these codes, see :ref:`rotating_fortran`. For more about the timings reported here and the computer used, see :ref:`timings`. The main point of the example on this page is to show how implementing a Python code using a few basic principles can speed it up by a factor of 2000 relative to a first attempt. These principles apply equally well to Matlab codes (see :ref:`numerical_python` for more about Python vs. Matlab for numerical calculations). For a general introduction to Python, see :ref:`python`. A poorly written code --------------------- These examples solve the simple problem described in Section :ref:`particle_methods`. Below is a Python program that computes the rotating particles and plots them at a number of revolution angles. Note that this program is **not** written at all efficiently, in fact it was designed to be nearly as bad as possible for what it does (though it could be made even worse, see :ref:`python_memory_alloc`). .. literalinclude:: ../codes/particles/rotate1.py :language: python :linenos: The file $CLASSHG/codes/particles/rotate_plotter.py is not included here but can be viewed `here <../../../codes/particles/rotate_plotter.py>`_ and is available in the repository. (See :ref:`pylab` for more about plotting.) Running this code at the command line gives:: $ python rotate1.py Rotating 150 particles through 40 theta values CPU time: 7.27665200 seconds It should also produce a sequence of figures similar to `this animation <../../images/rotate-movie.gif>`_. The output above shows that running this code took 7.2 seconds of CPU time. **You should be shocked at how long this took.** This is partly because plotting takes a lot of CPU time. This is an important issue in itself for scientific computing, see :ref:`visualization`. For now we do not want to be confused by this so let's make a version of the code that leaves out the plotting and just computes the locations of the particles for each value of theta. .. literalinclude:: ../codes/particles/rotate2.py :language: python :linenos: This takes much less time than the previous version:: $ python rotate2.py Rotating 150 particles through 40 theta values CPU time: 0.26613700 seconds Note that this is much quicker, but is still painfully slow. All we are doing is computing 6000 particle locations on a machine that in priciple is capable of doing a few hundred million floating point operations per second (see :ref:`timings`). There are several major things wrong with this program. We will make some improvements by examing each in turn. Eliminating repeated computations --------------------------------- One problem with the code above is that the trigonometric functions *cos* and *sin* are being evaluated over and over again with the same arguments in the loops. Evaluating these functions takes longer than doing a single multiply or add and it is much better to evaluate the required values once and reuse them as needed. Here is an improved version of the code: .. literalinclude:: ../codes/particles/rotate3.py :language: python :linenos: This cuts the time down by a factor of 10 or so:: $ python rotate3.py Rotating 150 particles through 40 theta values CPU time: 0.03809200 seconds This may seem pretty good, but note that this is for only 150 particles. Real applications often involve millions of particles. If we scaled this up to 150,000 particles we'd expect it to take at least 100 times longer, as in fact we see if we modify *N* in *rotate3.py*:: $ python rotate3.py # after modifying N Rotating 150000 particles through 40 theta values CPU time: 35.73830700 seconds Vectorizing code and eliminating loops -------------------------------------- The major problem with the code above is that the inner loop on *j*, where we loop over all the particles, is written as a loop. *Loops should be avoided whenever possible* with interpreted languages like Python or Matlab. Instead, it is much better to use vectorized operations that are applied to the vectors *x* and *y* rather than to individual elements. An improved version takes the form: .. literalinclude:: ../codes/particles/rotate4.py :language: python :linenos: Note that we have kept *N = 150000* here. Even so, this code executes in under 1 second:: $ python rotate4.py Rotating 150000 particles through 40 theta values CPU time: 0.27013900 seconds One might also be able to replace the loop on *theta* values by vectorized operations to compute all the locations at all times, but if one wants to add back in the plotting or other operations to be done for each value of *theta*, this may not be advantageous. Linear algebra version ---------------------- It is possible to improve this even further, however, by applying a bit of linear algebra. The location of a single particle at each theta can be computed by taking its location *(x,y)* at the previous value of *theta*, when viewed as a 2-vector, and multiplying by a *rotation matrix* of the form :math:`R = \left[\begin{array}{cc}\cos(\Delta\theta) & -\sin(\Delta\theta) \\ \sin(\Delta\theta) & \cos(\Delta\theta) \end{array}\right]`. where :math:`\Delta\theta` is the change in rotation angle, which is called *dtheta* in the code. We do not want to loop over all particles doing this, however, or we will be back to a very slow code because of the loop. Instead, notice if that we define a :math:`2\times N` matrix *B* whose first row consists of the *x* vector and whose second row is the *y* vector (so that the j'th column is the location vector for the j'th particle), then we can update all particle locations simultaneously simply by computing the matrix-matrix product *RB*, where *R* is again the :math:`2\times 2` rotation matrix. This gives the following (note that *B* is called *xy* in the code): .. literalinclude:: ../codes/particles/rotate5.py :language: python :linenos: This gives slightly improved timing because Python takes advantage of special routines that optimize matrix-matrix multiplication:: $ python rotate5.py Rotating 150000 particles through 40 theta values CPU time: 0.16931200 seconds This is about 200 times faster than the *rotate3.py* code, and probably about 2000 times faster than our original code would have been. We might now get brave and try increasing N by another factor of 100, using 15 million particles:: $ python rotate5.py Rotating 15000000 particles through 40 theta values CPU time: 25.91272300 seconds This is more like the speed expected from this computer! Note that doing the matrix-matrix multiply for the case with *N* particles requires doing *4N* multiplies and *2N* additions, so a total of about *240N* floating point operations and hence 3.6 billion operations when *N* is 15 million. Dividing this by 25 seconds indicates that we have achieved a speed of about 144 Mflops (million floating point operations per second). Further improvements are possible. In particular, this code as written is being executed on only one core on this dual core laptop, while the other sits idle. If we can take advantage of both cores, we may be able to cut the computing time in half. See :ref:`parallel` for more discussion of this. We will return to this example later in the quarter.