=====================================
 Profiling :mod:`mmf.math.multigrid`
=====================================

This is a log of the profiling process for :mod:`mmf.math.multigrid`
version 2216.  The file :file:`multigrid_profile.py` was put in a
directory :file:`multigrid/profile` located with the :mod:`multigrid`
module.  The optimized version was committed as svn version 2220, then both
of these were moved to the documentation directory for archival.  The
profile script was modified slightly along the way, so the output from
the current version is not exactly the same as the recorded output.
The detailed profiling code is shown at the bottom of this file.

First we must define a problem to profile.  The main application is a
three-dimensional problem, so we start with that.

.. literalinclude:: multigrid_profile.py
   :pyobject: test_1

Our first run gives the following output::

   $ python multigrid_profile.py test_1

   Profiling function test_1...
   Writing output to _profs/multigrid_test_1.prof
   Done.
            249197 function calls (233037 primitive calls) in 10.630 CPU seconds

      Ordered by: internal time, call count
      List reduced from 71 to 20 due to restriction <20>

      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1536    4.198    0.003    7.630    0.005 multigrid_.py:646(T)
       46080    2.887    0.000    3.076    0.000 numeric.py:879(roll)
    9184/224    1.052    0.000    1.388    0.006 multigrid_.py:864(I)
    6624/224    0.700    0.000    0.829    0.004 multigrid_.py:748(R)
        1984    0.324    0.000    0.403    0.000 multigrid_.py:1001(augment)
       20736    0.273    0.000    0.287    0.000 index_tricks.py:487(__getslice__)
       46080    0.189    0.000    0.189    0.000 numeric.py:232(asanyarray)
         640    0.185    0.000    5.190    0.008 multigrid_.py:503(S)
         723    0.152    0.000    0.168    0.000 fromnumeric.py:32(_wrapit)
        9259    0.126    0.000    0.126    0.000 numeric.py:180(asarray)
        1536    0.108    0.000    7.738    0.005 multigrid_.py:600(A)
           1    0.106    0.106   10.606   10.606 multigrid_.py:1344(solve)
       89478    0.066    0.000    0.066    0.000 index_tricks.py:478(__getitem__)
      240/80    0.061    0.000    9.292    0.116 multigrid_.py:1221(v_cycle)
        5952    0.040    0.000    0.040    0.000 numeric.py:943(rollaxis)
        2176    0.033    0.000    0.039    0.000 multigrid_.py:417(dx_inv)
          80    0.019    0.000    0.024    0.000 linalg.py:1185(lstsq)
          80    0.014    0.000    1.016    0.013 multigrid_.py:1203(to_mat)
         882    0.012    0.000    0.180    0.000 fromnumeric.py:1626(prod)
         640    0.010    0.000    0.012    0.000 fromnumeric.py:833(trace)

    Line-data for file .../utils/mmf/math/multigrid/multigrid_.py
      ...
      646: ------ ------     def T(self, x, dx_inv=None):
      715:  0.01%   1536         ndim = self.d
      716:  0.04%   1536         inner = np.s_[...,] + _inner*ndim
      717: ------ ------ 
      718:  0.02%   1536         v_shape = x.shape[:-ndim]   # Shape of vector part
      719:  0.01%   1536         nv = len(v_shape)           # Number of vector dimensions
      720:  0.02%   1536         shape = np.asarray(x.shape[-ndim:]) # Shape of grid part
      721: ------ ------ 
      722: ------ ------ 
      723:  0.48%   1536         ddx = np.zeros(x.shape, dtype=x.dtype)
      724: ------ ------ 
      725:  0.01%   1536         if dx_inv is None:
      726:  0.03%   1536            dx_inv = self.dx_inv(shape)
      727: ------ ------ 
      728:  0.03%   1536         x = self.augment(x)
      729: ------ ------ 
      730:  0.16%   1536         dx_inv_2 = np.dot(dx_inv, dx_inv.T)
      731:  0.05%   6144         for a in xrange(ndim):
      732:  0.19%  18432             for b in xrange(ndim):
      733:  0.10%  13824                 if a == b:
      734: ------ ------                     xab = (- 2*x
      735: ------ ------                            + np.roll(x, -1, axis=nv + a)
      736:  2.62%   4608                            + np.roll(x, 1, axis=nv + a))
      737: ------ ------                 else:
      738:  0.26%   9216                    xa = (np.roll(x, 1, axis=nv + a)
      739:  0.21%   9216                          - np.roll(x, -1, axis=nv + a))
      740:  0.26%   9216                    xab = (np.roll(xa, 1, axis=nv + b)
      741:  0.19%   9216                           - np.roll(xa, -1, axis=nv + b))/4
      742: ------ ------                    
      743: ------ ------                 # Transform to get Laplacian
      744: 10.90%  13824                 ddx += xab[inner]*dx_inv_2[a,b]
      745: ------ ------ 
      746:  0.01%   1536         return ddx
    
    Line-data for file /Library/.../numpy/core/numeric.py
         ...
      879: ------ ------ def roll(a, shift, axis=None):
         ...
      928:  0.43%  46080     a = asanyarray(a)
      929:  0.26%  46080     if axis is None:
      930: ------ ------         n = a.size
      931: ------ ------         reshape = True
      932: ------ ------     else:
      933:  0.48%  46080         n = a.shape[axis]
      934:  0.22%  46080         reshape = False
      935:  0.29%  46080     shift %= n
      936:  3.69%  46080     indexes = concatenate((arange(n-shift,n),arange(n-shift)))
      937: 20.85%  46080     res = a.take(indexes, axis)
      938:  0.24%  46080     if reshape:
      939: ------ ------         return res.reshape(a.shape)
      940: ------ ------     else:
      941:  0.18%  46080         return res

We see that most of the time is spent in `T` and `roll`.  Using the
lineevent data we can see the hotspots.  Running once more with `n=5`
gives demonstrates the same proportions::

   $ python multigrid_profile.py test_1 5
   Profiling function test_1...
   Writing output to _profs/multigrid_test_1.prof
   Done.
            372066 function calls (347195 primitive calls) in 57.850 CPU seconds

      Ordered by: internal time, call count
      List reduced from 72 to 20 due to restriction <20>

      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2210   28.526    0.013   48.318    0.022 multigrid_.py:646(T)
       66300   18.209    0.000   18.531    0.000 numeric.py:879(roll)
   13940/340    3.203    0.000    3.407    0.010 multigrid_.py:864(I)
   10540/340    1.866    0.000    2.338    0.007 multigrid_.py:748(R)
        2890    1.507    0.001    1.629    0.001 multigrid_.py:1001(augment)
        1020    1.306    0.001   34.933    0.034 multigrid_.py:503(S)
           1    0.897    0.897   57.689   57.689 multigrid_.py:1344(solve)
        2210    0.740    0.000   49.058    0.022 multigrid_.py:600(A)
     357/102    0.338    0.001   49.085    0.481 multigrid_.py:1221(v_cycle)
       66300    0.323    0.000    0.323    0.000 numeric.py:232(asanyarray)
       13747    0.198    0.000    0.198    0.000 numeric.py:180(asarray)
      138046    0.104    0.000    0.104    0.000 index_tricks.py:478(__getitem__)
       32130    0.078    0.000    0.101    0.000 index_tricks.py:487(__getslice__)
           1    0.068    0.068   57.847   57.847 multigrid_profile.py:28(test_1)
        8670    0.062    0.000    0.062    0.000 numeric.py:943(rollaxis)
        3230    0.054    0.000    0.065    0.000 multigrid_.py:417(dx_inv)
           2    0.053    0.027    0.054    0.027 multigrid_.py:434(x)
          85    0.048    0.001    0.054    0.001 multigrid_.py:1412(_remove_constant)
          17    0.034    0.002   50.641    2.979 multigrid_.py:1263(full_multigrid)
          32    0.033    0.001    0.033    0.001 fromnumeric.py:1492(amax)

We start with line 744 in :file:`multigrid_.py`::

      743: ------ ------                 # Transform to get Laplacian
      744: 10.90%  13824                 ddx += xab[inner]*dx_inv_2[a,b]

Perhaps the problem is that the operations of take and the
multiplication etc. do not happen in place and force copies to be
made.  Noting that the array `xab` is never used again, we change this
to the following using in-place operations and reprofile::

  743: ------ ------                 # Transform to get Laplacian
  744:  6.80%  13824                 xab[inner] *= dx_inv_2[a,b]
  745:  3.65%  13824                 ddx += xab[inner]

6.80% + 3.65% = 10.45%:  Not much improvement.  This surprised me and
demonstrates the important point that you must profile to determine
performance issues!  Perhaps it is the indexing that is taking a long
time.  We try a few other things for comparison.  It looks like
in-place operations on arrays with complex indexing is somewhat
costly::

  743: ------ ------                 # Transform to get Laplacian
  744:  0.78%  13824                 xab[inner]
  745:  2.70%  13824                 xab *= dx_inv_2[a,b]
  746:  6.67%  13824                 xab[inner] *= 1.0
  747:  3.82%  13824                 ddx += xab[inner]

Here is the best I can do with these lines::

  743: ------ ------                 # Transform to get Laplacian
  744:  2.95%  13824                 xab *= dx_inv_2[a,b]
  745:  4.37%  13824                 ddx += xab[inner]

2.95% + 4.37% = 7.32%: Not great but better than nothing.

.. note:: Note that the timings are not completely accurate as they
   depend on CPU load etc. and vary from run to run.  One can either
   average several runs, or just treat the results cautiously,
   acknowledging a fair amount of error.  Make another profiling run
   if things look suspicious.  We have done that here and included the
   most "consistent" runs.

Another problem is that the indexing is done each step in the loop,
which occurs 9 times.  Let's try making `ddx` bigger and just
extracting the result at the end.  Here is the present set of
timings::

    $ python multigrid_profile.py test_1
    Profiling function test_1...
    Writing output to _profs/multigrid_test_1.prof
    Done.
             249197 function calls (233037 primitive calls) in 10.354 CPU seconds

       Ordered by: internal time, call count
       List reduced from 71 to 20 due to restriction <20>

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1536    3.898    0.003    7.428    0.005 multigrid_.py:646(T)
        46080    2.909    0.000    3.111    0.000 numeric.py:879(roll)
     9184/224    1.104    0.000    1.327    0.006 multigrid_.py:865(I)
     6624/224    0.733    0.000    0.898    0.004 multigrid_.py:749(R)
         1984    0.365    0.000    0.475    0.000 multigrid_.py:1002(augment)

      723:  0.51%   1536         ddx = np.zeros(x.shape, dtype=x.dtype)
         ...
      728:  0.03%   1536         x = self.augment(x)
      729: ------ ------ 
      730:  0.16%   1536         dx_inv_2 = np.dot(dx_inv, dx_inv.T)
      731:  0.05%   6144         for a in xrange(ndim):
      732:  0.18%  18432             for b in xrange(ndim):
      733:  0.10%  13824                 if a == b:
      734: ------ ------                     xab = (- 2*x
      735: ------ ------                            + np.roll(x, -1, axis=nv + a)
      736:  2.70%   4608                            + np.roll(x, 1, axis=nv + a))
      737: ------ ------                 else:
      738:  0.26%   9216                    xa = (np.roll(x, 1, axis=nv + a)
      739:  0.21%   9216                          - np.roll(x, -1, axis=nv + a))
      740:  0.27%   9216                    xab = (np.roll(xa, 1, axis=nv + b)
      741:  0.20%   9216                           - np.roll(xa, -1, axis=nv + b))/4
      742: ------ ------                    
      743: ------ ------                 # Transform to get Laplacian
      744:  2.97%  13824                 xab *= dx_inv_2[a,b]
      745:  4.44%  13824                 ddx += xab[inner]
      746: ------ ------ 
      747:  0.01%   1536         return ddx

followed by the new timings with a larger `ddx`::

    $ python multigrid_profile.py test_1
       ...
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1536    3.735    0.002    7.263    0.005 multigrid_.py:646(T)
        46080    2.915    0.000    3.112    0.000 numeric.py:879(roll)
     9184/224    1.205    0.000    1.359    0.006 multigrid_.py:865(I)
     6624/224    0.734    0.000    0.891    0.004 multigrid_.py:749(R)

        ...
     726:  0.03%   1536         x = self.augment(x)
     727: ------ ------ 
     728:  0.57%   1536         ddx = np.zeros(x.shape, dtype=x.dtype)
     729: ------ ------ 
     730:  0.17%   1536         dx_inv_2 = np.dot(dx_inv, dx_inv.T)
     731:  0.05%   6144         for a in xrange(ndim):
     732:  0.19%  18432             for b in xrange(ndim):
     733:  0.11%  13824                 if a == b:
     734: ------ ------                     xab = (- 2*x
     735: ------ ------                            + np.roll(x, -1, axis=nv + a)
     736:  2.71%   4608                            + np.roll(x, 1, axis=nv + a))
     737: ------ ------                 else:
     738:  0.28%   9216                    xa = (np.roll(x, 1, axis=nv + a)
     739:  0.22%   9216                          - np.roll(x, -1, axis=nv + a))
     740:  0.28%   9216                    xab = (np.roll(xa, 1, axis=nv + b)
     741:  0.20%   9216                           - np.roll(xa, -1, axis=nv + b))/4
     742: ------ ------                    
     743: ------ ------                 # Transform to get Laplacian
     744:  2.98%  13824                 xab *= dx_inv_2[a,b]
     745:  2.72%  13824                 ddx += xab
     746: ------ ------ 
     747:  0.10%   1536         return ddx[inner]

This is better now and we are approaching the point where the roll
operations are significant, so let's try looking at these.  Note that
in order to implement the boundary conditions we have augmented the
array and extract the inner entries upon return.  Thus, we can afford
to have errors in the outer dimensions.  As such, we don't really need
a full roll, we just need to shift the array.  Let's try doing this
with indexing (noting that we may have to undo the previous
optimiziations!).

.. note::  Note, when making a major change like this one must
   run one's unit tests.  This was a tricky change and it took a few
   tries to get it right.  I found some other subtle potential bugs in
   the process.

Here are the results::

    $ python multigrid_profile.py test_1
    Profiling function test_1...
    Writing output to _profs/multigrid_test_1.prof
    Done.
             282989 function calls (266829 primitive calls) in 7.208 CPU seconds
    
       Ordered by: internal time, call count
       List reduced from 69 to 20 due to restriction <20>
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1536    3.708    0.002    4.356    0.003 multigrid_.py:646(T)
     9184/224    1.111    0.000    1.274    0.006 multigrid_.py:916(I)
     6624/224    0.739    0.000    0.903    0.004 multigrid_.py:800(R)
         1984    0.365    0.000    0.477    0.000 multigrid_.py:1053(augment)
       173958    0.231    0.000    0.231    0.000 index_tricks.py:478(__getitem__)
          640    0.190    0.000    2.941    0.005 multigrid_.py:503(S)

      646: ------ ------     def T(self, x, dx_inv=None):
         ...
      715:  0.02%   1536         ndim = self.d
      716:  0.03%   1536         v_shape = x.shape[:-ndim]   # Shape of vector part
      717:  0.02%   1536         nv = len(v_shape)           # Number of vector dimensions
      718:  0.04%   1536         shape = np.asarray(x.shape[-ndim:]) # Shape of grid part
      719: ------ ------ 
      720:  0.06%   1536         inner = np.s_[:,]*nv + np.s_[1:-1,]*ndim
      721: ------ ------ 
      722:  0.01%   1536         if dx_inv is None:
      723:  0.03%   1536            dx_inv = self.dx_inv(shape)
      724: ------ ------ 
      725:  0.72%   1536         ddx = np.zeros(x.shape, dtype=x.dtype)
      726: ------ ------ 
      727:  0.05%   1536         x = self.augment(x)
      728: ------ ------ 
      729:  0.24%   1536         dx_inv_2 = np.dot(dx_inv, dx_inv.T)
      730:  0.07%   6144         for a in xrange(ndim):
      731:  0.24%  18432             for b in xrange(ndim):
      732:  0.15%  13824                 if a == b:
      733: ------ ------                     # Index shifts
      734:  0.23%   4608                     a_left, a_right = list(inner), list(inner)
      735:  0.20%   4608                     a_left[nv + a] = np.s_[:-2,][0]
      736:  0.13%   4608                     a_right[nv + a] = np.s_[2:]
      737: ------ ------ 
      738: 12.03%   4608                     xab = x[a_left]  + x[a_right] - 2*x[inner]
      739: ------ ------                 else:
      740: ------ ------                     # Index shifts.  The b_shifts should leave the
      741: ------ ------                     # other indices alone as the a-shifts will extract
      742: ------ ------                     # these.  Likewise, the a-shifts should leave the
      743: ------ ------                     # index b alone.
      744:  0.49%   9216                     a_left, a_right = list(inner), list(inner)
      745:  0.40%   9216                     a_left[nv + a] = np.s_[:-2,][0]
      746:  0.24%   9216                     a_right[nv + a] = np.s_[2:]
      747:  1.04%   9216                     a_left[nv + b] = np.s_[:]
      748:  0.20%   9216                     a_right[nv + b] = np.s_[:]
      749: ------ ------ 
      750:  0.26%   9216                     b_left = list(len(a_left)*np.s_[:,])
      751:  0.25%   9216                     b_right = list(len(a_left)*np.s_[:,])                    
      752:  0.19%   9216                     b_left[nv + b] = np.s_[:-2,][0]
      753:  0.21%   9216                     b_right[nv + b] = np.s_[2:]
      754: ------ ------ 
      755: 10.21%   9216                     xa = (x[b_right] - x[b_left])
      756: 15.12%   9216                     xab = (xa[a_right] - xa[a_left])/4
      757: ------ ------                    
      758: ------ ------                 # Transform to get Laplacian
      759:  3.64%  13824                 xab *= dx_inv_2[a,b]
      760:  3.26%  13824                 ddx += xab
      761: ------ ------ 
      762:  0.01%   1536         return ddx

In order to compare various ways of doing the axpy operation `B +=
a*A` we added some special tests to the profiling code.  In principle
using the BLAS axpy operation should work well, but the slow part of
the code is 

.. literalinclude:: multigrid_profile.py
   :pyobject: test_axpy0

.. literalinclude:: multigrid_profile.py
   :pyobject: test_axpy1

.. literalinclude:: multigrid_profile.py
   :pyobject: test_axpy2

.. literalinclude:: multigrid_profile.py
   :pyobject: test_axpy3

with the following results::

    $ python multigrid_profile.py None test_axpy
    Profiling test_axpy() with profiler None...
    Writing output to _profs/multigrid_test_axpy().prof
    test_axpy0(A,B,inds,tmp,axpy) took 2.45949s
    test_axpy1(A,B,inds,tmp,axpy) took 1.94218s
    test_axpy2(A,B,inds,tmp,axpy) took 2.57161s
    test_axpy3(A,B,inds,tmp,axpy) took 3.35713s
    Done

We have a clear winner with the axpy operations after a copy.

.. note:: It is a bit tricky to use the axpy operation: you must make
   sure that everything is ravelled, and that the arrays are
   contiguous etc.  It will work without, but the performance will be
   worse than the old code.  One useful check is that the output array
   has the same base B::

      C = axpy(A,B,...)
      assert C.base is B

Here is the optimized code.  We have almost increased the performance
by a factor of 2::

    $ python multigrid_profile.py test_1
    Profiling test_1() with profiler mmf...
    Writing output to _profs/multigrid_test_1().prof
    Done
             289134 function calls (272974 primitive calls) in 5.983 CPU seconds

       Ordered by: internal time, call count
       List reduced from 71 to 20 due to restriction <20>

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         1536    2.396    0.002    3.170    0.002 multigrid_.py:648(T)
     9184/224    1.106    0.000    1.263    0.006 multigrid_.py:937(I)
     6624/224    0.737    0.000    0.906    0.004 multigrid_.py:821(R)
         1984    0.370    0.000    0.482    0.000 multigrid_.py:1074(augment)
       173958    0.229    0.000    0.229    0.000 index_tricks.py:478(__getitem__)
        62208    0.183    0.000    0.262    0.000 index_tricks.py:487(__getslice__)
          640    0.179    0.000    2.040    0.003 multigrid_.py:505(S)
        10795    0.160    0.000    0.160    0.000 numeric.py:180(asarray)
         1536    0.107    0.000    3.277    0.002 multigrid_.py:602(A)
            1    0.104    0.104    5.960    5.960 multigrid_.py:1417(solve)
         5952    0.069    0.000    0.069    0.000 numeric.py:943(rollaxis)
       240/80    0.061    0.000    5.227    0.065 multigrid_.py:1294(v_cycle)
         1536    0.050    0.000    0.050    0.000 __init__.py:22(get_blas_funcs)
         2259    0.042    0.000    0.086    0.000 fromnumeric.py:32(_wrapit)
         2176    0.038    0.000    0.045    0.000 multigrid_.py:419(dx_inv)
         2418    0.029    0.000    0.115    0.000 fromnumeric.py:1626(prod)
           80    0.020    0.000    0.027    0.000 linalg.py:1185(lstsq)
           80    0.015    0.000    0.959    0.012 multigrid_.py:1276(to_mat)
          640    0.010    0.000    0.012    0.000 fromnumeric.py:833(trace)
           16    0.009    0.001    5.687    0.355 multigrid_.py:1336(full_multigrid)


      717:  0.03%   1536         ndim = self.d
      718:  0.02%   1536         x_shape = x.shape
      719:  0.02%   1536         v_shape = x_shape[:-ndim]   # Shape of vector part
      720:  0.02%   1536         nv = len(v_shape)           # Number of vector dimensions
      721:  0.04%   1536         shape = np.asarray(x_shape[-ndim:]) # Shape of grid part
      722: ------ ------ 
      723:  0.07%   1536         inner = np.s_[:,]*nv + np.s_[1:-1,]*ndim
      724: ------ ------ 
      725:  0.01%   1536         if dx_inv is None:
      726:  0.04%   1536            dx_inv = self.dx_inv(shape)
      727: ------ ------ 
      728:  0.04%   1536         ddx = np.zeros(np.prod(x_shape), dtype=x.dtype)
      729:  0.18%   1536         tmp = np.empty(x_shape, dtype=float)
      730: ------ ------ 
      731:  0.05%   1536         axpy, = get_blas_funcs(['axpy'], [ddx, ddx])
      732: ------ ------         # axpy(A,B, A.size, a) computes B -> a*A + B
      733: ------ ------ 
      734:  0.11%   1536         x = self.augment(x)
      735: ------ ------ 
      736:  0.29%   1536         dx_inv_2 = np.dot(dx_inv, dx_inv.T)
      737: ------ ------         
      738:  0.09%   6144         for a in xrange(ndim):
      739:  0.26%  18432             for b in xrange(ndim):
      740:  0.18%  13824                 if a == b:
      741: ------ ------                     # Index shifts
      742:  0.22%   4608                     al, ar = list(inner), list(inner)
      743:  0.19%   4608                     al[nv + a] = np.s_[:-2,][0]
      744:  0.14%   4608                     ar[nv + a] = np.s_[2:]
      745: ------ ------ 
      746:  2.59%   4608                     tmp[::] = x[al]
      747:  0.97%   4608                     ddx = axpy(tmp.ravel(), ddx, a=dx_inv_2[a,b])
      748:  1.92%   4608                     tmp[::] = x[ar]
      749:  0.84%   4608                     ddx = axpy(tmp.ravel(), ddx, a=dx_inv_2[a,b])
      750:  1.90%   4608                     tmp[::] = x[inner]
      751:  1.00%   4608                     ddx = axpy(tmp.ravel(), ddx, a=-2*dx_inv_2[a,b])
      752: ------ ------ 
      753: ------ ------                 else:
      754: ------ ------                     # Index shifts.  The b_shifts should leave the
      755: ------ ------                     # other indices alone as the a-shifts will extract
      756: ------ ------                     # these.  Likewise, the a-shifts should leave the
      757: ------ ------                     # index b alone.
      758:  0.29%   9216                     al_bl = list(inner)
      759:  0.37%   9216                     al_bl[nv + a] = np.s_[:-2,][0] # Bug in s_
      760:  0.21%   9216                     al_bl[nv + b] = np.s_[:-2,][0] # Bug in s_
      761: ------ ------ 
      762:  0.25%   9216                     al_br = list(inner)
      763:  0.23%   9216                     al_br[nv + a] = np.s_[:-2,][0] # Bug in s_
      764:  1.43%   9216                     al_br[nv + b] = np.s_[2:]
      765: ------ ------ 
      766:  0.25%   9216                     ar_bl = list(inner)
      767:  0.27%   9216                     ar_bl[nv + a] = np.s_[2:]
      768:  2.06%   9216                     ar_bl[nv + b] = np.s_[:-2,][0] # Bug in s_
      769: ------ ------ 
      770:  0.26%   9216                     ar_br = list(inner)
      771:  0.26%   9216                     ar_br[nv + a] = np.s_[2:]
      772:  0.24%   9216                     ar_br[nv + b] = np.s_[2:]
      773: ------ ------ 
      774:  3.92%   9216                     tmp[::] = x[ar_br]
      775:  1.22%   9216                     ddx = axpy(tmp.ravel(), ddx, a=dx_inv_2[a,b]/4)
      776:  3.85%   9216                     tmp[::] = x[al_bl]
      777:  1.11%   9216                     ddx = axpy(tmp.ravel(), ddx, a=dx_inv_2[a,b]/4)
      778:  3.81%   9216                     tmp[::] = x[ar_bl]
      779:  1.16%   9216                     ddx = axpy(tmp.ravel(), ddx, a=-dx_inv_2[a,b]/4)
      780:  3.80%   9216                     tmp[::] = x[al_br]
      781:  1.13%   9216                     ddx = axpy(tmp.ravel(), ddx, a=-dx_inv_2[a,b]/4)
      782: ------ ------ 
      783:  0.08%   1536         return ddx.reshape(x_shape)

The slow part is still the indexing to perform the shifts.  It seems
we could probably gain about a factor of 3 or so if we coded this
directly in C++.

I considered cleaning up I and R, but they are kind of tricky and the
current code is about as clean an symmetric as it can be, so I think I
will leave it for now. Again, this could be pretty easily done in C++,
so let's wait for that.

Details
=======
We profile with the following class that allows us to choose between
the :mod:`hotshot` profiler, the :mod:`cProfile` profiler, and our own
:mod:`mmf.utils.mmf_profile` profiler based on :mod:`hotshot` but including
analysis of the line-event data.

The following class does the dirty work:

.. literalinclude:: multigrid_profile.py
   :pyobject: Profile