This post is the continuation of the previous one on: CPU starvation

I needed to rewrite a part of the code and it was a good opportunity to change the algorithm of the variance calculation. I was using the naive implementation: the first pass is the calculation of the mean, the second pass the calculation of the variance. Because I cannot store the data, they were calculated twice. Yes, doing twice more Fourier transforms is not efficient…

I found an online algorithm that can gives me both the mean and variance at the same in one pass (online variance calculation). The result is of course a big improvement but the program was still suffering from heavy starvation:

if(locali/=1) then delta=fftinout-average/(locali-1) else delta=fftinout end if average=average+fftinout sigmas = sigmas + delta*(fftinout - average/locali)

The problem here is that data are reused and do not stay in cache. delta, fftinout, average and sigmas are big 3D-arrays. When delta is computed, it does not fit into the cache, fftinout is send away to make room which is just used a line below and have to be pulled back from RAM…

What I did is doing some manual tiling. Usually the compiler do it automatically but in this case, gfortran was not smart enough even with -floop-block -floop-strip-mine. The result is that each stride is staying in cache which avoid moving data back and forth:

do kk=1,ubound(fftinout,3) do jj=1,ubound(fftinout,2) delta=fftinout(:,jj,kk)-tempfact*average(:,jj,kk) average(:,jj,kk)=average(:,jj,kk)+fftinout(:,jj,kk) sigmas(:,jj,kk) = sigmas(:,jj,kk) +delta*& & (fftinout(:,jj,kk) - average(:,jj,kk)/locali) end do end do

The result is amazing, I had a 20% speedup just with this change. You just need to be smarter than the compiler. And I save a bit of memory, delta is just a stride now.

New result on my 2.7GHz, 4 cores computer with 4GB@800MHz memory:

cores | real(s) | user(s) | loop(ms) |

4 | 64 | 233 | 19(76) |

3 | 72 | 204 | 21.7(65.1) |

2 | 93 | 180 | 28.7(57.4) |

1 | 171 | 171 | 55 |

Previous results were for half the work, I was skipping the average. The current code is now faster than half the job in the previous version.