Loop tiling

This post is the continuation of the previous one on: CPU starvation

I needed to rewrite a part of the code and it was a good opportunity to change the algorithm of the variance calculation. I was using the naive implementation: the first pass is the calculation of the mean, the second pass the calculation of the variance. Because I cannot store the data, they were calculated twice. Yes, doing twice more Fourier transforms is not efficient…

I found an online algorithm that can gives me both the mean and variance at the same in one pass (online variance calculation). The result is of course a big improvement but the program was still suffering from heavy starvation:

if(locali/=1) then
    delta=fftinout-average/(locali-1)
else
    delta=fftinout
end if
average=average+fftinout
sigmas = sigmas + delta*(fftinout - average/locali)

The problem here is that data are reused and do not stay in cache. delta, fftinout, average and sigmas are big 3D-arrays. When delta is computed, it does not fit into the cache, fftinout is send away to make room which is just used a line below and have to be pulled back from RAM…

What I did is doing some manual tiling. Usually the compiler do it automatically but in this case, gfortran was not smart enough even with -floop-block -floop-strip-mine. The result is that each stride is staying in cache which avoid moving data back and forth:

do kk=1,ubound(fftinout,3)
    do jj=1,ubound(fftinout,2)
        delta=fftinout(:,jj,kk)-tempfact*average(:,jj,kk)
        average(:,jj,kk)=average(:,jj,kk)+fftinout(:,jj,kk)
        sigmas(:,jj,kk) = sigmas(:,jj,kk) +delta*&
        &   (fftinout(:,jj,kk) - average(:,jj,kk)/locali)
    end do
end do

The result is amazing, I had a 20% speedup just with this change. You just need to be smarter than the compiler. And I save a bit of memory, delta is just a stride now.

New result on my 2.7GHz, 4 cores computer with 4GB@800MHz memory:

cores real(s) user(s) loop(ms)
4 64 233 19(76)
3 72 204 21.7(65.1)
2 93 180 28.7(57.4)
1 171 171 55

Previous results were for half the work, I was skipping the average. The current code is now faster than half the job in the previous version.

Comments are closed.