This post is the continuation of the previous one on: CPU starvation
I needed to rewrite a part of the code and it was a good opportunity to change the algorithm of the variance calculation. I was using the naive implementation: the first pass is the calculation of the mean, the second pass the calculation of the variance. Because I cannot store the data, they were calculated twice. Yes, doing twice more Fourier transforms is not efficient…
I found an online algorithm that can gives me both the mean and variance at the same in one pass (online variance calculation). The result is of course a big improvement but the program was still suffering from heavy starvation:
if(locali/=1) then delta=fftinout-average/(locali-1) else delta=fftinout end if average=average+fftinout sigmas = sigmas + delta*(fftinout - average/locali)
The problem here is that data are reused and do not stay in cache. delta, fftinout, average and sigmas are big 3D-arrays. When delta is computed, it does not fit into the cache, fftinout is send away to make room which is just used a line below and have to be pulled back from RAM…
What I did is doing some manual tiling. Usually the compiler do it automatically but in this case, gfortran was not smart enough even with -floop-block -floop-strip-mine. The result is that each stride is staying in cache which avoid moving data back and forth:
do kk=1,ubound(fftinout,3) do jj=1,ubound(fftinout,2) delta=fftinout(:,jj,kk)-tempfact*average(:,jj,kk) average(:,jj,kk)=average(:,jj,kk)+fftinout(:,jj,kk) sigmas(:,jj,kk) = sigmas(:,jj,kk) +delta*& & (fftinout(:,jj,kk) - average(:,jj,kk)/locali) end do end do
The result is amazing, I had a 20% speedup just with this change. You just need to be smarter than the compiler. And I save a bit of memory, delta is just a stride now.
New result on my 2.7GHz, 4 cores computer with 4GB@800MHz memory:
Previous results were for half the work, I was skipping the average. The current code is now faster than half the job in the previous version.