Valgrind and kcachegrind, spot bottlenecks

I have written a few routines in fortran, optimisation is not really necessary but it always nice to have something running fast. Finding bottlenecks in the execution of your favorite home made software can be tricky.

The routine I wanted to improve is about structure factor calculations. It does one thing: get the reflections from a fcf file, other parameters from the cif file and calculate the structure factor. The first version was in python. The advantage in python, you don’t have to worry about the computing so you can focus on the science. On my test file (4000 reflections), it took 5 seconds. Not to bad for python although the code was very ugly and unmaintainable.

I then ported the all thing in fortran and get something about 4s. At this stage, I used valgrind to profile the code:
$ valgrind --tool=callgrind --dump-instr=yes \
--simulate-cache=yes --collect-jumps=yes \
/home/pascal/etudes/uu.nl/svn/Fcalc/trunk/Fcalc l0124a-4

Then, I use kcachegrind to read the profile. You can get such a map:

Kcachegrind gui

The bigger the area is, the longer it takes for the routine to execute. You can also get the number of time a routine is called. The routine named dhkl is used 751872 times which is not normal. It should be used 4000 times, the same as the number of reflections. Effectively, by looking at my code. This function was too deep in a nested loop with a couple of other functions! The execution time goes down to 2sec.

An other example, is the routine kronecker, it has been called 751872 times. This routine is used on the 3×3 matrix representing a symmetry operation. It’s not suppose to be used that much. I could save the result somewhere to avoid recalculation but it seems it does not improve anything…

However, there is still the big main loop over the strucfact routine. It is typically where openmp becomes handy. The portion of code is:

[codesyntax lang="fortran"]

do i=1,hklsize
fcalctable(i)=structfact(hkltable(1,i),hkltable(2,i),hkltable(3,i))
end do

[/codesyntax]

No complicated shared variables, competition, synchronisation…. each step is completely independent. You just need to rewrite this bit like this:
[codesyntax lang="fortran"]

!$OMP PARALLEL shared(hkltable, fcalctable) private(mythread,inparallel)
!$	  inparallel=OMP_in_parallel()
!$    mythread = OMP_GET_THREAD_NUM()
if(inparallel) then
print *, 'Loop parallelised, thread number: ',mythread
end if
!$OMP DO
do i=1,hklsize
fcalctable(i)=structfact(hkltable(1,i),hkltable(2,i),hkltable(3,i))
end do
!$OMP END DO
!$OMP END PARALLEL

[/codesyntax]

Compile your source code with the -fopenmp flag and on my dual core processor, I have an expected speedup of two. The execution time is now below 1sec.

The new profile is:

kcachegrind result after optimisation

One thought on “Valgrind and kcachegrind, spot bottlenecks