CPU starvation

A few weeks ago, I discovered that one of my programs has some cpu starvation issues. A cpu is starved when is waiting for the data. The most common cases are due high load on the disk but it can happen that even the RAM can be too slow. As I am just using a few MB on a 4GB system, that could not be the case.

My first hint was that the computing speed was not scaling up with the number of cores used on the newest version of the program. An old version did not have these problem. I even tested on 48 cores. But the new version version is much more efficient. Here is the results of the program running on a different numbers of cores.

Cores: number of cores used
real: real clock time in seconds
user: user clock time in seconds
loop: average time to execute a cycle of a do loop in my program in ms. In parenthesis, the one core equivalent loop time

cores real user loop
4 70 246 22(88)
3 77 207 24.3(72.9)
2 90 171 28.3(56.6)
1 146 146 47.3

Of course these results alone are not enough, problems from parallelism execution can have many sources. These results were obtained (home computer) on a intel q9505 (2800MHz) cpu equipped with 667MHz DDR2 memory (dual channel).

I have a similar computer at work: a intel q9400 (2666MHz) cpu equipped with 800MHz DDR2 memory (dual channel). The fact is my code is running slower at home (ie with the faster cpu) than at work. This was clearly the indication that my cpu was not running at full speed.

After searching a bit, I found two tools that would help me:

Here is the result of the command # perf top:

As already noted in previous posts (http://blog.debroglie.net/2011/07/26/code-optimisation/), complex exponentials are quite numerous and can be seen here as sincos. They represent 6.7% of the time. The clear_page_c function is more worrying, it is a kernel function related to the control of memory. A web search did not reveal any more information.

The command # perf stat -p 425 -v -d is given more detailed informations about the program execution (425 was the pid of the program).

It revealed that the LLC-load-misses is 9.89%. It means that 10% of the time, the cpu won’t find any data in the L2 cache forcing him to wait for the data to come from the ram memory. I also run perf when the program is running on a different number of cores. In the case above, the 4 cores on my cpu were used. The missed cache rate was increasing with the number of cores (openmp has an environment variable: OMP_NUM_THREADS to do this). The workload was exactly the same but by doing simultaneously different jobs, the stress on the memory got more and more important. It would be interesting to run the same test on the same architecture but with a q9500 cpu. It has 12MB of L2 cache instead of 6MB for the q9505.

The next tool is oprofile. Similar to perf, it is based on recent kernel features about tracing performance. It comes with a nice gui where you can choose the parameter to follow, I choose the “LLc_*” events. The gui is launched via # oprof_start gui.

The opreport command gives you the results, honestly quite difficult to interpret. But I still manage to see that two “functions” got missed cache: the main program edensgrid (I really should put the cpu intensive part in its own subroutine…) and libfftw3f.

I feed the result into kcachegrind using the command: $ opreport -gdf | op2calltree. It creates a bunch of files, each one can loaded into kcachegrind. I loaded the one I was interested in: oprof.out.edensgrid. No need to look to the fftw3 one, this library is highly optimised, there is no room to improvement here. When compiled with the debugging symbols and with the source code available, kcachegrind will link the results to the source code.

The locations of missed cache are: sigmas=sigmas+(fftinout-datadiff)**2 and the initialisation of the array for a minor part: fftinout=0.0_fftkind with the inversion fftinout=-fftinout.

About the solution, it’s another story… I have at the moment no idea. I can probably work out a solution for the fftinout assignation by tweaking the fillfftarray subroutine. I guess I can remove the zero initalisation and the inversion. The sigmas business would need a better understanding of the cpu internals to find a better way to calculate if a solution exist.

A note about oprofile and perf vs valgrind:
I think that valgrind can also give this kind of information but as everything is virtualised, it’s slow and I am not sure if openmp code goes well with it. oprofile and perf have very little overhead and just probe the program while it’s running at full speed without any artefacts.