Jan
20

Autovectorization

Vectorization is a nice concept in optimisation. It allows simulaneous calculations using special units in the processor. It is working only on special cases. I tried to apply to structure factors calculations. Fortran does not allow you to write vector code as you would do in C. You can only rely on autovectorization from the compiler by exposing known patterns.

Actually, each structure factor is calculated one by one by each thread. I modified the function to process them by batch allowing the compiler to do a better job.

  1. Current version. 3 threads, structure factors calculated one by one.

[pascal@vinci ediff]$ perf stat -v -d  ../../trunk-r607/edensgrid --nopeaks  xu5015_shelxl
 Calculating structure factors
100.00 % done, remaining:   0:00min, elapsed:   0:01min, rate:   13.3 us/loop
101507 reflections processed in   1.4 s (201 atoms)
 Performance counter stats for '../../trunk-r607/edensgrid --nopeaks xu5015_shelxl':
 
       7210,750407 task-clock                #    1,664 CPUs utilized
               926 context-switches          #    0,000 M/sec
                18 CPU-migrations            #    0,000 M/sec
             7 404 page-faults               #    0,001 M/sec
    20 294 956 191 cycles                    #    2,815 GHz                     [24,85%]
      stalled-cycles-frontend
      stalled-cycles-backend
    27 473 509 707 instructions              #    1,35  insns per cycle         [37,42%]
     4 759 056 611 branches                  #  659,995 M/sec                   [37,49%]
        96 886 241 branch-misses             #    2,04% of all branches         [37,76%]
     8 209 247 003 L1-dcache-loads           # 1138,473 M/sec                   [25,22%]
       166 461 565 L1-dcache-load-misses     #    2,03% of all L1-dcache hits   [25,24%]
        88 992 738 LLC-loads                 #   12,342 M/sec                   [25,09%]
           710 262 LLC-load-misses           #    0,80% of all LL-cache hits    [24,87%]
 
       4,333275750 seconds time elapsed

  1. New version. 3 threads, structure factors calculated by batches of 32 of them.

[pascal@vinci ediff]$ perf stat -v -d  ../../trunk/edensgrid --nopeaks  xu5015_shelxl
 Calculating structure factors
100.00 % done, remaining:   0:00min, elapsed:   0:01min, rate:   11.1 us/loop
101507 reflections processed in   1.1 s (201 atoms)
        3.6175E+07 reflections*atoms*s^-1
 
 Performance counter stats for '../../trunk/edensgrid --nopeaks xu5015_shelxl':
 
       6510,431679 task-clock                #    1,584 CPUs utilized
               904 context-switches          #    0,000 M/sec
                20 CPU-migrations            #    0,000 M/sec
             6 750 page-faults               #    0,001 M/sec
    18 147 024 815 cycles                    #    2,787 GHz                     [25,20%]
      stalled-cycles-frontend
      stalled-cycles-backend
    26 969 494 651 instructions              #    1,49  insns per cycle         [37,78%]
     4 751 540 537 branches                  #  729,835 M/sec                   [37,76%]
        76 868 617 branch-misses             #    1,62% of all branches         [37,75%]
     7 161 160 309 L1-dcache-loads           # 1099,952 M/sec                   [25,11%]
        36 037 316 L1-dcache-load-misses     #    0,50% of all L1-dcache hits   [25,07%]
        16 708 632 LLC-loads                 #    2,566 M/sec                   [24,88%]
           568 664 LLC-load-misses           #    3,40% of all LL-cache hits    [24,99%]
 
       4,110610561 seconds time elapsed

The result is an increase of around 15-25% in speed. L1 data cache misses dropped a lot and I have more instructions per cycle executed. The pressure is pushed back on the L2 cache. Not all portions have been vectorized as I would have expect, some more insight using the flag -ftree-vectorizer-verbose will help.

Best results have been obtained with batches of 16 or 32 structure factors.

Result from the compiler analysis confirmed that more parts have been vectorized. However, the same code compiled without vectorization is still 10% faster. Meaning that a bit more is going on.

Before:

modules/functions.f90:922: note: LOOP VECTORIZED.
modules/functions.f90:922: note: LOOP VECTORIZED.
modules/functions.f90:905: note: vectorized 2 loops in function.

After:

modules/functions.f90:852: note: LOOP VECTORIZED.
modules/functions.f90:842: note: LOOP VECTORIZED.
modules/functions.f90:834: note: LOOP VECTORIZED.
modules/functions.f90:839: note: LOOP VECTORIZED.
modules/functions.f90:825: note: LOOP VECTORIZED.
modules/functions.f90:827: note: LOOP VECTORIZED.
modules/functions.f90:782: note: vectorized 6 loops in function.
 
edensgrid.F90:226: note: LOOP VECTORIZED.
edensgrid.F90:226: note: LOOP VECTORIZED.
edensgrid.F90:207: note: vectorized 2 loops in function.

I am now able to process nearly 50 millions reflections*atoms per second. On a structure with about 200 atoms that’s more than 100k reflections per seconds. The results I obtained are very close to the ones obtained by Vincent Favre-Nicolin (pdf) but I am using a much more complicated formula suitable for small molecule crystallography.

I don’t know if anyone would need such a fast calculation but they can always get in touch with me.

Permanent link to this article: http://blog.debroglie.net/2012/01/20/autovectorizatio/

Jan
04

Platon nightly bug

A bug happened on the last December release of platon-nightly which was dated from 2012 instead of 2011. I forced the update (I increase the epoch tag) for the January release so that it will updated as usual.

Permanent link to this article: http://blog.debroglie.net/2012/01/04/platon-nightly-bug/

Dec
13

Social integration – comments

I changed a few things in the settings of the blog. You have to be logged in to comment. No worry, you don’t need to create an account on Debroglie you can just login via facebook, twitter, google or linkedin. You will see an icon on the login page.

Permanent link to this article: http://blog.debroglie.net/2011/12/13/social-integration-comments/

Dec
10

Plotting libraries

I have edited this post on the 14th of December 2011. I added a few things in matplolib and plplot section.

There are numerous choices to plot diagrams. Depending of the functionality you need and the language you are developing in, the choice might be difficult.

I have found six possible libraries: gnuplot, matplotlib, pgplot, plplot, dislin and R. Each of them have different requirements. Matplotlib is dedicated to python while plplot have numerous bindings. Their capabilities are also different, gnuplot does not seem to superimpose graphs very easily. Licensing is also important dislin and pgplot have the most restrictive license: it is free to use only for non-commercial application. In each libraries, except R and dislin, I tried to draw a Fourier Map (picture) superimposed by a contour plot.

My data are stored in a 2D array. It seems that this cause some troubles to some libraries.

gnuplot

I could not find how to superimpose a contour plot. I only get an image. Also, I don’t know if I can apply a transformation to the coordinates in order to map the array indices to real coordinates.

matplotlib

Matplotlib can only be used in python. The combination with python makes it very easy to try quickly a few thing. The documentation is well written although not very consistent in its conventions.

#!/bin/python
 
import numpy
import matplotlib.pyplot as plt
from pylab import *
 
data=numpy.loadtxt("data.matrix", dtype=float32)
cdict = {
'red'  :  ((0., 1., 1.), (0.5, 1.0, 1.0), (1., 0.4, 0.4)),
'green'((0., 0.4, 0.4), (0.5, 1.0, 1.0), (1., 1., 1.)),
'blue'((0., 0.4, 0.4), (0.5, 1.0, 1.0), (1., 0.4, 0.4))
}
my_cmap = matplotlib.colors.LinearSegmentedColormap('my_colormap', cdict, 1024)
 
imshow(data[0:577,:], cmap=my_cmap)
contour(data[0:577,:],  colors='#00aa00', levels=(0.1,0.2,0.3,0.4,0.5))
contour(data[0:577,:],  colors='blue', levels=[0.0])
contour(data[0:577,:],  colors='#aa0000', linestyles='dashed', levels=(-0.1,-0.2,-0.3,-0.4,-0.5))
show()

pgplot

PGPLOT is used with low level languages Fortran and C. It’s an advantage or a disadvantage depending on what you want. My program is written in Fortran, so it’s good. However it makes the programming more difficult. The two main disadvantages are the license which I am not sure is compatible with the GPL and updates: the last version was released in 2001. I manage to get exactly what I wanted with this library, the documentation is really good. The only problem I have in about the layout: the resolution is limited and hard-coded, the placement of objects is troublesome.

plplot

On the paper, this one my favourite. It’s LGPL, there are bindings for all kinds of languages including Fortran and python, it is actively developed with several commits every month and there are a large choices of output formats. I have two major problems: the documentation is really not sufficient and there are bugs (contour plot bug). However, with a few tricks I managed to get a working example.

There is also a bug in the linear gradient function in RGB. The workaround is to used HLS space.
There is no function to draw a circle, I used a polygon instead.

Although plplot looks promising, there are serious issues to fix.

Permanent link to this article: http://blog.debroglie.net/2011/12/10/plotting-libraries/

Oct
28

Loop tiling

This post is the continuation of the previous one on: CPU starvation

I needed to rewrite a part of the code and it was a good opportunity to change the algorithm of the variance calculation. I was using the naive implementation: the first pass is the calculation of the mean, the second pass the calculation of the variance. Because I cannot store the data, they were calculated twice. Yes, doing twice more Fourier transforms is not efficient…

I found an online algorithm that can gives me both the mean and variance at the same in one pass (online variance calculation). The result is of course a big improvement but the program was still suffering from heavy starvation:

if(locali/=1) then
    delta=fftinout-average/(locali-1)
else
    delta=fftinout
end if
average=average+fftinout
sigmas = sigmas + delta*(fftinout - average/locali)

The problem here is that data are reused and do not stay in cache. delta, fftinout, average and sigmas are big 3D-arrays. When delta is computed, it does not fit into the cache, fftinout is send away to make room which is just used a line below and have to be pulled back from RAM…

What I did is doing some manual tiling. Usually the compiler do it automatically but in this case, gfortran was not smart enough even with -floop-block -floop-strip-mine. The result is that each stride is staying in cache which avoid moving data back and forth:

do kk=1,ubound(fftinout,3)
    do jj=1,ubound(fftinout,2)
        delta=fftinout(:,jj,kk)-tempfact*average(:,jj,kk)
        average(:,jj,kk)=average(:,jj,kk)+fftinout(:,jj,kk)
        sigmas(:,jj,kk) = sigmas(:,jj,kk) +delta*&
        &   (fftinout(:,jj,kk) - average(:,jj,kk)/locali)
    end do
end do

The result is amazing, I had a 20% speedup just with this change. You just need to be smarter than the compiler. And I save a bit of memory, delta is just a stride now.

New result on my 2.7GHz, 4 cores computer with 4GB@800MHz memory:

cores real(s) user(s) loop(ms)
4 64 233 19(76)
3 72 204 21.7(65.1)
2 93 180 28.7(57.4)
1 171 171 55

Previous results were for half the work, I was skipping the average. The current code is now faster than half the job in the previous version.

Permanent link to this article: http://blog.debroglie.net/2011/10/28/loop-tiling/

Oct
25

CPU starvation

A few weeks ago, I discovered that one of my programs has some cpu starvation issues. A cpu is starved when is waiting for the data. The most common cases are due high load on the disk but it can happen that even the RAM can be too slow. As I am just using a few MB on a 4GB system, that could not be the case.

My first hint was that the computing speed was not scaling up with the number of cores used on the newest version of the program. An old version did not have these problem. I even tested on 48 cores. But the new version version is much more efficient. Here is the results of the program running on a different numbers of cores.

Cores: number of cores used
real: real clock time in seconds
user: user clock time in seconds
loop: average time to execute a cycle of a do loop in my program in ms. In parenthesis, the one core equivalent loop time

cores real user loop
4 70 246 22(88)
3 77 207 24.3(72.9)
2 90 171 28.3(56.6)
1 146 146 47.3

Of course these results alone are not enough, problems from parallelism execution can have many sources. These results were obtained (home computer) on a intel q9505 (2800MHz) cpu equipped with 667MHz DDR2 memory (dual channel).

I have a similar computer at work: a intel q9400 (2666MHz) cpu equipped with 800MHz DDR2 memory (dual channel). The fact is my code is running slower at home (ie with the faster cpu) than at work. This was clearly the indication that my cpu was not running at full speed.

After searching a bit, I found two tools that would help me:

Here is the result of the command # perf top:

As already noted in previous posts (http://blog.debroglie.net/2011/07/26/code-optimisation/), complex exponentials are quite numerous and can be seen here as sincos. They represent 6.7% of the time. The clear_page_c function is more worrying, it is a kernel function related to the control of memory. A web search did not reveal any more information.

The command # perf stat -p 425 -v -d is given more detailed informations about the program execution (425 was the pid of the program).

It revealed that the LLC-load-misses is 9.89%. It means that 10% of the time, the cpu won’t find any data in the L2 cache forcing him to wait for the data to come from the ram memory. I also run perf when the program is running on a different number of cores. In the case above, the 4 cores on my cpu were used. The missed cache rate was increasing with the number of cores (openmp has an environment variable: OMP_NUM_THREADS to do this). The workload was exactly the same but by doing simultaneously different jobs, the stress on the memory got more and more important. It would be interesting to run the same test on the same architecture but with a q9500 cpu. It has 12MB of L2 cache instead of 6MB for the q9505.

The next tool is oprofile. Similar to perf, it is based on recent kernel features about tracing performance. It comes with a nice gui where you can choose the parameter to follow, I choose the “LLc_*” events. The gui is launched via # oprof_start gui.

The opreport command gives you the results, honestly quite difficult to interpret. But I still manage to see that two “functions” got missed cache: the main program edensgrid (I really should put the cpu intensive part in its own subroutine…) and libfftw3f.

I feed the result into kcachegrind using the command: $ opreport -gdf | op2calltree. It creates a bunch of files, each one can loaded into kcachegrind. I loaded the one I was interested in: oprof.out.edensgrid. No need to look to the fftw3 one, this library is highly optimised, there is no room to improvement here. When compiled with the debugging symbols and with the source code available, kcachegrind will link the results to the source code.

The locations of missed cache are: sigmas=sigmas+(fftinout-datadiff)**2 and the initialisation of the array for a minor part: fftinout=0.0_fftkind with the inversion fftinout=-fftinout.

About the solution, it’s another story… I have at the moment no idea. I can probably work out a solution for the fftinout assignation by tweaking the fillfftarray subroutine. I guess I can remove the zero initalisation and the inversion. The sigmas business would need a better understanding of the cpu internals to find a better way to calculate if a solution exist.

A note about oprofile and perf vs valgrind:
I think that valgrind can also give this kind of information but as everything is virtualised, it’s slow and I am not sure if openmp code goes well with it. oprofile and perf have very little overhead and just probe the program while it’s running at full speed without any artefacts.

Permanent link to this article: http://blog.debroglie.net/2011/10/25/cpu-starvation/

Oct
19

Need for a position

Again, I am looking for a new position starting 2012.

So if you are looking for a crystallographer, fortran/python writer and a linux guru. I am free. The location does not really matter as long as it’s an English or French speaking country. I can manage with spanish/italian though.

You can contact me at: pascal22p@parois.net

Permanent link to this article: http://blog.debroglie.net/2011/10/19/need-for-a-position/

Sep
20

Platon-nightly

The automatic build is broken since a few days, a yum update broke a few things. It should be fixed by now.

Permanent link to this article: http://blog.debroglie.net/2011/09/20/platon-nightly-2/

Sep
20

tonto-chem update

New rpms of tonto-chem for centos 6, fedora 13-15. Centos 5 is missing, I have some troubles.

There are builds for the serial version and mpi versions using openmpi, mpich2 or mvapich2. All versions can be installed simulatenously and a bash script has been added to launch the software loading the correct modules.

Permanent link to this article: http://blog.debroglie.net/2011/09/20/tonto-chem-update/

Aug
19

Updates

I updated Fox and drawxtl today for fedora 13, 14, 15 and centos 5, 6.

For platon, use the rpm from debroglie nightly.

Update: Vincent gave me the solution for centos 5.

Permanent link to this article: http://blog.debroglie.net/2011/08/19/updates-2/

Older posts «

Performance Optimization WordPress Plugins by W3 EDGE