Part of a study on the standard deviation of the electron density using Monte Carlo simulations, I was needing a fast way to calculate structure factors. I achieve this goal a while ago but I continued to improve things each time when I got a new idea to try.

I already talked about it a few times:

My algorithm is no longer memory bounded and the cpu is not starving anymore. Since I am calculating reflections by batches the memory access must be more efficient. This is a nice side effect of the modifications I have made. I checked it by running the program at 2 different cpu frequency and with different cores used. It all scale up nicely. I tried up to 4 threads, the limit of my actual cpu.

At this date, I am reaching 60-75M reflections*atoms per second and per core. On a 200 atoms structure with 100k reflections the result comes in less than 600ms.

The valgrind profile is shown below.

Compared to cctbx it’s almos 4 times faster. However cctbx is much more versatile, that’s probably the reason it’s slower. This efficiency is only possible with optimised trigonometric and BLAS functions. Without these, the calculations are a bit slower than cctbx. BLAS is only efficient on big vectors or matrices so I had to increase the number of reflections processed by batches from 128 to 1024.

This interesting thing about this batch process with BLAS functions is that it should be compatible for a GPU implementation. Though Fortran binding are not widely accessible yet and BLAS functions might not be as efficient as with the cpu. The number of reflections processed simultaneously would probably need to be increased. In the end it should be beneficial only on several hundred of thousands reflections. This range can only be found in protein crystallography.

It turns out that it can be useful to someone else and I am in the process of releasing only this part as a library to allow more people to play with it. This high level of perfomance relies on the AMD core mathematical library both for the vectorized trigonometric function (vrsa_sincos – 64bit only!) and the BLAS functions (SAXPY mainly). It should be possible to use mkl as well.

Actually the interface is very crude and uses several arrays. Object oriented data has not been implemented, data won’t be contigous anymore and would affect the efficiency. Difficult to say if the result would be very different or not.

The project is here:

libFc repository

The code produce both single precision and double precision routines accessible via generic interfaces. Via conditional compilation it is possible to select the vector math library and switch to blas functions.