Speed up code with SIMD instructions

I previously talked the use of optimised trigonometric function through the use of the acml library. Trigonometric functions are coded in hardware through the x87 extension but it works on 80bits which is not necessary all the time. So software implementation can be a lot faster.

Two major problems are present with the acml vector library I am actually using:

  • it only works on 64bit
  • It’s not open source

I was looking for an alternative and found the sse_mathfun library in c. It’s written by Julien Pommier. However only single precision routines were present. But recently I found sleef libray (Naoki Shibata). It contains routines in both single and double precision. And on top of it, it is also optimised for the new avx instructions in recent CPUs.

These optimised libraries heavily exploit SIMD instructions (Single Instruction Multiple Data). Certain operations can be done simultaneously on a certain number of data. For example, using SSE2, registers are 128bits wide it means that you can work 4 single precision or 2 double precision numbers simultaneously. The new avx registers are 256bits so 8 single precision or 4 double precision numbers. In theory you can have a speed up of 8, 4 or 2 depending on the precision and the registers used.

Of course, because these instructions are working on vectors, it is necessary to write compatible code to exploit vector operations on their full length, i.e. the code need to be parallelised similarly to openmp.

Application to my structure factors library

The first obvious problem is that in Fortran it is not possible to access such low level functions. It is necessary to do it in C. Fortunately, the ISO_C_BINDING module provides access to named constants that represent kind type parameters of data representations compatible with C types. On the fortran side, I then need to write compatible interfaces with the c code. However, the functions from the sleef library operate on vectors, not directly accessible from fortran. More over, the length is fixed. So, it is necessary to write a wrapper in c compatible with fortran which bind the low level trigonometric functions and the fortran code.

I chose to write c functions accepting an array of any length and returning an array of the same length with the math function applied.

As an example, for the exponential function I have on the fortran side the following interface (I placed it in a module for easier re-use):

module sse_wrapper
use, intrinsic :: ISO_C_BINDING
  subroutine sse_exp_array(n, input, output) bind(c)
    use, intrinsic :: ISO_C_BINDING
    implicit none
    real(kind=c_double), dimension(*), intent(in) :: input
    real(kind=c_double), dimension(*), intent(out) :: output
    integer(c_int), value, intent(in) :: n
  end subroutine
end interface
end module

The corresponding function in c is:

#include "sleefsseavx.h"
void sse_exp_array(int inputsize, double input[] , double output[])
    vdouble sse_values __attribute__((aligned (16)));
    unsigned int i, j, chunk;
    double temp[VECTLENDP] __attribute__((aligned (16)));
    chunk = inputsize / VECTLENDP;
    if ((chunk) != 0)
        for (i = 0; i < chunk; i++)
            sse_values = vloadu(&input[VECTLENDP*i]);
            vstoreu(&output[VECTLENDP*i], xexp(sse_values));
    if ((inputsize % VECTLENDP) != 0)
        for (i = inputsize - inputsize % VECTLENDP; i < inputsize; i++)
        sse_values = vloadu(&temp[0]);
        vstoreu(&temp[0], xexp(sse_values));
        for (i = inputsize - inputsize % VECTLENDP; i < inputsize; i++)

The code above is working as a glue between the code in fortran and the exponential function from the sleef library (xexp).

I have not run any benchmark but it is much faster than calling directly exp(values) in fortran.

Also alignmement can be important. when data are aligned on 16bytes for sse2 and 32bytes for avx, perfomance can be much higher. The probleme is that in fortran there is no way force a certain alignment. It seems that on my 64bit linux, gfortran seems to align allocatable arrays on 16bytes boundaries. In the example above, unaligned version of the sse functions are used.

Not everything need to be manually vectorised, many simple patterns can be vectorized automatically by the compiler (-ftree-vectorize with gcc suite) especially whole array operations. The most common exceptions are the mathematical functions (cos, exp…) and the add-multiply operation (a=a+y*b). The future avx2 will add fused add/multiply operation for the second case which could be supported one day in the compiler.

The add-multiply operation is very common in linear algebra. It is part of lapack where optimised library exist, so it can be used instead of writting a vector version yourself.


The sincos function in single precision is missing in sleef.
Fixed in sleef 2.31

Leave a Reply