Help vectorization of the code

The compiler gfortran is able to automatically vectorize part of the code. The flag “–ftree-vectorize” enable the auto-vectorization and some diagnostics are written on the screen at compile time with the flag “-ftree-vectorizer-verbose=x” (x is a number from 1 to 9). It is then possible to manually vectorize the part where the compiler failed or rewrite the code in a form that the compiler can understand (see

It is also possible to look at the assembly output to see whether SIMD instructions are used. The assembly output is generated with the flag “-S” during compilation.

As an example, let’s look at the line (Fortran):

tempc=cosvalues*cacheatomicfactorscomplex(hklindex:hklindex+n-1, uniqueatomindex(j))

gfortran reports the line as not vectorized:

1377: not vectorized: not suitable for gather D.26824_1220 = REALPART_EXPR <*cacheatomicfactorscomplex.0_164[D.14002_886]>;

Apparently the use of strides here is not compatible with vectorization. Remark: by default the array cacheatomicfactorscomplex is 16 bytes aligned but the strides are not!
19th of April 2013 update: the use of stride is ok as long as they are contiguous in memory. Because of the complex type of the variable this is not the case of the real part of the array. The solution I used then is to work on real array by spliting the complex components into two different arrays.

The assembly output is:

        .loc 1 1377 0 is_stmt 0 discriminator 2
        movss   -4(%r13,%rax,4), %xmm0  # MEM[base: cosvalues_212, index: D.27776_1618, step: 4, offset: -4B], D.13998
        addq    $1, %rax        #, S.891
        movss   4(%rcx), %xmm2  # MEM[base: D.27777_1610, offset: 4B], D.26827
        mulss   %xmm0, %xmm2    # D.13998, D.26827
        mulss   (%rcx), %xmm0   # MEM[base: D.27777_1610, offset: 0B], tmp1403
        addq    %rsi, %rcx      # D.27770, ivtmp.6414
        movss   %xmm2, 4(%rdx)  # D.26827, MEM[base: D.27779_1517, offset: 4B]
        movss   %xmm0, (%rdx)   # tmp1403, MEM[base: D.27779_1517, offset: 0B]
        addq    $8, %rdx        #, ivtmp.6417
        cmpq    %rax, %rbp      # S.891, prologue_after_cost_adjust.5995
        jge     .L1561  #,

The first remark is that although the compiler reports the loop as not vectorized, xmm? registers (128bits sse2 registers) are used. The reason is that these registers can also be used as a scalar, i.e. only on the first element. The instructions can be recognised with the “s” letter on the penultimate position: movss, mulss…

A vectorized loop should looks like this:

       .loc 1 1395 0 discriminator 2
        mulps   %xmm3, %xmm0    # tmp1456, vect_var_.5919
        mulps   %xmm3, %xmm2    # tmp1456, vect_var_.5918
        movaps  %xmm2, %xmm3    # vect_var_.5918, tmp1461
        unpckhps        %xmm0, %xmm2    # vect_var_.5919, tmp1462
        unpcklps        %xmm0, %xmm3    # vect_var_.5919, tmp1461
        movlps  %xmm2, (%rax)   # tmp1462, MEM[base: D.27734_1010, offset: 0B]
        movlps  %xmm3, -16(%rax)        # tmp1461, MEM[base: D.27734_1010, offset: -16B]
        movhps  %xmm3, -8(%rax) # tmp1461, MEM[base: D.27734_1010, offset: -16B]
        movhps  %xmm2, 8(%rax)  # tmp1462, MEM[base: D.27734_1010, offset: 0B]
        addq    $32, %rax       #, ivtmp.6388

The simd instructions are movaps, mulps, unpckhps… The letter “p” on the penultimate position refers to the full length of the register. Also the last letter “s” refers to single precision.

I tried to use a sse version in c of the example above which did not work but I got a slower version than the scalar one. It seems it is easier to rewrite the code in a more friendly for the compiler than writing a sse version manually.

5 thoughts on “Help vectorization of the code

  1. It would be easier to convert the most CPU-taking part into OpenCL kernels.
    We cannot expect the auto vectorizer in C or fortran compilers to dramatically improve in a few years.

  2. I run a few tests already but in small molecule crystallography we do not have enough computations to do. Most of the time is wasted in memory transfer. We have very little computations compared to the amount of data.

    Well, I tried from a fortran interface. When I’ll be more used to the C low level programming I can try again. The algorithm should be rethink, there are some conditional branches and I think it’s not really suitable for the gpu.

    The other problem, more practical, is the very expensive hardware. Actually, with just a low end cpu I can do plenty of things. But for the graphic card, the cheap ones are limited when there is one. It requires an investment where I cannot see the result.

    The auto vectorizer might not improve a lot but Intel and Amd continues to improve SIMD. The new fused/add avx2 instructed will be very usefull, and the 256bits avx registers also. It will push the barrier where a gpu is needed. Especially for middle size problems.

    The calculations I am playing with are about 70-80% of exp and sin/cos functions so with your sse functions it was quite an improvement.

  3. Hi, I came across your post about plotting libraries, but commenting is closed there.
    Interesting post. One missing library there is chaco:
    I’ve read opinions that it has some advantages (over the most popular matplotlib) for interactive plots, but haven’t tried it yet.