The compiler gfortran is able to automatically vectorize part of the code. The flag “–ftree-vectorize” enable the auto-vectorization and some diagnostics are written on the screen at compile time with the flag “-ftree-vectorizer-verbose=x” (x is a number from 1 to 9). It is then possible to manually vectorize the part where the compiler failed or rewrite the code in a form that the compiler can understand (see http://gcc.gnu.org/projects/tree-ssa/vectorization.html).
It is also possible to look at the assembly output to see whether SIMD instructions are used. The assembly output is generated with the flag “-S” during compilation.
As an example, let’s look at the line (Fortran):
gfortran reports the line as not vectorized:
1377: not vectorized: not suitable for gather D.26824_1220 = REALPART_EXPR <*cacheatomicfactorscomplex.0_164[D.14002_886]>;
Apparently the use of strides here is not compatible with vectorization. Remark: by default the array cacheatomicfactorscomplex is 16 bytes aligned but the strides are not!
19th of April 2013 update: the use of stride is ok as long as they are contiguous in memory. Because of the complex type of the variable this is not the case of the real part of the array. The solution I used then is to work on real array by spliting the complex components into two different arrays.
The assembly output is:
.LBB1800: .loc 1 1377 0 is_stmt 0 discriminator 2 movss -4(%r13,%rax,4), %xmm0 # MEM[base: cosvalues_212, index: D.27776_1618, step: 4, offset: -4B], D.13998 addq $1, %rax #, S.891 movss 4(%rcx), %xmm2 # MEM[base: D.27777_1610, offset: 4B], D.26827 mulss %xmm0, %xmm2 # D.13998, D.26827 mulss (%rcx), %xmm0 # MEM[base: D.27777_1610, offset: 0B], tmp1403 addq %rsi, %rcx # D.27770, ivtmp.6414 movss %xmm2, 4(%rdx) # D.26827, MEM[base: D.27779_1517, offset: 4B] movss %xmm0, (%rdx) # tmp1403, MEM[base: D.27779_1517, offset: 0B] addq $8, %rdx #, ivtmp.6417 cmpq %rax, %rbp # S.891, prologue_after_cost_adjust.5995 jge .L1561 #,
The first remark is that although the compiler reports the loop as not vectorized, xmm? registers (128bits sse2 registers) are used. The reason is that these registers can also be used as a scalar, i.e. only on the first element. The instructions can be recognised with the “s” letter on the penultimate position: movss, mulss…
A vectorized loop should looks like this:
.loc 1 1395 0 discriminator 2 mulps %xmm3, %xmm0 # tmp1456, vect_var_.5919 mulps %xmm3, %xmm2 # tmp1456, vect_var_.5918 movaps %xmm2, %xmm3 # vect_var_.5918, tmp1461 unpckhps %xmm0, %xmm2 # vect_var_.5919, tmp1462 unpcklps %xmm0, %xmm3 # vect_var_.5919, tmp1461 movlps %xmm2, (%rax) # tmp1462, MEM[base: D.27734_1010, offset: 0B] movlps %xmm3, -16(%rax) # tmp1461, MEM[base: D.27734_1010, offset: -16B] movhps %xmm3, -8(%rax) # tmp1461, MEM[base: D.27734_1010, offset: -16B] movhps %xmm2, 8(%rax) # tmp1462, MEM[base: D.27734_1010, offset: 0B] addq $32, %rax #, ivtmp.6388
The simd instructions are movaps, mulps, unpckhps… The letter “p” on the penultimate position refers to the full length of the register. Also the last letter “s” refers to single precision.
I tried to use a sse version in c of the example above which did not work but I got a slower version than the scalar one. It seems it is easier to rewrite the code in a more friendly for the compiler than writing a sse version manually.