autovectorization: “not vectorized: not suitable for gather”

The cryptic warning “not vectorized: not suitable for gather” is from a report of gfortran on auto-vectorization. For a long time, gfortran was reporting this failed optimization on a very simple mathematical operation. I managed to find the cause today and I am going to report here a description plus a solution to the problem.

Let start on a very simple program:

program gather
implicit none
 
real, dimension(:,:), allocatable :: datadata
real, dimension(100) :: chunk
 
allocate(datadata(1024,5))
chunk=datadata(121:220,3) ! line 8
datadata(121:220,2)=chunk ! line 9
print *, sum(datadata)
 
end program

If you compile with auto-vectorization on:

gfortran  -O2 -ftree-vectorize -ftree-vectorizer-verbose=3 -g  before.f90 -o before

You will notice that line 8 and 9 are vectorized. No problem, as the array assignment is one pattern recognised by the vectorizer. Now we are going to put line 8 and 9 in a subroutine inside a module.

Module vecmod
contains
subroutine test(input)
    implicit none
    real, dimension(:,:), intent(inout) :: input
    real, dimension(100) :: chunk
 
    chunk=input(121:220,3)
    input(121:220,2)=chunk
end subroutine
end module

When compiling the module, you will see that the two lines no longer get vectorized! Gfortran will report that the data are not suitable for gather which means that the dataare not contiguous in memory.

Analyzing loop at vecmod.f90:8

8: ===== analyze_loop_nest =====
8: === vect_analyze_loop_form ===
8: === get_loop_niters ===
8: ==> get_loop_niters:100
8: === vect_analyze_data_refs ===

8: get vectype with 4 units of type real(kind=4)
8: vectype: vector(4) real(kind=4)
8: not vectorized: not suitable for gather D.1909_36 = *input.0_9[D.1908_35];

8: bad data references.

The reason is that the compiler has no idea if the data passed to the subroutine are contiguous or not. It could be a pointer on a non-contiguous stride of an array. The previous example was vectorized because it was an allocated array and as such is contiguous in memory. The solution is then to declare the argument in the subroutine as contiguous. However, Not all compilers will accept this as it is a fortran 2008 option.

Module vecmod
contains
subroutine test(input)
    implicit none
    real, dimension(:,:), contiguous, intent(inout) :: input
    real, dimension(100) :: chunk
 
    chunk=input(121:220,3)
    input(121:220,2)=chunk
end subroutine
end module

Gfortran now report both lines as vectorized.

8: vect_model_load_cost: unaligned supported by hardware.
8: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
8: vect_model_store_cost: aligned.
8: vect_get_data_access_cost: inside_cost = 3, outside_cost = 0.
8: vect_model_load_cost: unaligned supported by hardware.
8: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
8: vect_model_store_cost: aligned.
8: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
8: Cost model analysis: 
  Vector inside of loop cost: 3
  Vector outside of loop cost: 0
  Scalar iteration cost: 2
  Scalar outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1

8:   Profitability threshold = 3


Vectorizing loop at vecmod.f90:8

8: LOOP VECTORIZED.

Performance wise it could be a huge improvement. The theoretical speed up on a sandy bridge (avx instruction) is 8 in single precision. Using perf I have compared the two versions on my library calculating structure factors. First, the non-contiguous version (~100k structure factors calculated in 90ms):

       |              tempr=cosvalues*cacheatomicfactorsreal(hklindex:hklindex+n-1, uniqueatomindex(j))                                                                                                           ?
       |              tempi=cosvalues*cacheatomicfactorsimag(hklindex:hklindex+n-1, uniqueatomindex(j))                                                                                                           ?
  0,88 |20e8:   vmovss (%rdx),%xmm1                                                                                                                                                                               ?
 11,31 |        add    %rcx,%rdx                                                                                                                                                                                  ?
  0,03 |        vmulss -0x4(%r10,%rax,4),%xmm1,%xmm1                                                                                                                                                              ?
  3,42 |        vmovss %xmm1,-0x4(%r12,%rax,4)                                                                                                                                                                    ?
 10,88 |        add    $0x1,%rax                                                                                                                                                                                  ?
  0,01 |        cmp    %rax,%r11                                                                                                                                                                                  ?
       |      ? jge    20e8                      

cacheatomicfactorsreal and cacheatomicfactorsimag are my arguments not declared as contiguous. On the assembly code you can see only scalar instuctions (add, vmulss, vmovss). The numbers in the first column are the execution time of the instructions in percentage. The perf report below is after adding the contiguous option:

       |              tempr=cosvalues*cacheatomicfactorsreal(hklindex:hklindex+n-1, uniqueatomindex(j))
       |              tempi=cosvalues*cacheatomicfactorsimag(hklindex:hklindex+n-1, uniqueatomindex(j))
  0,61 |22a0:   vmovup (%rsi,%rax,1),%xmm1
  2,27 |        add    $0x1,%rcx
       |        vinser $0x1,0x10(%rsi,%rax,1),%ymm1,%ymm1
  1,04 |        vmulps (%rbx,%rax,1),%ymm1,%ymm1
  2,98 |        vmovup %xmm1,(%r9,%rax,1)
  0,89 |        vextra $0x1,%ymm1,0x10(%r9,%rax,1)
  2,01 |        add    $0x20,%rax
  0,07 |        cmp    %r11,%rcx
       |      ? jb     22a0

The execution time for the structure factors is 70ms compared to the 90ms it is a 30% improvement and I have just changed 2 lines in the code. The vmulps instruction is a vector instruction operating on a 256bit vector %ymm (avx) showing that the code was indeed vectorized.

Auto-vectorization is working great most of the time but it is important to guide it as much as possible.

Leave a Reply