Autovectorization

Vectorization is a nice concept in optimisation. It allows simulaneous calculations using special units in the processor. It is working only on special cases. I tried to apply to structure factors calculations. Fortran does not allow you to write vector code as you would do in C. You can only rely on autovectorization from the compiler by exposing known patterns.

Actually, each structure factor is calculated one by one by each thread. I modified the function to process them by batch allowing the compiler to do a better job.

  1. Current version. 3 threads, structure factors calculated one by one.
[pascal@vinci ediff]$ perf stat -v -d  ../../trunk-r607/edensgrid --nopeaks  xu5015_shelxl
 Calculating structure factors
100.00 % done, remaining:   0:00min, elapsed:   0:01min, rate:   13.3 us/loop 
101507 reflections processed in   1.4 s (201 atoms)
 Performance counter stats for '../../trunk-r607/edensgrid --nopeaks xu5015_shelxl':
 
       7210,750407 task-clock                #    1,664 CPUs utilized          
               926 context-switches          #    0,000 M/sec                  
                18 CPU-migrations            #    0,000 M/sec                  
             7 404 page-faults               #    0,001 M/sec                  
    20 294 956 191 cycles                    #    2,815 GHz                     [24,85%]
     <not counted> stalled-cycles-frontend 
     <not counted> stalled-cycles-backend  
    27 473 509 707 instructions              #    1,35  insns per cycle         [37,42%]
     4 759 056 611 branches                  #  659,995 M/sec                   [37,49%]
        96 886 241 branch-misses             #    2,04% of all branches         [37,76%]
     8 209 247 003 L1-dcache-loads           # 1138,473 M/sec                   [25,22%]
       166 461 565 L1-dcache-load-misses     #    2,03% of all L1-dcache hits   [25,24%]
        88 992 738 LLC-loads                 #   12,342 M/sec                   [25,09%]
           710 262 LLC-load-misses           #    0,80% of all LL-cache hits    [24,87%]
 
       4,333275750 seconds time elapsed
  1. New version. 3 threads, structure factors calculated by batches of 32 of them.
[pascal@vinci ediff]$ perf stat -v -d  ../../trunk/edensgrid --nopeaks  xu5015_shelxl
 Calculating structure factors
100.00 % done, remaining:   0:00min, elapsed:   0:01min, rate:   11.1 us/loop 
101507 reflections processed in   1.1 s (201 atoms)
        3.6175E+07 reflections*atoms*s^-1
 
 Performance counter stats for '../../trunk/edensgrid --nopeaks xu5015_shelxl':
 
       6510,431679 task-clock                #    1,584 CPUs utilized          
               904 context-switches          #    0,000 M/sec                  
                20 CPU-migrations            #    0,000 M/sec                  
             6 750 page-faults               #    0,001 M/sec                  
    18 147 024 815 cycles                    #    2,787 GHz                     [25,20%]
     <not counted> stalled-cycles-frontend 
     <not counted> stalled-cycles-backend  
    26 969 494 651 instructions              #    1,49  insns per cycle         [37,78%]
     4 751 540 537 branches                  #  729,835 M/sec                   [37,76%]
        76 868 617 branch-misses             #    1,62% of all branches         [37,75%]
     7 161 160 309 L1-dcache-loads           # 1099,952 M/sec                   [25,11%]
        36 037 316 L1-dcache-load-misses     #    0,50% of all L1-dcache hits   [25,07%]
        16 708 632 LLC-loads                 #    2,566 M/sec                   [24,88%]
           568 664 LLC-load-misses           #    3,40% of all LL-cache hits    [24,99%]
 
       4,110610561 seconds time elapsed

The result is an increase of around 15-25% in speed. L1 data cache misses dropped a lot and I have more instructions per cycle executed. The pressure is pushed back on the L2 cache. Not all portions have been vectorized as I would have expect, some more insight using the flag -ftree-vectorizer-verbose will help.

Best results have been obtained with batches of 16 or 32 structure factors.

Result from the compiler analysis confirmed that more parts have been vectorized. However, the same code compiled without vectorization is still 10% faster. Meaning that a bit more is going on.

Before:

modules/functions.f90:922: note: LOOP VECTORIZED.
modules/functions.f90:922: note: LOOP VECTORIZED.
modules/functions.f90:905: note: vectorized 2 loops in function.

After:

modules/functions.f90:852: note: LOOP VECTORIZED.
modules/functions.f90:842: note: LOOP VECTORIZED.
modules/functions.f90:834: note: LOOP VECTORIZED.
modules/functions.f90:839: note: LOOP VECTORIZED.
modules/functions.f90:825: note: LOOP VECTORIZED.
modules/functions.f90:827: note: LOOP VECTORIZED.
modules/functions.f90:782: note: vectorized 6 loops in function.
 
edensgrid.F90:226: note: LOOP VECTORIZED.
edensgrid.F90:226: note: LOOP VECTORIZED.
edensgrid.F90:207: note: vectorized 2 loops in function.

I am now able to process nearly 50 millions reflections*atoms per second. On a structure with about 200 atoms that’s more than 100k reflections per seconds. The results I obtained are very close to the ones obtained by Vincent Favre-Nicolin (pdf) but I am using a much more complicated formula suitable for small molecule crystallography.

I don’t know if anyone would need such a fast calculation but they can always get in touch with me.

4 thoughts on “Autovectorization

  1. I replace the actual builtin exp function with their optimised vectorized sincos and exp counterpart from acml (vrsa_sinscosf and vrsa_expf). I also add a switch to single precision calculation.
    Finally I used the inversion centre when present to speed up the calculation (save half of the calculations).

    The result, from 1.1s It goes down to 240ms.


    [pascal@vinci trunk]$ ./edensgrid --nopeaks xu5015_shelxl
    Calculating structure factors
    100.00 % done, remaining: 0:00min, elapsed: 0:00min, rate: 2.4 us/loop
    101507 reflections processed in 239 ms (201 atoms)
    1.7074E+08 reflections*atoms*s^-1

    I compared it to cctbx, I am 3.5 times faster. I took the idea of the inversion centre from them. However, the fft method from cctx is 20 times fasters.

  2. Pingback: High perfomance structure factors calculations » Debroglie's repository

  3. hi, pascal very nice blog !
    i ‘ve a question about the perf tool , all the metrics it provide about cpu counter lile the LLC*, CPI etc..
    are cpu base(so you have on os noize), or can it filter per process or event better per thread
    regards marc
    Mode french on
    super blog !
    Mode french off

  4. with perf, it depends
    If you simply use perf top it is global, by using perf stat command it should be specific.

    Otherwise you have oprofile which can filter everything and gives you per process results.

Leave a Reply