Vectorization is a nice concept in optimisation. It allows simulaneous calculations using special units in the processor. It is working only on special cases. I tried to apply to structure factors calculations. Fortran does not allow you to write vector code as you would do in C. You can only rely on autovectorization from the compiler by exposing known patterns.
Actually, each structure factor is calculated one by one by each thread. I modified the function to process them by batch allowing the compiler to do a better job.
- Current version. 3 threads, structure factors calculated one by one.
[pascal@vinci ediff]$ perf stat -v -d ../../trunk-r607/edensgrid --nopeaks xu5015_shelxl
Calculating structure factors
100.00 % done, remaining: 0:00min, elapsed: 0:01min, rate: 13.3 us/loop
101507 reflections processed in 1.4 s (201 atoms)
Performance counter stats for '../../trunk-r607/edensgrid --nopeaks xu5015_shelxl':
7210,750407 task-clock # 1,664 CPUs utilized
926 context-switches # 0,000 M/sec
18 CPU-migrations # 0,000 M/sec
7 404 page-faults # 0,001 M/sec
20 294 956 191 cycles # 2,815 GHz [24,85%]
stalled-cycles-frontend
stalled-cycles-backend
27 473 509 707 instructions # 1,35 insns per cycle [37,42%]
4 759 056 611 branches # 659,995 M/sec [37,49%]
96 886 241 branch-misses # 2,04% of all branches [37,76%]
8 209 247 003 L1-dcache-loads # 1138,473 M/sec [25,22%]
166 461 565 L1-dcache-load-misses # 2,03% of all L1-dcache hits [25,24%]
88 992 738 LLC-loads # 12,342 M/sec [25,09%]
710 262 LLC-load-misses # 0,80% of all LL-cache hits [24,87%]
4,333275750 seconds time elapsed
- New version. 3 threads, structure factors calculated by batches of 32 of them.
[pascal@vinci ediff]$ perf stat -v -d ../../trunk/edensgrid --nopeaks xu5015_shelxl
Calculating structure factors
100.00 % done, remaining: 0:00min, elapsed: 0:01min, rate: 11.1 us/loop
101507 reflections processed in 1.1 s (201 atoms)
3.6175E+07 reflections*atoms*s^-1
Performance counter stats for '../../trunk/edensgrid --nopeaks xu5015_shelxl':
6510,431679 task-clock # 1,584 CPUs utilized
904 context-switches # 0,000 M/sec
20 CPU-migrations # 0,000 M/sec
6 750 page-faults # 0,001 M/sec
18 147 024 815 cycles # 2,787 GHz [25,20%]
stalled-cycles-frontend
stalled-cycles-backend
26 969 494 651 instructions # 1,49 insns per cycle [37,78%]
4 751 540 537 branches # 729,835 M/sec [37,76%]
76 868 617 branch-misses # 1,62% of all branches [37,75%]
7 161 160 309 L1-dcache-loads # 1099,952 M/sec [25,11%]
36 037 316 L1-dcache-load-misses # 0,50% of all L1-dcache hits [25,07%]
16 708 632 LLC-loads # 2,566 M/sec [24,88%]
568 664 LLC-load-misses # 3,40% of all LL-cache hits [24,99%]
4,110610561 seconds time elapsed
The result is an increase of around 15-25% in speed. L1 data cache misses dropped a lot and I have more instructions per cycle executed. The pressure is pushed back on the L2 cache. Not all portions have been vectorized as I would have expect, some more insight using the flag -ftree-vectorizer-verbose will help.
Best results have been obtained with batches of 16 or 32 structure factors.
Result from the compiler analysis confirmed that more parts have been vectorized. However, the same code compiled without vectorization is still 10% faster. Meaning that a bit more is going on.
Before:
modules/functions.f90:922: note: LOOP VECTORIZED.
modules/functions.f90:922: note: LOOP VECTORIZED.
modules/functions.f90:905: note: vectorized 2 loops in function.
After:
modules/functions.f90:852: note: LOOP VECTORIZED.
modules/functions.f90:842: note: LOOP VECTORIZED.
modules/functions.f90:834: note: LOOP VECTORIZED.
modules/functions.f90:839: note: LOOP VECTORIZED.
modules/functions.f90:825: note: LOOP VECTORIZED.
modules/functions.f90:827: note: LOOP VECTORIZED.
modules/functions.f90:782: note: vectorized 6 loops in function.
edensgrid.F90:226: note: LOOP VECTORIZED.
edensgrid.F90:226: note: LOOP VECTORIZED.
edensgrid.F90:207: note: vectorized 2 loops in function.
I am now able to process nearly 50 millions reflections*atoms per second. On a structure with about 200 atoms that’s more than 100k reflections per seconds. The results I obtained are very close to the ones obtained by Vincent Favre-Nicolin (pdf) but I am using a much more complicated formula suitable for small molecule crystallography.
I don’t know if anyone would need such a fast calculation but they can always get in touch with me.










