A while ago, I discussed a bit about code optimisation using valgrind:

Spot bottlenecks. At the end I obtained something like 1sec for about 4000 reflections. Meaning about 250-500us/refl.

In the meantime, I needed a much faster version. I am using Monte Carlo simulation which is very demanding.

Here is the old code:

The part on the right about string was dealing with the symmetry operations. This was really ugly. A bad habbit from associated array in pythons/php 🙂 Strings operations are very slow compared to number. I am using integer indices now, problem solved.

The matmul_r4 in the middle is dealing with the expansion of the coordinate into P1. But this expansion is constant for every reflection at this time I did not bother avoiding this expansion for each reflection although the result is exactly the same. Now I am caching the result of the first calculation for reuse. I also merged the matrices operations by using stacked vectors so I can use a fast blas/lapack library (acml) without too much overhead.

The Kronecker product was also quite demanding. I am now caching the result at the beginning of the program.

Now the result: 70ms for 5700 reflections. That’s 13us/refl, 20 times faster. On a quad core processor, so 52us on one thread. Furthermore the code is doing slightly more things than the old one using the matrix of constraints, variables allocations…

Here is the result in kcachegrind:

This version is now fully optimised. The large majority of CPU is used computing fortran intrinsics like exponential and atan2. The rest is really minor. My only chance now would be to use faster trigonometric functions. I tried the fastexp in the acml library but I did not see any improvement. Anyway, this is now fast enough.

Fortran code:

`[codesyntax lang="fortran"]`

```
```! Code under the GPL v3 license
complex(kind=8) function structfact(h,k,l, cifatomiccoordinates, cifatomicuaniso, newcoordinatespool, newanisopool, f2)
use unitcellmod
use datamod, ONLY: cifatomnametable, f1f2table, wavelength, listsymm, pi, twopi2, twopi
use datamod, ONLY: numberofatoms, listsymsize, cachekronecker
use datamod, only: cachesymmatrix, cachetranslation,cifsiteatomicnumber
implicit none
integer, dimension(3,3) :: sym
double precision, dimension(3,1) :: translation
double precision, dimension(2,1) :: f1f2
integer i,j, atomindex
integer, intent(in) :: h,k,l
logical, intent(in) :: f2
complex(kind=8) oldaf, af
double precision, dimension(:,:), intent(in) :: cifatomiccoordinates
double precision, dimension(:,:), intent(in) :: cifatomicuaniso
! data to cache some results
real(kind=8), dimension(:,:,:), allocatable, intent(inout) :: newcoordinatespool, newanisopool
real(kind=8) :: tempr1, temph2cela2, tempk2celb2, templ2celc2, hkab, lkbc, hlac
real(kind=8) :: phi, z, anisoscale
real(kind=8) :: resolution
if(.not. (allocated(newcoordinatespool) .and. allocated(newanisopool))) then
!$OMP CRITICAL
if(.not. (allocated(newcoordinatespool) .and. allocated(newanisopool))) then
allocate(newcoordinatespool(3,numberofatoms,listsymsize))
allocate(newanisopool(9,numberofatoms,listsymsize))
do i=1,listsymsize
sym=cachesymmatrix(i,:,:)
translation=cachetranslation(i,:,:)
do j=1,numberofatoms
newcoordinatespool(:,j,i)=translation(:,1)
end do
call DGEMM('N','N',3,numberofatoms,3,1.0D0,real(sym,8),3,&
& cifatomiccoordinates(1:3,1:numberofatoms),3,1.0D0,&
& newcoordinatespool(:,:,i),3)
!print *, newcoordinatespool(1:3,1,i)
!newaniso=matmul(cachekronecker(:,:,i),cifatomicuaniso(1:9,2:2))
!print *, newaniso(1:3,1)
call DGEMM('N','N',9,numberofatoms,9,1.0D0,real(cachekronecker(:,:,i),8),9,&
& cifatomicuaniso(1:9,1:numberofatoms),9,0.0D0,newanisopool(:,:,i),9)
!print *, newanisopool(:,2,i)
end do
end if
!$OMP END CRITICAL
end if
structfact=cmplx(0.0,0.0,8)
! temp variable to avoid unnecessary calculations
temph2cela2=h**2*rcela**2
tempk2celb2=k**2*rcelb**2
templ2celc2=l**2*rcelc**2
hkab=h*k*rcela*rcelb
lkbc=l*k*rcelb*rcelc
hlac=h*l*rcela*rcelc
resolution=dataresolution(dhkl(h,k,l))
Do j=1,numberofatoms
atomindex=cifsiteatomicnumber(j)
! f' and f''
f1f2=f1f2table(1:2,atomindex:atomindex)
tempr1=cifatomiccoordinates(6,j)*&
& exp(real(-4.0, 8)*twopi2*cifatomiccoordinates(5,j)*(resolution)**2)/&
& cifatomiccoordinates(4,j)
! saving the atomic factor for later use
oldaf=atomicfactor(atomindex, resolution, f1f2, f2)
do i=1,listsymsize
af = oldaf
! #anisotropic scale factor
! # 11 22 33 23 13 12
anisoscale=exp(-twopi2*(newanisopool(1,j,i)*temph2cela2 + &
& newanisopool(5,j,i)*tempk2celb2 + &
& newanisopool(9,j,i)*templ2celc2 + &
& 2.0d0*newanisopool(2,j,i)*hkab + &
& 2.0d0*newanisopool(6,j,i)*lkbc + &
& 2.0d0*newanisopool(3,j,i)*hlac))
phi = twopi*(real(h,8)*newcoordinatespool(1,j,i)+ &
& real(k,8)*newcoordinatespool(2,j,i)+ &
& real(l,8)*newcoordinatespool(3,j,i))
phi=phi+atan2(aimag(af),real(af,8))
phi=mod(phi,real(twopi,8))
z=abs(af)
af=z*exp(cmplx(0.0,1.0,8)*phi)*tempr1*anisoscale
structfact=structfact+af
enddo
enddo
return
end function

`[/codesyntax]`

I replaced exp by fastexp for a test (8000 reflections):

normal exp: 200ms/loop, fast exp: 150ms/loop

No multithreading used.