# Code optimisation

A while ago, I discussed a bit about code optimisation using valgrind:
Spot bottlenecks. At the end I obtained something like 1sec for about 4000 reflections. Meaning about 250-500us/refl.

In the meantime, I needed a much faster version. I am using Monte Carlo simulation which is very demanding.

Here is the old code:

Original Fcalc subroutine

The part on the right about string was dealing with the symmetry operations. This was really ugly. A bad habbit from associated array in pythons/php 🙂 Strings operations are very slow compared to number. I am using integer indices now, problem solved.

The matmul_r4 in the middle is dealing with the expansion of the coordinate into P1. But this expansion is constant for every reflection at this time I did not bother avoiding this expansion for each reflection although the result is exactly the same. Now I am caching the result of the first calculation for reuse. I also merged the matrices operations by using stacked vectors so I can use a fast blas/lapack library (acml) without too much overhead.

The Kronecker product was also quite demanding. I am now caching the result at the beginning of the program.

Now the result: 70ms for 5700 reflections. That’s 13us/refl, 20 times faster. On a quad core processor, so 52us on one thread. Furthermore the code is doing slightly more things than the old one using the matrix of constraints, variables allocations…

Here is the result in kcachegrind:

Fcalc subroutine fully optimised

This version is now fully optimised. The large majority of CPU is used computing fortran intrinsics like exponential and atan2. The rest is really minor. My only chance now would be to use faster trigonometric functions. I tried the fastexp in the acml library but I did not see any improvement. Anyway, this is now fast enough.

Fortran code:
[codesyntax lang="fortran"]

 ! Code under the GPL v3 license complex(kind=8) function structfact(h,k,l, cifatomiccoordinates, cifatomicuaniso, newcoordinatespool, newanisopool, f2) use unitcellmod use datamod, ONLY: cifatomnametable, f1f2table, wavelength, listsymm, pi, twopi2, twopi use datamod, ONLY: numberofatoms, listsymsize, cachekronecker use datamod, only: cachesymmatrix, cachetranslation,cifsiteatomicnumber implicit none integer, dimension(3,3) :: sym double precision, dimension(3,1) :: translation double precision, dimension(2,1) :: f1f2 integer i,j, atomindex integer, intent(in) :: h,k,l logical, intent(in) :: f2 complex(kind=8) oldaf, af double precision, dimension(:,:), intent(in) :: cifatomiccoordinates double precision, dimension(:,:), intent(in) :: cifatomicuaniso ! data to cache some results real(kind=8), dimension(:,:,:), allocatable, intent(inout) :: newcoordinatespool, newanisopool real(kind=8) :: tempr1, temph2cela2, tempk2celb2, templ2celc2, hkab, lkbc, hlac real(kind=8) :: phi, z, anisoscale real(kind=8) :: resolution if(.not. (allocated(newcoordinatespool) .and. allocated(newanisopool))) then !$OMP CRITICAL if(.not. (allocated(newcoordinatespool) .and. allocated(newanisopool))) then allocate(newcoordinatespool(3,numberofatoms,listsymsize)) allocate(newanisopool(9,numberofatoms,listsymsize)) do i=1,listsymsize sym=cachesymmatrix(i,:,:) translation=cachetranslation(i,:,:) do j=1,numberofatoms newcoordinatespool(:,j,i)=translation(:,1) end do call DGEMM('N','N',3,numberofatoms,3,1.0D0,real(sym,8),3,& & cifatomiccoordinates(1:3,1:numberofatoms),3,1.0D0,& & newcoordinatespool(:,:,i),3) !print *, newcoordinatespool(1:3,1,i) !newaniso=matmul(cachekronecker(:,:,i),cifatomicuaniso(1:9,2:2)) !print *, newaniso(1:3,1) call DGEMM('N','N',9,numberofatoms,9,1.0D0,real(cachekronecker(:,:,i),8),9,& & cifatomicuaniso(1:9,1:numberofatoms),9,0.0D0,newanisopool(:,:,i),9) !print *, newanisopool(:,2,i) end do end if !$OMP END CRITICAL end if structfact=cmplx(0.0,0.0,8) ! temp variable to avoid unnecessary calculations temph2cela2=h**2*rcela**2 tempk2celb2=k**2*rcelb**2 templ2celc2=l**2*rcelc**2 hkab=h*k*rcela*rcelb lkbc=l*k*rcelb*rcelc hlac=h*l*rcela*rcelc resolution=dataresolution(dhkl(h,k,l)) Do j=1,numberofatoms atomindex=cifsiteatomicnumber(j) ! f' and f'' f1f2=f1f2table(1:2,atomindex:atomindex) tempr1=cifatomiccoordinates(6,j)*& & exp(real(-4.0, 8)*twopi2*cifatomiccoordinates(5,j)*(resolution)**2)/& & cifatomiccoordinates(4,j) ! saving the atomic factor for later use oldaf=atomicfactor(atomindex, resolution, f1f2, f2) do i=1,listsymsize af = oldaf ! #anisotropic scale factor ! # 11 22 33 23 13 12 anisoscale=exp(-twopi2*(newanisopool(1,j,i)*temph2cela2 + & & newanisopool(5,j,i)*tempk2celb2 + & & newanisopool(9,j,i)*templ2celc2 + & & 2.0d0*newanisopool(2,j,i)*hkab + & & 2.0d0*newanisopool(6,j,i)*lkbc + & & 2.0d0*newanisopool(3,j,i)*hlac)) phi = twopi*(real(h,8)*newcoordinatespool(1,j,i)+ & & real(k,8)*newcoordinatespool(2,j,i)+ & & real(l,8)*newcoordinatespool(3,j,i)) phi=phi+atan2(aimag(af),real(af,8)) phi=mod(phi,real(twopi,8)) z=abs(af) af=z*exp(cmplx(0.0,1.0,8)*phi)*tempr1*anisoscale structfact=structfact+af enddo enddo return end function 

[/codesyntax] 

## One thought on “Code optimisation”

1. I replaced exp by fastexp for a test (8000 reflections):
normal exp: 200ms/loop, fast exp: 150ms/loop