Code optimisation

A while ago, I discussed a bit about code optimisation using valgrind:
Spot bottlenecks. At the end I obtained something like 1sec for about 4000 reflections. Meaning about 250-500us/refl.

In the meantime, I needed a much faster version. I am using Monte Carlo simulation which is very demanding.

Here is the old code:

Original Fcalc subroutine


The part on the right about string was dealing with the symmetry operations. This was really ugly. A bad habbit from associated array in pythons/php 🙂 Strings operations are very slow compared to number. I am using integer indices now, problem solved.

The matmul_r4 in the middle is dealing with the expansion of the coordinate into P1. But this expansion is constant for every reflection at this time I did not bother avoiding this expansion for each reflection although the result is exactly the same. Now I am caching the result of the first calculation for reuse. I also merged the matrices operations by using stacked vectors so I can use a fast blas/lapack library (acml) without too much overhead.

The Kronecker product was also quite demanding. I am now caching the result at the beginning of the program.

Now the result: 70ms for 5700 reflections. That’s 13us/refl, 20 times faster. On a quad core processor, so 52us on one thread. Furthermore the code is doing slightly more things than the old one using the matrix of constraints, variables allocations…

Here is the result in kcachegrind:

Fcalc subroutine fully optimised

This version is now fully optimised. The large majority of CPU is used computing fortran intrinsics like exponential and atan2. The rest is really minor. My only chance now would be to use faster trigonometric functions. I tried the fastexp in the acml library but I did not see any improvement. Anyway, this is now fast enough.

Fortran code:
[codesyntax lang="fortran"]

! Code under the GPL v3 license

complex(kind=8) function structfact(h,k,l, cifatomiccoordinates, cifatomicuaniso, newcoordinatespool, newanisopool, f2)
use unitcellmod
use datamod, ONLY: cifatomnametable, f1f2table, wavelength, listsymm, pi, twopi2, twopi
use datamod, ONLY: numberofatoms, listsymsize, cachekronecker
use datamod, only: cachesymmatrix, cachetranslation,cifsiteatomicnumber
implicit none

integer, dimension(3,3) :: sym
double precision, dimension(3,1) :: translation
double precision, dimension(2,1) :: f1f2
integer i,j, atomindex
integer, intent(in) :: h,k,l
logical, intent(in) :: f2
complex(kind=8) oldaf, af
double precision, dimension(:,:), intent(in) :: cifatomiccoordinates
double precision, dimension(:,:), intent(in) :: cifatomicuaniso
! data to cache some results
real(kind=8), dimension(:,:,:), allocatable, intent(inout) :: newcoordinatespool, newanisopool
real(kind=8) :: tempr1, temph2cela2, tempk2celb2, templ2celc2, hkab, lkbc, hlac
real(kind=8) :: phi, z, anisoscale
real(kind=8) :: resolution

if(.not. (allocated(newcoordinatespool) .and. allocated(newanisopool))) then
    !$OMP CRITICAL
    if(.not. (allocated(newcoordinatespool) .and. allocated(newanisopool))) then
        allocate(newcoordinatespool(3,numberofatoms,listsymsize))
        allocate(newanisopool(9,numberofatoms,listsymsize))
        do i=1,listsymsize
            sym=cachesymmatrix(i,:,:)
            translation=cachetranslation(i,:,:)

            do j=1,numberofatoms
                newcoordinatespool(:,j,i)=translation(:,1)
            end do
            call DGEMM('N','N',3,numberofatoms,3,1.0D0,real(sym,8),3,&
            &   cifatomiccoordinates(1:3,1:numberofatoms),3,1.0D0,&
            &   newcoordinatespool(:,:,i),3)
            !print *, newcoordinatespool(1:3,1,i)

            !newaniso=matmul(cachekronecker(:,:,i),cifatomicuaniso(1:9,2:2))
            !print *, newaniso(1:3,1)
            call DGEMM('N','N',9,numberofatoms,9,1.0D0,real(cachekronecker(:,:,i),8),9,&
            &   cifatomicuaniso(1:9,1:numberofatoms),9,0.0D0,newanisopool(:,:,i),9)
            !print *, newanisopool(:,2,i)
        end do
    end if
    !$OMP END CRITICAL 
end if

structfact=cmplx(0.0,0.0,8)
! temp variable to avoid unnecessary calculations
temph2cela2=h**2*rcela**2
tempk2celb2=k**2*rcelb**2
templ2celc2=l**2*rcelc**2
hkab=h*k*rcela*rcelb
lkbc=l*k*rcelb*rcelc
hlac=h*l*rcela*rcelc
resolution=dataresolution(dhkl(h,k,l))

Do j=1,numberofatoms
    atomindex=cifsiteatomicnumber(j)
!        f' and f''
    f1f2=f1f2table(1:2,atomindex:atomindex)
        
    tempr1=cifatomiccoordinates(6,j)*&
    &   exp(real(-4.0, 8)*twopi2*cifatomiccoordinates(5,j)*(resolution)**2)/&
    &   cifatomiccoordinates(4,j)
        
 
!       saving the atomic factor for later use
    oldaf=atomicfactor(atomindex, resolution, f1f2, f2)

    do i=1,listsymsize
        af = oldaf

!		#anisotropic scale factor
!		# 11 22 33 23 13 12
        anisoscale=exp(-twopi2*(newanisopool(1,j,i)*temph2cela2 +     &
&           newanisopool(5,j,i)*tempk2celb2 +     &
&           newanisopool(9,j,i)*templ2celc2 +     &
&           2.0d0*newanisopool(2,j,i)*hkab + &
&           2.0d0*newanisopool(6,j,i)*lkbc + &
&           2.0d0*newanisopool(3,j,i)*hlac))
  
        phi = twopi*(real(h,8)*newcoordinatespool(1,j,i)+  &
&           real(k,8)*newcoordinatespool(2,j,i)+          &
&           real(l,8)*newcoordinatespool(3,j,i))


        phi=phi+atan2(aimag(af),real(af,8))
        phi=mod(phi,real(twopi,8))
        
        z=abs(af)
        af=z*exp(cmplx(0.0,1.0,8)*phi)*tempr1*anisoscale
        structfact=structfact+af
    enddo
enddo
return
end function

[/codesyntax]

One thought on “Code optimisation

  1. I replaced exp by fastexp for a test (8000 reflections):
    normal exp: 200ms/loop, fast exp: 150ms/loop

    No multithreading used.

Leave a Reply