The Sleef library is an optimised library for the calculation of mathematical functions: the trigonometric functions(sin, cos, tan, sincos), inverse trigonometric functions(asin, acos, atan, atan2), exponential and logarithmic functions(exp, log, pow, exp2, exp10, expm1, log10, log1p), hyperbolic/inverse hyperbolic functions(sinh, cosh, tanh, asinh, acosh, atanh), and some other functions(cbrt, ilogb, ldexp) in double precision. Some of them are available also in single precision. I am using this library for the calculation of the structure factors. This post will show the advantages of choosing carefully your libraries when you are doing high performance calculations. Like the fftw for the Fourier transform I think Sleef is one of the fastest open source library in its field.
This post will also prepare the next one where I’ll go deeper in the optimisation very close to the CPU level.
I am running a small bench on a notebook Asus zenbook prime equipped with an Intel Core i5-3317U. The reason I am using my notebook is the presence of the AVX instruction set (256 bits registers) in the CPU. The bench will calculate 108 cosine values in single precision, is written in c and the execution time is reported. The bench is running under windows 7 64bits.
The source code was compiled using mingw32 with the O3 flag turned on.
|Pure c (-ffast-math)||cos||11000|
|Sleef – scalar||xcosf (sleef.c)||6000|
|Sleef – scalar (-ffast-math)||xcosf (sleef.c)||4000|
|Sleef – sse2||xcosf (sleefsseacvx.c)||800|
|Sleef – avx||xcosf (sleefsseacvx.c)||450|
The version using only pure C is the slowest by far, even the pure C version of Sleef is much faster. There are however some precautions to take before using these high performance libraries. Their domain of application might be reduce or their accuracy is usually worse. The flag -ffast-math option allows GCC to violate some ANSI or IEEE rules and/or specifications in the interest of optimizing code for speed.
The SSE version of Sleef is more than 4 times faster than the pure C version. The theoretical speed up is 4 as SSE registers are 128 bits wide enabling the use of 4 floats at the same time. This better result than expected could be due to the conditional branches in pure C version that are not it the SSE version.
The AVX version is a bit less than two times faster then SSE. As the AVX registers are doubled (256bits) compared to the SSE registers, it is perfectly normal.
These very fast functions and optimisations can be very useful but they can be used on only small part of the code which can mitigate the speed up of the overall application. A function which is 100 times faster but than only represent 1% of the execution time of the application is not going to improve anything. In the case of structure factors calculations, trigonometric functions and exponential represent more than 50% of the execution time which is the reason why it is a good candidate. More over, numerous structure factors are usually calculated which is also a beneficial for parallelisation (structure factors are independent) both at high level (OpenMP) and low level (SSE, AVX).
To conclude, the use of optimised functions when necessary can be greatly appreciated and the use of vectorized functions can give you a boost in performance for free.