As part of my new position in Oxford, I have a new computer: An 8-cores Intel Xeon CPU E5-2665 with 16GB of memory. Enough cores to try if my library is scaling well in multi-threaded environments.
The result is really good. It is not as perfect as it should theoretically be when I am using 8 threads or less but The hyper-threading manages to improve the speed a little more to a final result of 96% of the theoretical speed-up. However between 8 and 16 threads, the results are a bit unstable probably due to threads moving on different cores. This is clearly a consequence that no synchronization is needed between the threads. The overhead of the OpenMP is also minimum because of its use at a very high level in the code.
Threads | wall clock time (ms) | Refl.atoms.s^-1 | Speedup |
---|---|---|---|
1 | 2300 | 1.4 10^8 | 1 |
2 | 1300 | 2.5 10^8 | 1.8 |
3 | 849 | 3.8 10^8 | 2.7 |
4 | 635 | 5.0 10^8 | 3.6 |
5 | 536 | 6.0 10^8 | 4.3 |
6 | 457 | 7.0 10^8 | 5.0 |
7 | 406 | 7.9 10^8 | 5.7 |
8 | 359 | 9.0 10^8 | 6.4 |
9 | 374 | 8.5 10^8 | 6.1 |
10 | 336 | 9.6 10^8 | 6.8 |
11 | 370 | 8.7 10^8 | 6.2 |
16 | 305 | 1.0 10^9 | 7.7 |
~150k reflections, 1000 atoms