Skein, Threefish, and Mono.Simd — Part 4
Monday, October 5th, 2009In my previous article, I took a look at some of the framework considerations and alignment in Mono.Simd. Getting back to my experiment, I’ve been getting respectable results from my SIMD implementation of Threefish256. Threefish256Simd was taking approximately 20% longer than the Fajardo reference implementation, but something had to be holding the SIMD version back.
Looking back to the machine code, I found that even aligned loads via the API from arrays were coming back pretty slowly with a number of extra moves and extraneous temporary variable. So, I eliminated the pre-calculated key schedule altogether in favor of calculating the key permutations on the fly as in Fajardo’s implementation. With some specially crafted local vectors for the key and tweak values, this changed the key permutation lines in my prior listed source to look something like this1:
bB = bB + k1 + t0h; bA = bB + bB + k0 + t1l;
Later subkeys add in a subkey counter as noted in the Skein specification.
With these changes made, I re-ran my tests. This was the optimization I needed. The SIMD-enabled Threefish256 algorithm was now running in less than half the execution time of the reference implementation: 3.8 ns to 9.3 ns. I had been able to significantly beat my hypothesis.
Output of test/timing program2
mgriep@metis:~$ mono-2.4 mono -O=all,-shared Program.exe
Encrypt:
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
Good!
Cross Decrypt:
0000000000000000 0000000000000001 0000000000000002 0000000000000003
0000000000000000 0000000000000001 0000000000000002 0000000000000003
Good!
Speed test of 100000000 iterations with an unroll of 100
Encryption:
Non Simd: 00:15:49.8701366, Average: 00:00:00.0000094
Simd: 00:06:39.0892858, Average: 00:00:00.0000039
Speedup with parallelism = 2: 2.38009430570386
Efficiency: 1.19004715285193
Decryption:
Non Simd: 00:15:40.1476703, Average: 00:00:00.0000094
Simd: 00:06:29.0575232, Average: 00:00:00.0000038
Speedup with parallelism = 2: 2.41647472221403
Efficiency: 1.20823736110702
Overhead: 00:00:00.0026434, Average: 00:00:00
The speedup, approximately 2.4, equates to an efficiency of 120%. This is a super-linear speedup for degree of parallelism equal to two. I assumed parallelism = 2 because two long integer operations can take place simultaneously through the XMM registers. Just moving to the XMM registers incurs some additional overhead, so a super-linear speedup was unexpected.
The extra speedup came from a combination of architecture and the Threefish algorithm. Specifically, Threefish operates on 64-bit words. As my computer was 32-bit, long integer operations are broken up by the main processor into multiple micro-ops and therefore isn’t quite as efficient as on a 64-bit processor.
Thus, by moving the arithmetic out of the integer processor and onto the SSE processor, I was able to gain the benefit of the larger word width. Add to this the parallelism gained by performing two operations at a time, and subtract the overhead from moving data into and out of the XMM registers, and you get the exhibited performance.
While I haven’t tested on the 64-bit platform, I would expect the SIMD version to be faster, but not by as much on the 32-bit platform, since the larger word width would eat some of the efficiency gain from moving onto the SSE processor.
Overall, I was very happy with the experience developing with Mono.Simd. There are several things that I would love to see added or changed (aligned moves by default, alignable objects, the psllq %xmm1,%xmm0 instruction3 exposed). Nonetheless, the API is still evolving, and I expect I’ll find myself contributing a few patches to scratch some of the itches I’ve come across.
I hope that you’ve found this an interesting walkthrough of my first experience with Mono.Simd. If you want to inspect my code in more detail, check out my xpdm Github repository. You’ll find Threefish256Simd under the Xpdm.Security.Cryptography namespace4. I’d love to hear about your experiences with Mono.Simd, if any be had!
- Redundant
LoadAlignedandStoreAlignedcalls have been removed for readability [↩] - This output was created with the Skein v1.1 specification on an EeePC 900A with an x86 Intel Atom processor. [↩]
psllq %xmm1,%xmm0is a packed quadword (long) logical shift left where the components ofXMM0are shifted left the number of bits specified by the respective component inXMM1. [↩]- The code has not yet been updated to use the new Skein v1.2 rotation constants [↩]
-byte alignment on the heap, getting much closer to the ideal of being able to eliminate unaligned moves, albeit at the expense of a greater constraint on the garbage collector.






