Skein, Threefish, and Mono.Simd — Part 4October 5th, 2009 at 10:00
In my previous article, I took a look at some of the framework considerations and alignment in Mono.Simd. Getting back to my experiment, I’ve been getting respectable results from my SIMD implementation of Threefish256. Threefish256Simd was taking approximately 20% longer than the Fajardo reference implementation, but something had to be holding the SIMD version back.
Looking back to the machine code, I found that even aligned loads via the API from arrays were coming back pretty slowly with a number of extra moves and extraneous temporary variable. So, I eliminated the pre-calculated key schedule altogether in favor of calculating the key permutations on the fly as in Fajardo’s implementation. With some specially crafted local vectors for the key and tweak values, this changed the key permutation lines in my prior listed source to look something like this1:
bB = bB + k1 + t0h; bA = bB + bB + k0 + t1l;
Later subkeys add in a subkey counter as noted in the Skein specification.
With these changes made, I re-ran my tests. This was the optimization I needed. The SIMD-enabled Threefish256 algorithm was now running in less than half the execution time of the reference implementation: 3.8 ns to 9.3 ns. I had been able to significantly beat my hypothesis.
Output of test/timing program2
mgriep@metis:~$ mono-2.4 mono -O=all,-shared Program.exe Encrypt: 61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC 61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC Good! Cross Decrypt: 0000000000000000 0000000000000001 0000000000000002 0000000000000003 0000000000000000 0000000000000001 0000000000000002 0000000000000003 Good! Speed test of 100000000 iterations with an unroll of 100 Encryption: Non Simd: 00:15:49.8701366, Average: 00:00:00.0000094 Simd: 00:06:39.0892858, Average: 00:00:00.0000039 Speedup with parallelism = 2: 2.38009430570386 Efficiency: 1.19004715285193 Decryption: Non Simd: 00:15:40.1476703, Average: 00:00:00.0000094 Simd: 00:06:29.0575232, Average: 00:00:00.0000038 Speedup with parallelism = 2: 2.41647472221403 Efficiency: 1.20823736110702 Overhead: 00:00:00.0026434, Average: 00:00:00
The speedup, approximately 2.4, equates to an efficiency of 120%. This is a super-linear speedup for degree of parallelism equal to two. I assumed parallelism = 2 because two long integer operations can take place simultaneously through the
XMM registers. Just moving to the
XMM registers incurs some additional overhead, so a super-linear speedup was unexpected.
The extra speedup came from a combination of architecture and the Threefish algorithm. Specifically, Threefish operates on 64-bit words. As my computer was 32-bit, long integer operations are broken up by the main processor into multiple micro-ops and therefore isn’t quite as efficient as on a 64-bit processor.
Thus, by moving the arithmetic out of the integer processor and onto the SSE processor, I was able to gain the benefit of the larger word width. Add to this the parallelism gained by performing two operations at a time, and subtract the overhead from moving data into and out of the
XMM registers, and you get the exhibited performance.
While I haven’t tested on the 64-bit platform, I would expect the SIMD version to be faster, but not by as much on the 32-bit platform, since the larger word width would eat some of the efficiency gain from moving onto the SSE processor.
Overall, I was very happy with the experience developing with Mono.Simd. There are several things that I would love to see added or changed (aligned moves by default, alignable objects, the
psllq %xmm1,%xmm0 instruction3 exposed). Nonetheless, the API is still evolving, and I expect I’ll find myself contributing a few patches to scratch some of the itches I’ve come across.
I hope that you’ve found this an interesting walkthrough of my first experience with Mono.Simd. If you want to inspect my code in more detail, check out my xpdm Github repository. You’ll find
Threefish256Simd under the
Xpdm.Security.Cryptography namespace4. I’d love to hear about your experiences with Mono.Simd, if any be had!
StoreAlignedcalls have been removed for readability [↩]
- This output was created with the Skein v1.1 specification on an EeePC 900A with an x86 Intel Atom processor. [↩]
psllq %xmm1,%xmm0is a packed quadword (long) logical shift left where the components of
XMM0are shifted left the number of bits specified by the respective component in
- The code has not yet been updated to use the new Skein v1.2 rotation constants [↩]