Skein, Threefish, and Mono.Simd — Part 2
October 2nd, 2009 at 10:30Having established a baseline for the unrolled implementation of Threefish256 on the x86 integer processing unit in my last post, I was ready to re-target the algorithm to the SSE processing instructions via Mono.Simd. My first working implementation of Threefish256 was pretty similar to the non-SIMD version. The main differences were some revisions to the Mix, UnMix, and Rotate functions to inject the Mono SIMD intrinsics. I also added a function that pre-generated the full key schedule.
One key permutation and one round with Mono.Simd
bA = bA + keySchedule[0].GetVectorAligned(0); bB = bB + keySchedule[0].GetVectorAligned(2); Mix(ref bA, ref bB, 14, 16); SwapComponents(ref bB);
In the above, I am able to use the XMM registers well by packing the in-cipher long integers (b0, b1, b2, and b3 from the previous post) into vectors. In the above source, bA is <b0,b2> and bB is <b1,b3>. The components of bB are swapped between rounds according to the permutation schedule while bA‘s components do not need to be switched.
After a good deal of feeling my way around the Mono.Simd API, I struck paydirt and found myself with working, SIMD-enabled encrypt and decrypt functions. Paydirt wasn’t quite so sweet, however. Running my same speed test, I found that my SIMD version was running at half the speed of the non-SIMD version, about 18 ns to 9 ns, respectively.
How could the SIMD version be twice as slow? The answer lie in the generated machine code. Mono provides a great feature for this, Ahead-of-Time Compilation. The Mono AOT generates a shared object .so file with the platform-specific machine code. This can be used for some JIT-heavy executables to speed start-up time, since the bootstrapping has already been done. In my case, it allows me to disassemble the object file to inspect the generated machine code. My tool for the purpose was objdump.
Shell transcript using AOT and objdump to get machine code:
mgriep@metis:~$ mono-2.4 mono --aot -O=all,-shared Xpdm.Security.Cryptography.dll Mono Ahead of Time compiler - compiling assembly /home/mgriep/Xpdm.Security.Cryptography.dll reloc: got at 20 Code: 146706 Info: 359 Ex Info: 405 Class Info: 664 PLT: 29 GOT Info: 370 GOT Info Offsets: 192 GOT: 240 section .text aligned to 151824 from 151820 subsection 1 of .text added at offset 151824 (align: 0) num_sections: 3 dynsym: 34, dynstr size: 432 section .data, size: 24, 18 section .bss, size: 240, f0 section .text, size: 155488, 25f60 Compiled 77 out of 77 methods (100%) Methods without GOT slots: 56 (72%) Direct calls: 2184 (99%) JIT time: 147 ms, Generation time: 10 ms, Assembly+Link time: 4 ms. GOT slot distribution: methodconst: 7 switch: 3 image: 1 vtable: 12 ldstr: 1 delegate_trampoline: 7 mgriep@metis:~$ objdump -x Xpdm.Security.Cryptography.dll.so | grep -A1 Xpdm.Security.Cryptography.Threefish256Simd:Encrypt 000214b0 l F .text 00003974 Xpdm.Security.Cryptography.Threefish256Simd:Decrypt (ulong[],ulong[]) 0001db80 l F .text 0000392b Xpdm.Security.Cryptography.Threefish256Simd:Encrypt (ulong[],ulong[]) mgriep@metis:~$ objdump -d Xpdm.Security.Cryptography.dll.so --start-address=0x0001db80 --stop-address=0x000214b0 ...
The first thing that I noticed is that only a few XMM registers were getting used, usually XMM0 and XMM1. This was unexpected since I was operating on multiple vectors at any given time, and leaving the other 6 registers fallow seemed inefficient. The static utility functions, Mix, UnMix, and Rotate seemed to be getting in the way of using the registers more fully since they would require repositioning the stack pointer, re-loading inside the method, performing the operation, storing, and re-positioning back. In the spirit of removing the unnecessary cruft to get closer to the answer, I fully inlined the utility functions.
Additionally, in the assembly code instructions “between” the lines of source code, there would be a symmetric store and load to/from a stack location and the same XMM register. In an attempt to increase the temporal locality of the vectors, and hopefully reduce the store/loads, I did some reordering of arguments within the source code. As it stood after this finagling, here was the state of my source:
Key permutation and one round after inlining and reordering:
bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));
Neither of these changes had much effect on the overall speed, however. In my next post, I take a deeper look at the generated machine code to find my next optimization target.







