NineOnions and PurpleCarrots

|

The search for nonsense.

Posts Tagged ‘SHA3’

Skein, Threefish, and Mono.Simd — Part 4

Monday, October 5th, 2009

In my previous article, I took a look at some of the framework considerations and alignment in Mono.Simd. Getting back to my experiment, I’ve been getting respectable results from my SIMD implementation of Threefish256. Threefish256Simd was taking approximately 20% longer than the Fajardo reference implementation, but something had to be holding the SIMD version back.

Looking back to the machine code, I found that even aligned loads via the API from arrays were coming back pretty slowly with a number of extra moves and extraneous temporary variable. So, I eliminated the pre-calculated key schedule altogether in favor of calculating the key permutations on the fly as in Fajardo’s implementation. With some specially crafted local vectors for the key and tweak values, this changed the key permutation lines in my prior listed source to look something like this1:

bB = bB + k1 + t0h;
bA = bB + bB + k0 + t1l;

Later subkeys add in a subkey counter as noted in the Skein specification.

With these changes made, I re-ran my tests. This was the optimization I needed. The SIMD-enabled Threefish256 algorithm was now running in less than half the execution time of the reference implementation: 3.8 ns to 9.3 ns. I had been able to significantly beat my hypothesis.

Output of test/timing program2

mgriep@metis:~$ mono-2.4 mono -O=all,-shared Program.exe
Encrypt:
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
Good!
Cross Decrypt:
0000000000000000 0000000000000001 0000000000000002 0000000000000003
0000000000000000 0000000000000001 0000000000000002 0000000000000003
Good!
Speed test of 100000000 iterations with an unroll of 100
Encryption:
Non Simd: 00:15:49.8701366, Average: 00:00:00.0000094
    Simd: 00:06:39.0892858, Average: 00:00:00.0000039
Speedup with parallelism = 2: 2.38009430570386
Efficiency: 1.19004715285193
Decryption:
Non Simd: 00:15:40.1476703, Average: 00:00:00.0000094
    Simd: 00:06:29.0575232, Average: 00:00:00.0000038
Speedup with parallelism = 2: 2.41647472221403
Efficiency: 1.20823736110702
Overhead: 00:00:00.0026434, Average: 00:00:00

The speedup, approximately 2.4, equates to an efficiency of 120%. This is a super-linear speedup for degree of parallelism equal to two. I assumed parallelism = 2 because two long integer operations can take place simultaneously through the XMM registers. Just moving to the XMM registers incurs some additional overhead, so a super-linear speedup was unexpected.

The extra speedup came from a combination of architecture and the Threefish algorithm. Specifically, Threefish operates on 64-bit words. As my computer was 32-bit, long integer operations are broken up by the main processor into multiple micro-ops and therefore isn’t quite as efficient as on a 64-bit processor.

Thus, by moving the arithmetic out of the integer processor and onto the SSE processor, I was able to gain the benefit of the larger word width. Add to this the parallelism gained by performing two operations at a time, and subtract the overhead from moving data into and out of the XMM registers, and you get the exhibited performance.

While I haven’t tested on the 64-bit platform, I would expect the SIMD version to be faster, but not by as much on the 32-bit platform, since the larger word width would eat some of the efficiency gain from moving onto the SSE processor.

Overall, I was very happy with the experience developing with Mono.Simd. There are several things that I would love to see added or changed (aligned moves by default, alignable objects, the psllq %xmm1,%xmm0 instruction3 exposed). Nonetheless, the API is still evolving, and I expect I’ll find myself contributing a few patches to scratch some of the itches I’ve come across.

I hope that you’ve found this an interesting walkthrough of my first experience with Mono.Simd. If you want to inspect my code in more detail, check out my xpdm Github repository. You’ll find Threefish256Simd under the Xpdm.Security.Cryptography namespace4. I’d love to hear about your experiences with Mono.Simd, if any be had!

  1. Redundant LoadAligned and StoreAligned calls have been removed for readability []
  2. This output was created with the Skein v1.1 specification on an EeePC 900A with an x86 Intel Atom processor. []
  3. psllq %xmm1,%xmm0 is a packed quadword (long) logical shift left where the components of XMM0 are shifted left the number of bits specified by the respective component in XMM1. []
  4. The code has not yet been updated to use the new Skein v1.2 rotation constants []

Skein, Threefish, and Mono.Simd — Interlude

Saturday, October 3rd, 2009

Having added in aligned moves into as much of my Threefish implementation as I could in my last article, I thought that I’d take a moment to explain why this is a stable change that shouldn’t cause NullReferenceExceptions at random.

Currently, Mono.Simd is only supported in accelerated mode on the x86 architecture. Support for x64 is in progress (the subject of a 2009 Mono Google Summer of Code project) and other architectures are planned.

Within the Mono framework on x86 and x64, the JIT-compiler makes a guarantee that the stack will be aligned to 16 bytes in each frame. This is good because it meshes well with the requirements of SSE. Depending on the local variables present and temporary variables created by the compiler, the byte-alignment of a block of Vector.. variables can be ensured by adding appropriate padding.

This doesn’t solve for the general case, though. Class or struct members that are Vector..s may end up on the heap, where they may not be aligned, and thus unaligned moves would be necessary without some more defined means to specify alignment.

This lack of a guarantee is not a new problem, especially for people who needed to interact with native SSE code before Mono.Simd. One option is to create a struct with an explicit layout, which is a union with different 4-byte padding combinations, accepting a 12-byte overhead:

[StructLayout(LayoutKind.Explicit)]
public struct AtLeastOneIsAligned {
  [FieldOffset( 0)] public Vector2ul V0;
  [FieldOffset( 4)] public Vector2ul V1;
  [FieldOffset( 8)] public Vector2ul V2;
  [FieldOffset(12)] public Vector2ul V3;
}

You would then need to do some finagling to determine which V* was the aligned vector. It is plausible, but a lot of extra overhead for the convenience.

Of course, it would be much more efficient for the Mono framework to implicitly make the guarantee that local Vector.. variables will be aligned to 16-byte boundaries. Then, when working with local variables, the JIT-compiler could be empowered to emit the more expedient aligned moves. Additionally, local Vector..s can be more efficiently initialized with aligned moves. The framework could go further, adding a class attribute that would guarantee 2^x-byte alignment on the heap, getting much closer to the ideal of being able to eliminate unaligned moves, albeit at the expense of a greater constraint on the garbage collector.

In my final article, I’ll jump back to Threefish256. In that article, I will show one more optimization I was able to make at the C# level, discuss my results compared to the reference, and make a couple more recommendations.

Skein, Threefish, and Mono.Simd — Part 3

Friday, October 2nd, 2009

Now that I had tweaked my base Threefish256 source code as best I could, it was time to look for more inefficiencies in the generated machine code. Here’s where I left the excerpt at the end of the last article:

bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));

One line of interest above, line 3, expands to the following machine code:

   196c0:	66 0f ef c0          	pxor   %xmm0,%xmm0
   196c4:	0f 11 45 b0          	movups %xmm0,-0x50(%ebp)
   196c8:	c7 45 a8 00 00 00 00 	movl   $0x0,-0x58(%ebp)
   196cf:	c7 45 ac 00 00 00 00 	movl   $0x0,-0x54(%ebp)
   196d6:	c7 45 b4 00 00 00 00 	movl   $0x0,-0x4c(%ebp)
   196dd:	c7 45 b0 00 00 00 00 	movl   $0x0,-0x50(%ebp)
   196e4:	c7 45 bc 00 00 00 00 	movl   $0x0,-0x44(%ebp)
   196eb:	c7 45 b8 00 00 00 00 	movl   $0x0,-0x48(%ebp)
   196f2:	0f 10 45 b0          	movups -0x50(%ebp),%xmm0
   196f6:	0f 28 d8             	movaps %xmm0,%xmm3
   196f9:	66 0f 6c dc          	punpcklqdq %xmm4,%xmm3
   196fd:	0f 11 5d c0          	movups %xmm3,-0x40(%ebp)

The first 10 lines in the machine code above don’t seem to do too much. First, XMM0 is zeroed-out and moved to a variable on the stack. Then, the movl instructions zero-out the same variable and the same-sized space above and below it on the stack. Why? I don’t know. Line 9 then reads the variable off the stack (<0,0>) into XMM0. Line 10 then copies this zero to XMM3 and XMM0 is not used until it is clobbered in the next source code line. It isn’t until line 11 that the unpack operation executes.

A large part of me feels that Vector2ul.Zero could have been achieved with a pxor without the extra nine instructions.

Beyond that, however, there is a difference between lines 9 and 10. There is a difference between the two move operations: movups and movaps. Doing a Google search led me to the SIMD Programming Manual for Linux and Windows (shown in the sidebar) a preview of which included this nice tidbit on movups:

This performance overhead is sufficient that it often pays to use the MMX registers rather than the XMM registers if unaligned loads and stores must be used.

Moves with movaps have an alignment restriction; memory operands must be aligned to 16-byte boundaries or a “general protection” fault is raised. Moves with movups don’t have this restriction, but are much slower; so much so that, as the SIMD Manual states, it is often faster to forgo SSE altogether. This speed assertion explained well what I was seeing, and provided me with a new avenue for attack: eliminating as many unaligned moves as possible.

The Mono.Simd API offers the ability to explicitly specify which packed loads and stores should be emitted as aligned through a pair of static methods. Unfortunately, they totally destroy the readability of the code when implemented wholesale. Additionally, the StoreAligned method takes a ref parameter rather than an out parameter, thus requiring that variables be initialized before they can be assigned with an aligned move. The API current does all initializations using unaligned moves, so there is a minimum of one unaligned move per local/temporary vector in a method.

Taking the readability hit, I gave the “aligned move”-theory a try. Additionally, I initialized a local variable, bZero, with <0,0> so that I wouldn’t incur the unnecessary extra nine instructions and unaligned stores everytime I wanted a zero vector. After a whole lot of replacement, here is what the code looked like:

Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) + keySchedule[0].GetVectorAligned(2));
Vector2ul.StoreAligned(ref bA, Vector2ul.LoadAligned(ref bB) + Vector2ul.LoadAligned(ref bA) + keySchedule[0].GetVectorAligned(0));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bZero).UnpackLow(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bZero).UnpackHigh(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bTempB) << 16 | Vector2ul.LoadAligned(ref bTempB) << (64 - 16));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bTempA) >> 14 | Vector2ul.LoadAligned(ref bTempA) >> (64 - 14));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bTempA).UnpackHigh(Vector2ul.LoadAligned(ref bTempB)));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) ^ Vector2ul.LoadAligned(ref bA));
Vector2ul.StoreAligned(ref bB, (Vector2ul)(((Vector4ui) Vector2ul.LoadAligned(ref bB)).Shuffle(XYZWtoZWXY)));

Once it was all worked up and ready for test, I fired it up and was immediately hit with:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object
  at Xpdm.Security.Cryptography.Threefish256Simd.Encrypt (System.UInt64[] input, System.UInt64[] output) [0x00000]
  at Xpdm.Security.Cryptography.Tests.Program.Main () [0x00000]

The meaning of this exception was initially unclear to me, as I was not dereferencing anything that should be null. Then I remembered that an aligned load of an unaligned memory location will generate the same general protection fault as a bad dereference. I figured that this likely had to do with the position of variables on the stack, so I played around adding a couple of dummy variables. I finally got it running using one long integer as a stack buffer.

The speedup I got from these optimizations was significant, but the SIMD-version was still about 20% slower than the Fajardo reference implementation. I had a feeling that I’d be able to eek out a few more performance increases, which I’ll discuss in my next article.

Skein, Threefish, and Mono.Simd — Part 2

Friday, October 2nd, 2009

Having established a baseline for the unrolled implementation of Threefish256 on the x86 integer processing unit in my last post, I was ready to re-target the algorithm to the SSE processing instructions via Mono.Simd. My first working implementation of Threefish256 was pretty similar to the non-SIMD version. The main differences were some revisions to the Mix, UnMix, and Rotate functions to inject the Mono SIMD intrinsics. I also added a function that pre-generated the full key schedule.

One key permutation and one round with Mono.Simd

bA = bA + keySchedule[0].GetVectorAligned(0);
bB = bB + keySchedule[0].GetVectorAligned(2);
Mix(ref bA, ref bB, 14, 16);
SwapComponents(ref bB);

In the above, I am able to use the XMM registers well by packing the in-cipher long integers (b0, b1, b2, and b3 from the previous post) into vectors. In the above source, bA is <b0,b2> and bB is <b1,b3>. The components of bB are swapped between rounds according to the permutation schedule while bA’s components do not need to be switched.

After a good deal of feeling my way around the Mono.Simd API, I struck paydirt and found myself with working, SIMD-enabled encrypt and decrypt functions. Paydirt wasn’t quite so sweet, however. Running my same speed test, I found that my SIMD version was running at half the speed of the non-SIMD version, about 18 ns to 9 ns, respectively.

How could the SIMD version be twice as slow? The answer lie in the generated machine code. Mono provides a great feature for this, Ahead-of-Time Compilation. The Mono AOT generates a shared object .so file with the platform-specific machine code. This can be used for some JIT-heavy executables to speed start-up time, since the bootstrapping has already been done. In my case, it allows me to disassemble the object file to inspect the generated machine code. My tool for the purpose was objdump.

Shell transcript using AOT and objdump to get machine code:

mgriep@metis:~$ mono-2.4 mono --aot -O=all,-shared Xpdm.Security.Cryptography.dll
Mono Ahead of Time compiler - compiling assembly /home/mgriep/Xpdm.Security.Cryptography.dll
reloc: got at 20
Code: 146706 Info: 359 Ex Info: 405 Class Info: 664 PLT: 29 GOT Info: 370 GOT Info Offsets: 192 GOT: 240
section .text aligned to 151824 from 151820
subsection 1 of .text added at offset 151824 (align: 0)
num_sections: 3
dynsym: 34, dynstr size: 432
section .data, size: 24, 18
section .bss, size: 240, f0
section .text, size: 155488, 25f60
Compiled 77 out of 77 methods (100%)
Methods without GOT slots: 56 (72%)
Direct calls: 2184 (99%)
JIT time: 147 ms, Generation time: 10 ms, Assembly+Link time: 4 ms.
GOT slot distribution:
	methodconst: 7
	switch: 3
	image: 1
	vtable: 12
	ldstr: 1
	delegate_trampoline: 7
mgriep@metis:~$ objdump -x Xpdm.Security.Cryptography.dll.so | grep -A1 Xpdm.Security.Cryptography.Threefish256Simd:Encrypt
000214b0 l     F .text	00003974 Xpdm.Security.Cryptography.Threefish256Simd:Decrypt (ulong[],ulong[])
0001db80 l     F .text	0000392b Xpdm.Security.Cryptography.Threefish256Simd:Encrypt (ulong[],ulong[])
mgriep@metis:~$ objdump -d Xpdm.Security.Cryptography.dll.so --start-address=0x0001db80 --stop-address=0x000214b0
...

The first thing that I noticed is that only a few XMM registers were getting used, usually XMM0 and XMM1. This was unexpected since I was operating on multiple vectors at any given time, and leaving the other 6 registers fallow seemed inefficient. The static utility functions, Mix, UnMix, and Rotate seemed to be getting in the way of using the registers more fully since they would require repositioning the stack pointer, re-loading inside the method, performing the operation, storing, and re-positioning back. In the spirit of removing the unnecessary cruft to get closer to the answer, I fully inlined the utility functions.

Additionally, in the assembly code instructions “between” the lines of source code, there would be a symmetric store and load to/from a stack location and the same XMM register. In an attempt to increase the temporal locality of the vectors, and hopefully reduce the store/loads, I did some reordering of arguments within the source code. As it stood after this finagling, here was the state of my source:

Key permutation and one round after inlining and reordering:

bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));

Neither of these changes had much effect on the overall speed, however. In my next post, I take a deeper look at the generated machine code to find my next optimization target.

Skein, Threefish, and Mono.Simd — Part 1

Thursday, October 1st, 2009

Mono.Simd is a developing framework introducing SIMD intrinsics into the .NET/Mono framework.  Ever since Miguel introduced the world to Mono.Simd, I have been very interested in getting my hands dirty with the new API.

SIMD, or Single Instruction Multiple Data, instructions are special instructions that can concurrently perform the same or related functions on multiple data sets, or vectors, e.g. adding four integers to four other integers, respectively, as opposed to summing each pair in sequence. SIMD instructions on x86 processors use special XMM registers. Examples of SIMD instruction sets include the SSE family and AMD’s 3DNow!. SIMD instructions are most effectively put to use processing large data sets that have similar sets of operations repeated across the data set. Such effective applications include graphical rendering and some physics simulations.

Unfortunately for myself, I am mainly a front-end applications developer by trade. Thus, the opportunity to work with low-level intrinsic functions hadn’t come around in the nine months since their announcement, so I deliberately set out to find a little side project wherein I could test out the new instructions.

There is very little out there on the Internet regarding the Mono.Simd API except for Miguel’s blog post, the Mono source code, and the Monodoc. With that in mind, I found a promising candidate for implementing with Mono.Simd in one of the SHA-3 hash algorithm candidates, specifically Skein. The project’s homepage noted that there were already two implementations of the algorithm in .NET, which gave me a good basis to start from.

I chose the in-progress implementation of Skein by Alberto Fajardo as the basis for my work since it looked to be pretty clean and meshed alright with the BCL standards for System.Security.Cryptography with a few exceptions. Fajardo’s implementation is set up so that the Threefish cipher runs are completely unrolled, meaning no loop overhead in a function already a likely candidate for a tight loop. I chose Threefish256 to implement using Mono.Simd due to its relative simplicity. My preliminary tests1 with Threefish256 showed an average speed of 9.2 ns per encrypt operation.

An example of the first subkey permutation and four rounds of Skein 1.2:

Mix(ref b0, ref b1, 14, k0, k1 + t0);
Mix(ref b2, ref b3, 16, k2 + t1, k3);
Mix(ref b0, ref b3, 52);
Mix(ref b2, ref b1, 57);
Mix(ref b0, ref b1, 23);
Mix(ref b2, ref b3, 40);
Mix(ref b0, ref b3,  5);
Mix(ref b2, ref b1, 37);

Based on this, my working hypothesis is that I will be able to get the SIMD version of the function to run in about 75-55% of the time of the non-SIMD version, or 7–5 ns. This series will continue in my next post, when I’ll discuss some of the progress I’ve made and issues I’ve encountered.

  1. Tests were performed on an EeePC 900A, with an Intel Atom x86 32-bit processor and the Mono 2.4 framework. Tests were run for 100,000,000 iterations unrolled in sets of 100 and averaged to get the stated results. []