Skein, Threefish, and Mono.Simd — Part 3
October 2nd, 2009 at 14:00Now that I had tweaked my base Threefish256 source code as best I could, it was time to look for more inefficiencies in the generated machine code. Here’s where I left the excerpt at the end of the last article:
bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));
One line of interest above, line 3, expands to the following machine code:
196c0: 66 0f ef c0 pxor %xmm0,%xmm0 196c4: 0f 11 45 b0 movups %xmm0,-0x50(%ebp) 196c8: c7 45 a8 00 00 00 00 movl $0x0,-0x58(%ebp) 196cf: c7 45 ac 00 00 00 00 movl $0x0,-0x54(%ebp) 196d6: c7 45 b4 00 00 00 00 movl $0x0,-0x4c(%ebp) 196dd: c7 45 b0 00 00 00 00 movl $0x0,-0x50(%ebp) 196e4: c7 45 bc 00 00 00 00 movl $0x0,-0x44(%ebp) 196eb: c7 45 b8 00 00 00 00 movl $0x0,-0x48(%ebp) 196f2: 0f 10 45 b0 movups -0x50(%ebp),%xmm0 196f6: 0f 28 d8 movaps %xmm0,%xmm3 196f9: 66 0f 6c dc punpcklqdq %xmm4,%xmm3 196fd: 0f 11 5d c0 movups %xmm3,-0x40(%ebp)
The first 10 lines in the machine code above don’t seem to do too much. First, XMM0 is zeroed-out and moved to a variable on the stack. Then, the movl instructions zero-out the same variable and the same-sized space above and below it on the stack. Why? I don’t know. Line 9 then reads the variable off the stack (<0,0>) into XMM0. Line 10 then copies this zero to XMM3 and XMM0 is not used until it is clobbered in the next source code line. It isn’t until line 11 that the unpack operation executes.
A large part of me feels that Vector2ul.Zero could have been achieved with a pxor without the extra nine instructions.
Beyond that, however, there is a difference between lines 9 and 10. There is a difference between the two move operations: movups and movaps. Doing a Google search led me to the SIMD Programming Manual for Linux and Windows (shown in the sidebar) a preview of which included this nice tidbit on movups:
This performance overhead is sufficient that it often pays to use the MMX registers rather than the XMM registers if unaligned loads and stores must be used.
Moves with movaps have an alignment restriction; memory operands must be aligned to 16-byte boundaries or a “general protection” fault is raised. Moves with movups don’t have this restriction, but are much slower; so much so that, as the SIMD Manual states, it is often faster to forgo SSE altogether. This speed assertion explained well what I was seeing, and provided me with a new avenue for attack: eliminating as many unaligned moves as possible.
The Mono.Simd API offers the ability to explicitly specify which packed loads and stores should be emitted as aligned through a pair of static methods. Unfortunately, they totally destroy the readability of the code when implemented wholesale. Additionally, the StoreAligned method takes a ref parameter rather than an out parameter, thus requiring that variables be initialized before they can be assigned with an aligned move. The API current does all initializations using unaligned moves, so there is a minimum of one unaligned move per local/temporary vector in a method.
Taking the readability hit, I gave the “aligned move”-theory a try. Additionally, I initialized a local variable, bZero, with <0,0> so that I wouldn’t incur the unnecessary extra nine instructions and unaligned stores everytime I wanted a zero vector. After a whole lot of replacement, here is what the code looked like:
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) + keySchedule[0].GetVectorAligned(2));
Vector2ul.StoreAligned(ref bA, Vector2ul.LoadAligned(ref bB) + Vector2ul.LoadAligned(ref bA) + keySchedule[0].GetVectorAligned(0));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bZero).UnpackLow(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bZero).UnpackHigh(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bTempB) << 16 | Vector2ul.LoadAligned(ref bTempB) << (64 - 16));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bTempA) >> 14 | Vector2ul.LoadAligned(ref bTempA) >> (64 - 14));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bTempA).UnpackHigh(Vector2ul.LoadAligned(ref bTempB)));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) ^ Vector2ul.LoadAligned(ref bA));
Vector2ul.StoreAligned(ref bB, (Vector2ul)(((Vector4ui) Vector2ul.LoadAligned(ref bB)).Shuffle(XYZWtoZWXY)));
Once it was all worked up and ready for test, I fired it up and was immediately hit with:
Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object at Xpdm.Security.Cryptography.Threefish256Simd.Encrypt (System.UInt64[] input, System.UInt64[] output) [0x00000] at Xpdm.Security.Cryptography.Tests.Program.Main () [0x00000]
The meaning of this exception was initially unclear to me, as I was not dereferencing anything that should be null. Then I remembered that an aligned load of an unaligned memory location will generate the same general protection fault as a bad dereference. I figured that this likely had to do with the position of variables on the stack, so I played around adding a couple of dummy variables. I finally got it running using one long integer as a stack buffer.
The speedup I got from these optimizations was significant, but the SIMD-version was still about 20% slower than the Fajardo reference implementation. I had a feeling that I’d be able to eek out a few more performance increases, which I’ll discuss in my next article.
Tags: book:isbn=185233794X, Mono, SHA3, SIMD









October 2nd, 2009 at 15.11
this is great stuff! i’ll be keeping an eye out for updates. keep up the good work.
October 2nd, 2009 at 22.12
Thanks. Expect more by the end of the weekend!
November 3rd, 2009 at 17.19
The issue with aligned variables is due to the following:
-Mono uses boehm gc, which doesn’t support allocating 16 bytes aligned memory unless we do it for all objects, which would waste quite some memory. It will be possible to lift this
restriction once mono moves to it’s new moving gc.
-Stack variables aren’t aligned simply because no one finished support for them. It would be great if you could help us with it.
Besides that, good point that Aligned Store operations should take [out] params instead of [ref].
Vector2ul.Zero is not really optimized at all. You can just use new Vector2ul () instead, which would do exactly what you want. It should be quite trivial to do it thou.
The NullPointerException is not the best thing in the world, it would be nice to change the runtime to raise something more appropriate. But, at the same time, it doesn’t cause your program to crash, so it’s not that bad after all.
I’ll try to get some of your suggestions on the 2.8 release of mono.
November 3rd, 2009 at 21.12
Thanks for the reply, Rodrigo!
- I’m excited to see the compacting GC come into existence when it is finally ready.
- That’s probably why the stack alignment was spotty. I was able to get 8-byte alignment consistently, and the code in the arch-specific code looked ready for 16-byte alignment, but obviously is lacking a little. If I’m able to get some free-coding time I’ll take a look into it.
As for the NullPointerException, there may be a way to intercept the fault, and if it is a SIMD instruction with an address, and that address is not aligned, emit a NotAlignedException (new exception type) instead.
I look forward to seeing the development of Mono 2.8 as it progresses. Thanks again for looking in on this.