NineOnions and PurpleCarrots

|

The search for nonsense.

Archive for the ‘Programming’ Category

StackOverflow DevDays: Boston — In Review

Thursday, October 8th, 2009

StackOverflow DevDays day in Boston was very fun, well planned, and well executed, with a nice array of speakers from across a wide swath of the development world.

The day started off with a keynote speech by Joel Spolsky. His talk was very amusing, and included a theory that all work boils down to a use case wherein the goal is to get the end-user laid, or is in furtherance of that end goal. He further went on to discuss the two objectives of simplicity and power, and that whereas they may seem diametrically opposed, there is money to be made in reconciling simplicity (Don’t make me decide on things I don’t care about) and power (I want to be able to do this and that and some from column C).

From there, the main scheduled presenters took over. Ned Batchelder1 gave a good presentation on the Python programming language. Although I have been used to programming in Python for quite some time, I very much enjoyed the talk which used a Python spell checker as a demonstration of the features of the language. The talk moved at a good pace that was still interesting to me as someone who already knew the material, but also didn’t seem to leave anybody out in the code.

After Ned was Dan Pilone2. Dan’s presentation focused on the iPhone development experience, and moved methodically through the most pertinent points, from XCode to Instruments and the Interface Builder, to the application submission process (denoted by slide showing a despairing man in a desert crying out as sand slips from his grasp), to the sales and marketing mentality necessary in Apple’s App Store. Humor tossed in at a few key places made this an enjoyable and solid overview, which is what it was. If you were looking for an in-depth look, then the one-hour format would leave you disappointed, but from outside iPhone development looking in, the presentation fit the time and audience well.

Just before lunch, Joel Spolsky3 retook the stage using 30 minutes to demo the latest version of FogBugz, FogBugz 7, the flagship product of his company, FogCreek. FogBugz includes a new feature in this version, which I’ll briefly mention called Kiln. Kiln is a new integrated and hosted source control4 and code review management service going into beta as of the announcement. If it piques your interest, you can check it out on FogCreek’s website.

After a lunch5 Patrick Hynds and Chris Bowen6 presented on ASP.NET MVC. Their talk was good, though there were points where it sounded like cheerleading rather than a technical talk. To be fair, Chris Bowen is a developer evangelist for Microsoft, so it’s not unexpected, and it could have just been the “view from my seat”. The subject matter was good; I walked away convinced that Microsoft had done quite well with their MVC implementation.

John Resig7 was the next presenter, and the creator of the jQuery JavaScript library. His talk focused around the need for testing in JavaScript. He focused on several different frameworks for testing, as well as different methodologies from in-browser unit testing to mocking out the entire browser with an out-of-browser solution like Env.JS, to behavioral testing and functional testing.

John did a good job impressing the difficulty of testing JavaScript in a manner that spans the majority of all browsers and indicated that, in his opinion, there’s nothing like real-world testing, with a real human. He pointed to Test Swarm as a distributed example of how to get in-browser testing across a wide array of user/browser configurations. Overall, a good spanning overview that didn’t get mired in the gritty details of each framework, but just enough to understand where each had its use.

The last presenter of the day was Miguel de Icaza8 presented on Mono. I spoke with him briefly at lunch, and even then, he was still vacillating as to the subject matter for his talk. The actual talk was mildly haphazard as a result, but the flow worked well with Miguel’s free personality. He took the audience through Mono Tools for Visual Studio, showing remote debugging from a Windows VS session to a program running on a remote SUSE Linux VM. From there, he went a step further, building an RPM of the ASP.NET MVC sample blog engine. Next, he uploaded that RPM to SUSE Studio and generated a VMWare image in ~10 minutes, fully configured, and was running the blog engine in Mono’s ASP.NET MVC.

Miguel also showcased MonoTouch, building a simple program in MonoDevelop on Mac OSX, and demonstrating it in the iPhone simulator. Including lots of pro-Linux banter and some pokes at Richard Stallman, Miguel kept the audience interested and amused, which is exactly what the last presentation in an 8-hour day needs.

On the whole, the StackOverflow DevDays conference was well done. I enjoyed myself even in those areas outside my expertise, and came away having learned something. Be prepared for a few advertisements for FogCreek during the breaks (well put together and enjoyable, but advertisements no less), but as the co-founder of StackOverflow is also a founder of FogCreek, it’s not un-expected. Was it worth my $99? I believe so. Do I recommend you go? If you have the time, and there’s a location near you, then yes; check out the list of scheduled speakers for your event and see if there are any you’d be interested in.

As I mentioned in my preview, I took many pictures of the event, but have not had a chance to process them yet. I hope to do so tonight and have something to post by this evening.

UPDATE: My first batch of pictures is now up. Check them out on my Flickr account and read my photographs article.

  1. Ned’s DevDays blog post, including slides. On Twitter: @NedBat []
  2. On twitter: @danpilone []
  3. On Twitter: @Spolsky []
  4. The only hosted source control option is Mercurial; I didn’t get a chance to ask Joel why only Mercurial, though he assured us there were several reasons. []
  5. Provided lunches included turkey or beef sandwiches or a veggie wrap along with a half-ounce bag of chips, apple, and cookie []
  6. On Twitter: @ChrisBowen []
  7. On Twitter: @jresig []
  8. On Twitter: @migueldeicaza []

Looking Forward — SO DevDays: Boston

Tuesday, October 6th, 2009

Tomorrow, October 7, I’ll be headed to the StackOverflow DevDays in Boston. I’m especially excited to see Joel Spolsky and Miguel de Icaza.

I’ve been a fan of Joel’s insights, even if I don’t always agree. In my senior year of college, I received a copy of his book, The Best Software Writing I as an interview gift. In that book, Joel pulls together some great articles written by a very diverse group of people, all bringing their own insight to the world of software. When I began doing interviews from the “other side of the table”, I also found Smart and Gets Things Done to be a great read. Joel will be presenting on his company’s primary product FogBugz.

At the end of the day, Miguel will be presenting on something to do with Mono. Because Mono is such a “giant universe”, Miguel has solicited feedback as to which Mono topic should be the focus. I’m interested to hear about MonoTouch1 and Mono.Simd. There are a lot of value-added ways in which Mono differentiates itself, yet compliments, the .NET framework from Microsoft; I’m looking forward to hearing about the future of these and more developments.

Other topics to be presented include Python, the iPhone, ASP.NET MVC, and JavaScript testing. Although it’s late, if you feel inclined, tickets can still be obtained for the event ($99 each), and registration is at 0800. WiFi is also being provided free of charge, so I’ll probably post an update during the event.

I’ll be bringing along my camera to take some photos, which I’ll be posting on Flickr after the event2. If you see me around, feel free to introduce yourself or pose for a picture.

UPDATE: My first batch of pictures is now up. Check them out on my Flickr account and read my photographs article.

  1. Think the Mono/.NET framework and C# brought to the iPhone and iPod Touch. []
  2. Right now, you’ll only see a picture of Lucy, my cat, from when she was a kitten. []

Skein, Threefish, and Mono.Simd — Part 4

Monday, October 5th, 2009

In my previous article, I took a look at some of the framework considerations and alignment in Mono.Simd. Getting back to my experiment, I’ve been getting respectable results from my SIMD implementation of Threefish256. Threefish256Simd was taking approximately 20% longer than the Fajardo reference implementation, but something had to be holding the SIMD version back.

Looking back to the machine code, I found that even aligned loads via the API from arrays were coming back pretty slowly with a number of extra moves and extraneous temporary variable. So, I eliminated the pre-calculated key schedule altogether in favor of calculating the key permutations on the fly as in Fajardo’s implementation. With some specially crafted local vectors for the key and tweak values, this changed the key permutation lines in my prior listed source to look something like this1:

bB = bB + k1 + t0h;
bA = bB + bB + k0 + t1l;

Later subkeys add in a subkey counter as noted in the Skein specification.

With these changes made, I re-ran my tests. This was the optimization I needed. The SIMD-enabled Threefish256 algorithm was now running in less than half the execution time of the reference implementation: 3.8 ns to 9.3 ns. I had been able to significantly beat my hypothesis.

Output of test/timing program2

mgriep@metis:~$ mono-2.4 mono -O=all,-shared Program.exe
Encrypt:
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
Good!
Cross Decrypt:
0000000000000000 0000000000000001 0000000000000002 0000000000000003
0000000000000000 0000000000000001 0000000000000002 0000000000000003
Good!
Speed test of 100000000 iterations with an unroll of 100
Encryption:
Non Simd: 00:15:49.8701366, Average: 00:00:00.0000094
    Simd: 00:06:39.0892858, Average: 00:00:00.0000039
Speedup with parallelism = 2: 2.38009430570386
Efficiency: 1.19004715285193
Decryption:
Non Simd: 00:15:40.1476703, Average: 00:00:00.0000094
    Simd: 00:06:29.0575232, Average: 00:00:00.0000038
Speedup with parallelism = 2: 2.41647472221403
Efficiency: 1.20823736110702
Overhead: 00:00:00.0026434, Average: 00:00:00

The speedup, approximately 2.4, equates to an efficiency of 120%. This is a super-linear speedup for degree of parallelism equal to two. I assumed parallelism = 2 because two long integer operations can take place simultaneously through the XMM registers. Just moving to the XMM registers incurs some additional overhead, so a super-linear speedup was unexpected.

The extra speedup came from a combination of architecture and the Threefish algorithm. Specifically, Threefish operates on 64-bit words. As my computer was 32-bit, long integer operations are broken up by the main processor into multiple micro-ops and therefore isn’t quite as efficient as on a 64-bit processor.

Thus, by moving the arithmetic out of the integer processor and onto the SSE processor, I was able to gain the benefit of the larger word width. Add to this the parallelism gained by performing two operations at a time, and subtract the overhead from moving data into and out of the XMM registers, and you get the exhibited performance.

While I haven’t tested on the 64-bit platform, I would expect the SIMD version to be faster, but not by as much on the 32-bit platform, since the larger word width would eat some of the efficiency gain from moving onto the SSE processor.

Overall, I was very happy with the experience developing with Mono.Simd. There are several things that I would love to see added or changed (aligned moves by default, alignable objects, the psllq %xmm1,%xmm0 instruction3 exposed). Nonetheless, the API is still evolving, and I expect I’ll find myself contributing a few patches to scratch some of the itches I’ve come across.

I hope that you’ve found this an interesting walkthrough of my first experience with Mono.Simd. If you want to inspect my code in more detail, check out my xpdm Github repository. You’ll find Threefish256Simd under the Xpdm.Security.Cryptography namespace4. I’d love to hear about your experiences with Mono.Simd, if any be had!

  1. Redundant LoadAligned and StoreAligned calls have been removed for readability []
  2. This output was created with the Skein v1.1 specification on an EeePC 900A with an x86 Intel Atom processor. []
  3. psllq %xmm1,%xmm0 is a packed quadword (long) logical shift left where the components of XMM0 are shifted left the number of bits specified by the respective component in XMM1. []
  4. The code has not yet been updated to use the new Skein v1.2 rotation constants []

Skein, Threefish, and Mono.Simd — Interlude

Saturday, October 3rd, 2009

Having added in aligned moves into as much of my Threefish implementation as I could in my last article, I thought that I’d take a moment to explain why this is a stable change that shouldn’t cause NullReferenceExceptions at random.

Currently, Mono.Simd is only supported in accelerated mode on the x86 architecture. Support for x64 is in progress (the subject of a 2009 Mono Google Summer of Code project) and other architectures are planned.

Within the Mono framework on x86 and x64, the JIT-compiler makes a guarantee that the stack will be aligned to 16 bytes in each frame. This is good because it meshes well with the requirements of SSE. Depending on the local variables present and temporary variables created by the compiler, the byte-alignment of a block of Vector.. variables can be ensured by adding appropriate padding.

This doesn’t solve for the general case, though. Class or struct members that are Vector..s may end up on the heap, where they may not be aligned, and thus unaligned moves would be necessary without some more defined means to specify alignment.

This lack of a guarantee is not a new problem, especially for people who needed to interact with native SSE code before Mono.Simd. One option is to create a struct with an explicit layout, which is a union with different 4-byte padding combinations, accepting a 12-byte overhead:

[StructLayout(LayoutKind.Explicit)]
public struct AtLeastOneIsAligned {
  [FieldOffset( 0)] public Vector2ul V0;
  [FieldOffset( 4)] public Vector2ul V1;
  [FieldOffset( 8)] public Vector2ul V2;
  [FieldOffset(12)] public Vector2ul V3;
}

You would then need to do some finagling to determine which V* was the aligned vector. It is plausible, but a lot of extra overhead for the convenience.

Of course, it would be much more efficient for the Mono framework to implicitly make the guarantee that local Vector.. variables will be aligned to 16-byte boundaries. Then, when working with local variables, the JIT-compiler could be empowered to emit the more expedient aligned moves. Additionally, local Vector..s can be more efficiently initialized with aligned moves. The framework could go further, adding a class attribute that would guarantee 2^x-byte alignment on the heap, getting much closer to the ideal of being able to eliminate unaligned moves, albeit at the expense of a greater constraint on the garbage collector.

In my final article, I’ll jump back to Threefish256. In that article, I will show one more optimization I was able to make at the C# level, discuss my results compared to the reference, and make a couple more recommendations.

Skein, Threefish, and Mono.Simd — Part 3

Friday, October 2nd, 2009

Now that I had tweaked my base Threefish256 source code as best I could, it was time to look for more inefficiencies in the generated machine code. Here’s where I left the excerpt at the end of the last article:

bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));

One line of interest above, line 3, expands to the following machine code:

   196c0:	66 0f ef c0          	pxor   %xmm0,%xmm0
   196c4:	0f 11 45 b0          	movups %xmm0,-0x50(%ebp)
   196c8:	c7 45 a8 00 00 00 00 	movl   $0x0,-0x58(%ebp)
   196cf:	c7 45 ac 00 00 00 00 	movl   $0x0,-0x54(%ebp)
   196d6:	c7 45 b4 00 00 00 00 	movl   $0x0,-0x4c(%ebp)
   196dd:	c7 45 b0 00 00 00 00 	movl   $0x0,-0x50(%ebp)
   196e4:	c7 45 bc 00 00 00 00 	movl   $0x0,-0x44(%ebp)
   196eb:	c7 45 b8 00 00 00 00 	movl   $0x0,-0x48(%ebp)
   196f2:	0f 10 45 b0          	movups -0x50(%ebp),%xmm0
   196f6:	0f 28 d8             	movaps %xmm0,%xmm3
   196f9:	66 0f 6c dc          	punpcklqdq %xmm4,%xmm3
   196fd:	0f 11 5d c0          	movups %xmm3,-0x40(%ebp)

The first 10 lines in the machine code above don’t seem to do too much. First, XMM0 is zeroed-out and moved to a variable on the stack. Then, the movl instructions zero-out the same variable and the same-sized space above and below it on the stack. Why? I don’t know. Line 9 then reads the variable off the stack (<0,0>) into XMM0. Line 10 then copies this zero to XMM3 and XMM0 is not used until it is clobbered in the next source code line. It isn’t until line 11 that the unpack operation executes.

A large part of me feels that Vector2ul.Zero could have been achieved with a pxor without the extra nine instructions.

Beyond that, however, there is a difference between lines 9 and 10. There is a difference between the two move operations: movups and movaps. Doing a Google search led me to the SIMD Programming Manual for Linux and Windows (shown in the sidebar) a preview of which included this nice tidbit on movups:

This performance overhead is sufficient that it often pays to use the MMX registers rather than the XMM registers if unaligned loads and stores must be used.

Moves with movaps have an alignment restriction; memory operands must be aligned to 16-byte boundaries or a “general protection” fault is raised. Moves with movups don’t have this restriction, but are much slower; so much so that, as the SIMD Manual states, it is often faster to forgo SSE altogether. This speed assertion explained well what I was seeing, and provided me with a new avenue for attack: eliminating as many unaligned moves as possible.

The Mono.Simd API offers the ability to explicitly specify which packed loads and stores should be emitted as aligned through a pair of static methods. Unfortunately, they totally destroy the readability of the code when implemented wholesale. Additionally, the StoreAligned method takes a ref parameter rather than an out parameter, thus requiring that variables be initialized before they can be assigned with an aligned move. The API current does all initializations using unaligned moves, so there is a minimum of one unaligned move per local/temporary vector in a method.

Taking the readability hit, I gave the “aligned move”-theory a try. Additionally, I initialized a local variable, bZero, with <0,0> so that I wouldn’t incur the unnecessary extra nine instructions and unaligned stores everytime I wanted a zero vector. After a whole lot of replacement, here is what the code looked like:

Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) + keySchedule[0].GetVectorAligned(2));
Vector2ul.StoreAligned(ref bA, Vector2ul.LoadAligned(ref bB) + Vector2ul.LoadAligned(ref bA) + keySchedule[0].GetVectorAligned(0));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bZero).UnpackLow(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bZero).UnpackHigh(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bTempB) << 16 | Vector2ul.LoadAligned(ref bTempB) << (64 - 16));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bTempA) >> 14 | Vector2ul.LoadAligned(ref bTempA) >> (64 - 14));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bTempA).UnpackHigh(Vector2ul.LoadAligned(ref bTempB)));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) ^ Vector2ul.LoadAligned(ref bA));
Vector2ul.StoreAligned(ref bB, (Vector2ul)(((Vector4ui) Vector2ul.LoadAligned(ref bB)).Shuffle(XYZWtoZWXY)));

Once it was all worked up and ready for test, I fired it up and was immediately hit with:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object
  at Xpdm.Security.Cryptography.Threefish256Simd.Encrypt (System.UInt64[] input, System.UInt64[] output) [0x00000]
  at Xpdm.Security.Cryptography.Tests.Program.Main () [0x00000]

The meaning of this exception was initially unclear to me, as I was not dereferencing anything that should be null. Then I remembered that an aligned load of an unaligned memory location will generate the same general protection fault as a bad dereference. I figured that this likely had to do with the position of variables on the stack, so I played around adding a couple of dummy variables. I finally got it running using one long integer as a stack buffer.

The speedup I got from these optimizations was significant, but the SIMD-version was still about 20% slower than the Fajardo reference implementation. I had a feeling that I’d be able to eek out a few more performance increases, which I’ll discuss in my next article.

Skein, Threefish, and Mono.Simd — Part 2

Friday, October 2nd, 2009

Having established a baseline for the unrolled implementation of Threefish256 on the x86 integer processing unit in my last post, I was ready to re-target the algorithm to the SSE processing instructions via Mono.Simd. My first working implementation of Threefish256 was pretty similar to the non-SIMD version. The main differences were some revisions to the Mix, UnMix, and Rotate functions to inject the Mono SIMD intrinsics. I also added a function that pre-generated the full key schedule.

One key permutation and one round with Mono.Simd

bA = bA + keySchedule[0].GetVectorAligned(0);
bB = bB + keySchedule[0].GetVectorAligned(2);
Mix(ref bA, ref bB, 14, 16);
SwapComponents(ref bB);

In the above, I am able to use the XMM registers well by packing the in-cipher long integers (b0, b1, b2, and b3 from the previous post) into vectors. In the above source, bA is <b0,b2> and bB is <b1,b3>. The components of bB are swapped between rounds according to the permutation schedule while bA‘s components do not need to be switched.

After a good deal of feeling my way around the Mono.Simd API, I struck paydirt and found myself with working, SIMD-enabled encrypt and decrypt functions. Paydirt wasn’t quite so sweet, however. Running my same speed test, I found that my SIMD version was running at half the speed of the non-SIMD version, about 18 ns to 9 ns, respectively.

How could the SIMD version be twice as slow? The answer lie in the generated machine code. Mono provides a great feature for this, Ahead-of-Time Compilation. The Mono AOT generates a shared object .so file with the platform-specific machine code. This can be used for some JIT-heavy executables to speed start-up time, since the bootstrapping has already been done. In my case, it allows me to disassemble the object file to inspect the generated machine code. My tool for the purpose was objdump.

Shell transcript using AOT and objdump to get machine code:

mgriep@metis:~$ mono-2.4 mono --aot -O=all,-shared Xpdm.Security.Cryptography.dll
Mono Ahead of Time compiler - compiling assembly /home/mgriep/Xpdm.Security.Cryptography.dll
reloc: got at 20
Code: 146706 Info: 359 Ex Info: 405 Class Info: 664 PLT: 29 GOT Info: 370 GOT Info Offsets: 192 GOT: 240
section .text aligned to 151824 from 151820
subsection 1 of .text added at offset 151824 (align: 0)
num_sections: 3
dynsym: 34, dynstr size: 432
section .data, size: 24, 18
section .bss, size: 240, f0
section .text, size: 155488, 25f60
Compiled 77 out of 77 methods (100%)
Methods without GOT slots: 56 (72%)
Direct calls: 2184 (99%)
JIT time: 147 ms, Generation time: 10 ms, Assembly+Link time: 4 ms.
GOT slot distribution:
	methodconst: 7
	switch: 3
	image: 1
	vtable: 12
	ldstr: 1
	delegate_trampoline: 7
mgriep@metis:~$ objdump -x Xpdm.Security.Cryptography.dll.so | grep -A1 Xpdm.Security.Cryptography.Threefish256Simd:Encrypt
000214b0 l     F .text	00003974 Xpdm.Security.Cryptography.Threefish256Simd:Decrypt (ulong[],ulong[])
0001db80 l     F .text	0000392b Xpdm.Security.Cryptography.Threefish256Simd:Encrypt (ulong[],ulong[])
mgriep@metis:~$ objdump -d Xpdm.Security.Cryptography.dll.so --start-address=0x0001db80 --stop-address=0x000214b0
...

The first thing that I noticed is that only a few XMM registers were getting used, usually XMM0 and XMM1. This was unexpected since I was operating on multiple vectors at any given time, and leaving the other 6 registers fallow seemed inefficient. The static utility functions, Mix, UnMix, and Rotate seemed to be getting in the way of using the registers more fully since they would require repositioning the stack pointer, re-loading inside the method, performing the operation, storing, and re-positioning back. In the spirit of removing the unnecessary cruft to get closer to the answer, I fully inlined the utility functions.

Additionally, in the assembly code instructions “between” the lines of source code, there would be a symmetric store and load to/from a stack location and the same XMM register. In an attempt to increase the temporal locality of the vectors, and hopefully reduce the store/loads, I did some reordering of arguments within the source code. As it stood after this finagling, here was the state of my source:

Key permutation and one round after inlining and reordering:

bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));

Neither of these changes had much effect on the overall speed, however. In my next post, I take a deeper look at the generated machine code to find my next optimization target.

Skein, Threefish, and Mono.Simd — Part 1

Thursday, October 1st, 2009

Mono.Simd is a developing framework introducing SIMD intrinsics into the .NET/Mono framework.  Ever since Miguel introduced the world to Mono.Simd, I have been very interested in getting my hands dirty with the new API.

SIMD, or Single Instruction Multiple Data, instructions are special instructions that can concurrently perform the same or related functions on multiple data sets, or vectors, e.g. adding four integers to four other integers, respectively, as opposed to summing each pair in sequence. SIMD instructions on x86 processors use special XMM registers. Examples of SIMD instruction sets include the SSE family and AMD’s 3DNow!. SIMD instructions are most effectively put to use processing large data sets that have similar sets of operations repeated across the data set. Such effective applications include graphical rendering and some physics simulations.

Unfortunately for myself, I am mainly a front-end applications developer by trade. Thus, the opportunity to work with low-level intrinsic functions hadn’t come around in the nine months since their announcement, so I deliberately set out to find a little side project wherein I could test out the new instructions.

There is very little out there on the Internet regarding the Mono.Simd API except for Miguel’s blog post, the Mono source code, and the Monodoc. With that in mind, I found a promising candidate for implementing with Mono.Simd in one of the SHA-3 hash algorithm candidates, specifically Skein. The project’s homepage noted that there were already two implementations of the algorithm in .NET, which gave me a good basis to start from.

I chose the in-progress implementation of Skein by Alberto Fajardo as the basis for my work since it looked to be pretty clean and meshed alright with the BCL standards for System.Security.Cryptography with a few exceptions. Fajardo’s implementation is set up so that the Threefish cipher runs are completely unrolled, meaning no loop overhead in a function already a likely candidate for a tight loop. I chose Threefish256 to implement using Mono.Simd due to its relative simplicity. My preliminary tests1 with Threefish256 showed an average speed of 9.2 ns per encrypt operation.

An example of the first subkey permutation and four rounds of Skein 1.2:

Mix(ref b0, ref b1, 14, k0, k1 + t0);
Mix(ref b2, ref b3, 16, k2 + t1, k3);
Mix(ref b0, ref b3, 52);
Mix(ref b2, ref b1, 57);
Mix(ref b0, ref b1, 23);
Mix(ref b2, ref b3, 40);
Mix(ref b0, ref b3,  5);
Mix(ref b2, ref b1, 37);

Based on this, my working hypothesis is that I will be able to get the SIMD version of the function to run in about 75-55% of the time of the non-SIMD version, or 7–5 ns. This series will continue in my next post, when I’ll discuss some of the progress I’ve made and issues I’ve encountered.

  1. Tests were performed on an EeePC 900A, with an Intel Atom x86 32-bit processor and the Mono 2.4 framework. Tests were run for 100,000,000 iterations unrolled in sets of 100 and averaged to get the stated results. []

BarCamp Boston 4

Saturday, April 25th, 2009

I’m writing this from at BarCamp Boston 4. It’s lunch actually, so I have a bit of time to throw a few words down.

The whole concept of a BarCamp (an unConference) is very intriguing. There is no set format, only a scaffolding of session time slots and rooms. What makes BarCamp so special is the participants. Everyone is encouraged to participate, either from organizing a session, or contributing or learning from a session that you attend. Participants create the sessions and run them. Some may be pre-planned, but most are ad-hoc and follow the flow of the audience. The organizer need not be someone wholly knowlegeable either, but rather someone interested in that topic and looking for others with a similar interest. From the BarCamp rules: “Who ever comes, those are the right people. Whatever happens, was what was supposed to happen.”

I’ve only been in two sessions so far; the two pre-lunch sessions. The first session I attended dealt with Git and feature (aka topic) branching, a rather advanced concept which utilizes a branches branch per feature methodology for concurrently developed features that are merged into an integration branch once deemed complete. I found the session interesting, and learned that the presenter is the developer of a wonderful little Ruby application for managing feature-branched repositories: git-wtf.

The second session I went to was lead by a novice iPhone application developer. It wasn’t the most interesting talk, but it was amusing to hear a few stories about developing for the iPhone and attempting to get an application approved to go in Apple’s App Store. He relayed how his search application, Super Search, was rejected when he first submitted it because the tester had searched for c*cksucker and the app dutifully pulled up the Wikipedia page for fellatio. He resubmitted his app, without changing the binary five minutes later, was rejected again, spoke to someone at Apple, and they accepted the app. All without making any changes to the binary from its first submission. It’s similar to the common carrier argument; should the search app be held profane if the user searches for a profane keyword? I think not.

I also had the opportunity to meet Andrew Lewman from the Tor project. I’ve recently been doing some work on a hidden service address generator, named PurpleOnion, and have been mirroring the Tor site at torproj.xpdm.us. He offered me a Tor t-shirt for the work I had done. I have grandiose ideas to turn PurpleOnion into a Mono/.NET managed Tor-compliant onion router; if you’re interested in a .onion address generator, take a look at the PurpleOnion project at Github.

Anyway, lunch is rapping up, and there look to be a few interesting tracks to start off the afternoon. The day has been fun so far. I’ll post more as the day goes on.

Finding a Better Overload

Tuesday, February 3rd, 2009

One of the latest issues I’ve been working on with C5 is to close the loop on a few compilation warnings that are not #pragma warning warnings. The last of these involves CLS compliance. The C5 library currently advertizes itself as CLS-compliant, but it actually is not. The one example that I am currently wrestling with is the IDictionary<K,V>.Find() overloads. There are two overloads for the Find() method. Their signatures are:

bool Find(K key, out V value);
bool Find(ref K key, out V value);

These two methods are unambiguously distinguished in C# because the compiler requires the use of ref or out as the method signature dictates. However, for other languages, such as Boo, this “by reference” argument passing is inferred. The Common Language System dictates that each overload of the same method must differ by more than just the pass-by type of its arguments (CS3006).

Normally this could be remedied by applying a [CLSCompliant(false)] attribute to the method in question and move along, probably creating a new, compliant method and marking the old one obsolete. The confounding issue is that a core C5 principle is programming to interface. Thus, every method (save one or two) on the public classes is implementing a method defined in an interface, including this one. Interfaces do not have the option of having certain methods be non-CLS compliant. It’s all or nothing (CS3010).

This leaves me in a bit of a catch-22. I cannot ensure cross-language accessibility without changing the interface method signature, but that introduces a breaking change. I have a feeling that both method implementations are widely used among C5 users too.

The first overload is apparent as the default use case, but the second overload is useful when your key may be prototyped. For example, in our dictionary we have a Person as the key and a Account as the value. The person has a bunch of information on it, but the equality comparer is only concerned with the personId:

public class Person : IEquatable<Person> {
  private int personId;
  public string Name {get; set;}
  ...

  public bool Equals(Person other) {
    return this.personId.Equals(other.personId);
  }
}

Then if we knew the personId we could construct a prototype of the person and use the Find() method to get not only their account, but the actual person.

var person = new Person(1);
Account account;
if (Accounts.Find(ref person, out account)) {
  Console.WriteLine("Found {0}'s account", person.Name);
} else {
  Console.WriteLine("Account not found");
}

So, with that example in mind, how do I proceed? The current idea I have is to mark the second overload [Obsolete], create a new overload or method, and deal with the non-CLS compliance for a minor release or two before removing the non-compliant overload. Perhaps I should remove the non-compliant overload outright, like a band-aid.

As well, what should the new method signature be? After thinking for a while, here are my two top choices:

bool Find(K key, out KeyValuePair<K,V> pair);
bool FindWithKey(ref K key, out V value);

Which of those two would be better from an API user’s point of view?  My instinct says the second one is most clear, but the first follows the existing naming.  What’s your take?

Unexpected Behavior

Sunday, February 1st, 2009

“0929″ != 929.

This is the premise of a recent bug that I needed to bash. I’m processing a flat file wherein a time is specified in “HHmm” format. The mapping calls for the two left characters to be interpreted as the hour and the two right characters to be interpreted as the minute. This works great for times such as “2204″ or “1039″, but was failing miserably for some reason. The exception that was raised stated the DateTime was not in the correct format. I looked into the mapping file and saw a proper time “0929″ equivalent to 9:29 AM. In debugging, however, the hour being passed to the DateTime constructor was 92. This was the obvious source of the error.
Looking back at the mapping file, I learned that the mapping software we are using transparently converts the “0929″ string into the number 929 before passing it along the pipe. Converted back into a string for the LEFT(2) and RIGHT(2) gave a nonsense time of 92:29. This type of we-hid-the-details-from-you bugs are absolutely irritating. The workaround was relatively simple, just adding a few more controls into the mapping, but it shouldn’t have been necessary. The datum was a string value and had no sense being converted to an integer before actually being turned into a DateTime value.