# Archive for October, 2009

## When Eight Bits Just Isn’t Enough

Thursday, October 15th, 2009

There was a time when one bit per pixel was enough. Back in that time, there was only one color channel with only one value for each pixel; either on or off. One bit per pixel was sufficient. Next came progressively more graduated gray-scale, up to the 256 levels provided by a one-byte, eight-bit value. This was about the period that I got my first computer, an Apple IIc in all of it’s green-display glory.

Mario in 8-bit Color

Then color came along: red, green, and blue. Now those 8 bits were rationed out; three to green and red and two to blue in the common case1. Then came 16-bit color, followed by 24-bit color. Now red, green, and blue were on equal footing, each holding 8 bits of information. Soon after came the cake-is-a-lie 32-bit color, which is rarely adds anything more than 24-bit color2.

This brings us to today, where most computer monitors advertise the 16.7 million colors afforded by 24-bits of color data. For most of us, this is plenty, as the human eye is estimated to only be able to distinguish 10 million colors. Thus, 8-bits per channel is generally sufficient for displays and is why there haven’t been any big pushes to extend that range3. The issue is no longer one of how many colors but of how broad a gamut of colors.

Illustration Credit: Mysid

The sRGB Color Space

The standard gamut used in most displays is called sRGB. sRGB is a rather limited gamut but one that has such wide acceptance that, in the absence of any information to the contrary, it is the assumed default for images. It’s a form of least-common-denominator for electronic displays. For most uses that involve a computer display as the means of consumption, using sRGB is sufficient. However, sRGB lacks the ability to represent saturated yellows, greens, and cyans. This shortcoming is readily apparent when it comes to printing images. The printer has a different gamut for CMYK. Some colors that are in sRGB aren’t in CMYK and visa versa. The more saturated blues and yellows that CMYK can print but aren’t represented in sRGB can leave images of the sky and flowers a little under-saturated.

Wikipedia Public Domain

The Adobe Wide Gamut RGB Color Space

In order to handle the wider range of colors, alternative color spaces must be used with wider gamuts. Examples include Adobe RGB (1998), PhotoPro RGB4, and Adobe Wide Gamut RGB. These gamuts allow even more colors to be represented.

Or do they… In an analog system, sure; just as the cardinality of the set of real numbers, $\mathbb{R}$, is greater than the cardinality of the set of rational numbers, $\mathbb{Q}$, so too, the number colors in Adobe Wide Gamut RGB is greater than the number of colors in sRGB in addition to containing all of sRGB even though all those sets have an uncountable number of members.

However, in the digital format, each component is chunked into quanta. For 8-bits, there are 256 distinct values. When you expand the gamut, the number of representable colors remains exactly the same (256), but the distance between quanta of color is larger. Think of it as taking a 24-inch ruler and stretching it. The ruler covers more distance, but there are still only 24 sections on it. There is just more distance between each hash.

Image Credit: Derrick Coetzee

Posterization from conversion to 4-bit (16 color) palette

When this extended gamut is invariably cast back into a smaller gamut, such as sRGB, the lack of color data resolution causes a phenomenon called posterization. If you’ve ever used 8-bit mode during a remote session on your 24-bit desktop, you’ve seen this effect. The combined effect is somewhat similar to a rounding error.

One way to combat this posterization is to increase the resolution of the color data, e.g. from 8 to 16 bits per channel. With 256 times as many hash marks on the color ruler, the color gamut can easily be expanded to twice its size without causing the banding effect.

As most digital cameras still capture in sRGB and output to 8-bit JPEG, the common photographer has nothing to worry about, but for the more advanced photographer who wishes to have more control over their output, a larger gamut, such as Adobe RGB, and greater resolution are more important. My camera, the Canon Rebel T1i (pictured below) offers an output to RAW format5. Each pixel of the sensor has 14-bits of resolution, so it makes little sense to use a working format with less bits per channel. Thus, after importing the RAW file, my working format is Adobe RGB with a 16-bit TIFF. The TIFF format is an uncompressed beast, but it ensures that I don’t lose any of that color data in translation.

Getting the results that you want can be cannily uncomplicated. Most photographers aren’t even worrying about all this as nearly all decision making, aside from composition, is made inside the camera. But when more control is desired, the entire workflow needs to be insulated from introducing error into the process so that the end result is precisely what the photographer intended. And it’s at this time when eight bits really isn’t enough.

1. The reason being that green is primary color to which human eyes are most sensitive, and small changes in green are more easily noted than in red or blue. This fact is further utilized by many steganographic programs which embed information in image files. []
2. There are some 32-bit color implementations that use 30 bits (10 bits per color), but most commonly, the extra 8 bits are used for non-color data or padding. In fact, Windows 7 will support up to 48-bit color, but no graphics cards yet support it; only a few graphics cards currently support 10-bit-per-channel color; and, fewer still monitors can display the 1 billion+ colors available with 30-bit color. []
3. Though, as mentioned in the previous footnote, Windows 7 is breaching that ceiling. Read more about the advance of digital color resolution on Wikipedia []
4. Which uses an ‘imaginary’ blue as one of its primary colors. []
5. A proprietary version of the TIFF format; each manufacturer often has their own proprietary extensions. []

## The Square and Compasses

Monday, October 12th, 2009

The Woburn Masonic Apartments

This Saturday is Square and Compasses Day in Massachusetts. S&C Day is a state-wide open house organized by Freemasons across Massachusetts to offer the public the opportunity to learn more about the fraternity.

My lodge, Mount Horeb Lodge in Woburn, will be holding its open house from 0900 to 1500 on Saturday, October 17. If you happen to live in the area, I encourage you to come out and see what there is to see. We’ll have the entire building open and members from three lodges will be there giving tours around and about the lodge. There is a ton of history within the Masons, my lodge was established in 1855, and we have some great historians who can explain much of what we have done and continue to do.

In addition to tours of the Masonic lodge, Children’s Hospital of Boston will be outside taking blood donations for their children. The more that donate, the more lives are saved. One donation can save the lives of as many as three people! You can schedule an appointment, if you’d like, by calling 781-933-8220 and pressing 2, but walk-ins are also much appreciated.

Whether you want to find out more about the Masons, eat a hot dog, help save a life, or just jump on a bouncy castle, join me at the Woburn Masonic Apartments at 17 Arlington Road in Woburn. If you don’t live near Woburn, check out the Grand Lodge website to find a lodge near you. Everything is free, and you’ll certainly learn something new for your time. Feel free to contact me for more information at marcus@griep.us.

## StackOverflow DevDays: Boston — The Pictures

Friday, October 9th, 2009

SO DevDays

I’ve finally uploaded a first batch of pictures from the StackOverflow DevDays in Boston. Out of over 350 pictures taken, only 20 made it to this cut, with several not-quite-as-good pictures that I may look at adding later. If you see any you’d like, let me know, and I’ll get a proper copy to you.

Taking photos at the event was very difficult. For human eyes, everything was fine, but for a camera’s lens, there was very little light. I didn’t use flash during the event; a common courtesy to both presenters and audience in an already low-light environment. It made me really wish that I had a 50mm f/1.4 lens like this one, though; too bad such wide aperture lenses are so expensive!

Light source differences between two subjects in one picture cause color problems.

The other environmental difficulty for photography was all of the different light source types. There were fluorescent lights in the main area (blueish), recessed incandescent bulbs (redish-orange), and the projector light (near 6500K, or like-sunlight). So, in the picture at right, for example, the main subject, Joel Spolsky, was standing under incandescent lights. This gives him a redish-tinge. When correcting Joel’s white shirt back to white, the slides in the background, having a neutral tint before correction, end up tending toward blue after.

Regardless, I’m happy with the photos I did get. Again, feel free to check them out. Here’s the link in case you didn’t catch it earlier!

My StackOverflow DevDays Flickr Set

## StackOverflow DevDays: Boston — In Review

Thursday, October 8th, 2009

StackOverflow DevDays day in Boston was very fun, well planned, and well executed, with a nice array of speakers from across a wide swath of the development world.

The day started off with a keynote speech by Joel Spolsky. His talk was very amusing, and included a theory that all work boils down to a use case wherein the goal is to get the end-user laid, or is in furtherance of that end goal. He further went on to discuss the two objectives of simplicity and power, and that whereas they may seem diametrically opposed, there is money to be made in reconciling simplicity (Don’t make me decide on things I don’t care about) and power (I want to be able to do this and that and some from column C).

From there, the main scheduled presenters took over. Ned Batchelder1 gave a good presentation on the Python programming language. Although I have been used to programming in Python for quite some time, I very much enjoyed the talk which used a Python spell checker as a demonstration of the features of the language. The talk moved at a good pace that was still interesting to me as someone who already knew the material, but also didn’t seem to leave anybody out in the code.

After Ned was Dan Pilone2. Dan’s presentation focused on the iPhone development experience, and moved methodically through the most pertinent points, from XCode to Instruments and the Interface Builder, to the application submission process (denoted by slide showing a despairing man in a desert crying out as sand slips from his grasp), to the sales and marketing mentality necessary in Apple’s App Store. Humor tossed in at a few key places made this an enjoyable and solid overview, which is what it was. If you were looking for an in-depth look, then the one-hour format would leave you disappointed, but from outside iPhone development looking in, the presentation fit the time and audience well.

Just before lunch, Joel Spolsky3 retook the stage using 30 minutes to demo the latest version of FogBugz, FogBugz 7, the flagship product of his company, FogCreek. FogBugz includes a new feature in this version, which I’ll briefly mention called Kiln. Kiln is a new integrated and hosted source control4 and code review management service going into beta as of the announcement. If it piques your interest, you can check it out on FogCreek’s website.

After a lunch5 Patrick Hynds and Chris Bowen6 presented on ASP.NET MVC. Their talk was good, though there were points where it sounded like cheerleading rather than a technical talk. To be fair, Chris Bowen is a developer evangelist for Microsoft, so it’s not unexpected, and it could have just been the “view from my seat”. The subject matter was good; I walked away convinced that Microsoft had done quite well with their MVC implementation.

John Resig7 was the next presenter, and the creator of the jQuery JavaScript library. His talk focused around the need for testing in JavaScript. He focused on several different frameworks for testing, as well as different methodologies from in-browser unit testing to mocking out the entire browser with an out-of-browser solution like Env.JS, to behavioral testing and functional testing.

John did a good job impressing the difficulty of testing JavaScript in a manner that spans the majority of all browsers and indicated that, in his opinion, there’s nothing like real-world testing, with a real human. He pointed to Test Swarm as a distributed example of how to get in-browser testing across a wide array of user/browser configurations. Overall, a good spanning overview that didn’t get mired in the gritty details of each framework, but just enough to understand where each had its use.

The last presenter of the day was Miguel de Icaza8 presented on Mono. I spoke with him briefly at lunch, and even then, he was still vacillating as to the subject matter for his talk. The actual talk was mildly haphazard as a result, but the flow worked well with Miguel’s free personality. He took the audience through Mono Tools for Visual Studio, showing remote debugging from a Windows VS session to a program running on a remote SUSE Linux VM. From there, he went a step further, building an RPM of the ASP.NET MVC sample blog engine. Next, he uploaded that RPM to SUSE Studio and generated a VMWare image in ~10 minutes, fully configured, and was running the blog engine in Mono’s ASP.NET MVC.

Miguel also showcased MonoTouch, building a simple program in MonoDevelop on Mac OSX, and demonstrating it in the iPhone simulator. Including lots of pro-Linux banter and some pokes at Richard Stallman, Miguel kept the audience interested and amused, which is exactly what the last presentation in an 8-hour day needs.

On the whole, the StackOverflow DevDays conference was well done. I enjoyed myself even in those areas outside my expertise, and came away having learned something. Be prepared for a few advertisements for FogCreek during the breaks (well put together and enjoyable, but advertisements no less), but as the co-founder of StackOverflow is also a founder of FogCreek, it’s not un-expected. Was it worth my \$99? I believe so. Do I recommend you go? If you have the time, and there’s a location near you, then yes; check out the list of scheduled speakers for your event and see if there are any you’d be interested in.

As I mentioned in my preview, I took many pictures of the event, but have not had a chance to process them yet. I hope to do so tonight and have something to post by this evening.

UPDATE: My first batch of pictures is now up. Check them out on my Flickr account and read my photographs article.

1. Ned’s DevDays blog post, including slides. On Twitter: @NedBat []
4. The only hosted source control option is Mercurial; I didn’t get a chance to ask Joel why only Mercurial, though he assured us there were several reasons. []
5. Provided lunches included turkey or beef sandwiches or a veggie wrap along with a half-ounce bag of chips, apple, and cookie []

## Wait, Wait… There’s a Blog for That!

Wednesday, October 7th, 2009

If you’re like me, then you listen to a lot of NPR. Every Saturday at 1100, my radio turns on to listen to the crazy mechanics “Click & Clack” on Car Talk followed by Wait, Wait… Don’t Tell Me! at noon1. Wait, Wait…, a weekly news quiz show, is a great way to pull some humor out of both the mundane and the off-beat news of the week wherein everyone’s ultimate goal is to get Carl Kasell‘s voice on their home answering machine. Yesterday, Peter Segal, the host of the show, announced that Wait, Wait… has decided to enter the blogosphere.

We’ve been following this “web-log” thing for a while… waiting to see if it was just another one of those internet2 fads, like dancing babies or keyboard cats or histories of dance. Dances. But it seems to be sticking, so we here at Wait, Wait thought it would be worth a try.

I’m intrigued. It’ll be another outlet for news humor that for one reason or another didn’t make it to air; a supplement between doses of Peter Segal, Mo Rocca, Paula Poundstone, Amy Dickinson, and all the other panelists.

You can also keep up with the Wait, Wait… crew on Twitter via @waitwait3.

1. I’m a member of NPR member station WBUR 90.9 Boston []
2. Sic. The pedant in me wants to point out that Internet should be capitalized. []

## Looking Forward — SO DevDays: Boston

Tuesday, October 6th, 2009

Tomorrow, October 7, I’ll be headed to the StackOverflow DevDays in Boston. I’m especially excited to see Joel Spolsky and Miguel de Icaza.

I’ve been a fan of Joel’s insights, even if I don’t always agree. In my senior year of college, I received a copy of his book, The Best Software Writing I as an interview gift. In that book, Joel pulls together some great articles written by a very diverse group of people, all bringing their own insight to the world of software. When I began doing interviews from the “other side of the table”, I also found Smart and Gets Things Done to be a great read. Joel will be presenting on his company’s primary product FogBugz.

At the end of the day, Miguel will be presenting on something to do with Mono. Because Mono is such a “giant universe”, Miguel has solicited feedback as to which Mono topic should be the focus. I’m interested to hear about MonoTouch1 and Mono.Simd. There are a lot of value-added ways in which Mono differentiates itself, yet compliments, the .NET framework from Microsoft; I’m looking forward to hearing about the future of these and more developments.

Other topics to be presented include Python, the iPhone, ASP.NET MVC, and JavaScript testing. Although it’s late, if you feel inclined, tickets can still be obtained for the event (\$99 each), and registration is at 0800. WiFi is also being provided free of charge, so I’ll probably post an update during the event.

I’ll be bringing along my camera to take some photos, which I’ll be posting on Flickr after the event2. If you see me around, feel free to introduce yourself or pose for a picture.

UPDATE: My first batch of pictures is now up. Check them out on my Flickr account and read my photographs article.

1. Think the Mono/.NET framework and C# brought to the iPhone and iPod Touch. []
2. Right now, you’ll only see a picture of Lucy, my cat, from when she was a kitten. []

## Skein, Threefish, and Mono.Simd — Part 4

Monday, October 5th, 2009

In my previous article, I took a look at some of the framework considerations and alignment in Mono.Simd. Getting back to my experiment, I’ve been getting respectable results from my SIMD implementation of Threefish256. Threefish256Simd was taking approximately 20% longer than the Fajardo reference implementation, but something had to be holding the SIMD version back.

Looking back to the machine code, I found that even aligned loads via the API from arrays were coming back pretty slowly with a number of extra moves and extraneous temporary variable. So, I eliminated the pre-calculated key schedule altogether in favor of calculating the key permutations on the fly as in Fajardo’s implementation. With some specially crafted local vectors for the key and tweak values, this changed the key permutation lines in my prior listed source to look something like this1:

```bB = bB + k1 + t0h;
bA = bB + bB + k0 + t1l;
```

Later subkeys add in a subkey counter as noted in the Skein specification.

With these changes made, I re-ran my tests. This was the optimization I needed. The SIMD-enabled Threefish256 algorithm was now running in less than half the execution time of the reference implementation: 3.8 ns to 9.3 ns. I had been able to significantly beat my hypothesis.

Output of test/timing program2

```mgriep@metis:~\$ mono-2.4 mono -O=all,-shared Program.exe
Encrypt:
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
Good!
Cross Decrypt:
0000000000000000 0000000000000001 0000000000000002 0000000000000003
0000000000000000 0000000000000001 0000000000000002 0000000000000003
Good!
Speed test of 100000000 iterations with an unroll of 100
Encryption:
Non Simd: 00:15:49.8701366, Average: 00:00:00.0000094
Simd: 00:06:39.0892858, Average: 00:00:00.0000039
Speedup with parallelism = 2: 2.38009430570386
Efficiency: 1.19004715285193
Decryption:
Non Simd: 00:15:40.1476703, Average: 00:00:00.0000094
Simd: 00:06:29.0575232, Average: 00:00:00.0000038
Speedup with parallelism = 2: 2.41647472221403
Efficiency: 1.20823736110702
```

The speedup, approximately 2.4, equates to an efficiency of 120%. This is a super-linear speedup for degree of parallelism equal to two. I assumed parallelism = 2 because two long integer operations can take place simultaneously through the `XMM` registers. Just moving to the `XMM` registers incurs some additional overhead, so a super-linear speedup was unexpected.

The extra speedup came from a combination of architecture and the Threefish algorithm. Specifically, Threefish operates on 64-bit words. As my computer was 32-bit, long integer operations are broken up by the main processor into multiple micro-ops and therefore isn’t quite as efficient as on a 64-bit processor.

Thus, by moving the arithmetic out of the integer processor and onto the SSE processor, I was able to gain the benefit of the larger word width. Add to this the parallelism gained by performing two operations at a time, and subtract the overhead from moving data into and out of the `XMM` registers, and you get the exhibited performance.

While I haven’t tested on the 64-bit platform, I would expect the SIMD version to be faster, but not by as much on the 32-bit platform, since the larger word width would eat some of the efficiency gain from moving onto the SSE processor.

Overall, I was very happy with the experience developing with Mono.Simd. There are several things that I would love to see added or changed (aligned moves by default, alignable objects, the `psllq %xmm1,%xmm0` instruction3 exposed). Nonetheless, the API is still evolving, and I expect I’ll find myself contributing a few patches to scratch some of the itches I’ve come across.

I hope that you’ve found this an interesting walkthrough of my first experience with Mono.Simd. If you want to inspect my code in more detail, check out my xpdm Github repository. You’ll find `Threefish256Simd` under the `Xpdm.Security.Cryptography` namespace4. I’d love to hear about your experiences with Mono.Simd, if any be had!

1. Redundant `LoadAligned` and `StoreAligned` calls have been removed for readability []
2. This output was created with the Skein v1.1 specification on an EeePC 900A with an x86 Intel Atom processor. []
3. `psllq %xmm1,%xmm0` is a packed quadword (long) logical shift left where the components of `XMM0` are shifted left the number of bits specified by the respective component in `XMM1`. []
4. The code has not yet been updated to use the new Skein v1.2 rotation constants []

## Skein, Threefish, and Mono.Simd — Interlude

Saturday, October 3rd, 2009

Having added in aligned moves into as much of my Threefish implementation as I could in my last article, I thought that I’d take a moment to explain why this is a stable change that shouldn’t cause `NullReferenceException`s at random.

Currently, Mono.Simd is only supported in accelerated mode on the x86 architecture. Support for x64 is in progress (the subject of a 2009 Mono Google Summer of Code project) and other architectures are planned.

Within the Mono framework on x86 and x64, the JIT-compiler makes a guarantee that the stack will be aligned to 16 bytes in each frame. This is good because it meshes well with the requirements of SSE. Depending on the local variables present and temporary variables created by the compiler, the byte-alignment of a block of `Vector..` variables can be ensured by adding appropriate padding.

This doesn’t solve for the general case, though. Class or struct members that are `Vector..`s may end up on the heap, where they may not be aligned, and thus unaligned moves would be necessary without some more defined means to specify alignment.

This lack of a guarantee is not a new problem, especially for people who needed to interact with native SSE code before Mono.Simd. One option is to create a struct with an explicit layout, which is a union with different 4-byte padding combinations, accepting a 12-byte overhead:

```[StructLayout(LayoutKind.Explicit)]
public struct AtLeastOneIsAligned {
[FieldOffset( 0)] public Vector2ul V0;
[FieldOffset( 4)] public Vector2ul V1;
[FieldOffset( 8)] public Vector2ul V2;
[FieldOffset(12)] public Vector2ul V3;
}
```

You would then need to do some finagling to determine which `V*` was the aligned vector. It is plausible, but a lot of extra overhead for the convenience.

Of course, it would be much more efficient for the Mono framework to implicitly make the guarantee that local `Vector..` variables will be aligned to 16-byte boundaries. Then, when working with local variables, the JIT-compiler could be empowered to emit the more expedient aligned moves. Additionally, local `Vector..`s can be more efficiently initialized with aligned moves. The framework could go further, adding a class attribute that would guarantee $2^x$-byte alignment on the heap, getting much closer to the ideal of being able to eliminate unaligned moves, albeit at the expense of a greater constraint on the garbage collector.

In my final article, I’ll jump back to Threefish256. In that article, I will show one more optimization I was able to make at the C# level, discuss my results compared to the reference, and make a couple more recommendations.

## Skein, Threefish, and Mono.Simd — Part 3

Friday, October 2nd, 2009

Now that I had tweaked my base Threefish256 source code as best I could, it was time to look for more inefficiencies in the generated machine code. Here’s where I left the excerpt at the end of the last article:
``` ```

``````bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));
``````
``` ```

One line of interest above, line 3, expands to the following machine code:

```   196c0:	66 0f ef c0          	pxor   %xmm0,%xmm0
196c4:	0f 11 45 b0          	movups %xmm0,-0x50(%ebp)
196c8:	c7 45 a8 00 00 00 00 	movl   \$0x0,-0x58(%ebp)
196cf:	c7 45 ac 00 00 00 00 	movl   \$0x0,-0x54(%ebp)
196d6:	c7 45 b4 00 00 00 00 	movl   \$0x0,-0x4c(%ebp)
196dd:	c7 45 b0 00 00 00 00 	movl   \$0x0,-0x50(%ebp)
196e4:	c7 45 bc 00 00 00 00 	movl   \$0x0,-0x44(%ebp)
196eb:	c7 45 b8 00 00 00 00 	movl   \$0x0,-0x48(%ebp)
196f2:	0f 10 45 b0          	movups -0x50(%ebp),%xmm0
196f6:	0f 28 d8             	movaps %xmm0,%xmm3
196f9:	66 0f 6c dc          	punpcklqdq %xmm4,%xmm3
196fd:	0f 11 5d c0          	movups %xmm3,-0x40(%ebp)
```

The first 10 lines in the machine code above don’t seem to do too much. First, `XMM0` is zeroed-out and moved to a variable on the stack. Then, the `movl` instructions zero-out the same variable and the same-sized space above and below it on the stack. Why? I don’t know. Line 9 then reads the variable off the stack (`<0,0>`) into `XMM0`. Line 10 then copies this zero to `XMM3` and `XMM0` is not used until it is clobbered in the next source code line. It isn’t until line 11 that the unpack operation executes.

A large part of me feels that `Vector2ul.Zero` could have been achieved with a `pxor` without the extra nine instructions.

Beyond that, however, there is a difference between lines 9 and 10. There is a difference between the two move operations: `movups` and `movaps`. Doing a Google search led me to the SIMD Programming Manual for Linux and Windows (shown in the sidebar) a preview of which included this nice tidbit on `movups`:

This performance overhead is sufficient that it often pays to use the MMX registers rather than the XMM registers if unaligned loads and stores must be used.

Moves with `movaps` have an alignment restriction; memory operands must be aligned to 16-byte boundaries or a “general protection” fault is raised. Moves with `movups` don’t have this restriction, but are much slower; so much so that, as the SIMD Manual states, it is often faster to forgo SSE altogether. This speed assertion explained well what I was seeing, and provided me with a new avenue for attack: eliminating as many unaligned moves as possible.

The Mono.Simd API offers the ability to explicitly specify which packed loads and stores should be emitted as aligned through a pair of static methods. Unfortunately, they totally destroy the readability of the code when implemented wholesale. Additionally, the `StoreAligned` method takes a `ref` parameter rather than an `out` parameter, thus requiring that variables be initialized before they can be assigned with an aligned move. The API current does all initializations using unaligned moves, so there is a minimum of one unaligned move per local/temporary vector in a method.

Taking the readability hit, I gave the “aligned move”-theory a try. Additionally, I initialized a local variable, `bZero`, with `<0,0>` so that I wouldn’t incur the unnecessary extra nine instructions and unaligned stores everytime I wanted a zero vector. After a whole lot of replacement, here is what the code looked like:
``` ```

``````Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) + keySchedule[0].GetVectorAligned(2));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bTempB) << 16 | Vector2ul.LoadAligned(ref bTempB) << (64 - 16));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bTempA) >> 14 | Vector2ul.LoadAligned(ref bTempA) >> (64 - 14));
``````
``` ```

Once it was all worked up and ready for test, I fired it up and was immediately hit with:

```Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object
at Xpdm.Security.Cryptography.Threefish256Simd.Encrypt (System.UInt64[] input, System.UInt64[] output) [0x00000]
at Xpdm.Security.Cryptography.Tests.Program.Main () [0x00000]
```

The meaning of this exception was initially unclear to me, as I was not dereferencing anything that should be null. Then I remembered that an aligned load of an unaligned memory location will generate the same general protection fault as a bad dereference. I figured that this likely had to do with the position of variables on the stack, so I played around adding a couple of dummy variables. I finally got it running using one long integer as a stack buffer.

The speedup I got from these optimizations was significant, but the SIMD-version was still about 20% slower than the Fajardo reference implementation. I had a feeling that I’d be able to eek out a few more performance increases, which I’ll discuss in my next article.

## Skein, Threefish, and Mono.Simd — Part 2

Friday, October 2nd, 2009

Having established a baseline for the unrolled implementation of Threefish256 on the x86 integer processing unit in my last post, I was ready to re-target the algorithm to the SSE processing instructions via Mono.Simd. My first working implementation of Threefish256 was pretty similar to the non-SIMD version. The main differences were some revisions to the `Mix`, `UnMix`, and `Rotate` functions to inject the Mono SIMD intrinsics. I also added a function that pre-generated the full key schedule.

One key permutation and one round with Mono.Simd

```bA = bA + keySchedule[0].GetVectorAligned(0);
bB = bB + keySchedule[0].GetVectorAligned(2);
Mix(ref bA, ref bB, 14, 16);
SwapComponents(ref bB);
```

In the above, I am able to use the `XMM` registers well by packing the in-cipher long integers (`b0`, `b1`, `b2`, and `b3` from the previous post) into vectors. In the above source, `bA` is `<b0,b2>` and `bB` is `<b1,b3>`. The components of `bB` are swapped between rounds according to the permutation schedule while `bA`‘s components do not need to be switched.

After a good deal of feeling my way around the Mono.Simd API, I struck paydirt and found myself with working, SIMD-enabled encrypt and decrypt functions. Paydirt wasn’t quite so sweet, however. Running my same speed test, I found that my SIMD version was running at half the speed of the non-SIMD version, about 18 ns to 9 ns, respectively.

How could the SIMD version be twice as slow? The answer lie in the generated machine code. Mono provides a great feature for this, Ahead-of-Time Compilation. The Mono AOT generates a shared object `.so` file with the platform-specific machine code. This can be used for some JIT-heavy executables to speed start-up time, since the bootstrapping has already been done. In my case, it allows me to disassemble the object file to inspect the generated machine code. My tool for the purpose was `objdump`.

Shell transcript using AOT and `objdump` to get machine code:

```mgriep@metis:~\$ mono-2.4 mono --aot -O=all,-shared Xpdm.Security.Cryptography.dll
Mono Ahead of Time compiler - compiling assembly /home/mgriep/Xpdm.Security.Cryptography.dll
reloc: got at 20
Code: 146706 Info: 359 Ex Info: 405 Class Info: 664 PLT: 29 GOT Info: 370 GOT Info Offsets: 192 GOT: 240
section .text aligned to 151824 from 151820
subsection 1 of .text added at offset 151824 (align: 0)
num_sections: 3
dynsym: 34, dynstr size: 432
section .data, size: 24, 18
section .bss, size: 240, f0
section .text, size: 155488, 25f60
Compiled 77 out of 77 methods (100%)
Methods without GOT slots: 56 (72%)
Direct calls: 2184 (99%)
JIT time: 147 ms, Generation time: 10 ms, Assembly+Link time: 4 ms.
GOT slot distribution:
methodconst: 7
switch: 3
image: 1
vtable: 12
ldstr: 1
delegate_trampoline: 7
mgriep@metis:~\$ objdump -x Xpdm.Security.Cryptography.dll.so | grep -A1 Xpdm.Security.Cryptography.Threefish256Simd:Encrypt
000214b0 l     F .text	00003974 Xpdm.Security.Cryptography.Threefish256Simd:Decrypt (ulong[],ulong[])
0001db80 l     F .text	0000392b Xpdm.Security.Cryptography.Threefish256Simd:Encrypt (ulong[],ulong[])
...
```

The first thing that I noticed is that only a few `XMM` registers were getting used, usually `XMM0` and `XMM1`. This was unexpected since I was operating on multiple vectors at any given time, and leaving the other 6 registers fallow seemed inefficient. The static utility functions, `Mix`, `UnMix`, and `Rotate` seemed to be getting in the way of using the registers more fully since they would require repositioning the stack pointer, re-loading inside the method, performing the operation, storing, and re-positioning back. In the spirit of removing the unnecessary cruft to get closer to the answer, I fully inlined the utility functions.

Additionally, in the assembly code instructions “between” the lines of source code, there would be a symmetric store and load to/from a stack location and the same `XMM` register. In an attempt to increase the temporal locality of the vectors, and hopefully reduce the store/loads, I did some reordering of arguments within the source code. As it stood after this finagling, here was the state of my source:

Key permutation and one round after inlining and reordering:
``` ```

``````bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB << 16 | bTempB >> (64 - 16);
bTempA = bTempA << 14 | bTempA >> (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));
``````
``` ```

Neither of these changes had much effect on the overall speed, however. In my next post, I take a deeper look at the generated machine code to find my next optimization target.