<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>NineOnions and PurpleCarrots &#187; Mono</title>
	<atom:link href="http://blog.xpdm.us/category/programming/mono-programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.xpdm.us</link>
	<description>The search for nonsense.</description>
	<lastBuildDate>Tue, 27 Jul 2010 21:26:20 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Skein, Threefish, and Mono.Simd &#8212; Part 4</title>
		<link>http://blog.xpdm.us/2009/10/05/skein-threefish-and-mono-simd-part-4/</link>
		<comments>http://blog.xpdm.us/2009/10/05/skein-threefish-and-mono-simd-part-4/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 14:00:01 +0000</pubDate>
		<dc:creator>Marcus Griep</dc:creator>
				<category><![CDATA[Mono]]></category>
		<category><![CDATA[SHA3]]></category>
		<category><![CDATA[SIMD]]></category>

		<guid isPermaLink="false">http://blog.xpdm.us/?p=114</guid>
		<description><![CDATA[In my , I took a look at some of the framework considerations and alignment in Mono.Simd. Getting back to my experiment, I&#8217;ve been getting respectable results from my SIMD implementation of Threefish256. Threefish256Simd was taking approximately 20% longer than the Fajardo reference implementation, but something had to be holding the SIMD version back. Looking [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://blog.xpdm.us/2009/10/03/skein-threefish-and-mono-simd-interlude/" title='Skein, Threefish, and Mono.Simd — Interlude'>previous article</a>, I took a look at some of the framework considerations and alignment in Mono.Simd. Getting back to my experiment, I&#8217;ve been getting respectable results from my <acronym title="Single Instruction, Multiple Data">SIMD</acronym> implementation of Threefish256. Threefish256Simd was taking approximately 20% longer than the Fajardo reference implementation, but something had to be holding the <acronym title="Single Instruction, Multiple Data">SIMD</acronym> version back.</p>
<p>Looking back to the machine code, I found that even aligned loads via the <acronym title="Application Programming Interface">API</acronym> from arrays were coming back pretty slowly with a number of extra moves and extraneous temporary variable. So, I eliminated the pre-calculated key schedule altogether in favor of calculating the key permutations on the fly as in Fajardo&#8217;s implementation. With some specially crafted local vectors for the key and tweak values, this changed the key permutation lines in my prior listed source to look something like this<sup>1</sup>:</p>
<pre class="brush: csharp; light: true;">
bB = bB + k1 + t0h;
bA = bB + bB + k0 + t1l;
</pre>
<p>Later subkeys add in a subkey counter as noted in the <a href="http://www.skein-hash.info/sites/default/files/skein1.2.pdf">Skein specification</a>.</p>
<p>With these changes made, I re-ran my tests. This was the optimization I needed. The <acronym title="Single Instruction, Multiple Data">SIMD</acronym>-enabled Threefish256 algorithm was now running in less than half the execution time of the reference implementation: 3.8 <acronym title="nanoseconds">ns</acronym> to 9.3 <acronym title="nanoseconds">ns</acronym>. I had been able to significantly beat my hypothesis.</p>
<p><small>Output of test/timing program<sup>2</sup></small></p>
<pre class="brush: plain; class-name: with-caption; light: true; wrap-lines: false;">
mgriep@metis:~$ mono-2.4 mono -O=all,-shared Program.exe
Encrypt:
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
61773FAC03A37433 B1EFF698BC88802B 22A47D9E7D7F8005 1EC6162F214FB0EC
Good!
Cross Decrypt:
0000000000000000 0000000000000001 0000000000000002 0000000000000003
0000000000000000 0000000000000001 0000000000000002 0000000000000003
Good!
Speed test of 100000000 iterations with an unroll of 100
Encryption:
Non Simd: 00:15:49.8701366, Average: 00:00:00.0000094
    Simd: 00:06:39.0892858, Average: 00:00:00.0000039
Speedup with parallelism = 2: 2.38009430570386
Efficiency: 1.19004715285193
Decryption:
Non Simd: 00:15:40.1476703, Average: 00:00:00.0000094
    Simd: 00:06:29.0575232, Average: 00:00:00.0000038
Speedup with parallelism = 2: 2.41647472221403
Efficiency: 1.20823736110702
Overhead: 00:00:00.0026434, Average: 00:00:00
</pre>
<p>The speedup, approximately 2.4, equates to an efficiency of 120%. This is a super-linear speedup for degree of parallelism equal to two. I assumed parallelism = 2 because two long integer operations can take place simultaneously through the <code>XMM</code> registers. Just moving to the <code>XMM</code> registers incurs some additional overhead, so a super-linear speedup was unexpected.</p>
<p>The extra speedup came from a combination of architecture and the Threefish algorithm. Specifically, Threefish operates on 64-bit words. As my computer was 32-bit, long integer operations are broken up by the main processor into multiple micro-ops and therefore isn&#8217;t quite as efficient as on a 64-bit processor.</p>
<p>Thus, by moving the arithmetic out of the integer processor and onto the <acronym title="Streaming SIMD Extensions">SSE</acronym> processor, I was able to gain the benefit of the larger word width. Add to this the parallelism gained by performing two operations at a time, and subtract the overhead from moving data into and out of the <code>XMM</code> registers, and you get the exhibited performance.</p>
<p>While I haven&#8217;t tested on the 64-bit platform, I would expect the <acronym title="Single Instruction, Multiple Data">SIMD</acronym> version to be faster, but not by as much on the 32-bit platform, since the larger word width would eat some of the efficiency gain from moving onto the <acronym title="Streaming SIMD Extensions">SSE</acronym> processor.</p>
<p>Overall, I was very happy with the experience developing with Mono.Simd. There are several things that I would love to see added or changed (aligned moves by default, alignable objects, the <code>psllq %xmm1,%xmm0</code> instruction<sup>3</sup> exposed). Nonetheless, the <acronym title="Application Programming Interface">API</acronym> is still evolving, and I expect I&#8217;ll find myself contributing a few patches to scratch some of the itches I&#8217;ve come across.</p>
<p>I hope that you&#8217;ve found this an interesting walkthrough of my first experience with Mono.Simd. If you want to inspect my code in more detail, check out my <a href="http://github.com/neoeinstein/xpdm/tree/simd-test">xpdm Github repository</a>. You&#8217;ll find <code>Threefish256Simd</code> under the <code>Xpdm.Security.Cryptography</code> namespace<sup>4</sup>. I&#8217;d love to hear about your experiences with Mono.Simd, if any be had!</p><!-- Easy AdSense V2.80 -->
<!-- Post[count: 2] -->
<div class="ezAdsense adsense adsense-leadout" style="text-align:center;margin:12px; "><script type="text/javascript"><!--
google_ad_client = "pub-1288053843370235";
/* After Post */
google_ad_slot = "5574703727";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div><ol class="footnotes"><li id="footnote_0_114" class="footnote">Redundant <code>LoadAligned</code> and <code>StoreAligned</code> calls have been removed for readability</li><li id="footnote_1_114" class="footnote">This output was created with the Skein v1.1 specification on an EeePC 900A with an x86 Intel Atom processor.</li><li id="footnote_2_114" class="footnote"><code>psllq %xmm1,%xmm0</code> is a packed quadword (long) logical shift left where the components of <code>XMM0</code> are shifted left the number of bits specified by the respective component in <code>XMM1</code>.</li><li id="footnote_3_114" class="footnote">The code has not yet been updated to use the new Skein v1.2 rotation constants</li></ol>]]></content:encoded>
			<wfw:commentRss>http://blog.xpdm.us/2009/10/05/skein-threefish-and-mono-simd-part-4/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Skein, Threefish, and Mono.Simd &#8212; Interlude</title>
		<link>http://blog.xpdm.us/2009/10/03/skein-threefish-and-mono-simd-interlude/</link>
		<comments>http://blog.xpdm.us/2009/10/03/skein-threefish-and-mono-simd-interlude/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 14:00:16 +0000</pubDate>
		<dc:creator>Marcus Griep</dc:creator>
				<category><![CDATA[Mono]]></category>
		<category><![CDATA[SHA3]]></category>
		<category><![CDATA[SIMD]]></category>

		<guid isPermaLink="false">http://blog.xpdm.us/?p=105</guid>
		<description><![CDATA[Having added in aligned moves into as much of my Threefish implementation as I could in my , I thought that I&#8217;d take a moment to explain why this is a stable change that shouldn&#8217;t cause NullReferenceExceptions at random. Currently, Mono.Simd is only supported in accelerated mode on the x86 architecture. Support for x64 is [...]]]></description>
			<content:encoded><![CDATA[<p>Having added in aligned moves into as much of my Threefish implementation as I could in my <a href="http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-3/" title='Skein, Threefish, and Mono.Simd — Part 3'>last article</a>, I thought that I&#8217;d take a moment to explain why this is a stable change that shouldn&#8217;t cause <code>NullReferenceException</code>s at random.</p>
<p>Currently, Mono.Simd is only supported in accelerated mode on the x86 architecture. Support for x64 is in progress (the subject of a 2009 Mono <a href="http://code.google.com/soc/">Google Summer of Code</a> <a href="http://www.mono-project.com/StudentProjects#Mono.Simd_ports" title="Mono.Simd ports">project</a>) and other architectures are planned.</p>
<p>Within the Mono framework on x86 and x64, the <acronym title="Just-in-Time">JIT</acronym>-compiler makes a guarantee that the stack will be aligned to 16 bytes in each frame. This is good because it meshes well with the requirements of <acronym title="Streaming SIMD Extensions">SSE</acronym>. Depending on the local variables present and temporary variables created by the compiler, the byte-alignment of a block of <code>Vector..</code> variables can be ensured by adding appropriate padding.</p>
<p>This doesn&#8217;t solve for the general case, though. Class or struct members that are <code>Vector..</code>s may end up on the heap, where they may not be aligned, and thus unaligned moves would be necessary without some more defined means to specify alignment.</p>
<p>This lack of a guarantee is not a new problem, especially for people who needed to <a href="http://www.bluebytesoftware.com/blog/2007/01/23/GuaranteeingCLRDataAlignmentAtNByteBoundaries.aspx" title="Joe Duffy's Weblog: Guaranteeing CLR Data Alignment At N-Byte Boundaries">interact with native <acronym title="Streaming SIMD Extensions">SSE</acronym> code</a> before Mono.Simd. One option is to create a struct with an explicit layout, which is a union with different 4-byte padding combinations, accepting a 12-byte overhead:</p>
<pre class="brush: csharp;">
[StructLayout(LayoutKind.Explicit)]
public struct AtLeastOneIsAligned {
  [FieldOffset( 0)] public Vector2ul V0;
  [FieldOffset( 4)] public Vector2ul V1;
  [FieldOffset( 8)] public Vector2ul V2;
  [FieldOffset(12)] public Vector2ul V3;
}
</pre>
<p>You would then need to do some finagling to determine which <code>V*</code> was the aligned vector. It is plausible, but a lot of extra overhead for the convenience.</p>
<p>Of course, it would be much more efficient for the Mono framework to implicitly make the guarantee that local <code>Vector..</code> variables will be aligned to 16-byte boundaries. Then, when working with local variables, the <acronym title="Just-in-Time">JIT</acronym>-compiler could be empowered to emit the more expedient aligned moves. Additionally, local <code>Vector..</code>s can be more efficiently initialized with aligned moves. The framework could go further, adding a class attribute that would guarantee <img src="http://blog.xpdm.us/wp-content/plugins/easy-latex/cache/tex_4d772f5e96465271fa5435c1f501ec43.png" title="2^x" style="vertical-align:-20%;" class="tex" alt="2^x" />-byte alignment on the heap, getting much closer to the ideal of being able to eliminate unaligned moves, albeit at the expense of a greater constraint on the garbage collector.</p>
<p>In my <a href="http://blog.xpdm.us/2009/10/05/skein-threefish-and-mono-simd-part-4/" title='Skein, Threefish, and Mono.Simd — Part 4'>final article</a>, I&#8217;ll jump back to Threefish256. In that article, I will show one more optimization I was able to make at the C# level, discuss my results compared to the reference, and make a couple more recommendations.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.xpdm.us/2009/10/03/skein-threefish-and-mono-simd-interlude/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Skein, Threefish, and Mono.Simd &#8212; Part 3</title>
		<link>http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-3/</link>
		<comments>http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-3/#comments</comments>
		<pubDate>Fri, 02 Oct 2009 18:00:45 +0000</pubDate>
		<dc:creator>Marcus Griep</dc:creator>
				<category><![CDATA[Mono]]></category>
		<category><![CDATA[book:isbn=185233794X]]></category>
		<category><![CDATA[SHA3]]></category>
		<category><![CDATA[SIMD]]></category>

		<guid isPermaLink="false">http://blog.xpdm.us/?p=92</guid>
		<description><![CDATA[Now that I had tweaked my base Threefish256 source code as best I could, it was time to look for more inefficiencies in the generated machine code. Here&#8217;s where I left the excerpt at the end of the : bB = bB + keySchedule[0].GetVectorAligned(2); bA = bB + bA + keySchedule[0].GetVectorAligned(0); bTempA = Vector2ul.Zero.UnpackLow(bB); bTempB [...]]]></description>
			<content:encoded><![CDATA[<p>Now that I had tweaked my base Threefish256 source code as best I could, it was time to look for more inefficiencies in the generated machine code. Here&#8217;s where I left the excerpt at the end of the <a href="http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-2/" title='Skein, Threefish, and Mono.Simd — Part 2'>last article</a>:<br />
<code>
<pre class="brush: csharp; highlight: [3];">
bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB &lt;&lt; 16 | bTempB &gt;&gt; (64 - 16);
bTempA = bTempA &lt;&lt; 14 | bTempA &gt;&gt; (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));
</pre>
<p></code></p>
<p>One line of interest above, line 3, expands to the following machine code:</p>
<pre class="brush: plain;">
   196c0:	66 0f ef c0          	pxor   %xmm0,%xmm0
   196c4:	0f 11 45 b0          	movups %xmm0,-0x50(%ebp)
   196c8:	c7 45 a8 00 00 00 00 	movl   $0x0,-0x58(%ebp)
   196cf:	c7 45 ac 00 00 00 00 	movl   $0x0,-0x54(%ebp)
   196d6:	c7 45 b4 00 00 00 00 	movl   $0x0,-0x4c(%ebp)
   196dd:	c7 45 b0 00 00 00 00 	movl   $0x0,-0x50(%ebp)
   196e4:	c7 45 bc 00 00 00 00 	movl   $0x0,-0x44(%ebp)
   196eb:	c7 45 b8 00 00 00 00 	movl   $0x0,-0x48(%ebp)
   196f2:	0f 10 45 b0          	movups -0x50(%ebp),%xmm0
   196f6:	0f 28 d8             	movaps %xmm0,%xmm3
   196f9:	66 0f 6c dc          	punpcklqdq %xmm4,%xmm3
   196fd:	0f 11 5d c0          	movups %xmm3,-0x40(%ebp)
</pre>
<p>The first 10 lines in the machine code above don&#8217;t seem to do too much. First, <code>XMM0</code> is zeroed-out and moved to a variable on the stack. Then, the <code>movl</code> instructions zero-out the same variable and the same-sized space above and below it on the stack. Why? I don&#8217;t know. Line 9 then reads the variable off the stack (<code>&lt;0,0&gt;</code>) into <code>XMM0</code>. Line 10 then copies this zero to <code>XMM3</code> and <code>XMM0</code> is not used until it is clobbered in the next source code line. It isn&#8217;t until line 11 that the unpack operation executes.</p>
<p>A large part of me feels that <code>Vector2ul.Zero</code> could have been achieved with a <code>pxor</code> without the extra nine instructions.</p>
<p>Beyond that, however, there is a difference between lines 9 and 10. There is a difference between the two move operations: <code>movups</code> and <code>movaps</code>. Doing a Google search led me to the <em><acronym title="Single Instruction, Multiple Data">SIMD</acronym> Programming Manual for Linux and Windows</em> (shown in the sidebar) a preview of which included this nice tidbit on <code>movups</code>:</p>
<blockquote><p>This performance overhead is sufficient that it often pays to use the MMX registers rather than the XMM registers if unaligned loads and stores must be used.</p></blockquote>
<p>Moves with <code>movaps</code> have an alignment restriction; memory operands must be aligned to 16-byte boundaries or a &#8220;general protection&#8221; fault is raised. Moves with <code>movups</code> don&#8217;t have this restriction, but are much slower; so much so that, as the <em><acronym title="Single Instruction, Multiple Data">SIMD</acronym> Manual</em> states, it is often faster to forgo <acronym title="Streaming SIMD Extensions">SSE</acronym> altogether. This speed assertion explained well what I was seeing, and provided me with a new avenue for attack: eliminating as many unaligned moves as possible.</p>
<p>The Mono.Simd <acronym title="Application Programming Interface">API</acronym> offers the ability to explicitly specify which packed loads and stores should be emitted as aligned through a pair of static methods. Unfortunately, they totally destroy the readability of the code when implemented wholesale. Additionally, the <code>StoreAligned</code> method takes a <code>ref</code> parameter rather than an <code>out</code> parameter, thus requiring that variables be initialized before they can be assigned with an aligned move. The <acronym title="Application Programming Interface">API</acronym> current does all initializations using unaligned moves, so there is a minimum of one unaligned move per local/temporary vector in a method.</p>
<p>Taking the readability hit, I gave the &#8220;aligned move&#8221;-theory a try. Additionally, I initialized a local variable, <code>bZero</code>, with <code>&lt;0,0&gt;</code> so that I wouldn&#8217;t incur the unnecessary extra nine instructions and unaligned stores everytime I wanted a zero vector. After a whole lot of replacement, here is what the code looked like:<br />
<code>
<pre class="brush: csharp; wrap-lines: false;">
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) + keySchedule[0].GetVectorAligned(2));
Vector2ul.StoreAligned(ref bA, Vector2ul.LoadAligned(ref bB) + Vector2ul.LoadAligned(ref bA) + keySchedule[0].GetVectorAligned(0));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bZero).UnpackLow(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bZero).UnpackHigh(Vector2ul.LoadAligned(ref bB)));
Vector2ul.StoreAligned(ref bTempB, Vector2ul.LoadAligned(ref bTempB) &lt;&lt; 16 | Vector2ul.LoadAligned(ref bTempB) &lt;&lt; (64 - 16));
Vector2ul.StoreAligned(ref bTempA, Vector2ul.LoadAligned(ref bTempA) &gt;&gt; 14 | Vector2ul.LoadAligned(ref bTempA) &gt;&gt; (64 - 14));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bTempA).UnpackHigh(Vector2ul.LoadAligned(ref bTempB)));
Vector2ul.StoreAligned(ref bB, Vector2ul.LoadAligned(ref bB) ^ Vector2ul.LoadAligned(ref bA));
Vector2ul.StoreAligned(ref bB, (Vector2ul)(((Vector4ui) Vector2ul.LoadAligned(ref bB)).Shuffle(XYZWtoZWXY)));
</pre>
<p></code></p>
<p>Once it was all worked up and ready for test, I fired it up and was immediately hit with:</p>
<pre class="brush: plain; light: true; wrap-lines: false;">
Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object
  at Xpdm.Security.Cryptography.Threefish256Simd.Encrypt (System.UInt64[] input, System.UInt64[] output) [0x00000]
  at Xpdm.Security.Cryptography.Tests.Program.Main () [0x00000]
</pre>
<p>The meaning of this exception was initially unclear to me, as I was not dereferencing anything that should be null. Then I remembered that an aligned load of an unaligned memory location will generate the same general protection fault as a bad dereference. I figured that this likely had to do with the position of variables on the stack, so I played around adding a couple of dummy variables. I finally got it running using one long integer as a stack buffer.</p>
<p>The speedup I got from these optimizations was significant, but the <acronym title="Single Instruction, Multiple Data">SIMD</acronym>-version was still about 20% slower than the Fajardo reference implementation. I had a feeling that I&#8217;d be able to eek out a few more performance increases, which I&#8217;ll discuss in my <a href="http://blog.xpdm.us/2009/10/03/skein-threefish-and-mono-simd-interlude/" title='Skein, Threefish, and Mono.Simd — Interlude'>next article</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-3/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Skein, Threefish, and Mono.Simd &#8212; Part 2</title>
		<link>http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-2/</link>
		<comments>http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-2/#comments</comments>
		<pubDate>Fri, 02 Oct 2009 14:30:45 +0000</pubDate>
		<dc:creator>Marcus Griep</dc:creator>
				<category><![CDATA[Mono]]></category>
		<category><![CDATA[SHA3]]></category>
		<category><![CDATA[SIMD]]></category>

		<guid isPermaLink="false">http://blog.xpdm.us/?p=58</guid>
		<description><![CDATA[Having established a baseline for the unrolled implementation of Threefish256 on the x86 integer processing unit in my , I was ready to re-target the algorithm to the SSE processing instructions via Mono.Simd. My first working implementation of Threefish256 was pretty similar to the non-SIMD version. The main differences were some revisions to the Mix, [...]]]></description>
			<content:encoded><![CDATA[<p>Having established a baseline for the unrolled implementation of Threefish256 on the x86 integer processing unit in my <a href="http://blog.xpdm.us/2009/10/01/skein-threefish-and-mono-simd-part-1/" title='Skein, Threefish, and Mono.Simd — Part 1'>last post</a>, I was ready to re-target the algorithm to the <acronym title="Streaming SIMD Extensions">SSE</acronym> processing instructions via Mono.Simd. My first working implementation of Threefish256 was pretty similar to the non-<acronym title="Single Instruction, Multiple Data">SIMD</acronym> version. The main differences were some revisions to the <code>Mix</code>, <code>UnMix</code>, and <code>Rotate</code> functions to inject the Mono <acronym title="Single Instruction, Multiple Data">SIMD</acronym> intrinsics. I also added a function that pre-generated the full key schedule.</p>
<p><small>One key permutation and one round with Mono.Simd</small></p>
<pre class="brush: csharp; class-name: with-caption;">
bA = bA + keySchedule[0].GetVectorAligned(0);
bB = bB + keySchedule[0].GetVectorAligned(2);
Mix(ref bA, ref bB, 14, 16);
SwapComponents(ref bB);
</pre>
<p>In the above, I am able to use the <code>XMM</code> registers well by packing the in-cipher long integers (<code>b0</code>, <code>b1</code>, <code>b2</code>, and <code>b3</code> from the previous post) into vectors. In the above source, <code>bA</code> is <code>&lt;b0,b2&gt;</code> and <code>bB</code> is <code>&lt;b1,b3&gt;</code>. The components of <code>bB</code> are swapped between rounds according to the permutation schedule while <code>bA</code>&#8216;s components do not need to be switched.</p>
<p>After a good deal of feeling my way around the Mono.Simd <acronym title="Application Programming Interface">API</acronym>, I struck paydirt and found myself with working, <acronym title="Single Instruction, Multiple Data">SIMD</acronym>-enabled encrypt and decrypt functions. Paydirt wasn&#8217;t quite so sweet, however. Running my same speed test, I found that my <acronym title="Single Instruction, Multiple Data">SIMD</acronym> version was running at half the speed of the non-<acronym title="Single Instruction, Multiple Data">SIMD</acronym> version, about 18 <acronym title="nanoseconds">ns</acronym> to 9 <acronym title="nanoseconds">ns</acronym>, respectively.</p>
<p>How could the <acronym title="Single Instruction, Multiple Data">SIMD</acronym> version be twice as slow? The answer lie in the generated machine code. Mono provides a great feature for this, <a href="http://www.mono-project.com/AOT">Ahead-of-Time Compilation</a>. The Mono <acronym title="Ahead-of-Time">AOT</acronym> generates a shared object <code>.so</code> file with the platform-specific machine code. This can be used for some <acronym title="Just-in-Time">JIT</acronym>-heavy executables to speed start-up time, since the bootstrapping has already been done. In my case, it allows me to disassemble the object file to inspect the generated machine code. My tool for the purpose was <code>objdump</code>.</p>
<p><small>Shell transcript using <acronym title="Ahead-of-Time">AOT</acronym> and <code>objdump</code> to get machine code:</small></p>
<pre class="brush: plain; class-name: with-caption; collapse: true; light: false; toolbar: true; wrap-lines: false;">
mgriep@metis:~$ mono-2.4 mono --aot -O=all,-shared Xpdm.Security.Cryptography.dll
Mono Ahead of Time compiler - compiling assembly /home/mgriep/Xpdm.Security.Cryptography.dll
reloc: got at 20
Code: 146706 Info: 359 Ex Info: 405 Class Info: 664 PLT: 29 GOT Info: 370 GOT Info Offsets: 192 GOT: 240
section .text aligned to 151824 from 151820
subsection 1 of .text added at offset 151824 (align: 0)
num_sections: 3
dynsym: 34, dynstr size: 432
section .data, size: 24, 18
section .bss, size: 240, f0
section .text, size: 155488, 25f60
Compiled 77 out of 77 methods (100%)
Methods without GOT slots: 56 (72%)
Direct calls: 2184 (99%)
<acronym title="Just-in-Time">JIT</acronym> time: 147 ms, Generation time: 10 ms, Assembly+Link time: 4 ms.
GOT slot distribution:
	methodconst: 7
	switch: 3
	image: 1
	vtable: 12
	ldstr: 1
	delegate_trampoline: 7
mgriep@metis:~$ objdump -x Xpdm.Security.Cryptography.dll.so | grep -A1 Xpdm.Security.Cryptography.Threefish256Simd:Encrypt
000214b0 l     F .text	00003974 Xpdm.Security.Cryptography.Threefish256Simd:Decrypt (ulong[],ulong[])
0001db80 l     F .text	0000392b Xpdm.Security.Cryptography.Threefish256Simd:Encrypt (ulong[],ulong[])
mgriep@metis:~$ objdump -d Xpdm.Security.Cryptography.dll.so --start-address=0x0001db80 --stop-address=0x000214b0
...
</pre>
<p>The first thing that I noticed is that only a few <code>XMM</code> registers were getting used, usually <code>XMM0</code> and <code>XMM1</code>. This was unexpected since I was operating on multiple vectors at any given time, and leaving the other 6 registers fallow seemed inefficient. The static utility functions, <code>Mix</code>, <code>UnMix</code>, and <code>Rotate</code> seemed to be getting in the way of using the registers more fully since they would require repositioning the stack pointer, re-loading inside the method, performing the operation, storing, and re-positioning back. In the spirit of removing the unnecessary cruft to get closer to the answer, I fully inlined the utility functions.</p>
<p>Additionally, in the assembly code instructions &#8220;between&#8221; the lines of source code, there would be a symmetric store and load to/from a stack location and the same <code>XMM</code> register. In an attempt to increase the temporal locality of the vectors, and hopefully reduce the store/loads, I did some reordering of arguments within the source code. As it stood after this finagling, here was the state of my source:</p>
<p><small>Key permutation and one round after inlining and reordering:</small><br />
<code>
<pre class="brush: csharp; class-name: with-caption;">
bB = bB + keySchedule[0].GetVectorAligned(2);
bA = bB + bA + keySchedule[0].GetVectorAligned(0);
bTempA = Vector2ul.Zero.UnpackLow(bB);
bTempB = Vector2ul.Zero.UnpackHigh(bB);
bTempB = bTempB &lt;&lt; 16 | bTempB &gt;&gt; (64 - 16);
bTempA = bTempA &lt;&lt; 14 | bTempA &gt;&gt; (64 - 14);
bB = bTempA.UnpackHigh(bTempB);
bB = bB ^ bA;
bB = (Vector2ul)(((Vector4ui) bB).Shuffle(XYZWtoZWXY));
</pre>
<p></code></p>
<p>Neither of these changes had much effect on the overall speed, however. In my <a href="http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-3/" title='Skein, Threefish, and Mono.Simd — Part 3'>next post</a>, I take a deeper look at the generated machine code to find my next optimization target.</p>]]></content:encoded>
			<wfw:commentRss>http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Skein, Threefish, and Mono.Simd &#8212; Part 1</title>
		<link>http://blog.xpdm.us/2009/10/01/skein-threefish-and-mono-simd-part-1/</link>
		<comments>http://blog.xpdm.us/2009/10/01/skein-threefish-and-mono-simd-part-1/#comments</comments>
		<pubDate>Thu, 01 Oct 2009 22:00:17 +0000</pubDate>
		<dc:creator>Marcus Griep</dc:creator>
				<category><![CDATA[Mono]]></category>
		<category><![CDATA[SHA3]]></category>
		<category><![CDATA[SIMD]]></category>

		<guid isPermaLink="false">http://blog.xpdm.us/?p=15</guid>
		<description><![CDATA[Mono.Simd is a developing framework introducing SIMD intrinsics into the .NET/Mono framework.  Ever since Miguel introduced the world to Mono.Simd, I have been very interested in getting my hands dirty with the new API. SIMD, or Single Instruction Multiple Data, instructions are special instructions that can concurrently perform the same or related functions on multiple [...]]]></description>
			<content:encoded><![CDATA[<p>Mono.Simd is a developing framework introducing <acronym title="Single Instruction, Multiple Data">SIMD</acronym> intrinsics into the .NET/Mono framework.  Ever since Miguel <a title="Mono's SIMD Support: Making Mono safe for Gaming" href="http://tirania.org/blog/archive/2008/Nov-03.html">introduced the world to Mono.Simd</a>, I have been very interested in getting my hands dirty with the new <acronym title="Application Programming Interface">API</acronym>.</p>
<p><a title="Wikipedia: Single Instruction Multiple Data" href="http://en.wikipedia.org/wiki/SIMD"><acronym title="Single Instruction, Multiple Data">SIMD</acronym></a>, or Single Instruction Multiple Data, instructions are special instructions that can concurrently perform the same or related functions on multiple data sets, or vectors, e.g. adding four integers to four other integers, respectively, as opposed to summing each pair in sequence. <acronym title="Single Instruction, Multiple Data">SIMD</acronym> instructions on x86 processors use special <code>XMM</code> registers. Examples of <acronym title="Single Instruction, Multiple Data">SIMD</acronym> instruction sets include the <a title="Wikipedia: Streaming SIMD Extensions" href="http://en.wikipedia.org/wiki/&lt;a title="><acronym title="Streaming SIMD Extensions">SSE</acronym> family</a> and AMD&#8217;s <a title="Wikipedia: 3DNow!" href="http://en.wikipedia.org/wiki/3DNow!">3DNow!</a>. <acronym title="Single Instruction, Multiple Data">SIMD</acronym> instructions are most effectively put to use processing large data sets that have similar sets of operations repeated across the data set. Such effective applications include graphical rendering and some physics simulations.</p>
<p>Unfortunately for myself, I am mainly a front-end applications developer by trade. Thus, the opportunity to work with low-level intrinsic functions hadn&#8217;t come around in the nine months since their announcement, so I deliberately set out to find a little side project wherein I could test out the new instructions.</p>
<p>There is very little out there on the Internet regarding the Mono.Simd <acronym title="Application Programming Interface">API</acronym> except for Miguel&#8217;s blog post, the Mono source code, and the <a title="Monodoc: Mono.Simd API" href="http://go-mono.com/docs/index.aspx?tlink=0%40N%3aMono.Simd">Monodoc</a>. With that in mind, I found a promising candidate for implementing with Mono.Simd in one of the SHA-3 hash algorithm candidates, specifically <a title="Skein Project Page" href="http://www.skein-hash.info">Skein</a>. The project&#8217;s homepage noted that there were already two implementations of the algorithm in .NET, which gave me a good basis to start from.</p>
<p>I chose the in-progress implementation of <a title="Skeinfish at Google Code" href="http://code.google.com/p/skeinfish/">Skein by Alberto Fajardo</a> as the basis for my work since it looked to be pretty clean and meshed alright with the <acronym title="Base Class Libraries">BCL</acronym> standards for <code>System.Security.Cryptography</code> with a few exceptions. Fajardo&#8217;s implementation is set up so that the Threefish cipher runs are completely unrolled, meaning no loop overhead in a function already a likely candidate for a tight loop. I chose Threefish256 to implement using Mono.Simd due to its relative simplicity. My preliminary tests<sup>1</sup> with Threefish256 showed an average speed of 9.2 <acronym title="nanoseconds">ns</acronym> per encrypt operation.</p>
<p><small class="code-caption">An example of the first subkey permutation and four rounds of Skein 1.2:</small></p>
<pre class="brush: csharp; class-name: with-caption;">
Mix(ref b0, ref b1, 14, k0, k1 + t0);
Mix(ref b2, ref b3, 16, k2 + t1, k3);
Mix(ref b0, ref b3, 52);
Mix(ref b2, ref b1, 57);
Mix(ref b0, ref b1, 23);
Mix(ref b2, ref b3, 40);
Mix(ref b0, ref b3,  5);
Mix(ref b2, ref b1, 37);
</pre>
<p>Based on this, my working hypothesis is that I will be able to get the <acronym title="Single Instruction, Multiple Data">SIMD</acronym> version of the function to run in about 75-55% of the time of the non-<acronym title="Single Instruction, Multiple Data">SIMD</acronym> version, or 7&#8211;5 <acronym title="nanoseconds">ns</acronym>. This series will continue in my <a href="http://blog.xpdm.us/2009/10/02/skein-threefish-and-mono-simd-part-2/" title='Skein, Threefish, and Mono.Simd — Part 2'>next post</a>, when I&#8217;ll discuss some of the progress I&#8217;ve made and issues I&#8217;ve encountered.</p><ol class="footnotes"><li id="footnote_0_15" class="footnote">Tests were performed on an EeePC 900A, with an Intel Atom x86 32-bit processor and the Mono 2.4 framework. Tests were run for 100,000,000 iterations unrolled in sets of 100 and averaged to get the stated results.</li></ol>]]></content:encoded>
			<wfw:commentRss>http://blog.xpdm.us/2009/10/01/skein-threefish-and-mono-simd-part-1/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
