<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Andreas Gal</title>
	<atom:link href="http://andreasgal.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://andreasgal.com</link>
	<description>Project Scientist, University of California, Irvine</description>
	<pubDate>Tue, 10 Jun 2008 17:51:18 +0000</pubDate>
	<generator>http://wordpress.org/?v=MU</generator>
	<language>en</language>
			<item>
		<title>Trace-Trees FAQ</title>
		<link>http://andreasgal.com/2008/06/02/trace-trees-faq/</link>
		<comments>http://andreasgal.com/2008/06/02/trace-trees-faq/#comments</comments>
		<pubDate>Mon, 02 Jun 2008 09:01:26 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=67</guid>
		<description><![CDATA[Dave Roberts sent me a couple of questions about trace trees after he saw our work mentioned on Steve Yegge&#8217;s blog. I figured my answers might be interesting to more people than just Dave. 
Most of your papers on trace-trees just describe the behavior of the technique with respect to a single trace tree. That is, [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Dave Roberts sent me a couple of questions about trace trees after he saw our work mentioned on <a href="http://steve-yegge.blogspot.com/2008/05/dynamic-languages-strike-back.html">Steve Yegge&#8217;s blog</a>. I figured my answers might be interesting to more people than just Dave. </p>
<blockquote><p>Most of your papers on trace-trees just describe the behavior of the technique with respect to a single trace tree. That is, as described, you basically find the first inner loop in the program and then trace and compile that, extending it as you find other paths that branch from it. That&#8217;s fine, but how does the system behave with respect to large programs that have many such loops? I&#8217;m assuming that you&#8217;re compiling loops in many methods across a large such program. Are you saving the trace results across all that activity? In other words, if you find a hot loop in method A, then when you finally exit that method and later find a hot loop in method B, do you throw away the work you did for method A and recreate it later, or are you building up bits of compiled code throughout the long-term program run? I assume the latter, but didn&#8217;t really know.</p></blockquote>
<p>Our code initially runs through an interpreter in a bytecode format. In principle, each bytecode can be the anchor for a trace tree. The code is interpreted until a particular potential anchor becomes &#8220;hot&#8221; enough to host a tree. At that point we will record a trace and execute it and then subsequently try to extend the tree whenever we side-exit from it. We only grow the tree with traces that connect back to the same loop header the tree is anchored at, either through a direct path through the loop, or some path going through some outer loop. This is not always possible, i.e. if 2 loops are nested inside a loop, at which point we have to generate nested trees where an outer tree calls the inner trees (since we can&#8217;t easily form a path through the inner and outer loop at the same time, we would get stuck looping in the other inner loop and the trace would get very long). We use various abort conditions to restrict the maximum size of a trace we want to attach to a tree. With an unlimited trace length the entire program would eventually attach to each tree we start, which is counter-intuitive. We want each tree to represent one hot code region.</p>
<blockquote><p>Assuming you&#8217;re building up bits of code long-term, are there any issues reentering the compiled code from the interpreter when you next execute method A? The papers always describe entering the compiled code as an act that happens right after you record the trace and compile it, but they don&#8217;t really talk about the issues of reentering the same code later. How is this done.</p></blockquote>
<p>Yes, we compile the trace (or tree) and then re-enter it every time the interpreter runs across its anchor point. In our language (JVML) the bytecode is statically typed in that at each point in the program (so for each bytecode instruction) all variables (local variable slots and stack slots) have one unique type. The recorded and compiled trace is compiled with that fixed type distribution and knows how to pull the values from the interpreter stack and local variable frame. Constant values are detected by the optimized and directly embedded in the trace instead of reading them from the interpreter frame. One could even speculate on certain values. Once you see a boolean value in the local variable frame being true for N iterations we could just re-compile the tree assuming that vlaue is always true, and then insert a guard that ensures that this specialized tree is only executed if that slot really contains a boolean true value.</p>
<blockquote>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">What about the case where method A contains a loop and calls method B in the loop. Method B also has a loop inside it. Perhaps like the following, in pseudo-Java code:</span></span></div>
<div><span class="921420701-22052008"></span> </div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">public int methodA(int a) {</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">// complex way of calculating a^3</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">sum = 0;</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">for (i = 0; i &lt; a; i++) {</span></span></div>
<div><span class="921420701-22052008">        <span style="font-family:Arial;font-size:x-small;">sum += methodB(a);</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">}</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">return sum;</span></span></div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">}</span></span></div>
<div><span class="921420701-22052008"></span> </div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">public int methodB(int b) {</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">// complex way of calculating b^2</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">sum = 0;</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">for (i = 0; i &lt; b; i++) {</span></span></div>
<div><span class="921420701-22052008">        <span style="font-family:Arial;font-size:x-small;">sum += b;</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">}</span></span></div>
<div><span class="921420701-22052008">    <span style="font-family:Arial;font-size:x-small;">return sum;</span></span></div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">}</span></span></div>
<div><span class="921420701-22052008"></span> </div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">You would expect the system to detect the loop in B first and compile that. When B gets called again from A, you would expect the interpreter to re-enter the compiled code.</span></span></div>
<div><span class="921420701-22052008"></span> </div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">At some point, however, the system will detect the loop in A and then trace and compile that. When that happens, the trace starting in A would inline B, right? And while it&#8217;s tracing through the inlined B, does it just ignore the fact that there is already a compiled trace for the loop in B, unrolling it because it doesn&#8217;t return to the loop head in A? If the trace gets too long, because the loop in B might be much larger than in A, then the trace aborts. Is there a way to make the trace starting in A recognize that it has reached a spot where there is already an old trace in B, and the right behavior might be to somehow incorporate that previous trace instead of completely unrolling the loop in B.</span></span></div>
</blockquote>
<div>You hit the nail on the head. Thats exactly what we do :) We call this &#8220;nested trace trees&#8221; and its Michael Bebenita&#8217;s brainchild. In my original dissertation work I only traced through and compiled the inner loop. The rest of the code was interpreted. As long the inner loop is a lot hotter than the outer code calling it, this still gives a decent speedup. But in certain cases this of course fails. Michael extended this approach as follows. The inner loop is usually hotter and will trigger a tree being recorded for the inner loop. Eventually the outer loop triggers a tree to be recorded starting at its own header. We follow the trace inside the invoked method and then detect that we reached a point where we already have a tree (the inner tree). Instead of following the inner tree (which we as you pointed out wouldn&#8217;t be able to record without excessive unrolling), we call it (literally call it, like a method call). There are actually two ways to do this call. Either we compile the outer tree and the inner tree together, teaching the inner tree to directly read the values from the registers and spill locations the outer tree holds its context values (we call this welding), or by spilling all values the inner tree needs from the outer tree onto the stack and then using a more generic invocation mechanism. The latter allows the machine code generated for the inner tree to be reused (saving code space), while the former approach is faster. The nested trace tree construct permits a number of optimizations to be communicated between trees, i.e. whether values that a tree gets handed it from an outer tree escape the tree, allowing global analysis and optimization.</div>
<blockquote>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">Otherwise it seems like:</span></span></div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">  a. you could waste a lot of time trying to keep tracing the loop starting in A and have B blow out the length of your trace buffer. Since tracing is slower than simply interpreting, this would be a net loss in speed.</span></span></div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">  b. if you try to unroll another loop fully, even if it doesn&#8217;t result in your trace buffer length being exceeded, it&#8217;s a good way to get very long traces, but the compiled speed of those traces may not be much faster than calling the compiled code in B anyway.</span></span></div>
</blockquote>
<div>You are correct. Long traces and excessive &#8220;outerlining&#8221; (inlining of outer loop parts) rarely pay off, mostly because the outer loop parts are less hot than the inner paths, but now they compete for the same register resources as the inner paths. </div>
<blockquote>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">  c. it would then seem that loops that occur higher up in the call tree would get pretty large generally, which would bloat things up overall. either that or they wouldn&#8217;t get compiled at all because the traces would all be too long, which means you&#8217;d spend a lot more time doing interpreting.</span></span></div>
</blockquote>
<div>Yes. We are currently playing with the parameters and never outerlining at all and only nesting trees seems to be mostly almost as fast as outerlining.</div>
<blockquote>
<div>After how many iterations of a loop do you start tracing? You probably don&#8217;t want to do it after 1 loop, but you probably don&#8217;t want to wait until 50 or 100 either. Are we talking small, single-digit numbers here, or 10 or 20 times through the loop?</div>
</blockquote>
<div>We use 2-3 digit numbers to start a tree. The Tamarin Tracing team is using even smaller numbers (low 2 digits). Its basically a function of how much overhead compilation incurs vs interpretation. Tamarin&#8217;s interpreter is really slow (being worked on intensively though), so they try to compile as early as possible.</div>
<blockquote>
<div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;">You talked about tree folding in this recent blog post. Have you guys written anything about that, or is it too new? It would be interesting to understand the complexity of trying to fold the trees back. One of the nice things about the original trace tree algorithm was that it was relatively simple in concept: just trace a tree and then run a simplified form of SSA over it to compile it.</span></span></div>
<div><span class="921420701-22052008"><span style="font-family:Arial;font-size:x-small;"><a href="http://andreasgal.com/2008/02/28/tree-folding/">http://andreasgal.com/2008/02/28/tree-folding/</a></span></span></div>
</div>
</blockquote>
<div>A paper on folding is planned for CGO, and we plan on submitting a paper on nested trace trees to PLDI. We were spectacularly unsuccessful selling our trace-compilation work at either venue in the past though, so we will publish the papers in parallel as a technical report. Just check the tech report section of my publications shortly after the respective deadlines. We will also have a submission for VEE. There are a lot of conferences coming up over the summer, and we have a lot of unpublished research piled up.</div>
<blockquote>
<div>Does tree folding complicate your SSA analysis considerably?</div>
</blockquote>
<div>No, its a pre-pass that happens right after a trace was added to a tree. Its the only destructive/tree modifying optimization. It starts with the old tree state and the new trace and it produces a tree that merges traces as much as possible. That new tree than replaces the old tree. The representation is largely unchanged and the folding implementation doesn&#8217;t touch any of the backend code. The biggest issue with folding is that we have to run (side-exit) along most paths of a deeply branchy code area until everything has been folded, so we get quite a few compilation runs. The nasty 3D Cube example from sunspider (JavaScript benchmark) requires some 63 compiler runs for a fairly compact source code loop with nested if-statements inside. Our compiler is very fast though, so this might be tolerable.</div>
<blockquote>
<div>About 8 years ago, we looked at using Insignia&#8217;s GeodeVM in a commercial embedded project I was working on. Their VM was really quite fast. I remember them saying that they would try to identify hot pieces of code and would compile those to native code, but that they would do that on a sub-method basis. I think you mentioned Geode in one of the papers as related work. Do you know what they do versus your trace-tree technique?</div>
</blockquote>
<div>I know about Insignia&#8217;s work only from marketing material and through third party gossip. From what I understand, Insignia uses a bytecode to native code compiler to compile all of the bytecode to native code and then compresses the entire compilation result using gzip. The code is fast to execute, but is at the same time pretty compact since its stored in a compressed format. In other words its a Java VM for embedded systems similarly to my first implementation of a JVM trace compiler, but otherwise largely unrelated as far as the actual approach is concerned. If anyone from Insignia wants to correct me, please go ahead :)</div>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/67/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/67/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/67/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=67&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/06/02/trace-trees-faq/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Maxine VM</title>
		<link>http://andreasgal.com/2008/05/22/maxine-vm/</link>
		<comments>http://andreasgal.com/2008/05/22/maxine-vm/#comments</comments>
		<pubDate>Thu, 22 May 2008 23:05:05 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=65</guid>
		<description><![CDATA[We spent the 2nd day today with the Maxine VM group at Sun Labs to get familiar with the new meta-circular Java VM they are building. Maxine is completely implemented in Java, and makes extensive use of the Java language. Maxine bootstraps itself using its own optimizing compiler which understands how to specifically optimize Maxine [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We spent the 2nd day today with the Maxine VM group at Sun Labs to get familiar with the new meta-circular Java VM they are building. Maxine is completely implemented in Java, and makes extensive use of the Java language. Maxine bootstraps itself using its own optimizing compiler which understands how to specifically optimize Maxine VM paradigms and idioms, allowing Bernd and his team to use the full wealth of Java&#8217;s language features without incurring excessive runtime cost. Things like Pointers and Addresses are abstracted as objects, and the optimizing compilers knows how to turn them into machine words eventually, so you get the best of both worlds: neat abstractions, and raw performance. If you haven&#8217;t done so already, you should definitively check out <a href="https://www28.cplan.com/cc191/sessions_catalog.jsp?ilc=191-1&amp;ilg=english&amp;isort=&amp;isort_type=&amp;is=yes&amp;icriteria1=+&amp;icriteria2=+&amp;icriteria9=&amp;icriteria8=maxine&amp;icriteria3=">Maxine VM</a>. The source code will be released under an open source license some time soon (early June). </p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/65/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/65/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/65/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=65&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/05/22/maxine-vm/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Cell VM</title>
		<link>http://andreasgal.com/2008/05/22/cell-vm/</link>
		<comments>http://andreasgal.com/2008/05/22/cell-vm/#comments</comments>
		<pubDate>Thu, 22 May 2008 20:53:11 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=64</guid>
		<description><![CDATA[We finally managed to publish a paper on CellVM, our Java VM running Java threads on the Synergistic Processing Units (SPUs) of the Cell Processor. The paper will be presented at the 2008 Workshop on Cell Systems and Applications.
       ]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We finally managed to publish a paper on <a href="http://andreasgal.files.wordpress.com/2008/05/paper.pdf">CellVM</a>, our Java VM running Java threads on the Synergistic Processing Units (SPUs) of the Cell Processor. The paper will be presented at the <a href="http://www.research.ibm.com/cell/workshop/cfp.pdf">2008 Workshop on Cell Systems and Applications</a>.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/64/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/64/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/64/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/64/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/64/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/64/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/64/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/64/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/64/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/64/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/64/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/64/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=64&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/05/22/cell-vm/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Recording Traces in Spidermonkey</title>
		<link>http://andreasgal.com/2008/05/21/recording-traces-in-spidermonkey/</link>
		<comments>http://andreasgal.com/2008/05/21/recording-traces-in-spidermonkey/#comments</comments>
		<pubDate>Thu, 22 May 2008 03:14:19 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=56</guid>
		<description><![CDATA[Michael B. and I met yesterday with Brendan and a bunch of other people from Mozilla to talk about the integration of a tracing engine (Tamarin) into Spidermonkey (the JavaScript VM in Firefox). Spidermonkey has been highly optimized over the past decade, in particular in the area of its memory and object model (i.e. property [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Michael B. and I met yesterday with Brendan and a bunch of other people from Mozilla to talk about the integration of a tracing engine (Tamarin) into Spidermonkey (the JavaScript VM in Firefox). Spidermonkey has been highly optimized over the past decade, in particular in the area of its memory and object model (i.e. property tree), and the interpreter itself (lots of specialized fat opcodes to reduce dispatch overhead). Thus, it seems a good idea to retain the Spidermonkey interpreter and JavaScript source-to-opcode translator and add a tracer to the Spidermonkey interpreter instead of throwing away all the optimizations that went into Spidermonkey over time.</p>
<p>While very well suited for fast interpretation, Spidermonkey&#8217;s fat opcodes are not well suited for the intermediate representation we want to record. Instead, for the IR we want to have a much more low level representation that exposes all the type checking and specialization that happens in the fat opcodes. Take Spidermonkey&#8217;s JSOP_ADD opcode, for example:</p>
<p><a href="http://andreasgal.files.wordpress.com/2008/05/sm1.png"><img class="alignnone size-full wp-image-60" src="http://andreasgal.files.wordpress.com/2008/05/sm1.png?w=400&h=587" alt="" width="400" height="587" /></a></p>
<p>Apart from the nasty xml hack at the top, what this opcode mostly does is go along a chain of a bunch of type checks and conversions, and then perform the actual addition of 2 numbers, or a string concatenation in case either element is a string.</p>
<p>SpiderMonkey actually always performs a double addition and doesn&#8217;t specialize to integer additions, but even without this optimization JSOP_ADD contains a large amount of specialization decisions that must be exposed at the IR level in order for a tracer to be able to eliminate this overhead.</p>
<p>Since the interpreter code makes heavy use of macros to abstract certain intrinsic operations, we could use those very abstractions as IR instructions. FETCH_OPND, for example, fetches an operand from the stack (essentially returns &#8220;sp[n]&#8220;). This would be an ideal IR instruction since its low level enough to be compiled (well, actually its a no-op for the compiler since this turns into a register operation).</p>
<p>Ignoring for a moment the ugly xml code in between, the next operation is VALUE_TO_PRIMITIVE, which is actually a composite macro that itself calls a bunch of low-level macros. Such composite macros we don&#8217;t want to touch, instead we are interested in the low-level macros these macros eventually invoke.</p>
<p><a href="http://andreasgal.files.wordpress.com/2008/05/sm2.png"><img class="alignnone size-full wp-image-62" src="http://andreasgal.files.wordpress.com/2008/05/sm2.png?w=400&h=124" alt="" width="400" height="124" /></a></p>
<p>Here we can see that the macro invokes JSVAL_IS_PRIMITIVE to check whether the value is already a primitive in which case the value is return (in the last parameter vp), otherwise the default value is obtained from the object.</p>
<p>JSVAL_IS_PRIMITIVE is again a nice primitive to record since it basically just checks for some bit patterns to apply. </p>
<p>The next question is how do we refactor the interpreter to be able to insert all the recording hooks we need.</p>
<p>In our JamVM-based trace compiler we hooked into the interpreter by modifying the macros JamVM uses to implement each bytecode. Similarly, in Spidermonkey we could try to hook into the low-level macros (primitives). In FETCH_OPND, for example, we could insert a call to the recorder right after reading the value from the stack:</p>
<blockquote><p>#define FETCH_OPND(n)   (record_FETCH_OPND(n), sp[n])</p></blockquote>
<p>This approach works well for JamVM where each instruction we record performs a set of defined actions on the stack and local variables, and the recording functions can track those actions on some abstract stack and local variable array. A JVML push instruction, for example, pushes a value on the operand stack. The recording function (lets call it record_PUSH) performs the same action on the abstract stack, but instead of the value it pushes the address of the IR node on the abstract stack that generated the value PUSH is putting on the stack.</p>
<p>In case of FETCH_OPND, however, we don&#8217;t have a complete overview of the state, because the value FETCH_OPND reads is returned and then assigned to a variable (i.e. rval). While we could track rval and lval, this would also require modifying FETCH_OPND to indicate which value they write to, otherwise the recording function can&#8217;t tell which abstract equivalent of the variable it has to update. </p>
<p>Instead, I think it makes more sense to slightly change the interface of FETCH_OPND. Instead of returning a value, the output variable gets passed in as last argument (SpiderMonkey already does this for most macros):</p>
<blockquote><p>#define FETCH_OPND(n, x) x = sp[n];</p>
<p>FETCH_OPND(-1, rval)</p>
<p>FETCH_OPND(-2, lval) </p></blockquote>
<p>The code generated by this macro is still identical to the original approach, but we can now hook in a recorder much more easily. The recording version of FETCH_OPND would look like this:</p>
<blockquote><p>#define FETCH_OPND(n, x) x = sp[n]; \</p>
<p>record_FETCH_OPND(x, &amp;x);</p></blockquote>
<p>Essentially, we just append the recorder invocation to the existing macro. In addition, we can now record the result of the operation (x), as well as the address of the location we store to (&amp;x), which is essentially the name of the value (think of SSA names here). Using this name we can uniquely identify any future use of the value using a hash table of name to instruction that generated the value (in this case that instruction would be identified by consulting the abstract stack inside record_FETCH_OPND and figure out who put the value onto the stack).</p>
<p>JSVAL_IS_PRIMITIVE can be transformed similarly, however, in contrast to FETCH_OPND its actually not located in jsinterp.c. Instead, its buried somewhere deep inside of jsapi.h. To make this work well we would have to gather all primitives in one place where we can annotate them with recording code. Some common naming for primitives would be nice too. There is no need to actually remove the JSVAL_IS_PRIMITIVE code from jsapi.h, but jsinterp.h should use a second set of macros that map the primitives to their implementation:</p>
<blockquote><p>jsprim.h:</p>
<p><span style="line-height:7px;">#define PRIM_FETCH_OPND(n, x) \</span></p>
<p><span style="line-height:7px;"><span style="line-height:4px;">  FETCH_OPND(n, x)</span></span></p>
<p><span style="line-height:7px;">#define PRIM_JSVAL_IS_PRIMITIVE(v, x) \</span></p>
<p><span style="line-height:7px;">  x = JSVAL_IS_PRIMITIVE(v) </span></p></blockquote>
<p>To generate the tracing equivalent (jstrace.h) we could probably resort to some automated code manipulation, i.e. translate every primitive by appending a call to a recording macro with identical signature:</p>
<blockquote><p>jstrace.h:</p>
<p>#define PRIM_FETCH_OPND(n, x) FETCH_OPND(n, x); \</p>
<p>  RECORD_FETCH_OPND(n, x)</p>
<p>#define PRIM_JSVAL_IS_PRIMITIVE(v, x) x = JSVAL_IS_PRIMITIVE(v);</p>
<p>  RECORD_JSVAL_IS_PRIMITIVE(v, x)</p></blockquote>
<p>Transforming jsinterp.c like this will require a major code shakeup, however I think there is an argument to be made that this would actually improve readability and increase modularity even without tracing in mind. Also, if done right this entire macro refactoring business should not affect the actual underlying code at all. One could even do an automated code comparison of the code every time an opcode is rewritten to use primitives, since no code has to change in the non-recording case despite all the macro magic at the source level.</p>
<p>I will try to hack this up for a few opcodes to see what it looks like.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/56/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/56/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/56/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/56/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/56/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/56/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/56/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/56/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/56/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/56/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/56/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/56/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=56&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/05/21/recording-traces-in-spidermonkey/feed/</wfw:commentRss>
	
		<media:content url="http://andreasgal.files.wordpress.com/2008/05/sm1.png" medium="image" />

		<media:content url="http://andreasgal.files.wordpress.com/2008/05/sm2.png" medium="image" />
	</item>
		<item>
		<title>Bernd&#8217;s Challenge</title>
		<link>http://andreasgal.com/2008/05/19/bernds-challenge/</link>
		<comments>http://andreasgal.com/2008/05/19/bernds-challenge/#comments</comments>
		<pubDate>Tue, 20 May 2008 05:34:03 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=55</guid>
		<description><![CDATA[Michael and I spent the day at Sun Labs in Menlo Park. Bernd and his group are currently porting Maxine to MacOS X amongst others, and ran into the horror that is Mac OS X&#8217;s/Darwin&#8217;s ptrace implementation. When the Maxine VM image boots up, a debugger (inspector) uses ptrace to connect to it and observe the [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Michael and I spent the day at Sun Labs in Menlo Park. Bernd and his group are currently porting Maxine to MacOS X amongst others, and ran into the horror that is Mac OS X&#8217;s/Darwin&#8217;s ptrace implementation. When the Maxine VM image boots up, a debugger (inspector) uses ptrace to connect to it and observe the address at which the VM image is loaded (using mmap). Most sane OS&#8217;s support some sort of system call tracing, which makes since very trivial.</p>
<p>Mac OS X is a different story. ptrace is totally broken on Mac OS X, and for most functionality (like peek/poke the subject address space or reading the content of registers) one has to resort to using Mach. Even worse, Mac OS X&#8217;s kernel (xnu) doesn&#8217;t support any form of system call tracing (except ktrace, which writes system call info directly to a file). Bernd mentioned a bronze statue to be placed at Sun Labs for the person who gets Maxine&#8217;s ptrace monitoring code to work on Mac OS X ;) Here is my entry for that contest: <a href="http://www.ics.uci.edu/~gal/syscall.c">syscall.c</a> <a href="http://www.ics.uci.edu/~gal/hello.c">hello.c</a></p>
<p>As mentioned before, Darwin doesn&#8217;t support system call tracing so we simply single step through the code. This is of course pretty slow (ballpark factor 10,000), but Maxine&#8217;s startup code is pretty compact so it should be still manageable. hello.c is a test case that allocates 0&#215;88000 bytes using mmap. syscall.c traces through it in about a second. The mmap is recognized by scanning for RAX=0xc5 (mmap syscall) and a sufficiently large size for the mmap (Maxine&#8217;s VM image is very large, uniquely identifying the mmap call as the intended one). If both conditions hold we set a flag and check the result of the mmap syscall after we step over it. RAX contains the address mmap mapped the file to. Since we are still in single stepping mode, the client program (Maxine loader) is suspended and using the image address the image can be analyzed and the proper breakpoints can be set to take control of the subject VM. Once everything is ready to go PT_CONTINUE can be used to resume execution (until a breakpoint is hit).</p>
<p>Note: Make sure to run syscall.c as either super user, or assign the file to the proper group. The latest Darwin kernel is picky about using Mach syscalls from unprivileged executables.</p>
<p> </p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/55/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/55/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/55/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/55/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/55/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=55&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/05/19/bernds-challenge/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Assembler 2.0</title>
		<link>http://andreasgal.com/2008/05/12/assembler-20/</link>
		<comments>http://andreasgal.com/2008/05/12/assembler-20/#comments</comments>
		<pubDate>Mon, 12 May 2008 18:25:16 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=54</guid>
		<description><![CDATA[Our backend is currently using the Maxwell Assembler framework. Its a very neat system that generates an assembler and disassembler for a number of platforms (IA32 and AMD64/IA32E, SPARC, etc). Maxwell uses a generative approach and uses a built-in description to generate a huge (700k+ bytecode) class file that contains one method for each instruction [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Our backend is currently using the <a href="https://maxwellassembler.dev.java.net/" target="_blank">Maxwell Assembler</a> framework. Its a very neat system that generates an assembler and disassembler for a number of platforms (IA32 and AMD64/IA32E, SPARC, etc). Maxwell uses a generative approach and uses a built-in description to generate a huge (700k+ bytecode) class file that contains one method for each instruction in all its possible formats. To assembler instructions the backend then invokes one of those 65,000+ methods to emit the corresponding machine instruction.</p>
<p>This approach has a number of advantages. First of all, Maxwell Assembler is very strictly typed. Its almost impossible to assembler an instruction with incorrect register parameters since registers are typed by register type (GeneralRegister vs XMMRegister) but also by use (IndirectRegister vs vanilla GeneralRegister). This is nice and avoids a lot of issues with mixed-up order of arguments (i.e. what do I specify first, the base register or the index register?).</p>
<p>Maxwell Assembler&#8217;s approach is also pretty fast since each method is highly specialized and basically just shuffles and shifts bits around and emits them to a stream.</p>
<p>Anyway, I think its time to get rid of it and upgrade to something that works better in our specific environment. Why?</p>
<p>Well, two reasons: Maxwell is HUGE, and DOG SLOW. At the first glimpse, the latter slightly contradicts what I wrote two sentence ago, but its really a combination of the former (Maxwell is a ton of code, 2MB compressed bytecode) and the fact that we use it in a dynamically setting. The Maxwell/Maxine system Maxwell Assembler was designed for is a statically precompiled virtual machine, and it doesn&#8217;t have to dynamically load and link the assembler framework. We do since our VM runs on top of Hotspot. And thats where we fell out of love with Maxwell. It takes a good 750ms on a 2GHz Core 2 Duo to instantiate the Maxwell Assembler for the first time. That really kills our performance numbers for benchmarks with short runtime.</p>
<p>As <a href="http://michael.bebenita.com/?q=blog/1">Michael</a> suggested we could try to strip out the 64500 or so methods of the RawAssembler we don&#8217;t use and keep Maxwell, but I never much liked the one method per instruction and format interface and I will try to explain why.</p>
<p>The x86 architecture is crazy irregular, but it still has some significant regularity to it that can be exploited by a backend to reduce code duplication in the code generator itself. For example, our current backend has to implement the following idiom repeatedly for all possible arithmetic operations:</p>
<blockquote><p>if (is-immediate(right(x)))<br />
    if (is-8bit-value(immediate(right(x))))<br />
         emit(register(left(x)), (byte) immediate(right(x))))<br />
    else<br />
         emit(register(left(x)), immediate(right(x))))<br />
else<br />
    if (is-register(left(x)))<br />
        if (is-register(right(x)))<br />
            emit(register(left(x)), register(right(x)))<br />
        else<br />
            if (is-8bit-value(offset(right(x)))) <br />
                emit(register(left(x)), RBP, (byte)offset(right(x)))<br />
            else <br />
                emit(register(left(x)), RBP, offset(right(x)))<br />
    else<br />
        &#8230;. </p></blockquote>
<p>Actually, in reality the idiom is even more complicated because the left side is not always a register. Instead for each case we have to check whether its in memory and whether the offset fits into 8 bits. I also left out the second part of the last if cascade (if the left side is in memory). Anyway, the point is that using Maxwell requires huge ugly nested if cascades to select the proper matching method to emit the instruction with. And since each instruction has its separate methods for that, this code has to be duplicated for each IR instruction (i.e. ADD and SUB).</p>
<p>This entire exercise is particularly pointless because at the machine level all this address encodings are expressed using the same ModRM/SIB encoding. So instead of generating thousands of individual methods for them, I propose using a high-level interface instead that delays the decision what operands to use (memory or register) until after we pick what instruction we want to have. We do this using an Operand interface:</p>
<blockquote><p>public<span> </span>interface<span> Operand {<br />
<span>    public</span> OperandType getOperandType();<br />
<span>    public</span> Address getAddress();<br />
<span>    public</span> Register getRegister();<br />
}</span></p></blockquote>
<div>The emit functions now can take generic arguments and decide what to do with them:</div>
<blockquote>
<div>
<p><span>void</span> emit(CodeBuffer code, <span>int</span> rm_r, <span>int</span> r_rm, DataSize size, Operand left, Operand right) {<br />
<span>    if</span> (right.getOperandType() == OperandType.<span>Register</span>)<br />
        emitModRMSIBFormat(code, rm_r, size, left, right.getRegister());<br />
<span>    else</span> {<br />
<span>        assert</span> left.getOperandType() == OperandType.<span>Register</span>;<br />
        emitModRMSIBFormat(code, r_rm, size, right, left.getRegister());<br />
    }<br />
}</p>
</div>
</blockquote>
<div>Two things are noteworthy here. First, this one emitter function covers the same functionality as several thousand Maxwell methods, because we can deal with all combinations of Operands (except immediate, those we encode separately and I explain in a minute why). Second, and this is the downside, we are no longer as strictly typed as Maxwell. Its possible to pass two operands to this method that both are addresses, at which point an assertion is raised. Maxwell would have caught this at compile time.</div>
<div></div>
<div>With these unified emitter functions we can now assembler instructions from templates:</div>
<blockquote>
<div>
<p>ALU_Template <span>ADD</span>  = <span>new</span> ALU_Template(0&#215;05, 0&#215;81, 0&#215;83, 0, 0&#215;01, 0&#215;03);<br />
ALU_Template <span>SUB</span>  = <span>new</span> ALU_Template(0&#215;2D, 0&#215;81, 0&#215;83, 5, 0&#215;29, 0&#215;2B);<br />
ALU_Template <span>AND</span>  = <span>new</span> ALU_Template(0&#215;25, 0&#215;81, 0&#215;83, 4, 0&#215;21, 0&#215;23);<br />
ALU_Template <span>OR</span>   = <span>new</span> ALU_Template(0&#215;0D, 0&#215;81, 0&#215;83, 1, 0&#215;09, 0&#215;0B);<br />
ALU_Template <span>XOR</span>  = <span>new</span> ALU_Template(0&#215;35, 0&#215;81, 0&#215;83, 6, 0&#215;31, 0&#215;33);<br />
ALU_Template <span>CMP</span>  = <span>new</span> ALU_Template(0&#215;3D, 0&#215;81, 0&#215;83, 7, 0&#215;39, 0&#215;3B);</p>
</div>
</blockquote>
<div>For added convenience, our register enums (GeneralRegister and XMMRegister) also implement the Operand interface and of course always indicate that they are register operands. This way we can directly pass registers to the emitter functions, without having to box them into an Operand object. Here are a couple of examples on how our assembler can be used:</div>
<blockquote>
<div>
<p><span> </span>CodeBuffer code = <span>new</span> IndirectCodeBuffer(0&#215;1000, 2048);<br />
<span>ADD</span>.emit(code, DataSize.<span>LONG</span>, <span>RAX</span>, 0&#215;1234);<br />
<span>ADD</span>.emit(code, DataSize.<span>INT</span>, <span>new</span> Address(<span>FS</span>, <span>RBP</span>, <span>R12</span>, <span>SCALE_4</span>, -50), <span>RBP</span>);<br />
Label label2 = <span>new</span> Label();<br />
code.setLabel(label2);<br />
Label label = <span>new</span> Label();<br />
<span>ADD</span>.emit(code, DataSize.<span>LONG</span>, <span>new</span> Address(label), <span>RAX</span>);<br />
<span>JC</span>.emit(code, Condition.<span>NP</span>, BranchDistance.<span>BYTE</span>, label2); <span>/* short branch */<br />
<span>MOV</span>.emit(code, DataSize.<span>LONG</span>, <span>R12</span>, 0&#215;1234567822345678L);<br />
<span>MOV</span>.emit(code, DataSize.<span>INT</span>, <span>new</span> Address(<span>RSP</span>), 0&#215;12345678);<br />
<span>PUSH</span>.emit(code, 0&#215;12);<br />
<span>PUSH</span>.emit(code, 0&#215;1234);<br />
<span>PUSH</span>.emit(code, 0&#215;12345678);<br />
Label label3 = <span>new</span> Label();<br />
code.setLabel(label3);<br />
<span>SHL</span>.emit(code, DataSize.<span>LONG</span>, <span>RAX</span>, 1);<br />
/* if virtual registers implement the Operand interface they can be used directly */<br />
VirtualRegister vr1 = <span>new</span> VirtualRegister(GeneralRegister.<span>RCX</span>);<br />
VirtualRegister vr2 = <span>new</span> VirtualRegister(-40); <span>/* [RSP-40] */<br />
<span>SUB</span>.emit(code, DataSize.<span>LONG</span>, vr1, vr2); /* SUB RCX, [RSP-40] */<br />
<span>JMP</span>.emit(code, label3);<br />
<span>JMP</span>.emit(code, vr2);<br />
<span>CALL</span>.emit(code, label3);<br />
<span>CALL</span>.emit(code, vr2);<br />
<span>code.setLabel(label); </span>/* labels are generate like in maxwell */<br />
code.emit64(0);<br />
code.resolve();</span></span></p>
</div>
</blockquote>
<p>While we can directly pass virtual register objects to the assembler now, the backend still has to understand and compensate for irregularities such as not being able to assemble with two addresses as operands. Trying to hide this from the backend would be pointless since high-level decisions have to be made how to deal with it (higher level than an assembler anyway).</p>
<p>But as you can see its a fairly concise interface that generates highly compact code (optimal encoding of 8/32bit immediates and offsets), is almost as safe to use as Maxwell, and all that at an expense of about 800 lines of Java code.</p>
<p>Last but not least, our assembler is also very fast, and it can write directly to memory instead of to a byte stream (DirectCodeBuffer). We also split the CodeBuffer from the Assembler itself, so we don&#8217;t even have to instantiate the Assembler again just because we want to emit code to different code buffers.</p>
<p>As a final word of defense for Maxwell, it would be possible to implement such an interface on top of Maxwell&#8217;s million-method API (and Bernd did suggest this), but at least for our system that doesn&#8217;t solve the dynamic loading issue, so I think replacing Maxwell is the way to go.</p>
<p>We will keep the Maxwell disassembler for the time being though, because its output is beautiful and its slowness irrelevant. I will add a fast disassembler to the code base eventually though because we need to perform some static analysis on machine code and Maxwell&#8217;s disassembler is way too slow for that as well (unfortunately).</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/54/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/54/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/54/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/54/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/54/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/54/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/54/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/54/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/54/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/54/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/54/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/54/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=54&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/05/12/assembler-20/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tree Folding</title>
		<link>http://andreasgal.com/2008/02/28/tree-folding/</link>
		<comments>http://andreasgal.com/2008/02/28/tree-folding/#comments</comments>
		<pubDate>Fri, 29 Feb 2008 00:54:21 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=53</guid>
		<description><![CDATA[At our last meeting with the Spidermonkey developer team up at Mozilla Edwin Smith from Adobe gave a presentation about pathological cases for trace trees. Amongst others Ed talked about the massive tree that gets recorded for John Conway&#8217;s Game of Life algorithm. It essentially consists of a nested loop over a matrix that counts for [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>At our last meeting with the Spidermonkey developer team up at Mozilla Edwin Smith from Adobe gave a presentation about pathological cases for trace trees. Amongst others Ed talked about the massive tree that gets recorded for John Conway&#8217;s <a href="http://www.bitstorm.org/gameoflife/">Game of Life</a> algorithm. It essentially consists of a nested loop over a matrix that counts for each cell the number of neighboring cells that are alive. The trivial implementation of this is a series of consecutive if statement inside the loop body, each incrementing a counter if the cell coordinates are within range and the cell is alive. To address this issue, we have now added a new tree transformation to our trace tree compiler: tree folding. The idea of tree folding is to reconnect traces that split of another trace but only have a minimal difference with respect to the code executed. A simple example for this is calculating the minimum of two values:
<pre>for (n = 0; n &lt; 5000; ++n) a = min(b,c)</pre>
<p>The trace tree for this loop consists of 2 traces. One primary trace that covers one side of the min condition (i.e. b &lt; c), and then whenever b is &gt;= c, the other (secondary) trace is recorded. The goal of tree folding is to eliminate the need to have 2 traces in this scenario. While 2 traces seem reasonable here, just imagine using several min functions inside a loop body. The tree would fan out quickly and contain hundreds of traces.The tree folding algorithm is implemented as a short compiler pipeline that gets invoked every time we add a trace to the tree. The tree is serialized into a sequence of instructions (read about tree serialization <a href="http://andreasgal.com/2007/06/20/tree-filtering-ordering-of-filters/" target="_blank">here</a>), which is then passed through the IdentifySpliceSites and TreeFolding filters. At the end of this pipeline is a TreeBuilder that re-constitutes a trace tree, which is substituted for the original tree if we actually folded traces in it. To identify sliceable trace parts (strands), we look through the instruction sequence and scan for guards where a secondary trace is attached to. To be sliceable, both traces (primary and attached secondary) must be followed by the same merge point, which is a guard or the end of the loop. Between the split point (where we will start merging) and the merge point (the common guard we expect both traces to hit) we can executed arbitrary instructions as long they don&#8217;t change the memory state (STORE, SynchronizationInstruction, AllocationInstruction). We we find such a guard, we mark it as a split point and the next pass will eliminate it and replace it with conditional move instructions.The reason that we can only merge traces at a second &#8220;common&#8221; guards is that we need to compare the state of all variables on both traces in the merge point. Guards contain a complete snapshot of the variable state (InterpreterState object), because we need to restore the VM state in case of a side exit on that guard. When merging the two strands, we emit the code from both strands into the primary trace, compare the states in the two merge guards and for every difference between them we issue a conditional move instruction at the end of the code of both strands to merge the states into one state. For the remainder of the primary trace we have to replace all references to the &#8220;left&#8221; (primary) strand with a reference to the appropriate conditional move since we execute both strands in parallel now and the right side could be the valid strand. The conditional moves use the same condition as the guard to pick which values to copy. The guard instruction in the primary trace is completely eliminated, and so is the secondary trace attached to it and any trace attached to that. We call this tree pruning.Its important to note that we might prune a code path from the tree that is then no longer represented in the tree and will have to be re-recorded. Consider the game of life code. Its a series of if statements and when we merge the first if statement, the secondary trace will have a long code path for the additional 7 ifs in it. We only copy the code for the first if into the primary trace (until the merge guard), the rest of the trace is thrown away and will be re-recorded when we side exit on the 2nd remaining if. Through 8 iterations eventually all the if statements are inlined into the primary trace.The advantage of this approach is that we can do hierarchical inlining. If an if statement contains another if statement, the inner if statement is folded first and becomes a series of conditional moves, which then in a second pass can be folded into the outer if statement. This is particularly important because complex if clauses are essentially a series of nested ifs and can be thus folded using the algorithm described above. For this iterative folding, the folding pipeline has to be re-executed as long the tree changes. In most cases, at most 2 iterations are necessary to obtain a tree that cannot be further simplified.For the details of the algorithm take a look at the source code of our compiler which is available from our <a href="http://hotpath.org/" target="_blank">Hotpath Project</a> website.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/53/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/53/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/53/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/53/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/53/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/53/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/53/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/53/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/53/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/53/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/53/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/53/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=53&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/02/28/tree-folding/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Conditional Moves in the IR</title>
		<link>http://andreasgal.com/2008/02/15/conditional-moves-in-the-ir/</link>
		<comments>http://andreasgal.com/2008/02/15/conditional-moves-in-the-ir/#comments</comments>
		<pubDate>Sat, 16 Feb 2008 04:50:58 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.wordpress.com/?p=52</guid>
		<description><![CDATA[Today we added conditional move instructions (CMOVs) to the intermediate representation (IR). They are closely related to GUARDs, but with two additional arguments that represent the two optional values the CMOV will pick from. MIN, MAX, ABS were previously modeled as separate and explicit instructions in the IR. Instead, these are now expressed through CMOV:

MIN(a, b)	CMOV LE, [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Today we added conditional move instructions (CMOVs) to the intermediate representation (IR). They are closely related to GUARDs, but with two additional arguments that represent the two optional values the CMOV will pick from. MIN, MAX, ABS were previously modeled as separate and explicit instructions in the IR. Instead, these are now expressed through CMOV:
<ul>
<li>MIN(a, b)<span class="Apple-tab-span" style="white-space:pre;">	</span>CMOV LE, a, b, a, b    </li>
<li>MAX(a, b)<span class="Apple-tab-span" style="white-space:pre;">	</span>CMOV GE, a, b, a, b</li>
<li>ABS(a, b)<span class="Apple-tab-span" style="white-space:pre;">		</span>n = NEG(a), CMOV GE, a, n, a, n</li>
</ul>
<p>In the backend CMOVs are implemented using machine-level conditional moves, or in case of floating-point MIN/MAX we match the form CMOV x, a, b, c, d with (x == LE) and (a == c) and (b == d) and emit it as x86 MINSS instruction. </p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/52/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/52/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/52/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/52/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/52/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/52/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/52/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/52/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/52/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/52/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/52/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/52/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=52&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2008/02/15/conditional-moves-in-the-ir/feed/</wfw:commentRss>
		</item>
		<item>
		<title>More Precise Load Propagation</title>
		<link>http://andreasgal.com/2007/11/08/more-precise-load-propagation/</link>
		<comments>http://andreasgal.com/2007/11/08/more-precise-load-propagation/#comments</comments>
		<pubDate>Fri, 09 Nov 2007 05:24:53 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.com/2007/11/08/more-precise-load-propagation/</guid>
		<description><![CDATA[We are currently formalizing our escape analysis algorithm for trace-trees with the help of Christian Stork. As part of this work Chris observed that we are overly pessimistic when deciding which loads we can eliminate through load propagation. So far we first ran an escape analysis to find out which allocations are contained (are not [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We are currently formalizing our escape analysis algorithm for trace-trees with the help of Christian Stork. As part of this work Chris observed that we are overly pessimistic when deciding which loads we can eliminate through load propagation. So far we first ran an escape analysis to find out which allocations are contained (are not leaked), and then performed load propagation only for loads from these local allocations.This is sub-optimal for several reasons. First of all, the escape analysis pass has to detect two scenarios that permit values to escape our loop scope. On the one hand we have to track stores to non-captured memory (memory escape), and we have to flag values escaping into future loop iterations (loop escape). Both scenarios cause the allocation to be flagged as escaped. However, only the memory escape situation actually inhibits load propagation. While loop escaping allocations cannot be optimized through allocation hoisting, there is no reason not to perform load propagation on them.Secondly, and independently of the above observation, there is also no need to stop load propagation altogether just because a reference escapes to memory somewhere in the loop. Due to the tree shape of our intermediate representation we can actually keep performing load propagation until we see a reference escaping to memory, at which point we have to inhibit any further load propagations for this reference or any reference stored in the associated objects (and recursively all references stored there). This essentially kills load propagation for the rest of this trace, but for any trace &#8220;higher up&#8221; in the tree is not affected.Its important to kill not only the current reference, but any other allocation that can escape through it as well. For this every time we see a load we explicitly check whether the base reference is another load and follow this chain of loads all the way to the source. If any of this hops is an escaped object, we cannot perform load propagation. Previously we were able to cache this lookup. The slightly worse performance is well worth in this case though.In addition to being more precise, the new load propagation pass is also no longer dependent on escape analysis and we can move it to a much earlier point in the compilation pipeline. During escape analysis almost all loads are eliminated now, which also increases the quality of the escape analysis results. </p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/51/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/51/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/51/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/51/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/51/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/51/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/51/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/51/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/51/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/51/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/51/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/51/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=51&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2007/11/08/more-precise-load-propagation/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Improvements to the Singleton Analysis Pass</title>
		<link>http://andreasgal.com/2007/09/20/improvements-to-the-singleton-analysis-pass/</link>
		<comments>http://andreasgal.com/2007/09/20/improvements-to-the-singleton-analysis-pass/#comments</comments>
		<pubDate>Thu, 20 Sep 2007 20:19:05 +0000</pubDate>
		<dc:creator>Andreas</dc:creator>
		
		<category><![CDATA[Trace Compilation]]></category>

		<guid isPermaLink="false">http://andreasgal.com/2007/09/20/improvements-to-the-singleton-analysis-pass/</guid>
		<description><![CDATA[Mason found a problem with the Singleton Analysis Pass. Objects that were allocated and then written to an array were not correctly flagged as non-singleton. Besides this rather obvious bug I also fixed the way context slots are handled. When trying to write up a formal proof of the algorithm for the PLDI paper I [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Mason found a problem with the Singleton Analysis Pass. Objects that were allocated and then written to an array were not correctly flagged as non-singleton. Besides this rather obvious bug I also fixed the way context slots are handled. When trying to write up a formal proof of the algorithm for the PLDI paper I realized that a non-eliminated load from a context kills any and all allocations that flow into that contexts. Previously I was checking at the end of each trace whether allocations flow into a context and killed the allocation there. However, this is not sufficient. Instead, we have to collect for each context all allocations flowing into it along the loop edge and the kill all allocations if we observe a load from the context slot. To do this in linear time we collect a set of allocations flowing into to each context and a list of loads that touch contexts and then resolve and kill allocations after we have seen the entire tree.</p>
<p>The reason that a load kills the singleton properties is that it might see a previous state instead of a freshly initialized value (i.e. 0/0.0/null). The LoadElimination phase eliminates almost all same-iteration reads from a newly allocated objects, so as long an object is only used within the current iteration this will not inhibit the allocation from being detected as singleton. Only reading from it in the next iteration prevents this (because the LoadElimination doesn&#8217;t try to eliminate loads from context slots because it doesn&#8217;t track which allocations flow into context slots).</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/andreasgal.wordpress.com/50/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/andreasgal.wordpress.com/50/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/andreasgal.wordpress.com/50/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/andreasgal.wordpress.com/50/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/andreasgal.wordpress.com/50/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/andreasgal.wordpress.com/50/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/andreasgal.wordpress.com/50/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/andreasgal.wordpress.com/50/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/andreasgal.wordpress.com/50/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/andreasgal.wordpress.com/50/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/andreasgal.wordpress.com/50/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/andreasgal.wordpress.com/50/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=andreasgal.com&blog=891661&post=50&subd=andreasgal&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://andreasgal.com/2007/09/20/improvements-to-the-singleton-analysis-pass/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>