Trace-Trees FAQ

Dave Roberts sent me a couple of questions about trace trees after he saw our work mentioned on Steve Yegge’s blog. I figured my answers might be interesting to more people than just Dave. 

Most of your papers on trace-trees just describe the behavior of the technique with respect to a single trace tree. That is, as described, you basically find the first inner loop in the program and then trace and compile that, extending it as you find other paths that branch from it. That’s fine, but how does the system behave with respect to large programs that have many such loops? I’m assuming that you’re compiling loops in many methods across a large such program. Are you saving the trace results across all that activity? In other words, if you find a hot loop in method A, then when you finally exit that method and later find a hot loop in method B, do you throw away the work you did for method A and recreate it later, or are you building up bits of compiled code throughout the long-term program run? I assume the latter, but didn’t really know.

Our code initially runs through an interpreter in a bytecode format. In principle, each bytecode can be the anchor for a trace tree. The code is interpreted until a particular potential anchor becomes “hot” enough to host a tree. At that point we will record a trace and execute it and then subsequently try to extend the tree whenever we side-exit from it. We only grow the tree with traces that connect back to the same loop header the tree is anchored at, either through a direct path through the loop, or some path going through some outer loop. This is not always possible, i.e. if 2 loops are nested inside a loop, at which point we have to generate nested trees where an outer tree calls the inner trees (since we can’t easily form a path through the inner and outer loop at the same time, we would get stuck looping in the other inner loop and the trace would get very long). We use various abort conditions to restrict the maximum size of a trace we want to attach to a tree. With an unlimited trace length the entire program would eventually attach to each tree we start, which is counter-intuitive. We want each tree to represent one hot code region.

Assuming you’re building up bits of code long-term, are there any issues reentering the compiled code from the interpreter when you next execute method A? The papers always describe entering the compiled code as an act that happens right after you record the trace and compile it, but they don’t really talk about the issues of reentering the same code later. How is this done.

Yes, we compile the trace (or tree) and then re-enter it every time the interpreter runs across its anchor point. In our language (JVML) the bytecode is statically typed in that at each point in the program (so for each bytecode instruction) all variables (local variable slots and stack slots) have one unique type. The recorded and compiled trace is compiled with that fixed type distribution and knows how to pull the values from the interpreter stack and local variable frame. Constant values are detected by the optimized and directly embedded in the trace instead of reading them from the interpreter frame. One could even speculate on certain values. Once you see a boolean value in the local variable frame being true for N iterations we could just re-compile the tree assuming that vlaue is always true, and then insert a guard that ensures that this specialized tree is only executed if that slot really contains a boolean true value.

What about the case where method A contains a loop and calls method B in the loop. Method B also has a loop inside it. Perhaps like the following, in pseudo-Java code:
 
public int methodA(int a) {
    // complex way of calculating a^3
    sum = 0;
    for (i = 0; i < a; i++) {
        sum += methodB(a);
    }
    return sum;
}
 
public int methodB(int b) {
    // complex way of calculating b^2
    sum = 0;
    for (i = 0; i < b; i++) {
        sum += b;
    }
    return sum;
}
 
You would expect the system to detect the loop in B first and compile that. When B gets called again from A, you would expect the interpreter to re-enter the compiled code.
 
At some point, however, the system will detect the loop in A and then trace and compile that. When that happens, the trace starting in A would inline B, right? And while it’s tracing through the inlined B, does it just ignore the fact that there is already a compiled trace for the loop in B, unrolling it because it doesn’t return to the loop head in A? If the trace gets too long, because the loop in B might be much larger than in A, then the trace aborts. Is there a way to make the trace starting in A recognize that it has reached a spot where there is already an old trace in B, and the right behavior might be to somehow incorporate that previous trace instead of completely unrolling the loop in B.
You hit the nail on the head. Thats exactly what we do :) We call this “nested trace trees” and its Michael Bebenita’s brainchild. In my original dissertation work I only traced through and compiled the inner loop. The rest of the code was interpreted. As long the inner loop is a lot hotter than the outer code calling it, this still gives a decent speedup. But in certain cases this of course fails. Michael extended this approach as follows. The inner loop is usually hotter and will trigger a tree being recorded for the inner loop. Eventually the outer loop triggers a tree to be recorded starting at its own header. We follow the trace inside the invoked method and then detect that we reached a point where we already have a tree (the inner tree). Instead of following the inner tree (which we as you pointed out wouldn’t be able to record without excessive unrolling), we call it (literally call it, like a method call). There are actually two ways to do this call. Either we compile the outer tree and the inner tree together, teaching the inner tree to directly read the values from the registers and spill locations the outer tree holds its context values (we call this welding), or by spilling all values the inner tree needs from the outer tree onto the stack and then using a more generic invocation mechanism. The latter allows the machine code generated for the inner tree to be reused (saving code space), while the former approach is faster. The nested trace tree construct permits a number of optimizations to be communicated between trees, i.e. whether values that a tree gets handed it from an outer tree escape the tree, allowing global analysis and optimization.
Otherwise it seems like:
  a. you could waste a lot of time trying to keep tracing the loop starting in A and have B blow out the length of your trace buffer. Since tracing is slower than simply interpreting, this would be a net loss in speed.
  b. if you try to unroll another loop fully, even if it doesn’t result in your trace buffer length being exceeded, it’s a good way to get very long traces, but the compiled speed of those traces may not be much faster than calling the compiled code in B anyway.
You are correct. Long traces and excessive “outerlining” (inlining of outer loop parts) rarely pay off, mostly because the outer loop parts are less hot than the inner paths, but now they compete for the same register resources as the inner paths. 
  c. it would then seem that loops that occur higher up in the call tree would get pretty large generally, which would bloat things up overall. either that or they wouldn’t get compiled at all because the traces would all be too long, which means you’d spend a lot more time doing interpreting.
Yes. We are currently playing with the parameters and never outerlining at all and only nesting trees seems to be mostly almost as fast as outerlining.
After how many iterations of a loop do you start tracing? You probably don’t want to do it after 1 loop, but you probably don’t want to wait until 50 or 100 either. Are we talking small, single-digit numbers here, or 10 or 20 times through the loop?
We use 2-3 digit numbers to start a tree. The Tamarin Tracing team is using even smaller numbers (low 2 digits). Its basically a function of how much overhead compilation incurs vs interpretation. Tamarin’s interpreter is really slow (being worked on intensively though), so they try to compile as early as possible.
You talked about tree folding in this recent blog post. Have you guys written anything about that, or is it too new? It would be interesting to understand the complexity of trying to fold the trees back. One of the nice things about the original trace tree algorithm was that it was relatively simple in concept: just trace a tree and then run a simplified form of SSA over it to compile it.
A paper on folding is planned for CGO, and we plan on submitting a paper on nested trace trees to PLDI. We were spectacularly unsuccessful selling our trace-compilation work at either venue in the past though, so we will publish the papers in parallel as a technical report. Just check the tech report section of my publications shortly after the respective deadlines. We will also have a submission for VEE. There are a lot of conferences coming up over the summer, and we have a lot of unpublished research piled up.
Does tree folding complicate your SSA analysis considerably?
No, its a pre-pass that happens right after a trace was added to a tree. Its the only destructive/tree modifying optimization. It starts with the old tree state and the new trace and it produces a tree that merges traces as much as possible. That new tree than replaces the old tree. The representation is largely unchanged and the folding implementation doesn’t touch any of the backend code. The biggest issue with folding is that we have to run (side-exit) along most paths of a deeply branchy code area until everything has been folded, so we get quite a few compilation runs. The nasty 3D Cube example from sunspider (JavaScript benchmark) requires some 63 compiler runs for a fairly compact source code loop with nested if-statements inside. Our compiler is very fast though, so this might be tolerable.
About 8 years ago, we looked at using Insignia’s GeodeVM in a commercial embedded project I was working on. Their VM was really quite fast. I remember them saying that they would try to identify hot pieces of code and would compile those to native code, but that they would do that on a sub-method basis. I think you mentioned Geode in one of the papers as related work. Do you know what they do versus your trace-tree technique?
I know about Insignia’s work only from marketing material and through third party gossip. From what I understand, Insignia uses a bytecode to native code compiler to compile all of the bytecode to native code and then compresses the entire compilation result using gzip. The code is fast to execute, but is at the same time pretty compact since its stored in a compressed format. In other words its a Java VM for embedded systems similarly to my first implementation of a JVM trace compiler, but otherwise largely unrelated as far as the actual approach is concerned. If anyone from Insignia wants to correct me, please go ahead :)