Dave Roberts sent me a couple of questions about trace trees after he saw our work mentioned on Steve Yegge’s blog. I figured my answers might be interesting to more people than just Dave.
Most of your papers on trace-trees just describe the behavior of the technique with respect to a single trace tree. That is, as described, you basically find the first inner loop in the program and then trace and compile that, extending it as you find other paths that branch from it. That’s fine, but how does the system behave with respect to large programs that have many such loops? I’m assuming that you’re compiling loops in many methods across a large such program. Are you saving the trace results across all that activity? In other words, if you find a hot loop in method A, then when you finally exit that method and later find a hot loop in method B, do you throw away the work you did for method A and recreate it later, or are you building up bits of compiled code throughout the long-term program run? I assume the latter, but didn’t really know.
Our code initially runs through an interpreter in a bytecode format. In principle, each bytecode can be the anchor for a trace tree. The code is interpreted until a particular potential anchor becomes “hot” enough to host a tree. At that point we will record a trace and execute it and then subsequently try to extend the tree whenever we side-exit from it. We only grow the tree with traces that connect back to the same loop header the tree is anchored at, either through a direct path through the loop, or some path going through some outer loop. This is not always possible, i.e. if 2 loops are nested inside a loop, at which point we have to generate nested trees where an outer tree calls the inner trees (since we can’t easily form a path through the inner and outer loop at the same time, we would get stuck looping in the other inner loop and the trace would get very long). We use various abort conditions to restrict the maximum size of a trace we want to attach to a tree. With an unlimited trace length the entire program would eventually attach to each tree we start, which is counter-intuitive. We want each tree to represent one hot code region.
Assuming you’re building up bits of code long-term, are there any issues reentering the compiled code from the interpreter when you next execute method A? The papers always describe entering the compiled code as an act that happens right after you record the trace and compile it, but they don’t really talk about the issues of reentering the same code later. How is this done.
Yes, we compile the trace (or tree) and then re-enter it every time the interpreter runs across its anchor point. In our language (JVML) the bytecode is statically typed in that at each point in the program (so for each bytecode instruction) all variables (local variable slots and stack slots) have one unique type. The recorded and compiled trace is compiled with that fixed type distribution and knows how to pull the values from the interpreter stack and local variable frame. Constant values are detected by the optimized and directly embedded in the trace instead of reading them from the interpreter frame. One could even speculate on certain values. Once you see a boolean value in the local variable frame being true for N iterations we could just re-compile the tree assuming that vlaue is always true, and then insert a guard that ensures that this specialized tree is only executed if that slot really contains a boolean true value.
What about the case where method A contains a loop and calls method B in the loop. Method B also has a loop inside it. Perhaps like the following, in pseudo-Java code:public int methodA(int a) {// complex way of calculating a^3sum = 0;for (i = 0; i < a; i++) {sum += methodB(a);}return sum;}public int methodB(int b) {// complex way of calculating b^2sum = 0;for (i = 0; i < b; i++) {sum += b;}return sum;}You would expect the system to detect the loop in B first and compile that. When B gets called again from A, you would expect the interpreter to re-enter the compiled code.At some point, however, the system will detect the loop in A and then trace and compile that. When that happens, the trace starting in A would inline B, right? And while it’s tracing through the inlined B, does it just ignore the fact that there is already a compiled trace for the loop in B, unrolling it because it doesn’t return to the loop head in A? If the trace gets too long, because the loop in B might be much larger than in A, then the trace aborts. Is there a way to make the trace starting in A recognize that it has reached a spot where there is already an old trace in B, and the right behavior might be to somehow incorporate that previous trace instead of completely unrolling the loop in B.
Otherwise it seems like:a. you could waste a lot of time trying to keep tracing the loop starting in A and have B blow out the length of your trace buffer. Since tracing is slower than simply interpreting, this would be a net loss in speed.b. if you try to unroll another loop fully, even if it doesn’t result in your trace buffer length being exceeded, it’s a good way to get very long traces, but the compiled speed of those traces may not be much faster than calling the compiled code in B anyway.
c. it would then seem that loops that occur higher up in the call tree would get pretty large generally, which would bloat things up overall. either that or they wouldn’t get compiled at all because the traces would all be too long, which means you’d spend a lot more time doing interpreting.
After how many iterations of a loop do you start tracing? You probably don’t want to do it after 1 loop, but you probably don’t want to wait until 50 or 100 either. Are we talking small, single-digit numbers here, or 10 or 20 times through the loop?
You talked about tree folding in this recent blog post. Have you guys written anything about that, or is it too new? It would be interesting to understand the complexity of trying to fold the trees back. One of the nice things about the original trace tree algorithm was that it was relatively simple in concept: just trace a tree and then run a simplified form of SSA over it to compile it.
Does tree folding complicate your SSA analysis considerably?
About 8 years ago, we looked at using Insignia’s GeodeVM in a commercial embedded project I was working on. Their VM was really quite fast. I remember them saying that they would try to identify hot pieces of code and would compile those to native code, but that they would do that on a sub-method basis. I think you mentioned Geode in one of the papers as related work. Do you know what they do versus your trace-tree technique?
Pingback: Schrep’s Blog » Blog Archives » What can you do when your browser is 7 times faster?
Pingback: Firefox speeds up; has competitors looking for Rocket Boosts | Startup Meme
Pingback: 網路資訊雜誌 » 讓Firefox 3.1超快的祕密武器:TraceMonkey
Pingback: 兔眼看天下 » 讓Firefox 3.1超快的祕密武器:TraceMonkey
Pingback: TraceMonkey e IE 8 « dezone
Pingback: McColley.net » Blog Archive » Why TraceMonkey is Going to Blow Your Web Browsing Mind [Firefox]
Pingback: Why TraceMonkey is Going to Blow Your Web Browsing Mind - Softsaurus.org
Pingback: TraceMonkey: Another Curious George « Web Under Construction
Pingback: Snabbare Javascript, tacka Adobe