TraceMonkey vs V8

Update: I got a lot of comments on my post. I am trying to answer them as they come in, so check back after you leave a comment.

Brendan Eich and Mike Shaver have posted an update on our progress on TraceMonkey. There has been a lot of buzz around Google’s new Chrome browsers and its V8 JavaScript VM. Some voices have claimed that V8 is several times faster than TraceMonkey. We did some head to head comparisons and these claims don’t match our observations.

We used Apple’s SunSpider benchmarks for our tests. Depending on the OS and machine configuration we are 1.18x to 1.28x faster than V8. Since V8 is only available for Windows, we didn’t perform any tests on MacOSX and Linux, both of which we support already. Our latest builds also work on ARM, by the way.

I am sure you can derive different results by tweaking the benchmarks or designing entirely new custom benchmarks alltogether, but since SunSpider has been used fairly intensively in the past two years to measure the evolution of JavaScript performance in Safari, Firefox, Opera, and IE, I think SunSpider is probably the most reliable cross-platform benchmarking tool at this point (which doesn’t say that its a particularly good one, its just the best we have right now.)

Talking about IE, our tests also indicate that we are about 15 times faster than IE 7, and about 5 times faster than IE 8 beta on the SunSpider aggregate scores.

If you want to give TraceMonkey a try, take a look at our nightly builds. You can enable the JIT in the about:config settings. The nightly builds are certainly not ready yet for wide-spread use, but we have improved stability significantly since our initial preview release. Firefox with TraceMonkey enabled is now my default browser, and I am writing this post with it.

Tracing the Web

We have landed!

For the last two months I have been working with Mozilla on a just-in-time compiler for the JavaScript engine in Firefox, and a few hours ago this project (codenamed TraceMonkey) has landed in the main Firefox development tree.

TraceMonkey is a trace-based JIT compiler and it pushes the envelope on JavaScript performance. On average, we speed up Apple’s popular SunSpider benchmarks by factor 4.6 over the last release of Firefox. The overall runtime of SunSpider improved by about 1.83x (parts of SunSpider excercise things like the regular expression engine which is outside of the scope of JIT compilation, hence the lesser overall speedup). For the SunSpider ubench suite, which focuses on core JavaScript language features, we achieve a speedup of 22x. Whichever metric you chose to apply, Firefox now has the fastest JavaScript engine in the world.

TraceMonkey Performance relative to Firefox 3.0

TraceMonkey Performance relative to Firefox 3.0

Mike Schroepfer put together a demo showing the real-world performance impact of TraceMonkey. You should also check out Brendan Eich’s and Mike Shaver’s blog post about TraceMonkey, as well as David Anderson’s updates on our 64-bit x86 port (yes, we do 64-bit!).

Dynamic Compilation with Traces

Traditional just-in-time compilers (like Sun’s Hotspot VM) are in their design and structure very similar to static compilers (like GCC). They observe which methods get executed frequently, and translate the method into native machine code once a certaint threshold has been reached. While such methods often contain performance-critical parts (such as loops), they often also contain slow paths and non-loopy code, which barely if at all contributes to the runtime of the method. A whole-method compiler, however, has to always analyze and translate the entire method, even if parts of it are not particularly “compilation-worthy”.

Trace-based compilation takes a very different approach. We monitor the interpretation of bytecode instruction by the virtual machine and scan for frequently taken backwards branches, which are an indicator for loops in the underlying program. Once we identify such a loop start point, we follow the interpreter as it executes the program and record the sequence of bytecode instructions that get executed along the way. Since we start at a loop header, the interpreter will eventually return to this entry point once it completed an iteration through the loop. The resulting linear sequence of instructions is what we call a trace.

Traces represent a single iteration through a loop, and can span multiple methods and program modules. If a function is invoked from inside a loop, we follow the function call and inline the instructions executed inside the called method. Function calls themselves are never actually recorded. We merely verify at runtime that the same conditions that caused that function to be activated still hold.

Trace Trees and Nested Trace Trees

TraceMonkey uses a particular flavor of trace-based compilation which I described in my dissertation: Trace Trees. Loops often consist of more than a single performance-relevant path, and Trace Trees allow to capture all of these and organize them in a tree-shaped data structure which can be compiled trace-by-trace yet produces a globally optimized result for the entire loop. To deal with nested loop, these trees can also be nested inside of each other, with outer loop trees calling the inner loop tree.

Control-Flow Graph representation of a loop with a nested condition.

Trace Tree for the code shown int he Control-Flow Graph. Traces are recorded starting at the loop header (A) and connect back to A after completing an iteration.

A particular advantage of Trace Trees is the fact that they always represent a loop and thus enter function frame and leave function frame operations are always balanced as long we stay on trace. Thus, we can actually completely optimize away the overhead of function calls. As long we stay on trace (which in case of a loop we usually do for many iterations), we don’t construct and destroy function frames. Instead, we simply execute the inlined trace we recorded. Function frames for inlined calls are only constructed should we detect that we have to leave the trace (for example because we reached the end of the loop).

Type Specialization

Trace Trees by their very nature are the result of a control-flow speculation. We speculate that loops tend to execute the same sequence of instructions over and over, which is usually true for many applications. In TraceMonkey we go a step further and also speculate on types.

JavaScript in contrast to Java or C/C++ is a dynamically typed language. Variables are declared by name only, and their type will be determined automatically once a value is assigned to them. Assigning values with different types to a variable changes the type of the variable on the fly to match the new value’s type. Executing such dynamically typed code has been traditionally fairly expensive. Type specialization eliminates much of this overhead.

Mike Shaver ran some benchmarks, comparing the performance of simple loops written in JavaScript and C. Our JIT generates code that is roughly equivalent to the performance of unoptimized C code (gcc -O0). We achieve this through aggressive type speculation. Whenever we see a program assign only integers to a variable, for example, we specialize the generated machine code to hold that variable in an integer machine register. Guards in the traces ensure that the type doesn’t unexpectedly change, in which case we leave the trace and let the interpreter handle this (unexpected and often infrequent) case.

Type specialization removes much of the principal overhead associated with dynamically typed languages, and as we further improve our JIT we expect to get fairly close to the performance of statically typed languages such as Java or C.

Traces Everywhere

Our work on TraceMonkey was done in close collaboration with Adobe’s Tamarin Tracing project. In fact, TraceMonkey and Tamarin Tracing share the same core tracing backend (nanojit), which was contributed by Adobe. Adobe has been criticized in the last few month for the slow performance of Tamarin Tracing on untyped JavaScript code. However, Tamarin Tracing is first and foremost a JIT compiler for ActionScript, a typed JavaScript dialect. While Tamarin Tracing does run untyped code, its not particularly optimized (yet) for this task.

TraceMonkey shows the full potential of Adobe’s nanojit backend when combined with a VM that was specifically designed and optimized for untyped JavaScript code (SpiderMonkey), and we expect much of our work to make its way into nanojit and Tamarin Tracing.

People

TraceMonkey was a tremendous group effort of a large group of extremely talented people. Much of the recent advances in the area of nested trees, aggressive type speculation and runtime type inference are based on work done by graduate students at our research group at UC Irvine (Michael Bebenita, Mason Chang, Marcelo Contra, Gregor Wagner and others). Our research was generously funded by a grant from the National Science Foundation (Principal Investigator Professor Michael Franz, Program Director Dr. Helen Gill) as well as grants and donations from Microsoft, Sun Microsystems, Intel, and last but not least Mozilla.

For me, it has been an amazing opportunity to spend the last two month here at Mozilla, turning our research ideas into actual product code. Its hard to describe what it feels like to work alongside people like Brendan Eich, the inventor of JavaScript, or Mike Shaver, Mozilla’s new VP Engineering and life-long JavaScript VM veteran. And even interns around here are rocket scientists. David Anderson, one of Mozilla’s interns, wrote a complete 64-bit backend for us over the summer, making TraceMonkey the first JavaScript JIT capable of targeting x86-64.

TraceMonkey was developed in close collaboration with Edwin Smith and his Tamarin Tracing team at Adobe, and the web at large owes Adobe a great deal of gratitude for open-sourcing the Tamarin and Tamarin Tracing VMs, allowing Mozilla to build TraceMonkey on top of Tamarin Tracing’s nanojit backend. nanojit is a small and highly efficient trace-based just-in-time compiler backend that is language agnostic and highly portable, and I think it has a bright future. It has just landed in Firefox, and hopefully we will see it pop up in a future release of Adobe’s Flash Player soon.

The Road Ahead

Landing in the central Firefox repository was a big step for us, but there is also definitively a lot of work ahead of us. We are now at the point where we trace a lot of code in benchmarks and on the web, but there is a lot more coverage we will add over time.

Also, we are far away from having exhausted all the potential of trace compilation and we plan to add many features and optimizations over the next few month. Our current speedups are just the beginning of whats possible:

  • Improve register allocation and code generation in nanojit.
  • Runtime analysis of builtins (machine code) to reduce spill overhead of builtin calls (Gregor Wagner from UCI did some work on this recently.)
  • Bring performance of the ARM backend up to par with x86 and x86-64 backends and add a PowerPC backend (joint work with Adobe).
  • Add tree-recompilation and parallel compilation (based on our prior work on Parallel Dynamic Compilation, Mohammad Haghighat from Intel has been looking into this for nanojit).
  • Add more advanced trace optimization techniques like Tree Folding, Load Propagation and Escape Analysis.

Our goal is to eventually close the performance gap between JavaScript and traditional desktop languages, and we believe that for many applications this will be possible.

In parallel to our work with Mozilla on JavaScript performance, we also have a number of exciting tracing-related projects going on at UC Irvine. Mason Chang, one of our graduate students, is currently working with Adobe on the Tamarin Tracing VM, adding context threading and trace visualisation. Michael Bebenita from UCI is currently interning with Sun Microssystems and has been making great progress integrating our Java trace compiler into Maxine, and we plan on switching to Maxine as our main research platform for Java compilation. Alexander Yermolovich (also UC Irvine) is working with Adobe this summer on an exciting project involving fast execution of rich dynamic content that Adobe will hopefully announce to the public soon.

If you are interested in their work, check out their blogs (linked from my website). For further reading material on traces and trace compilation you can also take a log at my earlier blog posts on this topic.

Update: Mason Chang did some benchmarks comparing TraceMonkey to Apple’s WebKit/SquirrelFish VM. Looks like we are on average 2.5x faster than SquirrelFish (about 15% faster on total runtime).

I am looking for a tenure-track faculty position for Fall 2009 to continue my research on virtual machines, dynamic compilation and type-safe languages.

Trace-Trees FAQ

Dave Roberts sent me a couple of questions about trace trees after he saw our work mentioned on Steve Yegge’s blog. I figured my answers might be interesting to more people than just Dave. 

Most of your papers on trace-trees just describe the behavior of the technique with respect to a single trace tree. That is, as described, you basically find the first inner loop in the program and then trace and compile that, extending it as you find other paths that branch from it. That’s fine, but how does the system behave with respect to large programs that have many such loops? I’m assuming that you’re compiling loops in many methods across a large such program. Are you saving the trace results across all that activity? In other words, if you find a hot loop in method A, then when you finally exit that method and later find a hot loop in method B, do you throw away the work you did for method A and recreate it later, or are you building up bits of compiled code throughout the long-term program run? I assume the latter, but didn’t really know.

Our code initially runs through an interpreter in a bytecode format. In principle, each bytecode can be the anchor for a trace tree. The code is interpreted until a particular potential anchor becomes “hot” enough to host a tree. At that point we will record a trace and execute it and then subsequently try to extend the tree whenever we side-exit from it. We only grow the tree with traces that connect back to the same loop header the tree is anchored at, either through a direct path through the loop, or some path going through some outer loop. This is not always possible, i.e. if 2 loops are nested inside a loop, at which point we have to generate nested trees where an outer tree calls the inner trees (since we can’t easily form a path through the inner and outer loop at the same time, we would get stuck looping in the other inner loop and the trace would get very long). We use various abort conditions to restrict the maximum size of a trace we want to attach to a tree. With an unlimited trace length the entire program would eventually attach to each tree we start, which is counter-intuitive. We want each tree to represent one hot code region.

Assuming you’re building up bits of code long-term, are there any issues reentering the compiled code from the interpreter when you next execute method A? The papers always describe entering the compiled code as an act that happens right after you record the trace and compile it, but they don’t really talk about the issues of reentering the same code later. How is this done.

Yes, we compile the trace (or tree) and then re-enter it every time the interpreter runs across its anchor point. In our language (JVML) the bytecode is statically typed in that at each point in the program (so for each bytecode instruction) all variables (local variable slots and stack slots) have one unique type. The recorded and compiled trace is compiled with that fixed type distribution and knows how to pull the values from the interpreter stack and local variable frame. Constant values are detected by the optimized and directly embedded in the trace instead of reading them from the interpreter frame. One could even speculate on certain values. Once you see a boolean value in the local variable frame being true for N iterations we could just re-compile the tree assuming that vlaue is always true, and then insert a guard that ensures that this specialized tree is only executed if that slot really contains a boolean true value.

What about the case where method A contains a loop and calls method B in the loop. Method B also has a loop inside it. Perhaps like the following, in pseudo-Java code:
 
public int methodA(int a) {
    // complex way of calculating a^3
    sum = 0;
    for (i = 0; i < a; i++) {
        sum += methodB(a);
    }
    return sum;
}
 
public int methodB(int b) {
    // complex way of calculating b^2
    sum = 0;
    for (i = 0; i < b; i++) {
        sum += b;
    }
    return sum;
}
 
You would expect the system to detect the loop in B first and compile that. When B gets called again from A, you would expect the interpreter to re-enter the compiled code.
 
At some point, however, the system will detect the loop in A and then trace and compile that. When that happens, the trace starting in A would inline B, right? And while it’s tracing through the inlined B, does it just ignore the fact that there is already a compiled trace for the loop in B, unrolling it because it doesn’t return to the loop head in A? If the trace gets too long, because the loop in B might be much larger than in A, then the trace aborts. Is there a way to make the trace starting in A recognize that it has reached a spot where there is already an old trace in B, and the right behavior might be to somehow incorporate that previous trace instead of completely unrolling the loop in B.
You hit the nail on the head. Thats exactly what we do :) We call this “nested trace trees” and its Michael Bebenita’s brainchild. In my original dissertation work I only traced through and compiled the inner loop. The rest of the code was interpreted. As long the inner loop is a lot hotter than the outer code calling it, this still gives a decent speedup. But in certain cases this of course fails. Michael extended this approach as follows. The inner loop is usually hotter and will trigger a tree being recorded for the inner loop. Eventually the outer loop triggers a tree to be recorded starting at its own header. We follow the trace inside the invoked method and then detect that we reached a point where we already have a tree (the inner tree). Instead of following the inner tree (which we as you pointed out wouldn’t be able to record without excessive unrolling), we call it (literally call it, like a method call). There are actually two ways to do this call. Either we compile the outer tree and the inner tree together, teaching the inner tree to directly read the values from the registers and spill locations the outer tree holds its context values (we call this welding), or by spilling all values the inner tree needs from the outer tree onto the stack and then using a more generic invocation mechanism. The latter allows the machine code generated for the inner tree to be reused (saving code space), while the former approach is faster. The nested trace tree construct permits a number of optimizations to be communicated between trees, i.e. whether values that a tree gets handed it from an outer tree escape the tree, allowing global analysis and optimization.
Otherwise it seems like:
  a. you could waste a lot of time trying to keep tracing the loop starting in A and have B blow out the length of your trace buffer. Since tracing is slower than simply interpreting, this would be a net loss in speed.
  b. if you try to unroll another loop fully, even if it doesn’t result in your trace buffer length being exceeded, it’s a good way to get very long traces, but the compiled speed of those traces may not be much faster than calling the compiled code in B anyway.
You are correct. Long traces and excessive “outerlining” (inlining of outer loop parts) rarely pay off, mostly because the outer loop parts are less hot than the inner paths, but now they compete for the same register resources as the inner paths. 
  c. it would then seem that loops that occur higher up in the call tree would get pretty large generally, which would bloat things up overall. either that or they wouldn’t get compiled at all because the traces would all be too long, which means you’d spend a lot more time doing interpreting.
Yes. We are currently playing with the parameters and never outerlining at all and only nesting trees seems to be mostly almost as fast as outerlining.
After how many iterations of a loop do you start tracing? You probably don’t want to do it after 1 loop, but you probably don’t want to wait until 50 or 100 either. Are we talking small, single-digit numbers here, or 10 or 20 times through the loop?
We use 2-3 digit numbers to start a tree. The Tamarin Tracing team is using even smaller numbers (low 2 digits). Its basically a function of how much overhead compilation incurs vs interpretation. Tamarin’s interpreter is really slow (being worked on intensively though), so they try to compile as early as possible.
You talked about tree folding in this recent blog post. Have you guys written anything about that, or is it too new? It would be interesting to understand the complexity of trying to fold the trees back. One of the nice things about the original trace tree algorithm was that it was relatively simple in concept: just trace a tree and then run a simplified form of SSA over it to compile it.
A paper on folding is planned for CGO, and we plan on submitting a paper on nested trace trees to PLDI. We were spectacularly unsuccessful selling our trace-compilation work at either venue in the past though, so we will publish the papers in parallel as a technical report. Just check the tech report section of my publications shortly after the respective deadlines. We will also have a submission for VEE. There are a lot of conferences coming up over the summer, and we have a lot of unpublished research piled up.
Does tree folding complicate your SSA analysis considerably?
No, its a pre-pass that happens right after a trace was added to a tree. Its the only destructive/tree modifying optimization. It starts with the old tree state and the new trace and it produces a tree that merges traces as much as possible. That new tree than replaces the old tree. The representation is largely unchanged and the folding implementation doesn’t touch any of the backend code. The biggest issue with folding is that we have to run (side-exit) along most paths of a deeply branchy code area until everything has been folded, so we get quite a few compilation runs. The nasty 3D Cube example from sunspider (JavaScript benchmark) requires some 63 compiler runs for a fairly compact source code loop with nested if-statements inside. Our compiler is very fast though, so this might be tolerable.
About 8 years ago, we looked at using Insignia’s GeodeVM in a commercial embedded project I was working on. Their VM was really quite fast. I remember them saying that they would try to identify hot pieces of code and would compile those to native code, but that they would do that on a sub-method basis. I think you mentioned Geode in one of the papers as related work. Do you know what they do versus your trace-tree technique?
I know about Insignia’s work only from marketing material and through third party gossip. From what I understand, Insignia uses a bytecode to native code compiler to compile all of the bytecode to native code and then compresses the entire compilation result using gzip. The code is fast to execute, but is at the same time pretty compact since its stored in a compressed format. In other words its a Java VM for embedded systems similarly to my first implementation of a JVM trace compiler, but otherwise largely unrelated as far as the actual approach is concerned. If anyone from Insignia wants to correct me, please go ahead :)

Maxine VM

We spent the 2nd day today with the Maxine VM group at Sun Labs to get familiar with the new meta-circular Java VM they are building. Maxine is completely implemented in Java, and makes extensive use of the Java language. Maxine bootstraps itself using its own optimizing compiler which understands how to specifically optimize Maxine VM paradigms and idioms, allowing Bernd and his team to use the full wealth of Java’s language features without incurring excessive runtime cost. Things like Pointers and Addresses are abstracted as objects, and the optimizing compilers knows how to turn them into machine words eventually, so you get the best of both worlds: neat abstractions, and raw performance. If you haven’t done so already, you should definitively check out Maxine VM. The source code will be released under an open source license some time soon (early June). 

Recording Traces in Spidermonkey

Michael B. and I met yesterday with Brendan and a bunch of other people from Mozilla to talk about the integration of a tracing engine (Tamarin) into Spidermonkey (the JavaScript VM in Firefox). Spidermonkey has been highly optimized over the past decade, in particular in the area of its memory and object model (i.e. property tree), and the interpreter itself (lots of specialized fat opcodes to reduce dispatch overhead). Thus, it seems a good idea to retain the Spidermonkey interpreter and JavaScript source-to-opcode translator and add a tracer to the Spidermonkey interpreter instead of throwing away all the optimizations that went into Spidermonkey over time.

While very well suited for fast interpretation, Spidermonkey’s fat opcodes are not well suited for the intermediate representation we want to record. Instead, for the IR we want to have a much more low level representation that exposes all the type checking and specialization that happens in the fat opcodes. Take Spidermonkey’s JSOP_ADD opcode, for example:

Apart from the nasty xml hack at the top, what this opcode mostly does is go along a chain of a bunch of type checks and conversions, and then perform the actual addition of 2 numbers, or a string concatenation in case either element is a string.

SpiderMonkey actually always performs a double addition and doesn’t specialize to integer additions, but even without this optimization JSOP_ADD contains a large amount of specialization decisions that must be exposed at the IR level in order for a tracer to be able to eliminate this overhead.

Since the interpreter code makes heavy use of macros to abstract certain intrinsic operations, we could use those very abstractions as IR instructions. FETCH_OPND, for example, fetches an operand from the stack (essentially returns “sp[n]”). This would be an ideal IR instruction since its low level enough to be compiled (well, actually its a no-op for the compiler since this turns into a register operation).

Ignoring for a moment the ugly xml code in between, the next operation is VALUE_TO_PRIMITIVE, which is actually a composite macro that itself calls a bunch of low-level macros. Such composite macros we don’t want to touch, instead we are interested in the low-level macros these macros eventually invoke.

Here we can see that the macro invokes JSVAL_IS_PRIMITIVE to check whether the value is already a primitive in which case the value is return (in the last parameter vp), otherwise the default value is obtained from the object.

JSVAL_IS_PRIMITIVE is again a nice primitive to record since it basically just checks for some bit patterns to apply. 

The next question is how do we refactor the interpreter to be able to insert all the recording hooks we need.

In our JamVM-based trace compiler we hooked into the interpreter by modifying the macros JamVM uses to implement each bytecode. Similarly, in Spidermonkey we could try to hook into the low-level macros (primitives). In FETCH_OPND, for example, we could insert a call to the recorder right after reading the value from the stack:

#define FETCH_OPND(n)   (record_FETCH_OPND(n), sp[n])

This approach works well for JamVM where each instruction we record performs a set of defined actions on the stack and local variables, and the recording functions can track those actions on some abstract stack and local variable array. A JVML push instruction, for example, pushes a value on the operand stack. The recording function (lets call it record_PUSH) performs the same action on the abstract stack, but instead of the value it pushes the address of the IR node on the abstract stack that generated the value PUSH is putting on the stack.

In case of FETCH_OPND, however, we don’t have a complete overview of the state, because the value FETCH_OPND reads is returned and then assigned to a variable (i.e. rval). While we could track rval and lval, this would also require modifying FETCH_OPND to indicate which value they write to, otherwise the recording function can’t tell which abstract equivalent of the variable it has to update. 

Instead, I think it makes more sense to slightly change the interface of FETCH_OPND. Instead of returning a value, the output variable gets passed in as last argument (SpiderMonkey already does this for most macros):

#define FETCH_OPND(n, x) x = sp[n];

FETCH_OPND(-1, rval)

FETCH_OPND(-2, lval) 

The code generated by this macro is still identical to the original approach, but we can now hook in a recorder much more easily. The recording version of FETCH_OPND would look like this:

#define FETCH_OPND(n, x) x = sp[n]; \

record_FETCH_OPND(x, &x);

Essentially, we just append the recorder invocation to the existing macro. In addition, we can now record the result of the operation (x), as well as the address of the location we store to (&x), which is essentially the name of the value (think of SSA names here). Using this name we can uniquely identify any future use of the value using a hash table of name to instruction that generated the value (in this case that instruction would be identified by consulting the abstract stack inside record_FETCH_OPND and figure out who put the value onto the stack).

JSVAL_IS_PRIMITIVE can be transformed similarly, however, in contrast to FETCH_OPND its actually not located in jsinterp.c. Instead, its buried somewhere deep inside of jsapi.h. To make this work well we would have to gather all primitives in one place where we can annotate them with recording code. Some common naming for primitives would be nice too. There is no need to actually remove the JSVAL_IS_PRIMITIVE code from jsapi.h, but jsinterp.h should use a second set of macros that map the primitives to their implementation:

jsprim.h:

#define PRIM_FETCH_OPND(n, x) \

  FETCH_OPND(n, x)

#define PRIM_JSVAL_IS_PRIMITIVE(v, x) \

  x = JSVAL_IS_PRIMITIVE(v) 

To generate the tracing equivalent (jstrace.h) we could probably resort to some automated code manipulation, i.e. translate every primitive by appending a call to a recording macro with identical signature:

jstrace.h:

#define PRIM_FETCH_OPND(n, x) FETCH_OPND(n, x); \

  RECORD_FETCH_OPND(n, x)

#define PRIM_JSVAL_IS_PRIMITIVE(v, x) x = JSVAL_IS_PRIMITIVE(v);

  RECORD_JSVAL_IS_PRIMITIVE(v, x)

Transforming jsinterp.c like this will require a major code shakeup, however I think there is an argument to be made that this would actually improve readability and increase modularity even without tracing in mind. Also, if done right this entire macro refactoring business should not affect the actual underlying code at all. One could even do an automated code comparison of the code every time an opcode is rewritten to use primitives, since no code has to change in the non-recording case despite all the macro magic at the source level.

I will try to hack this up for a few opcodes to see what it looks like.

Bernd’s Challenge

Michael and I spent the day at Sun Labs in Menlo Park. Bernd and his group are currently porting Maxine to MacOS X amongst others, and ran into the horror that is Mac OS X’s/Darwin’s ptrace implementation. When the Maxine VM image boots up, a debugger (inspector) uses ptrace to connect to it and observe the address at which the VM image is loaded (using mmap). Most sane OS’s support some sort of system call tracing, which makes since very trivial.

Mac OS X is a different story. ptrace is totally broken on Mac OS X, and for most functionality (like peek/poke the subject address space or reading the content of registers) one has to resort to using Mach. Even worse, Mac OS X’s kernel (xnu) doesn’t support any form of system call tracing (except ktrace, which writes system call info directly to a file). Bernd mentioned a bronze statue to be placed at Sun Labs for the person who gets Maxine’s ptrace monitoring code to work on Mac OS X ;) Here is my entry for that contest: syscall.c hello.c

As mentioned before, Darwin doesn’t support system call tracing so we simply single step through the code. This is of course pretty slow (ballpark factor 10,000), but Maxine’s startup code is pretty compact so it should be still manageable. hello.c is a test case that allocates 0x88000 bytes using mmap. syscall.c traces through it in about a second. The mmap is recognized by scanning for RAX=0xc5 (mmap syscall) and a sufficiently large size for the mmap (Maxine’s VM image is very large, uniquely identifying the mmap call as the intended one). If both conditions hold we set a flag and check the result of the mmap syscall after we step over it. RAX contains the address mmap mapped the file to. Since we are still in single stepping mode, the client program (Maxine loader) is suspended and using the image address the image can be analyzed and the proper breakpoints can be set to take control of the subject VM. Once everything is ready to go PT_CONTINUE can be used to resume execution (until a breakpoint is hit).

Note: Make sure to run syscall.c as either super user, or assign the file to the proper group. The latest Darwin kernel is picky about using Mach syscalls from unprivileged executables.