Archive for May, 2008

Maxine VM

May 22, 2008

We spent the 2nd day today with the Maxine VM group at Sun Labs to get familiar with the new meta-circular Java VM they are building. Maxine is completely implemented in Java, and makes extensive use of the Java language. Maxine bootstraps itself using its own optimizing compiler which understands how to specifically optimize Maxine VM paradigms and idioms, allowing Bernd and his team to use the full wealth of Java’s language features without incurring excessive runtime cost. Things like Pointers and Addresses are abstracted as objects, and the optimizing compilers knows how to turn them into machine words eventually, so you get the best of both worlds: neat abstractions, and raw performance. If you haven’t done so already, you should definitively check out Maxine VM. The source code will be released under an open source license some time soon (early June). 

Cell VM

May 22, 2008

We finally managed to publish a paper on CellVM, our Java VM running Java threads on the Synergistic Processing Units (SPUs) of the Cell Processor. The paper will be presented at the 2008 Workshop on Cell Systems and Applications.

Recording Traces in Spidermonkey

May 21, 2008

Michael B. and I met yesterday with Brendan and a bunch of other people from Mozilla to talk about the integration of a tracing engine (Tamarin) into Spidermonkey (the JavaScript VM in Firefox). Spidermonkey has been highly optimized over the past decade, in particular in the area of its memory and object model (i.e. property tree), and the interpreter itself (lots of specialized fat opcodes to reduce dispatch overhead). Thus, it seems a good idea to retain the Spidermonkey interpreter and JavaScript source-to-opcode translator and add a tracer to the Spidermonkey interpreter instead of throwing away all the optimizations that went into Spidermonkey over time.

While very well suited for fast interpretation, Spidermonkey’s fat opcodes are not well suited for the intermediate representation we want to record. Instead, for the IR we want to have a much more low level representation that exposes all the type checking and specialization that happens in the fat opcodes. Take Spidermonkey’s JSOP_ADD opcode, for example:

Apart from the nasty xml hack at the top, what this opcode mostly does is go along a chain of a bunch of type checks and conversions, and then perform the actual addition of 2 numbers, or a string concatenation in case either element is a string.

SpiderMonkey actually always performs a double addition and doesn’t specialize to integer additions, but even without this optimization JSOP_ADD contains a large amount of specialization decisions that must be exposed at the IR level in order for a tracer to be able to eliminate this overhead.

Since the interpreter code makes heavy use of macros to abstract certain intrinsic operations, we could use those very abstractions as IR instructions. FETCH_OPND, for example, fetches an operand from the stack (essentially returns “sp[n]“). This would be an ideal IR instruction since its low level enough to be compiled (well, actually its a no-op for the compiler since this turns into a register operation).

Ignoring for a moment the ugly xml code in between, the next operation is VALUE_TO_PRIMITIVE, which is actually a composite macro that itself calls a bunch of low-level macros. Such composite macros we don’t want to touch, instead we are interested in the low-level macros these macros eventually invoke.

Here we can see that the macro invokes JSVAL_IS_PRIMITIVE to check whether the value is already a primitive in which case the value is return (in the last parameter vp), otherwise the default value is obtained from the object.

JSVAL_IS_PRIMITIVE is again a nice primitive to record since it basically just checks for some bit patterns to apply. 

The next question is how do we refactor the interpreter to be able to insert all the recording hooks we need.

In our JamVM-based trace compiler we hooked into the interpreter by modifying the macros JamVM uses to implement each bytecode. Similarly, in Spidermonkey we could try to hook into the low-level macros (primitives). In FETCH_OPND, for example, we could insert a call to the recorder right after reading the value from the stack:

#define FETCH_OPND(n)   (record_FETCH_OPND(n), sp[n])

This approach works well for JamVM where each instruction we record performs a set of defined actions on the stack and local variables, and the recording functions can track those actions on some abstract stack and local variable array. A JVML push instruction, for example, pushes a value on the operand stack. The recording function (lets call it record_PUSH) performs the same action on the abstract stack, but instead of the value it pushes the address of the IR node on the abstract stack that generated the value PUSH is putting on the stack.

In case of FETCH_OPND, however, we don’t have a complete overview of the state, because the value FETCH_OPND reads is returned and then assigned to a variable (i.e. rval). While we could track rval and lval, this would also require modifying FETCH_OPND to indicate which value they write to, otherwise the recording function can’t tell which abstract equivalent of the variable it has to update. 

Instead, I think it makes more sense to slightly change the interface of FETCH_OPND. Instead of returning a value, the output variable gets passed in as last argument (SpiderMonkey already does this for most macros):

#define FETCH_OPND(n, x) x = sp[n];

FETCH_OPND(-1, rval)

FETCH_OPND(-2, lval) 

The code generated by this macro is still identical to the original approach, but we can now hook in a recorder much more easily. The recording version of FETCH_OPND would look like this:

#define FETCH_OPND(n, x) x = sp[n]; \

record_FETCH_OPND(x, &x);

Essentially, we just append the recorder invocation to the existing macro. In addition, we can now record the result of the operation (x), as well as the address of the location we store to (&x), which is essentially the name of the value (think of SSA names here). Using this name we can uniquely identify any future use of the value using a hash table of name to instruction that generated the value (in this case that instruction would be identified by consulting the abstract stack inside record_FETCH_OPND and figure out who put the value onto the stack).

JSVAL_IS_PRIMITIVE can be transformed similarly, however, in contrast to FETCH_OPND its actually not located in jsinterp.c. Instead, its buried somewhere deep inside of jsapi.h. To make this work well we would have to gather all primitives in one place where we can annotate them with recording code. Some common naming for primitives would be nice too. There is no need to actually remove the JSVAL_IS_PRIMITIVE code from jsapi.h, but jsinterp.h should use a second set of macros that map the primitives to their implementation:

jsprim.h:

#define PRIM_FETCH_OPND(n, x) \

  FETCH_OPND(n, x)

#define PRIM_JSVAL_IS_PRIMITIVE(v, x) \

  x = JSVAL_IS_PRIMITIVE(v) 

To generate the tracing equivalent (jstrace.h) we could probably resort to some automated code manipulation, i.e. translate every primitive by appending a call to a recording macro with identical signature:

jstrace.h:

#define PRIM_FETCH_OPND(n, x) FETCH_OPND(n, x); \

  RECORD_FETCH_OPND(n, x)

#define PRIM_JSVAL_IS_PRIMITIVE(v, x) x = JSVAL_IS_PRIMITIVE(v);

  RECORD_JSVAL_IS_PRIMITIVE(v, x)

Transforming jsinterp.c like this will require a major code shakeup, however I think there is an argument to be made that this would actually improve readability and increase modularity even without tracing in mind. Also, if done right this entire macro refactoring business should not affect the actual underlying code at all. One could even do an automated code comparison of the code every time an opcode is rewritten to use primitives, since no code has to change in the non-recording case despite all the macro magic at the source level.

I will try to hack this up for a few opcodes to see what it looks like.

Bernd’s Challenge

May 19, 2008

Michael and I spent the day at Sun Labs in Menlo Park. Bernd and his group are currently porting Maxine to MacOS X amongst others, and ran into the horror that is Mac OS X’s/Darwin’s ptrace implementation. When the Maxine VM image boots up, a debugger (inspector) uses ptrace to connect to it and observe the address at which the VM image is loaded (using mmap). Most sane OS’s support some sort of system call tracing, which makes since very trivial.

Mac OS X is a different story. ptrace is totally broken on Mac OS X, and for most functionality (like peek/poke the subject address space or reading the content of registers) one has to resort to using Mach. Even worse, Mac OS X’s kernel (xnu) doesn’t support any form of system call tracing (except ktrace, which writes system call info directly to a file). Bernd mentioned a bronze statue to be placed at Sun Labs for the person who gets Maxine’s ptrace monitoring code to work on Mac OS X ;) Here is my entry for that contest: syscall.c hello.c

As mentioned before, Darwin doesn’t support system call tracing so we simply single step through the code. This is of course pretty slow (ballpark factor 10,000), but Maxine’s startup code is pretty compact so it should be still manageable. hello.c is a test case that allocates 0×88000 bytes using mmap. syscall.c traces through it in about a second. The mmap is recognized by scanning for RAX=0xc5 (mmap syscall) and a sufficiently large size for the mmap (Maxine’s VM image is very large, uniquely identifying the mmap call as the intended one). If both conditions hold we set a flag and check the result of the mmap syscall after we step over it. RAX contains the address mmap mapped the file to. Since we are still in single stepping mode, the client program (Maxine loader) is suspended and using the image address the image can be analyzed and the proper breakpoints can be set to take control of the subject VM. Once everything is ready to go PT_CONTINUE can be used to resume execution (until a breakpoint is hit).

Note: Make sure to run syscall.c as either super user, or assign the file to the proper group. The latest Darwin kernel is picky about using Mach syscalls from unprivileged executables.

 

Assembler 2.0

May 12, 2008

Our backend is currently using the Maxwell Assembler framework. Its a very neat system that generates an assembler and disassembler for a number of platforms (IA32 and AMD64/IA32E, SPARC, etc). Maxwell uses a generative approach and uses a built-in description to generate a huge (700k+ bytecode) class file that contains one method for each instruction in all its possible formats. To assembler instructions the backend then invokes one of those 65,000+ methods to emit the corresponding machine instruction.

This approach has a number of advantages. First of all, Maxwell Assembler is very strictly typed. Its almost impossible to assembler an instruction with incorrect register parameters since registers are typed by register type (GeneralRegister vs XMMRegister) but also by use (IndirectRegister vs vanilla GeneralRegister). This is nice and avoids a lot of issues with mixed-up order of arguments (i.e. what do I specify first, the base register or the index register?).

Maxwell Assembler’s approach is also pretty fast since each method is highly specialized and basically just shuffles and shifts bits around and emits them to a stream.

Anyway, I think its time to get rid of it and upgrade to something that works better in our specific environment. Why?

Well, two reasons: Maxwell is HUGE, and DOG SLOW. At the first glimpse, the latter slightly contradicts what I wrote two sentence ago, but its really a combination of the former (Maxwell is a ton of code, 2MB compressed bytecode) and the fact that we use it in a dynamically setting. The Maxwell/Maxine system Maxwell Assembler was designed for is a statically precompiled virtual machine, and it doesn’t have to dynamically load and link the assembler framework. We do since our VM runs on top of Hotspot. And thats where we fell out of love with Maxwell. It takes a good 750ms on a 2GHz Core 2 Duo to instantiate the Maxwell Assembler for the first time. That really kills our performance numbers for benchmarks with short runtime.

As Michael suggested we could try to strip out the 64500 or so methods of the RawAssembler we don’t use and keep Maxwell, but I never much liked the one method per instruction and format interface and I will try to explain why.

The x86 architecture is crazy irregular, but it still has some significant regularity to it that can be exploited by a backend to reduce code duplication in the code generator itself. For example, our current backend has to implement the following idiom repeatedly for all possible arithmetic operations:

if (is-immediate(right(x)))
    if (is-8bit-value(immediate(right(x))))
         emit(register(left(x)), (byte) immediate(right(x))))
    else
         emit(register(left(x)), immediate(right(x))))
else
    if (is-register(left(x)))
        if (is-register(right(x)))
            emit(register(left(x)), register(right(x)))
        else
            if (is-8bit-value(offset(right(x)))) 
                emit(register(left(x)), RBP, (byte)offset(right(x)))
            else 
                emit(register(left(x)), RBP, offset(right(x)))
    else
        …. 

Actually, in reality the idiom is even more complicated because the left side is not always a register. Instead for each case we have to check whether its in memory and whether the offset fits into 8 bits. I also left out the second part of the last if cascade (if the left side is in memory). Anyway, the point is that using Maxwell requires huge ugly nested if cascades to select the proper matching method to emit the instruction with. And since each instruction has its separate methods for that, this code has to be duplicated for each IR instruction (i.e. ADD and SUB).

This entire exercise is particularly pointless because at the machine level all this address encodings are expressed using the same ModRM/SIB encoding. So instead of generating thousands of individual methods for them, I propose using a high-level interface instead that delays the decision what operands to use (memory or register) until after we pick what instruction we want to have. We do this using an Operand interface:

public interface Operand {
    public OperandType getOperandType();
    public Address getAddress();
    public Register getRegister();
}

The emit functions now can take generic arguments and decide what to do with them:

void emit(CodeBuffer code, int rm_r, int r_rm, DataSize size, Operand left, Operand right) {
    if (right.getOperandType() == OperandType.Register)
        emitModRMSIBFormat(code, rm_r, size, left, right.getRegister());
    else {
        assert left.getOperandType() == OperandType.Register;
        emitModRMSIBFormat(code, r_rm, size, right, left.getRegister());
    }
}

Two things are noteworthy here. First, this one emitter function covers the same functionality as several thousand Maxwell methods, because we can deal with all combinations of Operands (except immediate, those we encode separately and I explain in a minute why). Second, and this is the downside, we are no longer as strictly typed as Maxwell. Its possible to pass two operands to this method that both are addresses, at which point an assertion is raised. Maxwell would have caught this at compile time.
With these unified emitter functions we can now assembler instructions from templates:

ALU_Template ADD  = new ALU_Template(0×05, 0×81, 0×83, 0, 0×01, 0×03);
ALU_Template SUB  = new ALU_Template(0×2D, 0×81, 0×83, 5, 0×29, 0×2B);
ALU_Template AND  = new ALU_Template(0×25, 0×81, 0×83, 4, 0×21, 0×23);
ALU_Template OR   = new ALU_Template(0×0D, 0×81, 0×83, 1, 0×09, 0×0B);
ALU_Template XOR  = new ALU_Template(0×35, 0×81, 0×83, 6, 0×31, 0×33);
ALU_Template CMP  = new ALU_Template(0×3D, 0×81, 0×83, 7, 0×39, 0×3B);

For added convenience, our register enums (GeneralRegister and XMMRegister) also implement the Operand interface and of course always indicate that they are register operands. This way we can directly pass registers to the emitter functions, without having to box them into an Operand object. Here are a couple of examples on how our assembler can be used:

CodeBuffer code = new IndirectCodeBuffer(0×1000, 2048);
ADD.emit(code, DataSize.LONG, RAX, 0×1234);
ADD.emit(code, DataSize.INTnew Address(FS, RBP, R12, SCALE_4, -50), RBP);
Label label2 = new Label();
code.setLabel(label2);
Label label = new Label();
ADD.emit(code, DataSize.LONG, new Address(label), RAX);
JC.emit(code, Condition.NP, BranchDistance.BYTE, label2); /* short branch */
MOV.emit(code, DataSize.LONG, R12, 0×1234567822345678L);
MOV.emit(code, DataSize.INT, new Address(RSP), 0×12345678);
PUSH.emit(code, 0×12);
PUSH.emit(code, 0×1234);
PUSH.emit(code, 0×12345678);
Label label3 = new Label();
code.setLabel(label3);
SHL.emit(code, DataSize.LONG, RAX, 1);
/* if virtual registers implement the Operand interface they can be used directly */
VirtualRegister vr1 = new VirtualRegister(GeneralRegister.RCX);
VirtualRegister vr2 = new VirtualRegister(-40); /* [RSP-40] */
SUB.emit(code, DataSize.LONG, vr1, vr2); /* SUB RCX, [RSP-40] */
JMP.emit(code, label3);
JMP.emit(code, vr2);
CALL.emit(code, label3);
CALL.emit(code, vr2);
code.setLabel(label); /* labels are generate like in maxwell */
code.emit64(0);
code.resolve();

While we can directly pass virtual register objects to the assembler now, the backend still has to understand and compensate for irregularities such as not being able to assemble with two addresses as operands. Trying to hide this from the backend would be pointless since high-level decisions have to be made how to deal with it (higher level than an assembler anyway).

But as you can see its a fairly concise interface that generates highly compact code (optimal encoding of 8/32bit immediates and offsets), is almost as safe to use as Maxwell, and all that at an expense of about 800 lines of Java code.

Last but not least, our assembler is also very fast, and it can write directly to memory instead of to a byte stream (DirectCodeBuffer). We also split the CodeBuffer from the Assembler itself, so we don’t even have to instantiate the Assembler again just because we want to emit code to different code buffers.

As a final word of defense for Maxwell, it would be possible to implement such an interface on top of Maxwell’s million-method API (and Bernd did suggest this), but at least for our system that doesn’t solve the dynamic loading issue, so I think replacing Maxwell is the way to go.

We will keep the Maxwell disassembler for the time being though, because its output is beautiful and its slowness irrelevant. I will add a fast disassembler to the code base eventually though because we need to perform some static analysis on machine code and Maxwell’s disassembler is way too slow for that as well (unfortunately).