Assembler 2.0

Our backend is currently using the Maxwell Assembler framework. Its a very neat system that generates an assembler and disassembler for a number of platforms (IA32 and AMD64/IA32E, SPARC, etc). Maxwell uses a generative approach and uses a built-in description to generate a huge (700k+ bytecode) class file that contains one method for each instruction in all its possible formats. To assembler instructions the backend then invokes one of those 65,000+ methods to emit the corresponding machine instruction.

This approach has a number of advantages. First of all, Maxwell Assembler is very strictly typed. Its almost impossible to assembler an instruction with incorrect register parameters since registers are typed by register type (GeneralRegister vs XMMRegister) but also by use (IndirectRegister vs vanilla GeneralRegister). This is nice and avoids a lot of issues with mixed-up order of arguments (i.e. what do I specify first, the base register or the index register?).

Maxwell Assembler’s approach is also pretty fast since each method is highly specialized and basically just shuffles and shifts bits around and emits them to a stream.

Anyway, I think its time to get rid of it and upgrade to something that works better in our specific environment. Why?

Well, two reasons: Maxwell is HUGE, and DOG SLOW. At the first glimpse, the latter slightly contradicts what I wrote two sentence ago, but its really a combination of the former (Maxwell is a ton of code, 2MB compressed bytecode) and the fact that we use it in a dynamically setting. The Maxwell/Maxine system Maxwell Assembler was designed for is a statically precompiled virtual machine, and it doesn’t have to dynamically load and link the assembler framework. We do since our VM runs on top of Hotspot. And thats where we fell out of love with Maxwell. It takes a good 750ms on a 2GHz Core 2 Duo to instantiate the Maxwell Assembler for the first time. That really kills our performance numbers for benchmarks with short runtime.

As Michael suggested we could try to strip out the 64500 or so methods of the RawAssembler we don’t use and keep Maxwell, but I never much liked the one method per instruction and format interface and I will try to explain why.

The x86 architecture is crazy irregular, but it still has some significant regularity to it that can be exploited by a backend to reduce code duplication in the code generator itself. For example, our current backend has to implement the following idiom repeatedly for all possible arithmetic operations:

if (is-immediate(right(x)))
    if (is-8bit-value(immediate(right(x))))
         emit(register(left(x)), (byte) immediate(right(x))))
    else
         emit(register(left(x)), immediate(right(x))))
else
    if (is-register(left(x)))
        if (is-register(right(x)))
            emit(register(left(x)), register(right(x)))
        else
            if (is-8bit-value(offset(right(x)))) 
                emit(register(left(x)), RBP, (byte)offset(right(x)))
            else 
                emit(register(left(x)), RBP, offset(right(x)))
    else
        …. 

Actually, in reality the idiom is even more complicated because the left side is not always a register. Instead for each case we have to check whether its in memory and whether the offset fits into 8 bits. I also left out the second part of the last if cascade (if the left side is in memory). Anyway, the point is that using Maxwell requires huge ugly nested if cascades to select the proper matching method to emit the instruction with. And since each instruction has its separate methods for that, this code has to be duplicated for each IR instruction (i.e. ADD and SUB).

This entire exercise is particularly pointless because at the machine level all this address encodings are expressed using the same ModRM/SIB encoding. So instead of generating thousands of individual methods for them, I propose using a high-level interface instead that delays the decision what operands to use (memory or register) until after we pick what instruction we want to have. We do this using an Operand interface:

public interface Operand {
    public OperandType getOperandType();
    public Address getAddress();
    public Register getRegister();
}

The emit functions now can take generic arguments and decide what to do with them:

void emit(CodeBuffer code, int rm_r, int r_rm, DataSize size, Operand left, Operand right) {
    if (right.getOperandType() == OperandType.Register)
        emitModRMSIBFormat(code, rm_r, size, left, right.getRegister());
    else {
        assert left.getOperandType() == OperandType.Register;
        emitModRMSIBFormat(code, r_rm, size, right, left.getRegister());
    }
}

Two things are noteworthy here. First, this one emitter function covers the same functionality as several thousand Maxwell methods, because we can deal with all combinations of Operands (except immediate, those we encode separately and I explain in a minute why). Second, and this is the downside, we are no longer as strictly typed as Maxwell. Its possible to pass two operands to this method that both are addresses, at which point an assertion is raised. Maxwell would have caught this at compile time.
With these unified emitter functions we can now assembler instructions from templates:

ALU_Template ADD  = new ALU_Template(0x05, 0x81, 0x83, 0, 0x01, 0x03);
ALU_Template SUB  = new ALU_Template(0x2D, 0x81, 0x83, 5, 0x29, 0x2B);
ALU_Template AND  = new ALU_Template(0x25, 0x81, 0x83, 4, 0x21, 0x23);
ALU_Template OR   = new ALU_Template(0x0D, 0x81, 0x83, 1, 0x09, 0x0B);
ALU_Template XOR  = new ALU_Template(0x35, 0x81, 0x83, 6, 0x31, 0x33);
ALU_Template CMP  = new ALU_Template(0x3D, 0x81, 0x83, 7, 0x39, 0x3B);

For added convenience, our register enums (GeneralRegister and XMMRegister) also implement the Operand interface and of course always indicate that they are register operands. This way we can directly pass registers to the emitter functions, without having to box them into an Operand object. Here are a couple of examples on how our assembler can be used:

CodeBuffer code = new IndirectCodeBuffer(0x1000, 2048);
ADD.emit(code, DataSize.LONG, RAX, 0x1234);
ADD.emit(code, DataSize.INTnew Address(FS, RBP, R12, SCALE_4, -50), RBP);
Label label2 = new Label();
code.setLabel(label2);
Label label = new Label();
ADD.emit(code, DataSize.LONG, new Address(label), RAX);
JC.emit(code, Condition.NP, BranchDistance.BYTE, label2); /* short branch */
MOV.emit(code, DataSize.LONG, R12, 0x1234567822345678L);
MOV.emit(code, DataSize.INT, new Address(RSP), 0x12345678);
PUSH.emit(code, 0x12);
PUSH.emit(code, 0x1234);
PUSH.emit(code, 0x12345678);
Label label3 = new Label();
code.setLabel(label3);
SHL.emit(code, DataSize.LONG, RAX, 1);
/* if virtual registers implement the Operand interface they can be used directly */
VirtualRegister vr1 = new VirtualRegister(GeneralRegister.RCX);
VirtualRegister vr2 = new VirtualRegister(-40); /* [RSP-40] */
SUB.emit(code, DataSize.LONG, vr1, vr2); /* SUB RCX, [RSP-40] */
JMP.emit(code, label3);
JMP.emit(code, vr2);
CALL.emit(code, label3);
CALL.emit(code, vr2);
code.setLabel(label); /* labels are generate like in maxwell */
code.emit64(0);
code.resolve();

While we can directly pass virtual register objects to the assembler now, the backend still has to understand and compensate for irregularities such as not being able to assemble with two addresses as operands. Trying to hide this from the backend would be pointless since high-level decisions have to be made how to deal with it (higher level than an assembler anyway).

But as you can see its a fairly concise interface that generates highly compact code (optimal encoding of 8/32bit immediates and offsets), is almost as safe to use as Maxwell, and all that at an expense of about 800 lines of Java code.

Last but not least, our assembler is also very fast, and it can write directly to memory instead of to a byte stream (DirectCodeBuffer). We also split the CodeBuffer from the Assembler itself, so we don’t even have to instantiate the Assembler again just because we want to emit code to different code buffers.

As a final word of defense for Maxwell, it would be possible to implement such an interface on top of Maxwell’s million-method API (and Bernd did suggest this), but at least for our system that doesn’t solve the dynamic loading issue, so I think replacing Maxwell is the way to go.

We will keep the Maxwell disassembler for the time being though, because its output is beautiful and its slowness irrelevant. I will add a fast disassembler to the code base eventually though because we need to perform some static analysis on machine code and Maxwell’s disassembler is way too slow for that as well (unfortunately).

Tree Folding

At our last meeting with the Spidermonkey developer team up at Mozilla Edwin Smith from Adobe gave a presentation about pathological cases for trace trees. Amongst others Ed talked about the massive tree that gets recorded for John Conway’s Game of Life algorithm. It essentially consists of a nested loop over a matrix that counts for each cell the number of neighboring cells that are alive. The trivial implementation of this is a series of consecutive if statement inside the loop body, each incrementing a counter if the cell coordinates are within range and the cell is alive. To address this issue, we have now added a new tree transformation to our trace tree compiler: tree folding. The idea of tree folding is to reconnect traces that split of another trace but only have a minimal difference with respect to the code executed. A simple example for this is calculating the minimum of two values:

for (n = 0; n < 5000; ++n) a = min(b,c)

The trace tree for this loop consists of 2 traces. One primary trace that covers one side of the min condition (i.e. b < c), and then whenever b is >= c, the other (secondary) trace is recorded. The goal of tree folding is to eliminate the need to have 2 traces in this scenario. While 2 traces seem reasonable here, just imagine using several min functions inside a loop body. The tree would fan out quickly and contain hundreds of traces.The tree folding algorithm is implemented as a short compiler pipeline that gets invoked every time we add a trace to the tree. The tree is serialized into a sequence of instructions (read about tree serialization here), which is then passed through the IdentifySpliceSites and TreeFolding filters. At the end of this pipeline is a TreeBuilder that re-constitutes a trace tree, which is substituted for the original tree if we actually folded traces in it. To identify sliceable trace parts (strands), we look through the instruction sequence and scan for guards where a secondary trace is attached to. To be sliceable, both traces (primary and attached secondary) must be followed by the same merge point, which is a guard or the end of the loop. Between the split point (where we will start merging) and the merge point (the common guard we expect both traces to hit) we can executed arbitrary instructions as long they don’t change the memory state (STORE, SynchronizationInstruction, AllocationInstruction). We we find such a guard, we mark it as a split point and the next pass will eliminate it and replace it with conditional move instructions.The reason that we can only merge traces at a second “common” guards is that we need to compare the state of all variables on both traces in the merge point. Guards contain a complete snapshot of the variable state (InterpreterState object), because we need to restore the VM state in case of a side exit on that guard. When merging the two strands, we emit the code from both strands into the primary trace, compare the states in the two merge guards and for every difference between them we issue a conditional move instruction at the end of the code of both strands to merge the states into one state. For the remainder of the primary trace we have to replace all references to the “left” (primary) strand with a reference to the appropriate conditional move since we execute both strands in parallel now and the right side could be the valid strand. The conditional moves use the same condition as the guard to pick which values to copy. The guard instruction in the primary trace is completely eliminated, and so is the secondary trace attached to it and any trace attached to that. We call this tree pruning.Its important to note that we might prune a code path from the tree that is then no longer represented in the tree and will have to be re-recorded. Consider the game of life code. Its a series of if statements and when we merge the first if statement, the secondary trace will have a long code path for the additional 7 ifs in it. We only copy the code for the first if into the primary trace (until the merge guard), the rest of the trace is thrown away and will be re-recorded when we side exit on the 2nd remaining if. Through 8 iterations eventually all the if statements are inlined into the primary trace.The advantage of this approach is that we can do hierarchical inlining. If an if statement contains another if statement, the inner if statement is folded first and becomes a series of conditional moves, which then in a second pass can be folded into the outer if statement. This is particularly important because complex if clauses are essentially a series of nested ifs and can be thus folded using the algorithm described above. For this iterative folding, the folding pipeline has to be re-executed as long the tree changes. In most cases, at most 2 iterations are necessary to obtain a tree that cannot be further simplified.For the details of the algorithm take a look at the source code of our compiler which is available from our Hotpath Project website.

Conditional Moves in the IR

Today we added conditional move instructions (CMOVs) to the intermediate representation (IR). They are closely related to GUARDs, but with two additional arguments that represent the two optional values the CMOV will pick from. MIN, MAX, ABS were previously modeled as separate and explicit instructions in the IR. Instead, these are now expressed through CMOV:

  • MIN(a, b) CMOV LE, a, b, a, b    
  • MAX(a, b) CMOV GE, a, b, a, b
  • ABS(a, b) n = NEG(a), CMOV GE, a, n, a, n

In the backend CMOVs are implemented using machine-level conditional moves, or in case of floating-point MIN/MAX we match the form CMOV x, a, b, c, d with (x == LE) and (a == c) and (b == d) and emit it as x86 MINSS instruction. 

More Precise Load Propagation

We are currently formalizing our escape analysis algorithm for trace-trees with the help of Christian Stork. As part of this work Chris observed that we are overly pessimistic when deciding which loads we can eliminate through load propagation. So far we first ran an escape analysis to find out which allocations are contained (are not leaked), and then performed load propagation only for loads from these local allocations.This is sub-optimal for several reasons. First of all, the escape analysis pass has to detect two scenarios that permit values to escape our loop scope. On the one hand we have to track stores to non-captured memory (memory escape), and we have to flag values escaping into future loop iterations (loop escape). Both scenarios cause the allocation to be flagged as escaped. However, only the memory escape situation actually inhibits load propagation. While loop escaping allocations cannot be optimized through allocation hoisting, there is no reason not to perform load propagation on them.Secondly, and independently of the above observation, there is also no need to stop load propagation altogether just because a reference escapes to memory somewhere in the loop. Due to the tree shape of our intermediate representation we can actually keep performing load propagation until we see a reference escaping to memory, at which point we have to inhibit any further load propagations for this reference or any reference stored in the associated objects (and recursively all references stored there). This essentially kills load propagation for the rest of this trace, but for any trace “higher up” in the tree is not affected.Its important to kill not only the current reference, but any other allocation that can escape through it as well. For this every time we see a load we explicitly check whether the base reference is another load and follow this chain of loads all the way to the source. If any of this hops is an escaped object, we cannot perform load propagation. Previously we were able to cache this lookup. The slightly worse performance is well worth in this case though.In addition to being more precise, the new load propagation pass is also no longer dependent on escape analysis and we can move it to a much earlier point in the compilation pipeline. During escape analysis almost all loads are eliminated now, which also increases the quality of the escape analysis results. 

Improvements to the Singleton Analysis Pass

Mason found a problem with the Singleton Analysis Pass. Objects that were allocated and then written to an array were not correctly flagged as non-singleton. Besides this rather obvious bug I also fixed the way context slots are handled. When trying to write up a formal proof of the algorithm for the PLDI paper I realized that a non-eliminated load from a context kills any and all allocations that flow into that contexts. Previously I was checking at the end of each trace whether allocations flow into a context and killed the allocation there. However, this is not sufficient. Instead, we have to collect for each context all allocations flowing into it along the loop edge and the kill all allocations if we observe a load from the context slot. To do this in linear time we collect a set of allocations flowing into to each context and a list of loads that touch contexts and then resolve and kill allocations after we have seen the entire tree.

The reason that a load kills the singleton properties is that it might see a previous state instead of a freshly initialized value (i.e. 0/0.0/null). The LoadElimination phase eliminates almost all same-iteration reads from a newly allocated objects, so as long an object is only used within the current iteration this will not inhibit the allocation from being detected as singleton. Only reading from it in the next iteration prevents this (because the LoadElimination doesn’t try to eliminate loads from context slots because it doesn’t track which allocations flow into context slots).

Singleton Allocation Analysis

I finished today the Singleton Allocation Analysis phase which was the last building block missing to enable Invariant Allocation Hoising (move NEWs out of loops, which is our equivalent of escape-analysis based stack allocation). The SAA pass goes over the tree in forward order and builds an interference graph between context slots and allocation instructions. For this we track all live allocation instructions in a set and add that set to the interference set of a context slot every time it is read. The resulting interference set denots for each slot the allocation instructions that context slot interfers with.

To decide which allocation is singleton (only one instance of it lives in the loop) we have to check two properties. First, if an allocation flows back along the backedge, the context slot it flows into must not flow into another context slot along the backedge in the subsequent iteration. This ensures that we don’t “carry over” instances into future loop iterations. In a second step we then check for each context slot where an allocation flows into along the backedge whether that context slot interfers with said allocation (using the interference graph above). If there is an interference it means that the newly allocated object and the object allocated in the previous iteration are alive at the same time, which precludes the allocation site from being singleton.

Allocations can be moved out of the loop if they are singleton and captured. I updated the Invariant Code Motion phase to do so accordingly. This should significantly speed up our JavaScript JIT which dynamically allocates a ton of small objects inside loops.

sun.misc.Unsafe fixed in J9

John Duimovich from IBM contacted me shortly after I posted here about the bug we found in J9’s implementation of the sun.misc.Unsafe interface. We were able to narrow down the issue to memory accesses though sun.misc.Unsafe with an odd offset and provided this description to John the same night. The next morning John and the J9 team identified and fixed the problem (a wrong code path was taken based on the least significant bit). It took IBM less than 24h to fix this issue, measured from the time I mentioned it on my blog. Thats pretty impressive I would say :)

John also reported of a vivid discussion amongst the J9 team whether using sun.misc.Unsafe is a good idea in the first place. The main reason for us to use it was that we independently optimize null, type and bounds checks and we wanted the HostVM’s JIT to not emit these checks if our optimizer already ensured them. Both, Hotspot and J9, breaking on the generated code forced us to reconsider. We are now using regular xALOAD/xASTORE and GET/PUT/FIELD/STATIC instructions to access memory instead of sun.misc.Unsafe. This incurs quite some overhead, unfortunately. If we are able to obtain a developer snapshot of J9 that contains the fix described above, we might be able to benchmark with J9. In parallel we are trying to nail down the issue in Hotspot. If neither option is available by the deadline we will go with the pure bytecode backend, which is not ideal but better than nothing.