Assembler 2.0

Our backend is currently using the Maxwell Assembler framework. Its a very neat system that generates an assembler and disassembler for a number of platforms (IA32 and AMD64/IA32E, SPARC, etc). Maxwell uses a generative approach and uses a built-in description to generate a huge (700k+ bytecode) class file that contains one method for each instruction in all its possible formats. To assembler instructions the backend then invokes one of those 65,000+ methods to emit the corresponding machine instruction.

This approach has a number of advantages. First of all, Maxwell Assembler is very strictly typed. Its almost impossible to assembler an instruction with incorrect register parameters since registers are typed by register type (GeneralRegister vs XMMRegister) but also by use (IndirectRegister vs vanilla GeneralRegister). This is nice and avoids a lot of issues with mixed-up order of arguments (i.e. what do I specify first, the base register or the index register?).

Maxwell Assembler’s approach is also pretty fast since each method is highly specialized and basically just shuffles and shifts bits around and emits them to a stream.

Anyway, I think its time to get rid of it and upgrade to something that works better in our specific environment. Why?

Well, two reasons: Maxwell is HUGE, and DOG SLOW. At the first glimpse, the latter slightly contradicts what I wrote two sentence ago, but its really a combination of the former (Maxwell is a ton of code, 2MB compressed bytecode) and the fact that we use it in a dynamically setting. The Maxwell/Maxine system Maxwell Assembler was designed for is a statically precompiled virtual machine, and it doesn’t have to dynamically load and link the assembler framework. We do since our VM runs on top of Hotspot. And thats where we fell out of love with Maxwell. It takes a good 750ms on a 2GHz Core 2 Duo to instantiate the Maxwell Assembler for the first time. That really kills our performance numbers for benchmarks with short runtime.

As Michael suggested we could try to strip out the 64500 or so methods of the RawAssembler we don’t use and keep Maxwell, but I never much liked the one method per instruction and format interface and I will try to explain why.

The x86 architecture is crazy irregular, but it still has some significant regularity to it that can be exploited by a backend to reduce code duplication in the code generator itself. For example, our current backend has to implement the following idiom repeatedly for all possible arithmetic operations:

if (is-immediate(right(x)))
    if (is-8bit-value(immediate(right(x))))
         emit(register(left(x)), (byte) immediate(right(x))))
    else
         emit(register(left(x)), immediate(right(x))))
else
    if (is-register(left(x)))
        if (is-register(right(x)))
            emit(register(left(x)), register(right(x)))
        else
            if (is-8bit-value(offset(right(x)))) 
                emit(register(left(x)), RBP, (byte)offset(right(x)))
            else 
                emit(register(left(x)), RBP, offset(right(x)))
    else
        …. 

Actually, in reality the idiom is even more complicated because the left side is not always a register. Instead for each case we have to check whether its in memory and whether the offset fits into 8 bits. I also left out the second part of the last if cascade (if the left side is in memory). Anyway, the point is that using Maxwell requires huge ugly nested if cascades to select the proper matching method to emit the instruction with. And since each instruction has its separate methods for that, this code has to be duplicated for each IR instruction (i.e. ADD and SUB).

This entire exercise is particularly pointless because at the machine level all this address encodings are expressed using the same ModRM/SIB encoding. So instead of generating thousands of individual methods for them, I propose using a high-level interface instead that delays the decision what operands to use (memory or register) until after we pick what instruction we want to have. We do this using an Operand interface:

public interface Operand {
    public OperandType getOperandType();
    public Address getAddress();
    public Register getRegister();
}

The emit functions now can take generic arguments and decide what to do with them:

void emit(CodeBuffer code, int rm_r, int r_rm, DataSize size, Operand left, Operand right) {
    if (right.getOperandType() == OperandType.Register)
        emitModRMSIBFormat(code, rm_r, size, left, right.getRegister());
    else {
        assert left.getOperandType() == OperandType.Register;
        emitModRMSIBFormat(code, r_rm, size, right, left.getRegister());
    }
}

Two things are noteworthy here. First, this one emitter function covers the same functionality as several thousand Maxwell methods, because we can deal with all combinations of Operands (except immediate, those we encode separately and I explain in a minute why). Second, and this is the downside, we are no longer as strictly typed as Maxwell. Its possible to pass two operands to this method that both are addresses, at which point an assertion is raised. Maxwell would have caught this at compile time.
With these unified emitter functions we can now assembler instructions from templates:

ALU_Template ADD  = new ALU_Template(0x05, 0x81, 0x83, 0, 0x01, 0x03);
ALU_Template SUB  = new ALU_Template(0x2D, 0x81, 0x83, 5, 0x29, 0x2B);
ALU_Template AND  = new ALU_Template(0x25, 0x81, 0x83, 4, 0x21, 0x23);
ALU_Template OR   = new ALU_Template(0x0D, 0x81, 0x83, 1, 0x09, 0x0B);
ALU_Template XOR  = new ALU_Template(0x35, 0x81, 0x83, 6, 0x31, 0x33);
ALU_Template CMP  = new ALU_Template(0x3D, 0x81, 0x83, 7, 0x39, 0x3B);

For added convenience, our register enums (GeneralRegister and XMMRegister) also implement the Operand interface and of course always indicate that they are register operands. This way we can directly pass registers to the emitter functions, without having to box them into an Operand object. Here are a couple of examples on how our assembler can be used:

CodeBuffer code = new IndirectCodeBuffer(0x1000, 2048);
ADD.emit(code, DataSize.LONG, RAX, 0x1234);
ADD.emit(code, DataSize.INTnew Address(FS, RBP, R12, SCALE_4, -50), RBP);
Label label2 = new Label();
code.setLabel(label2);
Label label = new Label();
ADD.emit(code, DataSize.LONG, new Address(label), RAX);
JC.emit(code, Condition.NP, BranchDistance.BYTE, label2); /* short branch */
MOV.emit(code, DataSize.LONG, R12, 0x1234567822345678L);
MOV.emit(code, DataSize.INT, new Address(RSP), 0x12345678);
PUSH.emit(code, 0x12);
PUSH.emit(code, 0x1234);
PUSH.emit(code, 0x12345678);
Label label3 = new Label();
code.setLabel(label3);
SHL.emit(code, DataSize.LONG, RAX, 1);
/* if virtual registers implement the Operand interface they can be used directly */
VirtualRegister vr1 = new VirtualRegister(GeneralRegister.RCX);
VirtualRegister vr2 = new VirtualRegister(-40); /* [RSP-40] */
SUB.emit(code, DataSize.LONG, vr1, vr2); /* SUB RCX, [RSP-40] */
JMP.emit(code, label3);
JMP.emit(code, vr2);
CALL.emit(code, label3);
CALL.emit(code, vr2);
code.setLabel(label); /* labels are generate like in maxwell */
code.emit64(0);
code.resolve();

While we can directly pass virtual register objects to the assembler now, the backend still has to understand and compensate for irregularities such as not being able to assemble with two addresses as operands. Trying to hide this from the backend would be pointless since high-level decisions have to be made how to deal with it (higher level than an assembler anyway).

But as you can see its a fairly concise interface that generates highly compact code (optimal encoding of 8/32bit immediates and offsets), is almost as safe to use as Maxwell, and all that at an expense of about 800 lines of Java code.

Last but not least, our assembler is also very fast, and it can write directly to memory instead of to a byte stream (DirectCodeBuffer). We also split the CodeBuffer from the Assembler itself, so we don’t even have to instantiate the Assembler again just because we want to emit code to different code buffers.

As a final word of defense for Maxwell, it would be possible to implement such an interface on top of Maxwell’s million-method API (and Bernd did suggest this), but at least for our system that doesn’t solve the dynamic loading issue, so I think replacing Maxwell is the way to go.

We will keep the Maxwell disassembler for the time being though, because its output is beautiful and its slowness irrelevant. I will add a fast disassembler to the code base eventually though because we need to perform some static analysis on machine code and Maxwell’s disassembler is way too slow for that as well (unfortunately).