Archive for August, 2007

sun.misc.Unsafe fixed in J9

August 30, 2007

John Duimovich from IBM contacted me shortly after I posted here about the bug we found in J9’s implementation of the sun.misc.Unsafe interface. We were able to narrow down the issue to memory accesses though sun.misc.Unsafe with an odd offset and provided this description to John the same night. The next morning John and the J9 team identified and fixed the problem (a wrong code path was taken based on the least significant bit). It took IBM less than 24h to fix this issue, measured from the time I mentioned it on my blog. Thats pretty impressive I would say :)

John also reported of a vivid discussion amongst the J9 team whether using sun.misc.Unsafe is a good idea in the first place. The main reason for us to use it was that we independently optimize null, type and bounds checks and we wanted the HostVM’s JIT to not emit these checks if our optimizer already ensured them. Both, Hotspot and J9, breaking on the generated code forced us to reconsider. We are now using regular xALOAD/xASTORE and GET/PUT/FIELD/STATIC instructions to access memory instead of sun.misc.Unsafe. This incurs quite some overhead, unfortunately. If we are able to obtain a developer snapshot of J9 that contains the fix described above, we might be able to benchmark with J9. In parallel we are trying to nail down the issue in Hotspot. If neither option is available by the deadline we will go with the pure bytecode backend, which is not ideal but better than nothing.

sun.misc.Unsafe broken in Hotspot (and J9)

August 28, 2007

We went through a great deal of pain trying to get sun.misc.Unsafe to work properly in our VM. We use it extensively for the interpretative backends (JPL and BPL). It turns out that the C2 compiler (server JIT) of Hotspot generates incorrect machine code for sun.misc.Unsafe.putByte in certain conditions. We searched the Hotspot bug database and couldn’t find any entries on this issue yet. Bernd offered to put us in touch with someone from the C2 engineering team if we can reduce the problem enough to post it to the bug database. Right now the issue is reproducible by running a certain test case through our VM. To file a report we have to find out under what conditions exactly C2 fails.

We have observed the same bug on our main platform (Apple’s Hotspot 6.0preview and 5.0 ports), but also on Linux (Sun’s Hotspot 6.0). For the latter 32-bit and 64-bit x86 are affected. IBM’s J9 VM that ships with 5.0 and 6.0earlyaccess does not break, which increases our confidence that our use of sun.misc.Unsafe is correct, and the bug is in C2. On the downside, even J9 seems to break in certain cases (i.e. accessing a large array via sun.misc.Unsafe). To make a long story short, we had to part with the idea of using sun.misc.Unsafe. It works in the Hotspot interpreter and C1, but is pretty slow. Its broken in Hotspot C2, which is the only JIT that is reasonably fast when it comes to sun.misc.Unsafe. Even IBM’s J9 is pretty slow since it probably doesn’t support sun.misc.Unsafe as builtin intrensics.

We have revamped our backend to use xALOAD and GET/PUT/FIELD/STATIC for memory access. This does introduce some extra overhead since null and typechecks are now done redundantly, once by our guard, and then again by the memory access operation.

While rewriting the backend we also stumbled over a gigantic security loophole. By injecting a public version of sun.reflect.MagicAccessorImpl into the classpath before the standard library JDK jars we are able to selectively disable the verifier for certain classes. We don’t think that this works for applets since class loading is controlled more thoroughly in that environment.

Trace-wide Register Caching and Long Integers on x86

August 19, 2007

With its tiny register set of 7 32-bit integer registers, long integers are particularily difficult to implement efficiently on the x86. I tried a couple different implementations and settled for the following approach for now: A first level register allocator assigns virtual registers to instructions in the trace as we go bottom up to the root in reverse recording order. Each virtual register corresponds to an 8-byte cell on the stack addressed via ESP. Virtual registers are cached in physical registers in a round-robin scheme. An age counter is used to determine the oldest register when we need to spill one. This simplifies the code generator logic since there will be no competition for recently allocated registers (except in case of explicit spilling of specific registers, in which case a recently allocated register might be spilled again).
Just as we ensure that virtual registers match across trace boundaries, we have to do the same for the register cache that maps virtual registers to physical registers. For this I use a preference mechanism. If a virtual register needs a physical register and it has a preference register, we try to put it back into that register. Such a preference can be caused by the instruction form (i.e. a division always produces the result in EAX or EDX), or it can be that an instruction and the corresponding virtual register was assigned a particular register in the other trace. In this case we try to get it back to the same register by the time we arrive at the merge point. If that failed we need to spill misaligned virtual/physical register pairs, and they are reloaded into the right register.
For longs I settled on 3 sets of explicit register pairs (EAX:EDX, EBX:EBP, ESI:EDI) instead of arbitrary register pairs. This smaller set of registers reduces the number of potential combinations and increases the chance that we will manage to end up with the same cache configuration at merge points, reducing the amount of spill code. Also, the observant reader will have noticed that ECX is absent from these sets. This was chosen to aid code generation for 64-bit shifts, which need the shift amount to be in ECX. By never using ECX for long integers ECX can always safely be spilled without clobbering the 64-bit value itself, which would force a reload (we can’t operate on memory results in our compiler as a code generation simplification).

x86 backend

August 17, 2007

We have started working on a x86 backend for our compiler. Initially the backend will emit 32-bit x86 code, because our default VM (Apple’s Java 5 VM) is 32-bit only. Even the preview of Java 6 seems to be 32-bit only on Mac OSX, which is pretty disappointing for a supposedly 64-bit OS. We will switch over to 64-bit generation once we found a better host VM. We are still trying to figure out the best approach to do register allocation on traces with the very limited set of physical registers available on x86 (especially 32-bit). Our current backend assigns a virtual register to each instruction the trace using a simple virtual register allocator. On top of this first level allocation scheme we run a second register allocator that keeps up to 7 (EAX, ECX, EDX, EBX, EBP, ESI, EDI) virtual registers in physical registers. As for an x86 assembler, after initial attempts to roll our own we decided to use the Maxwell Assembler System from Sun Research. I don’t like the one-function-one-instruction-format API of Maxwell, but its a good start and since its open source maybe we will end up re-writing it to fit our needs.