Trace-wide Register Caching and Long Integers on x86

With its tiny register set of 7 32-bit integer registers, long integers are particularily difficult to implement efficiently on the x86. I tried a couple different implementations and settled for the following approach for now: A first level register allocator assigns virtual registers to instructions in the trace as we go bottom up to the root in reverse recording order. Each virtual register corresponds to an 8-byte cell on the stack addressed via ESP. Virtual registers are cached in physical registers in a round-robin scheme. An age counter is used to determine the oldest register when we need to spill one. This simplifies the code generator logic since there will be no competition for recently allocated registers (except in case of explicit spilling of specific registers, in which case a recently allocated register might be spilled again).
Just as we ensure that virtual registers match across trace boundaries, we have to do the same for the register cache that maps virtual registers to physical registers. For this I use a preference mechanism. If a virtual register needs a physical register and it has a preference register, we try to put it back into that register. Such a preference can be caused by the instruction form (i.e. a division always produces the result in EAX or EDX), or it can be that an instruction and the corresponding virtual register was assigned a particular register in the other trace. In this case we try to get it back to the same register by the time we arrive at the merge point. If that failed we need to spill misaligned virtual/physical register pairs, and they are reloaded into the right register.
For longs I settled on 3 sets of explicit register pairs (EAX:EDX, EBX:EBP, ESI:EDI) instead of arbitrary register pairs. This smaller set of registers reduces the number of potential combinations and increases the chance that we will manage to end up with the same cache configuration at merge points, reducing the amount of spill code. Also, the observant reader will have noticed that ECX is absent from these sets. This was chosen to aid code generation for 64-bit shifts, which need the shift amount to be in ECX. By never using ECX for long integers ECX can always safely be spilled without clobbering the 64-bit value itself, which would force a reload (we can’t operate on memory results in our compiler as a code generation simplification).