A tale of three code generators

Posted 20 Feb 2009 at 17:00 UTC (updated 21 Feb 2009 at 10:51 UTC) by aph Share This

At FOSDEM 2009, Gary Benson of Red Hat presented Shark. (Slides at http://gbenson.net/wp-content/uploads/2009/02/fosdem-2009.pdf) Shark is a port of OpenJDK that uses LLVM to do JIT code generation. While Shark is pretty fast when compared with OpenJDK's C++ interpreter, it's still quite a lot slower than gcj. gcj is a fairly straightforward bytecode->native compiler and doesn't use many of the Java-specific optimizations in HotSpot, so I was of the opinion that Shark and gcj ought to be similar in speed. So, I wanted to find out why Shark was slower than gcj.

There are two approaches to studying optimization: the large and the small. The large involves benchmarking and profiling substantial blocks of code, often entire applications. The small involves much smaller chunks of code, inspecting them one instruction at a time to discover what's going on. One has to be careful, of course, not to read too much into a single example, but it's often instructive.

This is an exercise in the small. The very first method compiled by HotSpot when the VM starts is often String.hashCode(). The speed of hashCode() seems to follow the same pattern as the benchmarks Gary Benson presented with repect to speed: Shark is slowest, then gcj, then HotSpot. Also, hashCode() is used a lot, so it has an important effect on execution time. Finally, it's so small that it's possible to study the generated code in great detail.

Here's the source of String.hashCode():

    public int hashCode() {
        int h = hash;
        if (h == 0) {
            int off = offset;
            char val[] = value;
            int len = count;

for (int i = 0; i < len; i++) { h = 31*h + val[off++]; } hash = h; } return h; }

And here's the bytecode:

public int hashCode();
   0:   aload_0
   1:   getfield        #54; //Field hash:I
   4:   istore_1
   5:   iload_1
   6:   ifne    58
   9:   aload_0
   10:  getfield        #2; //Field offset:I
   13:  istore_2
   14:  aload_0
   15:  getfield        #4; //Field value:[C
   18:  astore_3
   19:  aload_0
   20:  getfield        #3; //Field count:I
   23:  istore  4
   25:  iconst_0
   26:  istore  5

28: iload 5 30: iload 4 32: if_icmpge 53 35: bipush 31 37: iload_1 38: imul 39: aload_3 40: iload_2 41: iinc 2, 1 44: caload 45: iadd 46: istore_1 47: iinc 5, 1 50: goto 28

53: aload_0 54: iload_1 55: putfield #54; //Field hash:I 58: iload_1 59: ireturn

While the entry and exit sequences of the method may be of some interest, they are unlikely to dominate the execution time. The loop from index 28-50 is where the real action is.

So, let's start with the code that Shark generates for this loop on AMD64:

i = 0:

0x7fa557064235:	xor    %ebp,%ebp

h = 31 * h:

0x7fa557064237:	mov    0x8(%rsp),%ecx
0x7fa55706423b:	lea    (%rcx,%rbp,1),%ecx
0x7fa55706423e:	imul   $0x1f,%r13d,%r13d

check that val is non-null:

0x7fa557064242:	test   %rdx,%rdx
0x7fa557064245:	jne    0x7fa5570642bf	not_zero45

If val is null, throw a NullPointerException. Don't worry about exactly what all this generated code does, but here it is:

0x7fa55706424b:	lea    0x10(%rbx),%rcx
0x7fa55706424f:	mov    0x10(%rsp),%r13
0x7fa557064254:	mov    %rcx,0x0(%r13)
0x7fa557064258:	mov    %rdi,0x20(%rbx)
0x7fa55706425c:	add    $0x23,%r12
0x7fa557064263:	mov    %r12,0x30(%rbx)
0x7fa557064267:	mov    %rax,0x70(%rbx)
0x7fa55706426b:	mov    %rdx,0x58(%rbx)
0x7fa55706426f:	mov    (%r15),%rax
0x7fa557064272:	mov    %rax,0x1d0(%r14)	last_Java_sp_addr46
0x7fa557064279:	mov    $0x3e4,%edx
0x7fa55706427e:	mov    $0x121d7b0,%rsi
0x7fa557064288:	mov    %r14,%rdi
0x7fa55706428b:	callq  0x7fa557063ff0
0x7fa557064290:	movq   $0x0,0x1d0(%r14)
0x7fa55706429b:	mov    0x8(%r14),%rax
0x7fa55706429f:	movq   $0x0,0x8(%r14)
0x7fa5570642a7:	mov    %rax,0x18(%rbx)
0x7fa5570642ab:	mov    %rax,0x8(%r14)
0x7fa5570642af:	mov    (%r15),%rax
0x7fa5570642b2:	lea    0x38(%rax),%rdx
0x7fa5570642b6:	mov    %rdx,0x0(%r13)
0x7fa5570642ba:	jmpq   0x7fa557064192	exception

Make sure that off is < val.length, throw an OutOfBoundsException if it isn't:

0x7fa5570642bf:	cmp    0x10(%rdx),%ecx	val.length, off
0x7fa5570642c2:	jae    0x7fa5570643d0	out_of_bounds

Load val[off]:

0x7fa5570642c8:	movzwl 0x18(%rdx,%rcx,2),%ecx

h += val[off]:

0x7fa5570642cd:	add    %ecx,%r13d

We're about to do a backward branch to the top of the loop. At this point we need to check to see if we need to move to a safepoint. A safepoint is the way that the JVM brings all threads to a stop for some reason, usually garbage collection.

Check to see if we should move to a safepoint:

0x7fa5570642d0:	mov    $0x17f8320,%rcx
0x7fa5570642da:	cmpl   $0x1,(%rcx)
0x7fa5570642dd:	jne    0x7fa55706433e	safepointed61

Move to a safepoint. We're going to save all live pointers in the Zero stack. We need to do this because the garbage collector needs to trace all live objects, so we put these references where the GC can see them.

0x7fa5570642e3:	lea    0x8(%rbx),%rcx
0x7fa5570642e7:	mov    0x10(%rsp),%rsi
0x7fa5570642ec:	mov    %rcx,(%rsi)
0x7fa5570642ef:	mov    %rdi,0x20(%rbx)
0x7fa5570642f3:	lea    0x25(%r12),%rcx

Save all the live references on the Zero stack:

0x7fa5570642f8:	mov    %rcx,0x30(%rbx)
0x7fa5570642fc:	mov    %rax,0x70(%rbx)
0x7fa557064300:	mov    %rdx,0x58(%rbx)

And make the call:

0x7fa557064304:	mov    (%r15),%rax
0x7fa557064307:	mov    %rax,0x1d0(%r14)
0x7fa55706430e:	mov    %r14,%rdi
0x7fa557064311:	callq  0x7fa557063fe0	SafepointSynchronize__block

Now we need to reload all the references that we saved on the Zero stack, because the GC might have moved some objects. We also check to see if SafepointSynchronize has raised an exception:

0x7fa557064316:	movq   $0x0,0x1d0(%r14)
0x7fa557064321:	mov    0x8(%r14),%rcx
0x7fa557064325:	lea    0x8(%r14),%rsi
0x7fa557064329:	test   %rcx,%rcx
0x7fa55706432c:	mov    0x58(%rbx),%rdx
0x7fa557064330:	mov    0x70(%rbx),%rax
0x7fa557064334:	mov    0x20(%rbx),%rdi
0x7fa557064338:	jne    0x7fa55706441a	exception24

Nowe we can increment the loop counter and continue:


0x7fa55706433e:	inc    %ebp

0x7fa557064340: cmp 0xc(%rsp),%ebp 0x7fa557064344: jl 0x7fa557064237 loop

And that's it.

Now for the gcj version:

First check that val is non-null. gcj doesn't usually do null pointer checks, relying instead on a segfault on the memory access, but I enabled soft null pointer checks in order to compare gcj more accurately with Shark:

        testq   %rdx, %rdx
        je      .L161

Do a bounds check for offset > length:

        movl    8(%rdx), %r9d
        leaq    12(%rdx), %r10
        cmpl    %esi, %r9d
        jbe     .L170

Load offset+1 into edx:

        leal    1(%rsi), %edx

h = 0:

        xorl    %eax, %eax

i = 0:

        xorl    %ecx, %ecx

Jump into the middle of the loop at .L166:

        jmp     .L166


Bounds check:

        cmpl    %edx, %r9d
        jbe     .L165

h = 31*h. Note that gcj has used the identity a*31 == a*32-a:

        movl    %eax, %esi
        sall    $5, %esi
        subl    %eax, %esi
        movl    %esi, %eax

tmp = val[off++]:

        movl    %edx, %esi
        addl    $1, %edx

        movslq  %esi,%rsi
        addl    $1, %ecx
        movzwl  (%r10,%rsi,2), %esi

See if i < len:

        cmpl    %ecx, %r8d

h += tmp:

        leal    (%rsi,%rax), %eax
        jg      .L164

hash = h:

        movl    %eax, 32(%rdi)
        addq    $8, %rsp
        call    _Jv_ThrowNullPointerException@PLT
        movl    %esi, %edx
        movl    %edx, %edi
        call    _Jv_ThrowBadArrayIndex@PLT

gcj is obviously generating much better code here, and we need to look at why. Firstly, gcj doesn't use safepoints: the Boehm garbage collector that gcj uses scans conservatively and never moves anything, so we don't have any safepoints. Also, gcj has hoisted the null pointer check outside the loop, which is good, but it's left the array bounds check inside the loop, which is bad. I think this is a gcj bug.

Finally, the HotSpot server JIT:

  0x00007fb6a98eed01: mov    %ebp,%r8d
  0x00007fb6a98eed04: sub    %r10d,%r8d

0x00007fb6a98eed07: cmp %r11d,%r8d 0x00007fb6a98eed0a: cmovg %r11d,%r8d 0x00007fb6a98eed0e: sub %r9d,%r8d 0x00007fb6a98eed11: and $0xfffffffffffffffc,%r8d 0x00007fb6a98eed15: add %r9d,%r8d 0x00007fb6a98eed18: cmp %r8d,%r9d 0x00007fb6a98eed1b: jge 0x00007fb6a98eed6a

0x00007fb6a98eed20: mov %r9d,%ebx 0x00007fb6a98eed23: add %r10d,%ebx 0x00007fb6a98eed26: movzwl 0x18(%rdi,%rbx,2),%edx 0x00007fb6a98eed2b: mov %eax,%ecx 0x00007fb6a98eed2d: shl $0x5,%ecx 0x00007fb6a98eed30: sub %eax,%ecx 0x00007fb6a98eed32: add %edx,%ecx 0x00007fb6a98eed34: movslq %ebx,%rbx 0x00007fb6a98eed37: movzwl 0x1a(%rdi,%rbx,2),%eax 0x00007fb6a98eed3c: movzwl 0x1e(%rdi,%rbx,2),%edx 0x00007fb6a98eed41: movzwl 0x1c(%rdi,%rbx,2),%ebx 0x00007fb6a98eed46: mov %ecx,%esi 0x00007fb6a98eed48: shl $0x5,%esi 0x00007fb6a98eed4b: sub %ecx,%esi 0x00007fb6a98eed4d: add %eax,%esi 0x00007fb6a98eed4f: add $0x4,%r9d 0x00007fb6a98eed53: mov %esi,%ecx 0x00007fb6a98eed55: shl $0x5,%ecx 0x00007fb6a98eed58: sub %esi,%ecx 0x00007fb6a98eed5a: add %ebx,%ecx 0x00007fb6a98eed5c: mov %ecx,%eax 0x00007fb6a98eed5e: shl $0x5,%eax 0x00007fb6a98eed61: sub %ecx,%eax 0x00007fb6a98eed63: add %edx,%eax 0x00007fb6a98eed65: cmp %r8d,%r9d 0x00007fb6a98eed68: jl 0x00007fb6a98eed20

0x00007fb6a98eed6a: cmp %r11d,%r9d 0x00007fb6a98eed6d: jge 0x00007fb6a98eed92

0x00007fb6a98eed6f: nop 0x00007fb6a98eed70: mov %r9d,%r8d 0x00007fb6a98eed73: add %r10d,%r8d 0x00007fb6a98eed76: mov %eax,%ecx 0x00007fb6a98eed78: shl $0x5,%ecx 0x00007fb6a98eed7b: sub %eax,%ecx 0x00007fb6a98eed7d: cmp %ebp,%r8d 0x00007fb6a98eed80: jae 0x00007fb6a98eeda6

0x00007fb6a98eed70: mov %r9d,%r8d 0x00007fb6a98eed73: add %r10d,%r8d 0x00007fb6a98eed76: mov %eax,%ecx 0x00007fb6a98eed78: shl $0x5,%ecx 0x00007fb6a98eed7b: sub %eax,%ecx 0x00007fb6a98eed7d: cmp %ebp,%r8d 0x00007fb6a98eed80: jae 0x00007fb6a98eeda6

0x00007fb6a98eed82: movzwl 0x18(%rdi,%r8,2),%eax 0x00007fb6a98eed88: add %ecx,%eax 0x00007fb6a98eed8a: inc %r9d 0x00007fb6a98eed8d: cmp %r11d,%r9d 0x00007fb6a98eed90: jl 0x00007fb6a98eed70

Wow! What is going on here?

It looks pretty hairy, but really it's quite simple: HotSpot has unrolled the inner loop four times. Otherwise, the code here is pretty much the same as that generated by gcj, which doesn't unroll loops by default.

But there is no code in the loop to move to a safepoint. Why is that? HotSpot detects counted loops and removes safepoints from them, because we know that they will finish in finite time. So, HotSpot only inserts safepoints in backward branches that are unbounded.

To conclude, saving and re-loading all the registers at a safepoint has some unpleasant consequences. In particular, it means that null pointer checks cannot be hoisted outside a loop, because LLVM has no way to know that a safepoint won't replace a valid pointer with NULL. Also, LLVM has no way to know that a safepoint won't change the loop limit, so LLVM can't find counted loops.

So, it's fair to say that the presence of SafepointSynchronize in a loop breaks most loop optimizations. It's pretty clear that if loop optimizations are to be done at all, safepoints must be eliminated.

However, it might be possible to persuade LLVM that a safepoint will never set a live reference to NULL, eliminating a lot of null pointer checks. I suspect this will be less rewarding than eliminating safepoints, though.

So, what's to be done? Loop optimizations are probably the most significant part of any optimizing compiler, so it would be a major boost to have them in Shark. LLVM can presumably detect counted loops, so a Java-specific LLVM pass could remove unnecessary safepoints in loops. It's worth investigating. Loop detection in the Shark front end is probably possible, but it might be a lot of work.

PS, posted 3 Mar 2009 at 14:54 UTC by aph » (Master)

In the development builds of OpenJDK it's possible to control the generation of loop safepoints with -XX:-UseLoopSafepoints. So, I did the experiment of generating the code again.

i = 0:

0x7f6f8b1cc17f:	xor    %r10d,%r10d

h = 31 * h:

0x7f6f8b1cc182:	lea    (%r9,%r10,1),%ecx
0x7f6f8b1cc186:	imul   $0x1f,%r13d,%r13d

check that val is non-null:

0x7f6f8b1cc18a:	test   %r8,%r8
0x7f6f8b1cc18d:	jne    0x7f6f8b1cc204

If val is null, throw a NullPointerException.

0x7f6f8b1cc193:	lea    0x10(%rbx),%rcx
0x7f6f8b1cc197:	mov    %rcx,(%r12)
0x7f6f8b1cc19b:	mov    %rdi,0x20(%rbx)
0x7f6f8b1cc19f:	add    $0x22,%rsi
0x7f6f8b1cc1a6:	mov    %rsi,0x30(%rbx)
0x7f6f8b1cc1aa:	mov    %rax,0x70(%rbx)
0x7f6f8b1cc1ae:	mov    %r8,0x58(%rbx)
0x7f6f8b1cc1b2:	mov    (%r15),%rax
0x7f6f8b1cc1b5:	mov    %rax,0x1a0(%r14)
0x7f6f8b1cc1bc:	mov    $0x984520,%eax
0x7f6f8b1cc1c1:	mov    $0x3e4,%edx
0x7f6f8b1cc1c6:	mov    $0xf4aa60,%rsi
0x7f6f8b1cc1d0:	mov    %r14,%rdi
0x7f6f8b1cc1d3:	callq  *%rax             
0x7f6f8b1cc1d5:	movq   $0x0,0x1a0(%r14)
0x7f6f8b1cc1e0:	mov    0x8(%r14),%rax
0x7f6f8b1cc1e4:	movq   $0x0,0x8(%r14)
0x7f6f8b1cc1ec:	mov    %rax,0x18(%rbx)
0x7f6f8b1cc1f0:	mov    %rax,0x8(%r14)
0x7f6f8b1cc1f4:	mov    (%r15),%rax
0x7f6f8b1cc1f7:	lea    0x38(%rax),%r8
0x7f6f8b1cc1fb:	mov    %r8,(%r12)
0x7f6f8b1cc1ff:	jmpq   0x7f6f8b1cc15c     exception

Make sure that off is < val.length, throw an OutOfBoundsException if it isn't:

0x7f6f8b1cc204:	cmp    0x10(%r8),%ecx
0x7f6f8b1cc208:	jae    0x7f6f8b1cc2a7

Load val[off]:

0x7f6f8b1cc20e:	movzwl 0x18(%r8,%rcx,2),%ecx

h += val[off]:

0x7f6f8b1cc214:	add    %ecx,%r13d

Now we can increment the loop counter and continue:


0x7f6f8b1cc217:	inc    %r10d
0x7f6f8b1cc21a:	cmp    %edx,%r10d
0x7f6f8b1cc21d:	jl     0x7f6f8b1cc182	loop

The very strange thing about this is that LLVM is not using the opportunity to hoist the null pointer check out of the loop. Also, the code repeatedly reads val.length from memory, despite the fact that the address of val is in %r8 and nothing ever writes to it. I suspect that the call to SharkRuntime__throw_NullPointerException may be responsible. This unfortunately means that even though the code is better, it isn't vry much better.

PPS, posted 3 Mar 2009 at 15:54 UTC by aph » (Master)

I ran SPEC jvm98 against Shark with and without loop safepoints enabled. Sadly, the results confirm what I feared: there is no measurable performance difference.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page