There are two approaches to studying optimization: the large and the
small. The large involves benchmarking and profiling substantial
blocks of code, often entire applications. The small involves much
smaller chunks of code, inspecting them one instruction at a time to
discover what's going on. One has to be careful, of course, not to
read too much into a single example, but it's often instructive.
This is an exercise in the small. The very first method compiled
by HotSpot when the VM starts is often String.hashCode(). The speed
of hashCode() seems to follow the same pattern as the benchmarks Gary
Benson presented with repect to speed: Shark is slowest, then gcj,
then HotSpot. Also, hashCode() is used a lot, so it has an important
effect on execution time. Finally, it's so small that it's possible
to study the generated code in great detail.
Here's the source of String.hashCode():
public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
And here's the bytecode:
public int hashCode();
Code:
0: aload_0
1: getfield #54; //Field hash:I
4: istore_1
5: iload_1
6: ifne 58
9: aload_0
10: getfield #2; //Field offset:I
13: istore_2
14: aload_0
15: getfield #4; //Field value:[C
18: astore_3
19: aload_0
20: getfield #3; //Field count:I
23: istore 4
25: iconst_0
26: istore 5
28: iload 5
30: iload 4
32: if_icmpge 53
35: bipush 31
37: iload_1
38: imul
39: aload_3
40: iload_2
41: iinc 2, 1
44: caload
45: iadd
46: istore_1
47: iinc 5, 1
50: goto 28
53: aload_0
54: iload_1
55: putfield #54; //Field hash:I
58: iload_1
59: ireturn
While the entry and exit sequences of the method may be of some
interest, they are unlikely to dominate the execution time. The loop
from index 28-50 is where the real action is.
So, let's start with the code that Shark generates for this loop on
AMD64:
i = 0:
0x7fa557064235: xor %ebp,%ebp
h = 31 * h:
loop:
0x7fa557064237: mov 0x8(%rsp),%ecx
0x7fa55706423b: lea (%rcx,%rbp,1),%ecx
0x7fa55706423e: imul $0x1f,%r13d,%r13d
check that val is non-null:
0x7fa557064242: test %rdx,%rdx
0x7fa557064245: jne 0x7fa5570642bf not_zero45
If val is null, throw a NullPointerException. Don't worry about
exactly what all this generated code does, but here it is:
0x7fa55706424b: lea 0x10(%rbx),%rcx
0x7fa55706424f: mov 0x10(%rsp),%r13
0x7fa557064254: mov %rcx,0x0(%r13)
0x7fa557064258: mov %rdi,0x20(%rbx)
0x7fa55706425c: add $0x23,%r12
0x7fa557064263: mov %r12,0x30(%rbx)
0x7fa557064267: mov %rax,0x70(%rbx)
0x7fa55706426b: mov %rdx,0x58(%rbx)
0x7fa55706426f: mov (%r15),%rax
0x7fa557064272: mov %rax,0x1d0(%r14) last_Java_sp_addr46
0x7fa557064279: mov $0x3e4,%edx
0x7fa55706427e: mov $0x121d7b0,%rsi
0x7fa557064288: mov %r14,%rdi
0x7fa55706428b: callq 0x7fa557063ff0
SharkRuntime__throw_NullPointerException
0x7fa557064290: movq $0x0,0x1d0(%r14)
0x7fa55706429b: mov 0x8(%r14),%rax
0x7fa55706429f: movq $0x0,0x8(%r14)
0x7fa5570642a7: mov %rax,0x18(%rbx)
0x7fa5570642ab: mov %rax,0x8(%r14)
0x7fa5570642af: mov (%r15),%rax
0x7fa5570642b2: lea 0x38(%rax),%rdx
0x7fa5570642b6: mov %rdx,0x0(%r13)
0x7fa5570642ba: jmpq 0x7fa557064192 exception
Make sure that off is < val.length, throw an
OutOfBoundsException if
it isn't:
not_zero45:
0x7fa5570642bf: cmp 0x10(%rdx),%ecx val.length, off
0x7fa5570642c2: jae 0x7fa5570643d0 out_of_bounds
Load val[off]:
0x7fa5570642c8: movzwl 0x18(%rdx,%rcx,2),%ecx
h += val[off]:
0x7fa5570642cd: add %ecx,%r13d
We're about to do a backward branch to the top of the loop. At this
point we need to check to see if we need to move to a safepoint. A
safepoint is the way that the JVM brings all threads to a stop for
some reason, usually garbage collection.
Check to see if we should move to a safepoint:
0x7fa5570642d0: mov $0x17f8320,%rcx
0x7fa5570642da: cmpl $0x1,(%rcx)
0x7fa5570642dd: jne 0x7fa55706433e safepointed61
Move to a safepoint. We're going to save all live pointers in the
Zero stack. We need to do this because the garbage collector needs to
trace all live objects, so we put these references where the GC can
see them.
0x7fa5570642e3: lea 0x8(%rbx),%rcx
0x7fa5570642e7: mov 0x10(%rsp),%rsi
0x7fa5570642ec: mov %rcx,(%rsi)
0x7fa5570642ef: mov %rdi,0x20(%rbx)
0x7fa5570642f3: lea 0x25(%r12),%rcx
Save all the live references on the Zero stack:
0x7fa5570642f8: mov %rcx,0x30(%rbx)
0x7fa5570642fc: mov %rax,0x70(%rbx)
0x7fa557064300: mov %rdx,0x58(%rbx)
And make the call:
0x7fa557064304: mov (%r15),%rax
0x7fa557064307: mov %rax,0x1d0(%r14)
0x7fa55706430e: mov %r14,%rdi
0x7fa557064311: callq 0x7fa557063fe0 SafepointSynchronize__block
Now we need to reload all the references that we saved on the Zero
stack, because the GC might have moved some objects. We also check to
see if SafepointSynchronize has raised an exception:
0x7fa557064316: movq $0x0,0x1d0(%r14)
0x7fa557064321: mov 0x8(%r14),%rcx
0x7fa557064325: lea 0x8(%r14),%rsi
0x7fa557064329: test %rcx,%rcx
0x7fa55706432c: mov 0x58(%rbx),%rdx
0x7fa557064330: mov 0x70(%rbx),%rax
0x7fa557064334: mov 0x20(%rbx),%rdi
0x7fa557064338: jne 0x7fa55706441a exception24
Nowe we can increment the loop counter and continue:
i++:
safepointed61:
0x7fa55706433e: inc %ebp
0x7fa557064340: cmp 0xc(%rsp),%ebp
0x7fa557064344: jl 0x7fa557064237 loop
And that's it.
Now for the gcj version:
First check that val is non-null. gcj doesn't usually do null
pointer
checks, relying instead on a segfault on the memory access, but I
enabled soft null pointer checks in order to compare gcj more
accurately with Shark:
testq %rdx, %rdx
je .L161
Do a bounds check for offset > length:
movl 8(%rdx), %r9d
leaq 12(%rdx), %r10
cmpl %esi, %r9d
jbe .L170
Load offset+1 into edx:
leal 1(%rsi), %edx
h = 0:
xorl %eax, %eax
i = 0:
xorl %ecx, %ecx
Jump into the middle of the loop at .L166:
jmp .L166
.L164:
Bounds check:
cmpl %edx, %r9d
jbe .L165
h = 31*h. Note that gcj has used the identity a*31 == a*32-a:
movl %eax, %esi
sall $5, %esi
subl %eax, %esi
movl %esi, %eax
tmp = val[off++]:
movl %edx, %esi
addl $1, %edx
.L166:
.L167:
movslq %esi,%rsi
addl $1, %ecx
movzwl (%r10,%rsi,2), %esi
See if i < len:
cmpl %ecx, %r8d
h += tmp:
leal (%rsi,%rax), %eax
jg .L164
.L160:
hash = h:
movl %eax, 32(%rdi)
.L158:
addq $8, %rsp
ret
.L161:
call _Jv_ThrowNullPointerException@PLT
.L170:
movl %esi, %edx
.L165:
movl %edx, %edi
call _Jv_ThrowBadArrayIndex@PLT
gcj is obviously generating much better code here, and we need
to look
at why. Firstly, gcj doesn't use safepoints: the Boehm garbage
collector that gcj uses scans conservatively and never moves anything,
so we don't have any safepoints. Also, gcj has hoisted the null
pointer check outside the loop, which is good, but it's left the array
bounds check inside the loop, which is bad. I think this is a gcj
bug.
Finally, the HotSpot server JIT:
0x00007fb6a98eed01: mov %ebp,%r8d
0x00007fb6a98eed04: sub %r10d,%r8d
0x00007fb6a98eed07: cmp %r11d,%r8d
0x00007fb6a98eed0a: cmovg %r11d,%r8d
0x00007fb6a98eed0e: sub %r9d,%r8d
0x00007fb6a98eed11: and $0xfffffffffffffffc,%r8d
0x00007fb6a98eed15: add %r9d,%r8d
0x00007fb6a98eed18: cmp %r8d,%r9d
0x00007fb6a98eed1b: jge 0x00007fb6a98eed6a
0x00007fb6a98eed20: mov %r9d,%ebx
0x00007fb6a98eed23: add %r10d,%ebx
0x00007fb6a98eed26: movzwl 0x18(%rdi,%rbx,2),%edx
0x00007fb6a98eed2b: mov %eax,%ecx
0x00007fb6a98eed2d: shl $0x5,%ecx
0x00007fb6a98eed30: sub %eax,%ecx
0x00007fb6a98eed32: add %edx,%ecx
0x00007fb6a98eed34: movslq %ebx,%rbx
0x00007fb6a98eed37: movzwl 0x1a(%rdi,%rbx,2),%eax
0x00007fb6a98eed3c: movzwl 0x1e(%rdi,%rbx,2),%edx
0x00007fb6a98eed41: movzwl 0x1c(%rdi,%rbx,2),%ebx
0x00007fb6a98eed46: mov %ecx,%esi
0x00007fb6a98eed48: shl $0x5,%esi
0x00007fb6a98eed4b: sub %ecx,%esi
0x00007fb6a98eed4d: add %eax,%esi
0x00007fb6a98eed4f: add $0x4,%r9d
0x00007fb6a98eed53: mov %esi,%ecx
0x00007fb6a98eed55: shl $0x5,%ecx
0x00007fb6a98eed58: sub %esi,%ecx
0x00007fb6a98eed5a: add %ebx,%ecx
0x00007fb6a98eed5c: mov %ecx,%eax
0x00007fb6a98eed5e: shl $0x5,%eax
0x00007fb6a98eed61: sub %ecx,%eax
0x00007fb6a98eed63: add %edx,%eax
0x00007fb6a98eed65: cmp %r8d,%r9d
0x00007fb6a98eed68: jl 0x00007fb6a98eed20
0x00007fb6a98eed6a: cmp %r11d,%r9d
0x00007fb6a98eed6d: jge 0x00007fb6a98eed92
0x00007fb6a98eed6f: nop
0x00007fb6a98eed70: mov %r9d,%r8d
0x00007fb6a98eed73: add %r10d,%r8d
0x00007fb6a98eed76: mov %eax,%ecx
0x00007fb6a98eed78: shl $0x5,%ecx
0x00007fb6a98eed7b: sub %eax,%ecx
0x00007fb6a98eed7d: cmp %ebp,%r8d
0x00007fb6a98eed80: jae 0x00007fb6a98eeda6
0x00007fb6a98eed70: mov %r9d,%r8d
0x00007fb6a98eed73: add %r10d,%r8d
0x00007fb6a98eed76: mov %eax,%ecx
0x00007fb6a98eed78: shl $0x5,%ecx
0x00007fb6a98eed7b: sub %eax,%ecx
0x00007fb6a98eed7d: cmp %ebp,%r8d
0x00007fb6a98eed80: jae 0x00007fb6a98eeda6
0x00007fb6a98eed82: movzwl 0x18(%rdi,%r8,2),%eax
0x00007fb6a98eed88: add %ecx,%eax
0x00007fb6a98eed8a: inc %r9d
0x00007fb6a98eed8d: cmp %r11d,%r9d
0x00007fb6a98eed90: jl 0x00007fb6a98eed70
Wow! What is going on here?
It looks pretty hairy, but really it's quite simple: HotSpot has
unrolled the inner loop four times. Otherwise, the code here is
pretty much the same as that generated by gcj, which doesn't unroll
loops by default.
But there is no code in the loop to move to a safepoint. Why is
that?
HotSpot detects counted loops and removes safepoints from them,
because we know that they will finish in finite time. So, HotSpot
only inserts safepoints in backward branches that are unbounded.
To conclude, saving and re-loading all the registers at a safepoint
has some unpleasant consequences. In particular, it means that null
pointer checks cannot be hoisted outside a loop, because LLVM has no
way to know that a safepoint won't replace a valid pointer with NULL.
Also, LLVM has no way to know that a safepoint won't change the loop
limit, so LLVM can't find counted loops.
So, it's fair to say that the presence of SafepointSynchronize in a
loop breaks most loop optimizations. It's pretty clear that if loop
optimizations are to be done at all, safepoints must be eliminated.
However, it might be possible to persuade LLVM that a safepoint will
never set a live reference to NULL, eliminating a lot of null pointer
checks. I suspect this will be less rewarding than eliminating
safepoints, though.
So, what's to be done? Loop optimizations are probably the most
significant part of any optimizing compiler, so it would be a major
boost to have them in Shark. LLVM can presumably detect counted
loops, so a Java-specific LLVM pass could remove unnecessary
safepoints in loops. It's worth investigating. Loop detection in the
Shark front end is probably possible, but it might be a lot of work.