Older blog entries for gary (starting at number 254)

10 Jun 2010 »

Shark build passes TCK

An IcedTea build of OpenJDK using Shark passed the Java SE 6 TCK today. Fedora 12, x86_64, LLVM 2.6, icedtea6-7674917fa451. Dr Fun is here!

Syndicated 2010-06-10 15:08:48 from gbenson.net

7 May 2010 »

Zero and Shark in IcedTea

Over the past few months I’ve been working on Shark in it’s own forest. This has allowed me to track upstream HotSpot (and the goal is to upstream Shark, so it’s the correct place to base it) but it’s meant that the Shark (and Zero) in IcedTea6 are old. I’m trying to update Zero and Shark in Icedtea6, but it’s a nightmare.

Zero in IcedTea6 has the ARM interpreter which can’t go upstream. Upstream has all the JSR 292 stuff which can’t go downstream. Between these two are fixes that need synchronizing, in both directions. One of the fixes, 6939182 (aka PR icedtea/324) requires the ARM interpreter to be updated before it can be committed, so that needs keeping separate… except that the changes to Shark to support that one are pretty invasive and hard to strip out.

It’s a real mess. I’m pretty close to giving up on it.

Syndicated 2010-05-07 08:37:10 from gbenson.net

21 Apr 2010 »

New stack overflow code

It’s funny, but I’d kind of thought that the version of Zero that got upstreamed was pretty much cut and dried. It hadn’t had any real changes for months, after all, and the S/390 compatibility work required was so minimal—in terms of code changed, if not in terms of time taken!—that I didn’t think there was much more to do. But in this last couple of months I’ve made a ton of fixes, including one big change that’s upstream (improved stack overflow handling) and one that’s hopefully heading that way (some frame anchor fixes).

The new stack overflow code is pretty awesome (if I say so myself!) The old code jumped through some pretty bizarre hoops in order to throw the exception from the exact frame of the method being called. This is more or less how it happens in “classic” HotSpot, which decides which method you’re in from the ABI stack, but wasn’t workable for Zero which might have run out of Zero stack with no room to put the frame of the method from which you want to throw the exception. Argh! So I wrote this look-ahead detection thing, to throw the exception when you still had some stack to put your frame on, which worked, but was easily fooled. A valid Java method could have a frame far larger than the default thread stack size on most systems, so the lookahead worked on the assumption that a frame probably wouldn’t be that big. There were a number of places where the stack could overflow, in which case you’d hit an Unimplemented and the VM would bomb out. Not good, and not very secure either.

The new code is awesome. I realised that it doesn’t really matter exactly which frame the StackOverflowError got thrown from, only that it got thrown. I wrote a little method, ZeroStack::overflow_check, which is inline and can be called from anywhere—any VM state, etc. It does the check, and throws the exception, via ZeroStack::handle_overflow. That one is pretty complicated, but I did say I wanted the overflow check to be callable from anywhere, so it has to set up the frame anchor if it isn’t already, and cope with whatever thread state it finds itself in. The best thing about ZeroStack::handle_overflow? You can call it from Shark, which let me blow away all the ugly stack overflow code I was griping about in my last post.

Those Unimplementeds in the stack overflow detector were my least favourite thing about Zero, my guilty secret (in as much as open source code published on the internet is secret). And now they’re gone! No more ugly code in Zero! Until a few days later, when I discovered some stuff in the frame anchor that was reversed. But that’s for another blog…

Syndicated 2010-04-21 11:23:57 from gbenson.net

24 Mar 2010 »

Stack overflow detection

I added stack overflow detection to Shark last week. It works, but it made me realise just how odd Zero’s overflow detection is. Knowing what I know now, I think I can make it cleaner, better and faster, for Zero and Shark. That’s my project for this afternoon…

Syndicated 2010-03-24 13:26:16 from gbenson.net

12 Mar 2010 »

Shark

I’m back on Shark, after a four month hiatus. A minor milestone: it can build itself again.

Syndicated 2010-03-12 09:29:02 from gbenson.net

28 Oct 2009 »

JNI wrapper compilation

I now have a version of Shark with a basic implementation JNI wrapper compilation. Sadly I can’t say if it’s faster or not yet because it’s totally unstable!

The problem is this. When HotSpot wishes to compile a normal (interpreted) method, the thread initiating the compile simply adds it to a queue and carries on doing whatever it was it was doing. A separate thread, the compiler thread, loops over this queue, compiling methods one at a time. This means there’s only ever the one thread making LLVM calls, and everything is rosy.

When HotSpot wishes to compile a native (JNI) method, the thread initiating the compile bypasses the queue and does it immediately. It acquires a lock, the adapter hander library lock, so there’s only every one native method being compiled at once, but in the meantime the compiler thread is in all likelihood busy compiling some normal method or another, so there’s two threads making LLVM calls. LLVM doesn’t like this, not LLVM prior to 2.6 at any rate, and even then not without being written with a separate LLVMContext for each thread.

The obvious fix for this is for Shark to acquire a lock before compiling either a normal or a native method, ensuring that only one thread is calling into LLVM at once. This doesn’t work, however, as the compiler thread runs _thread_in_native. The benefit of this is that the compiler thread does not have to halt for safepoints (and the rest of the VM doesn’t have to wait for the compiler thread to halt) but the drawback of this is that threads running _thread_in_native may not own locks. You can’t make the compiler thread run other than _thread_in_native, not without losing the large chunk of the server compiler that Shark shares, and you can’t hack a lock in there anyway (by using a pthread mutex, say, rather than a HotSpot lock) because it’ll deadlock the first time a safepoint occurs when the compiler thread holds the lock (compiling a normal method) and a Java thread is blocking trying to take it (to compile a native one).

I’ve been circling around this issue for a couple of days now, but it looks like the only solution is to require LLVM 2.6 and rearrange everything to use separate contexts.

Syndicated 2009-10-28 10:57:11 from gbenson.net

15 Oct 2009 »

Zero, now available upstream

So the two halves of Zero are upstream!

They’re in different forests; there won’t be anywhere you can hg fclone from and get a buildable Zero until the two forests get promoted. But upstream is upstream!

Thank you Tom Rodriguez, Tim Bell, Andrew Hughes, and everyone else who made this happen :D

Syndicated 2009-10-15 13:56:19 from gbenson.net

9 Oct 2009 »

Long overdue update

It’s been a while; time for a catchup!

June and July I mostly spent cleaning up Shark. HotSpot’s existing JITs, client and server, both inline pointers to objects in the native code they generate. These pointers need to be visible to the garbage collector, both so it knows the objects are live and so it can rewrite the pointers if it moves the object. This is trivial for client and server, as they both have access to the native code they generate: each method’s code is accompanied by a list of pointer locations within it. Shark, on the other hand, has no access to its generated native code other than knowing its address and size. Pointers can’t be inlined in Shark – it can’t tell the garbage collector where they are – so Shark had to load all garbage collected object pointers from other places, generally wherever the interpreter stored them.

This caused no end of problems. Aside from requiring more loads than the other JITs (at least one per object, and sometimes three or four) Shark had to mirror huge chunks of the interpreter. It had to cope with objects that were loaded in the VM (so the compiler could see them) but not cached in the interpreter (where the compiled code could see them). Because Shark wasn’t behaving like the other compilers, HotSpot’s compiler support layer would break in all kinds of exciting and imaginative ways. Finally, the method used by the server JIT to optimize interface calls to virtual calls and virtual calls to direct calls could not be used. Aside from the obvious speedup, Shark can only inline direct calls, so reducing virtual and interface calls to direct calls exposes them to the inliner. Calls in Shark have a lot of overhead, so this would have been a big win.

Sometime in May I figured out how to fake inlined object pointers. HotSpot’s compiler interface expects the compiler to generate native code into a CodeBuffer. Shark, of course, uses LLVM, which generates code into a buffer it allocates. Shark had a HotSpot code buffer, but it didn’t do a lot with it. Now, every time Shark has an object pointer to inline, it writes it into the HotSpot code buffer where the garbage collector can see it. The generated code then loads the object pointer from the code buffer whenever it needs it. The pointer is still not inlined – there’s still a load required – but now it’s always only one load. Not a big speedup in itself, but it meant the remaining interpreterisms could be removed, which fixed the support layer breakages and allowed me to copy the interface-virtual-direct call optimization code more or less directly from the server compiler. Everything got a lot more stable, a lot more clean, and a little bit faster in the bargain.

During August I began the (long!) process of preparing Zero for submission to OpenJDK proper. It took some time to get started, but the patch has now gone through a couple of cycles of being reviewed by the HotSpot team: the code has been reformatted, the build system has been almost completely rewritten, and a bunch of other things got changed. It’s still ongoing, but the HotSpot part of the patch seems close to acceptance and the much smaller remainder will hopefully be reviewed soon. I’ve been ramping up my testing with each step: this one bootstrapped and built itself on 32-bit x86, x86_64 and 32-bit PowerPC, and has bootstrapped itself and is in the process of building itself on 64-bit PowerPC and 64-bit zSeries.

Also in August, Ed Nevill released his assembler interpreter for ARM. It replaces part of Zero with hand-crafted assembly language, making OpenJDK 2-8 times faster on that platform.

After the Zero patch is accepted, my next task will be getting Zero certified on 64-bit zSeries. I won’t have a lot of time for Shark until that’s done, but I have one last thing I want to do before I step aside for a couple of months. Xerxes Rånby posted some benchmarks of Zero, Shark, and the assembler interpreter on ARM; Shark is gratifyingly faster than everything on five of the tests, but considerably slower than the assembler interpreter on the other four. I’m not happy with that!

On the tests where it’s slower, Shark is showing very little improvement over Zero, which suggests that these benchmarks are not spending a lot of time interpreting bytecode (which Shark would have compiled and made faster). I suspect these benchmarks are spending a lot of time in JNI calls. Back in February, Ed Nevill posted some profilies he had made to figure out why some interpreter improvements he had made had had very little effect; those profiles seemed to imply that the VM was spending a lot of its time setting up JNI calls. Zero uses libffi for this, and we at Red Hat have long suspected that libffi is slow.

HotSpot’s JITs have the capability to “compile” JNI methods. This sounds odd, as JNI methods are already native code; what’s actually getting compiled is the interface between the JVM and the native JNI code. If Shark could compile JNI methods, whenever HotSpot found a hot JNI method it would be able to replace its generic, one-size-fits-all interface code (using libffi) with an LLVM-generated interface custom built specifically for that method. I’m going to spend a week or so making Shark able to compile these methods, before I descend into zSeries TCK hell…

Syndicated 2009-10-09 16:20:37 from gbenson.net

10 Jun 2009 »

First Shark self-builds

Xerxes Rånby and I simultaneously decided to try building Shark with Shark today… and both worked!

Syndicated 2009-06-10 14:37:10 from gbenson.net

29 May 2009 »

Instrumenting Zero and Shark

Every so often I find myself adding little bits of code to Zero or Shark, to figure out obscure bugs or to see whether working on some optimization or another is worthwhile. I did it again today, and thought I’d write a little tutorial.

The first versions of Shark implemented a lot of things the same way as the interpreter, ie slowly. Since February I’ve been slowly replacing these interpreter-isms with implementations that are more compiler-like, and today there’s only one left: invokeinterface. The reason I left it until last is that it’s the biggest and the ugliest: it’ll no doubt be a pig to do, and quite frankly I don’t really want to do it. To see if I could get away with not bothering with it, I decided to instrument Shark so I could run SPECjvm98 and have it print out the number of times Shark-compiled code executed an invokeinterface for every benchmark.

First, I needed somewhere to store the counter. I decided to put it in the individual thread’s JavaThread objects as they’re easy to get at from both C++ and Shark, and they’re thread-specific so you don’t have to worry about locking.

diff -r 4cc0bc87aef4 ports/hotspot/src/os_cpu/linux_zero/vm/thread_linux_zero.hpp
--- a/ports/hotspot/src/os_cpu/linux_zero/vm/thread_linux_zero.hpp	Fri May 29 12:46:07 2009 +0100
+++ b/ports/hotspot/src/os_cpu/linux_zero/vm/thread_linux_zero.hpp	Fri May 29 14:44:04 2009 +0100
@@ -32,6 +32,25 @@
     _top_zero_frame = NULL;
   }

+ private:
+  int _interface_call_count;
+
+ public:
+  int interface_call_count() const
+  {
+    return _interface_call_count;
+  }
+  void set_interface_call_count(int interface_call_count)
+  {
+    _interface_call_count = interface_call_count;
+  }
+
+ public:
+  static ByteSize interface_call_count_offset()
+  {
+    return byte_offset_of(JavaThread, _interface_call_count);
+  }
+
  public:
   ZeroStack *zero_stack()
   {

So we have the field itself, a getter and setter to access it from C++, and a static method to expose the offset of the field in the thread object to Shark. Next we need to make Shark update the counter:

diff -r 4cc0bc87aef4 ports/hotspot/src/share/vm/shark/sharkTopLevelBlock.cpp
--- a/ports/hotspot/src/share/vm/shark/sharkTopLevelBlock.cpp	Fri May 29 12:46:07 2009 +0100
+++ b/ports/hotspot/src/share/vm/shark/sharkTopLevelBlock.cpp	Fri May 29 14:44:04 2009 +0100
@@ -985,6 +985,16 @@
 // Interpreter-style interface call lookup
 Value* SharkTopLevelBlock::get_interface_callee(SharkValue *receiver)
 {
+  Value *count_addr = builder()->CreateAddressOfStructEntry(
+    thread(),
+    JavaThread::interface_call_count_offset(),
+    PointerType::getUnqual(SharkType::jint_type()));
+  builder()->CreateStore(
+    builder()->CreateAdd(
+      builder()->CreateLoad(count_addr),
+      LLVMValue::jint_constant(1)),
+    count_addr);
+
   SharkConstantPool constants(this);
   Value *cache = constants.cache_entry_at(iter()->get_method_index());

We’re almost ready to add the SPECjvm98-specific bits now, but there’s one thing left. Some of the benchmarks are multithreaded, but we have one counter per thread; we need a way to set and get the counters from all running threads. HotSpot has some code to iterate over all the threads in the VM, but it’s all private to the Threads class. Not to worry though, we’ll just stick it in there:

diff -r 4cc0bc87aef4 openjdk-ecj/hotspot/src/share/vm/runtime/thread.hpp
--- a/openjdk-ecj/hotspot/src/share/vm/runtime/thread.hpp	Fri May 29 12:46:07 2009 +0100
+++ b/openjdk-ecj/hotspot/src/share/vm/runtime/thread.hpp	Fri May 29 14:44:04 2009 +0100
@@ -1669,6 +1669,9 @@
   // Deoptimizes all frames tied to marked nmethods
   static void deoptimized_wrt_marked_nmethods();

+ public:
+  static void reset_interface_call_counts();
+  static int  interface_call_counts_total();
 };

diff -r 4cc0bc87aef4 openjdk-ecj/hotspot/src/share/vm/runtime/thread.cpp
--- a/openjdk-ecj/hotspot/src/share/vm/runtime/thread.cpp	Fri May 29 12:46:07 2009 +0100
+++ b/openjdk-ecj/hotspot/src/share/vm/runtime/thread.cpp	Fri May 29 14:44:04 2009 +0100
@@ -3828,6 +3828,21 @@
   }
 }

+void Threads::reset_interface_call_counts()
+{
+  ALL_JAVA_THREADS(thread) {
+    thread->set_interface_call_count(0);
+  }
+}
+
+int Threads::interface_call_counts_total()
+{
+  int total = 0;
+  ALL_JAVA_THREADS(thread) {
+    total += thread->interface_call_count();
+  }
+  return total;
+}

 // Lifecycle management for TSM ParkEvents.
 // ParkEvents are type-stable (TSM).

Now we’re ready to add some SPECjvm98-specific code. A quick poke around in SPECjvm98 brings up the method spec.harness.ProgramRunner::runOnce as a likely place to hook ourselves in. This will be run by the interpreter — Shark won’t compile it as it’s only called a few times — so we put our code into the C++ interpreter’s normal entry which is the bit that executes bytecode methods:

diff -r 4cc0bc87aef4 ports/hotspot/src/cpu/zero/vm/cppInterpreter_zero.cpp
--- a/ports/hotspot/src/cpu/zero/vm/cppInterpreter_zero.cpp	Fri May 29 12:46:07 2009 +0100
+++ b/ports/hotspot/src/cpu/zero/vm/cppInterpreter_zero.cpp	Fri May 29 14:44:04 2009 +0100
@@ -42,6 +42,26 @@
   JavaThread *thread = (JavaThread *) THREAD;
   ZeroStack *stack = thread->zero_stack();

+  char *benchmark = NULL;
+  {
+    ResourceMark rm;
+    const char *name = method->name_and_sig_as_C_string();
+    if (strstr(name, “spec.harness.ProgramRunner.runOnce(”) == name) {
+      intptr_t *locals = stack->sp() + method->size_of_parameters() - 1;
+      if (LOCALS_INT(5) == 100) {
+        Threads::reset_interface_call_counts();
+
+        name = LOCALS_OBJECT(1)->klass()->klass_part()->name()->as_C_string();
+        const char *limit = name + strlen(name);
+        while (*(–limit) != ‘/’);
+        const char *start = limit;
+        while (*(–start) != ‘/’);
+        start++;
+        benchmark = strndup(start, limit - start);
+      }
+    }
+  }
+
   // Adjust the caller’s stack frame to accomodate any additional
   // local variables we have contiguously with our parameters.
   int extra_locals = method->max_locals() - method->size_of_parameters();
@@ -59,6 +79,12 @@

   // Execute those bytecodes!
   main_loop(0, THREAD);
+
+  if (benchmark) {
+    tty->print_cr(”%s: %d interface calls”,
+                  benchmark, Threads::interface_call_counts_total());
+    free(benchmark);
+  }
 }

 void CppInterpreter::main_loop(int recurse, TRAPS)

This looks a little messy, but what it’s basically doing is spotting calls to spec.harness.ProgramRunner::runOnce and extracting the name of the benchmark from its arguments. It is complicated by the fact that SPECjvm98 intersperses full runs with hidden tenth-speed ones, so we use the speed argument (in LOCALS_INT(5)) to ignore the hidden runs.

Now we’re ready to run the benchmarks and see what happens:

Looks like making invokeinterface faster is worthwhile after all! Now all I have to do is do it ;)

Syndicated 2009-05-29 15:15:04 from gbenson.net

245 older entries...