Older blog entries for gary (starting at number 247)

9 Oct 2009 »

Long overdue update

It’s been a while; time for a catchup!

June and July I mostly spent cleaning up Shark. HotSpot’s existing JITs, client and server, both inline pointers to objects in the native code they generate. These pointers need to be visible to the garbage collector, both so it knows the objects are live and so it can rewrite the pointers if it moves the object. This is trivial for client and server, as they both have access to the native code they generate: each method’s code is accompanied by a list of pointer locations within it. Shark, on the other hand, has no access to its generated native code other than knowing its address and size. Pointers can’t be inlined in Shark – it can’t tell the garbage collector where they are – so Shark had to load all garbage collected object pointers from other places, generally wherever the interpreter stored them.

This caused no end of problems. Aside from requiring more loads than the other JITs (at least one per object, and sometimes three or four) Shark had to mirror huge chunks of the interpreter. It had to cope with objects that were loaded in the VM (so the compiler could see them) but not cached in the interpreter (where the compiled code could see them). Because Shark wasn’t behaving like the other compilers, HotSpot’s compiler support layer would break in all kinds of exciting and imaginative ways. Finally, the method used by the server JIT to optimize interface calls to virtual calls and virtual calls to direct calls could not be used. Aside from the obvious speedup, Shark can only inline direct calls, so reducing virtual and interface calls to direct calls exposes them to the inliner. Calls in Shark have a lot of overhead, so this would have been a big win.

Sometime in May I figured out how to fake inlined object pointers. HotSpot’s compiler interface expects the compiler to generate native code into a CodeBuffer. Shark, of course, uses LLVM, which generates code into a buffer it allocates. Shark had a HotSpot code buffer, but it didn’t do a lot with it. Now, every time Shark has an object pointer to inline, it writes it into the HotSpot code buffer where the garbage collector can see it. The generated code then loads the object pointer from the code buffer whenever it needs it. The pointer is still not inlined – there’s still a load required – but now it’s always only one load. Not a big speedup in itself, but it meant the remaining interpreterisms could be removed, which fixed the support layer breakages and allowed me to copy the interface-virtual-direct call optimization code more or less directly from the server compiler. Everything got a lot more stable, a lot more clean, and a little bit faster in the bargain.

During August I began the (long!) process of preparing Zero for submission to OpenJDK proper. It took some time to get started, but the patch has now gone through a couple of cycles of being reviewed by the HotSpot team: the code has been reformatted, the build system has been almost completely rewritten, and a bunch of other things got changed. It’s still ongoing, but the HotSpot part of the patch seems close to acceptance and the much smaller remainder will hopefully be reviewed soon. I’ve been ramping up my testing with each step: this one bootstrapped and built itself on 32-bit x86, x86_64 and 32-bit PowerPC, and has bootstrapped itself and is in the process of building itself on 64-bit PowerPC and 64-bit zSeries.

Also in August, Ed Nevill released his assembler interpreter for ARM. It replaces part of Zero with hand-crafted assembly language, making OpenJDK 2-8 times faster on that platform.

After the Zero patch is accepted, my next task will be getting Zero certified on 64-bit zSeries. I won’t have a lot of time for Shark until that’s done, but I have one last thing I want to do before I step aside for a couple of months. Xerxes Rånby posted some benchmarks of Zero, Shark, and the assembler interpreter on ARM; Shark is gratifyingly faster than everything on five of the tests, but considerably slower than the assembler interpreter on the other four. I’m not happy with that!

On the tests where it’s slower, Shark is showing very little improvement over Zero, which suggests that these benchmarks are not spending a lot of time interpreting bytecode (which Shark would have compiled and made faster). I suspect these benchmarks are spending a lot of time in JNI calls. Back in February, Ed Nevill posted some profilies he had made to figure out why some interpreter improvements he had made had had very little effect; those profiles seemed to imply that the VM was spending a lot of its time setting up JNI calls. Zero uses libffi for this, and we at Red Hat have long suspected that libffi is slow.

HotSpot’s JITs have the capability to “compile” JNI methods. This sounds odd, as JNI methods are already native code; what’s actually getting compiled is the interface between the JVM and the native JNI code. If Shark could compile JNI methods, whenever HotSpot found a hot JNI method it would be able to replace its generic, one-size-fits-all interface code (using libffi) with an LLVM-generated interface custom built specifically for that method. I’m going to spend a week or so making Shark able to compile these methods, before I descend into zSeries TCK hell…

Syndicated 2009-10-09 16:20:37 from gbenson.net

10 Jun 2009 »

First Shark self-builds

Xerxes Rånby and I simultaneously decided to try building Shark with Shark today… and both worked!

Syndicated 2009-06-10 14:37:10 from gbenson.net

29 May 2009 »

Instrumenting Zero and Shark

Every so often I find myself adding little bits of code to Zero or Shark, to figure out obscure bugs or to see whether working on some optimization or another is worthwhile. I did it again today, and thought I’d write a little tutorial.

The first versions of Shark implemented a lot of things the same way as the interpreter, ie slowly. Since February I’ve been slowly replacing these interpreter-isms with implementations that are more compiler-like, and today there’s only one left: invokeinterface. The reason I left it until last is that it’s the biggest and the ugliest: it’ll no doubt be a pig to do, and quite frankly I don’t really want to do it. To see if I could get away with not bothering with it, I decided to instrument Shark so I could run SPECjvm98 and have it print out the number of times Shark-compiled code executed an invokeinterface for every benchmark.

First, I needed somewhere to store the counter. I decided to put it in the individual thread’s JavaThread objects as they’re easy to get at from both C++ and Shark, and they’re thread-specific so you don’t have to worry about locking.

diff -r 4cc0bc87aef4 ports/hotspot/src/os_cpu/linux_zero/vm/thread_linux_zero.hpp
--- a/ports/hotspot/src/os_cpu/linux_zero/vm/thread_linux_zero.hpp	Fri May 29 12:46:07 2009 +0100
+++ b/ports/hotspot/src/os_cpu/linux_zero/vm/thread_linux_zero.hpp	Fri May 29 14:44:04 2009 +0100
@@ -32,6 +32,25 @@
     _top_zero_frame = NULL;
   }

+ private:
+  int _interface_call_count;
+
+ public:
+  int interface_call_count() const
+  {
+    return _interface_call_count;
+  }
+  void set_interface_call_count(int interface_call_count)
+  {
+    _interface_call_count = interface_call_count;
+  }
+
+ public:
+  static ByteSize interface_call_count_offset()
+  {
+    return byte_offset_of(JavaThread, _interface_call_count);
+  }
+
  public:
   ZeroStack *zero_stack()
   {

So we have the field itself, a getter and setter to access it from C++, and a static method to expose the offset of the field in the thread object to Shark. Next we need to make Shark update the counter:

diff -r 4cc0bc87aef4 ports/hotspot/src/share/vm/shark/sharkTopLevelBlock.cpp
--- a/ports/hotspot/src/share/vm/shark/sharkTopLevelBlock.cpp	Fri May 29 12:46:07 2009 +0100
+++ b/ports/hotspot/src/share/vm/shark/sharkTopLevelBlock.cpp	Fri May 29 14:44:04 2009 +0100
@@ -985,6 +985,16 @@
 // Interpreter-style interface call lookup
 Value* SharkTopLevelBlock::get_interface_callee(SharkValue *receiver)
 {
+  Value *count_addr = builder()->CreateAddressOfStructEntry(
+    thread(),
+    JavaThread::interface_call_count_offset(),
+    PointerType::getUnqual(SharkType::jint_type()));
+  builder()->CreateStore(
+    builder()->CreateAdd(
+      builder()->CreateLoad(count_addr),
+      LLVMValue::jint_constant(1)),
+    count_addr);
+
   SharkConstantPool constants(this);
   Value *cache = constants.cache_entry_at(iter()->get_method_index());

We’re almost ready to add the SPECjvm98-specific bits now, but there’s one thing left. Some of the benchmarks are multithreaded, but we have one counter per thread; we need a way to set and get the counters from all running threads. HotSpot has some code to iterate over all the threads in the VM, but it’s all private to the Threads class. Not to worry though, we’ll just stick it in there:

diff -r 4cc0bc87aef4 openjdk-ecj/hotspot/src/share/vm/runtime/thread.hpp
--- a/openjdk-ecj/hotspot/src/share/vm/runtime/thread.hpp	Fri May 29 12:46:07 2009 +0100
+++ b/openjdk-ecj/hotspot/src/share/vm/runtime/thread.hpp	Fri May 29 14:44:04 2009 +0100
@@ -1669,6 +1669,9 @@
   // Deoptimizes all frames tied to marked nmethods
   static void deoptimized_wrt_marked_nmethods();

+ public:
+  static void reset_interface_call_counts();
+  static int  interface_call_counts_total();
 };

diff -r 4cc0bc87aef4 openjdk-ecj/hotspot/src/share/vm/runtime/thread.cpp
--- a/openjdk-ecj/hotspot/src/share/vm/runtime/thread.cpp	Fri May 29 12:46:07 2009 +0100
+++ b/openjdk-ecj/hotspot/src/share/vm/runtime/thread.cpp	Fri May 29 14:44:04 2009 +0100
@@ -3828,6 +3828,21 @@
   }
 }

+void Threads::reset_interface_call_counts()
+{
+  ALL_JAVA_THREADS(thread) {
+    thread->set_interface_call_count(0);
+  }
+}
+
+int Threads::interface_call_counts_total()
+{
+  int total = 0;
+  ALL_JAVA_THREADS(thread) {
+    total += thread->interface_call_count();
+  }
+  return total;
+}

 // Lifecycle management for TSM ParkEvents.
 // ParkEvents are type-stable (TSM).

Now we’re ready to add some SPECjvm98-specific code. A quick poke around in SPECjvm98 brings up the method spec.harness.ProgramRunner::runOnce as a likely place to hook ourselves in. This will be run by the interpreter — Shark won’t compile it as it’s only called a few times — so we put our code into the C++ interpreter’s normal entry which is the bit that executes bytecode methods:

diff -r 4cc0bc87aef4 ports/hotspot/src/cpu/zero/vm/cppInterpreter_zero.cpp
--- a/ports/hotspot/src/cpu/zero/vm/cppInterpreter_zero.cpp	Fri May 29 12:46:07 2009 +0100
+++ b/ports/hotspot/src/cpu/zero/vm/cppInterpreter_zero.cpp	Fri May 29 14:44:04 2009 +0100
@@ -42,6 +42,26 @@
   JavaThread *thread = (JavaThread *) THREAD;
   ZeroStack *stack = thread->zero_stack();

+  char *benchmark = NULL;
+  {
+    ResourceMark rm;
+    const char *name = method->name_and_sig_as_C_string();
+    if (strstr(name, “spec.harness.ProgramRunner.runOnce(”) == name) {
+      intptr_t *locals = stack->sp() + method->size_of_parameters() - 1;
+      if (LOCALS_INT(5) == 100) {
+        Threads::reset_interface_call_counts();
+
+        name = LOCALS_OBJECT(1)->klass()->klass_part()->name()->as_C_string();
+        const char *limit = name + strlen(name);
+        while (*(–limit) != ‘/’);
+        const char *start = limit;
+        while (*(–start) != ‘/’);
+        start++;
+        benchmark = strndup(start, limit - start);
+      }
+    }
+  }
+
   // Adjust the caller’s stack frame to accomodate any additional
   // local variables we have contiguously with our parameters.
   int extra_locals = method->max_locals() - method->size_of_parameters();
@@ -59,6 +79,12 @@

   // Execute those bytecodes!
   main_loop(0, THREAD);
+
+  if (benchmark) {
+    tty->print_cr(”%s: %d interface calls”,
+                  benchmark, Threads::interface_call_counts_total());
+    free(benchmark);
+  }
 }

 void CppInterpreter::main_loop(int recurse, TRAPS)

This looks a little messy, but what it’s basically doing is spotting calls to spec.harness.ProgramRunner::runOnce and extracting the name of the benchmark from its arguments. It is complicated by the fact that SPECjvm98 intersperses full runs with hidden tenth-speed ones, so we use the speed argument (in LOCALS_INT(5)) to ignore the hidden runs.

Now we’re ready to run the benchmarks and see what happens:

Looks like making invokeinterface faster is worthwhile after all! Now all I have to do is do it ;)

Syndicated 2009-05-29 15:15:04 from gbenson.net

27 May 2009 »

Zero and Shark article

This past month or so I’ve been working on an article about Zero and Shark for java.net. It went live today, so if you fancy a little primer on what Zero and Shark are and how they work then head over there and check it out :)

Syndicated 2009-05-27 16:07:09 from gbenson.net

22 Apr 2009 »

Debugging the C++ interpreter

Every so often I find myself wanting to add debug printing to the C++ interpreter for specific methods. I can never remember how I did it the last time and have to figure it out all over again, so here’s how:

diff -r 4d8381231af6 openjdk-ecj/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp
--- a/openjdk-ecj/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp  Tue Apr 21 09:50:43 2009 +0100
+++ b/openjdk-ecj/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp  Wed Apr 22 11:13:35 2009 +0100
@@ -555,6 +555,15 @@
          topOfStack stack_base(),
          “Stack top out of range”);

+  bool interesting = false;
+  if (istate->msg() != initialize) {
+    ResourceMark rm;
+    if (!strcmp(istate->method()->name_and_sig_as_C_string(),
+                “spec.benchmarks._202_jess.jess.Rete.FindDeffunction(Ljava/lang/String;)Lspec/benchmarks/_202_jess/jess/Deffunction;”)) {
+      interesting = true;
+    }
+  }
+
   switch (istate->msg()) {
     case initialize: {
       if (initialized++) ShouldNotReachHere(); // Only one initialize call

The trick is getting the fully-qualified name of the method right: the method name contains dots, but the class names in its signature contain slashes. You’re there once you have that down.

Syndicated 2009-04-22 10:27:19 from gbenson.net

8 Apr 2009 »

I’m not dead

I haven’t blogged for a while. I’ve been working on Shark’s performance, walking through the native code generated for critical methods and looking at what’s happening. There’s several cases where I can see that some piece of code is unnecessary, but translating that into a way that Shark can see it’s unnecessary is non-trivial. I’m thinking I may need to separate the code generation, adding an intermediate layer between the typeflow and the LLVM IR so I can add things which are maybe necessary and then remove them if not. It all seems a bit convoluted — bytecode → typeflow → new intermediate → LLVM IR → native — but the vast bulk of the Shark’s time is spent in the last step so a bit more overhead to create simpler LLVM IR should speed up compilation as well as the runtime.

None of this has been particularly bloggable, but I wanted to point out two exiting things that are happening in Shark land. Robert Schuster and Xerxes Rånby have been busy getting Shark to run on ARM, and Neale Ferguson has started porting LLVM to zSeries with the intention of getting Shark running there. I expected to see Shark on ARM sooner or later, but Shark on zSeries came completely out of the blue. I’m really looking forward to seeing that happen!

Syndicated 2009-04-08 10:31:46 from gbenson.net

20 Feb 2009 »

Good news and bad news

Bad news first. The drop in speed between the Zeros in IcedTea6 1.3 and 1.4 doesn’t seem to come from Zero itself. I did a build of IcedTea6 1.4 with everything in ports/hotspot/src/*cpu reverted to 1.3, and the speed loss remained. It must be something to do with the newer HotSpot, or some other patch that got added or changed. I don’t really want to spend any more time on this than I have, so we’ll just have to live with it.

I’ve not come to any conclusions as to the difference in speed between the native-layer C++ interpreter and Zero either. It’s not the unaligned access stuff I mentioned: I ran some benchmarks, but the results were ambiguous. It may be libffi, but again, I don’t want to spend more time on this…

The good news is that I’ve been checking the Zero sources for SCA cover, emailing various people, and there’s only one tiny easily-removable bit I’m unsure about. I spent the morning preparing and submitting the first of the patches that will be required, the core build patch, which will hopefully be well received.

Syndicated 2009-02-20 14:53:12 from gbenson.net

18 Feb 2009 (updated 19 Feb 2009 at 15:30 UTC) »

Benchmarks

Advogato can't display this entry, read it here instead.

Syndicated 2009-02-18 17:19:29 from gbenson.net

16 Feb 2009 »

State of the world

Well, in case you missed it, Zero passed the TCK! Specifically, the latest OpenJDK packages in Fedora 10 for 32- and 64-bit PowerPC passed the Java SE 6 TCK and are compatible with the Java SE 6 platform. I’ve been working toward this since sometime in November — the sharp-eyed amongst you may have noticed the steady stream of obscure fixes I’ve been committing — and the final 200 hour marathon finished at 5pm on FOSDEM Saturday, less than 24 hours before my talk. It was pretty stressful, and I took the week off to recover!

Needless to say, none of this could have happened without the rest of the OpenJDK team here at Red Hat getting it to pass on the other platforms. Special thanks must go to Lillian for managing the release. She got the blame for a lot of what went wrong, and it’s only fair she should get the credit for what went right.

Of course, all of this wasn’t just so I’d have something exciting to announce at FOSDEM. In a way it validates the decision we took at Red Hat to focus on Zero rather than using Cacao or another VM. By using as much OpenJDK code as possible — Zero builds are 99% HotSpot — we get as much OpenJDK goodness as possible, including the “correctness” of the code. Zero’s speed can make it a standing joke, but I’d like to use these passes to emphasize that Zero isn’t just a neat hack — it’s production quality code that hasn’t been optimized yet. I’ve written fast code and I’ve written correct code, and in my experience it’s easier in the long run to make correct code fast than it is to make fast code correct. The TCK isn’t everything, naturally, but the fact that it’s possible to pass it using Zero builds gives us a firm foundation for future work.

So, what now for me? Well, in the medium term I want to restart work on Shark, but there’s a couple of things for Zero I want to look at while they’re fresh in my mind. The first is speed. As an interpreter Zero will never be “fast”, but in my FOSDEM slides I used some of Andrew Haley’s benchmarks that show Zero as significantly slower than the template interpreter on x86_64. Furthermore, Robert Schuster mentioned in his talk that the Zero in IcedTea 1.4 was significantly slower than the Zero in IcedTea 1.3. I’m not going to spend a great deal of time on it, but I’d like to do a bit of benchmarking and profiling to check that nothing stupid is happening.

The other thing I want to do for Zero is to get it into upstream HotSpot. This is going to require a lot of non-fun stuff — tidying, a bit of rethinking, and an SCA audit.

Finally, Inside Zero and Shark, the articles I’ve been writing. I didn’t mention it at the time, but I was writing them while my TCK runs were in progress, to keep me sane! I do plan to continue them, but they’ll likely be a little more sporadic now I’m starting the fun stuff again. Watch this space!

Syndicated 2009-02-16 13:34:44 from gbenson.net

2 Feb 2009 »

Inside Zero and Shark: The call stub and the frame manager

Now that we have all that stack stuff out of the way we can get into the core of Zero itself, the replacements for the parts of HotSpot that were originally written in assembly language.

I’ve mentioned already that the bridge from the VM into Java code is the call stub. In Zero, this is the function StubGenerator::call_stub, in stubGenerator_zero.cpp, and its job is really simple. In the previous article I explained how when a method is entered it finds its arguments at the top of the stack, at the end of the previous method’s frame. Well, for the first frame of all there is no previous frame, so the call stub’s job is to create one. If you look in the crash dump I linked in the previous article, you’ll see that right at the bottom of the stack trace is a short frame that’s different from all the others. This is the entry frame, the frame the call stub made. You can see the code that built it in EntryFrame::build, right at the bottom of stubGenerator_zero.cpp.

Once the entry frame is created, the call stub invokes the method by jumping to its entry point. If the method we’re calling hasn’t been JIT compiled then the entry point will be pointing at one of the interpreter’s method entries. These, along with the call stub, are the bits that are written in assembly language in classic HotSpot.

There are several different method entries — Zero has five, classic HotSpot has fourteen! — but most of this is optimization, and you can do pretty much everything with just two: an entry point for normal (bytecode) methods, and an entry for native (JNI) methods. In this article I’m going to talk about the normal entry.

In the C++ interpreter, the normal entry is split into two parts. The larger of the two is the bytecode interpreter. This is written in C++, the function BytecodeInterpreter::run in bytecodeInterpreter.cpp, and it does the bulk of the work. The other part is the frame manager. In non-Zero HotSpot this is written in assembly language, and it handles the various stack manipulations that cannot be performed from within C++. Zero’s frame manager is, of course, written in C++; it’s the function CppInterpreter::normal_entry in cppInterpreter_zero.cpp.

The frame manager performs tasks for the bytecode interpreter, so you might expect the bytecode interpreter call the frame manager wherever it needs to adjust the stack. It can’t work this way, however; the interleaved stack in classic HotSpot means that once you’re inside the bytecode interpreter the bytecode interpreter’s ABI frame lies on top of the stack, blocking any access to the Java frames beneath. To cope with this, the code is essentially written inside out, with the frame manager calling the bytecode interpreter, and the bytecode interpreter returning to the frame manager whenever it needs something done.

The way it works is this. On entering a method, we start off in the frame manager. The frame manager extends the caller’s frame to accomodate any extra locals, then creates a new frame for the callee. The frame manager then calls the bytecode interpreter with a method_entry message.

Now we’re inside the bytecode interpreter, which executes bytecodes one-by-one until it reaches something it cannot handle. Say it arrives at a method call instruction. In classic HotSpot, the bytecode interpreter’s ABI frame is blocking the top of the stack, so if the bytecode interpreter were to handle the call itself the callee wouldn’t be able to extend its caller’s frame to accomodate its extra locals. The frame manager has to to handle this, so the bytecode interpreter returns with a call_method message.

Now we’re back in the frame manager again; the bytecode interpreter’s frame has been removed, and the Java frame at the top of the stack. The arguments to the call were set up by the bytecode interpreter, so all the frame manager has to do is jump to the callee’s entry point. When the callee returns, the frame manager returns control to the bytecode interpreter by calling it with a method_resume message, and the bytecode interpreter continues from where it was when it issued the call_method.

This process is repeated every time a method call is required. Once the bytecode interpreter is finished with a method, it returns to the frame manager with a return_from_method or a throwing_exception message. The frame manager then removes the method’s frame, copies the result into it’s caller’s frame if necessary, and return to its caller.

The frame manager exists because the bytecode interpreter’s frame blocks the stack in classic HotSpot. In Zero, Java frames live on the Zero stack, which is separate from the ABI stack. Why then does Zero need a separate frame manager? The answer is that it doesn’t — it would be perfectly possible to rewrite the bytecode interpreter to stand alone. That, however, is the issue: you’d have to rewrite the bytecode interpreter, making significant modifications to the existing HotSpot code. That runs counter to the design philosophy of Zero, which aims for it to slot into HotSpot with minimal modification. It could be done, but we didn’t do it.

That pretty well sums up the normal entry. Next time I’ll talk about the other essential method entry, the one that handles JNI methods.

Syndicated 2009-02-02 13:49:28 from gbenson.net

238 older entries...