15 Oct 2007 lupus   » (Master)

Memory savings with magic trampolines in Mono

Mono is a JIT compiler and as such it compiles a method only when needed: the moment the execution flow requires the method to execute. This mode of execution greatly improves startup time of applications and is implemented with a simple trick: when a method call is compiled, the generated native code can't transfer execution to the method's native code address, because it hasn't been compiled yet. Instead it will go through a magic trampoline: this chunk of code knows which method is going to be executed, so it will compile it and jump to the generated code.

The way the trampoline knows which method to compile is pretty simple: for each method a small specific trampoline is created that will pass the pointer to the method to execute to the real worker, the magic trampoline.

Different architectures implement this trampoline in different ways, but each with the aim to reduce its size: the reason is that many trampolines are generated and so they use quite a bit of memory.

Mono in svn has quite a few improvements in this area compared to mono 1.2.5 which was released just a few weeks ago. I'll try to detail the major changes below.

The first change is related to how the memory for the specific trampolines is allocated: this is executable memory so it is not allocated with malloc, but with a custom allocator, called Mono Code Manager. Since the code manager is used primarily for methods, it allocates chunks of memory that are aligned to multiples of 8 or 16 bytes depending on the architecture: this allows the cpu to fetch the instructions faster. But the specific trampolines are not performance critical (we'll spend lots of time JITting the method anyway), so they can tolerate a smaller alignment. Considering the fact that most trampolines are allocated one after the other and that in most architectures they are 10 or 12 bytes, this change alone saved about 25% of the memory used (they used to be aligned up to 16 bytes).

To give a rough idea of how many trampolines are generated I'll give a few examples:

  • MonoDevelop startup creates about 21 thousand trampolines
  • IronPython 2.0 running a benchmark creates about 17 thousand trampolines
  • an "hello, world" style program about 800
This change in the first case saved more than 80 KB of memory (plus about the same because reviewing the code allowed me to fix also a related overallocation issue).

So reducing the size of the trampolines is great, but it's really not possible to reduce them much further in size, if at all. The next step is trying just not to create them.
There are two primary ways a trampoline is generated: a direct call to the method is made or a virtual table slot is filled with a trampoline for the case when the method is invoked using a virtual call. I'll note here than in both cases, after compiling the method, the magic trampoline will do the needed changes so that the trampoline is not executed again, but execution goes directly to the newly compiled code. In one case the callsite is changed so that the branch or call instruction will transfer control to the new address. In the virtual call case the magic trampoline will change the virtual table slot directly.

The sequence of instructions used by the JIT to implement a virtual call are well-known and the magic trampoline (inspecting the registers and the code sequence) can easily get the virtual table slot that was used for the invocation. The idea here then is: if we know the virtual table slot we know also the method that is supposed to be compiled and executed, since each vtable slot is assigned a unique method by the class loader. This simple fact allows us to use a completely generic trampoline in the virtual table slots, avoiding the creation of many method-specific trampolines.

In the cases above, the number of generated trampolines goes from 21000 to 7700 for MonoDevelop (saving 160 KB of memory), from 17000 to 5400 for the IronPython case and from 800 to 150 for the hello world case.

I'll describe more optimizations (both already committed and forthcoming) in the next blog posts.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!