Mono is a JIT compiler and as such it compiles a method only when needed: the moment the execution flow requires the method to execute. This mode of execution greatly improves startup time of applications and is implemented with a simple trick: when a method call is compiled, the generated native code can't transfer execution to the method's native code address, because it hasn't been compiled yet. Instead it will go through a magic trampoline: this chunk of code knows which method is going to be executed, so it will compile it and jump to the generated code.
The way the trampoline knows which method to compile is pretty simple: for each method a small specific trampoline is created that will pass the pointer to the method to execute to the real worker, the magic trampoline.
Different architectures implement this trampoline in different ways, but each with the aim to reduce its size: the reason is that many trampolines are generated and so they use quite a bit of memory.
Mono in svn has quite a few improvements in this area compared to mono 1.2.5 which was released just a few weeks ago. I'll try to detail the major changes below.
The first change is related to how the memory for the specific trampolines is allocated: this is executable memory so it is not allocated with malloc, but with a custom allocator, called Mono Code Manager. Since the code manager is used primarily for methods, it allocates chunks of memory that are aligned to multiples of 8 or 16 bytes depending on the architecture: this allows the cpu to fetch the instructions faster. But the specific trampolines are not performance critical (we'll spend lots of time JITting the method anyway), so they can tolerate a smaller alignment. Considering the fact that most trampolines are allocated one after the other and that in most architectures they are 10 or 12 bytes, this change alone saved about 25% of the memory used (they used to be aligned up to 16 bytes).
To give a rough idea of how many trampolines are generated I'll give a few examples:
- MonoDevelop startup creates about 21 thousand trampolines
- IronPython 2.0 running a benchmark creates about 17 thousand trampolines
- an "hello, world" style program about 800
So reducing the size of the trampolines is great, but it's
really not possible to reduce them much further in size, if
at all. The next step is trying just not to create them.
There are two primary ways a trampoline is generated: a
direct call to the method is made or a virtual table slot is
filled with a trampoline for the case when the method is
invoked using a virtual call. I'll note here than in both
cases, after compiling the method, the magic trampoline will
do the needed changes so that the trampoline is not
executed again, but execution goes directly to the newly
compiled code. In one case the callsite is changed so that
the branch or call instruction will transfer control to the
new address. In the virtual call case the magic trampoline
will change the virtual table slot directly.
The sequence of instructions used by the JIT to implement a virtual call are well-known and the magic trampoline (inspecting the registers and the code sequence) can easily get the virtual table slot that was used for the invocation. The idea here then is: if we know the virtual table slot we know also the method that is supposed to be compiled and executed, since each vtable slot is assigned a unique method by the class loader. This simple fact allows us to use a completely generic trampoline in the virtual table slots, avoiding the creation of many method-specific trampolines.
In the cases above, the number of generated trampolines goes from 21000 to 7700 for MonoDevelop (saving 160 KB of memory), from 17000 to 5400 for the IronPython case and from 800 to 150 for the hello world case.
I'll describe more optimizations (both already committed and forthcoming) in the next blog posts.