Older blog entries for wingo (starting at number 396)

inside javascriptcore's low-level interpreter

Good day, hackers! And hello to the rest of you, too, though I fear that this article isn't for you. In the vertical inches that follow, we're going to nerd out with JavaScriptCore's new low-level interpreter. So for those of you that are still with me, refill that coffee cup, and get ready for a look into a lovely hack!

hot corn, cold corn

Earlier this year, JavaScriptCore got a new interpreter, the LLInt. ("Low-level interpreter", you see.). As you probably know, JavaScriptCore is the JavaScript implementation of the WebKit project. It's used in web browsers like Safari and Epiphany, and also offers a stable API for embedding in non-web applications.

In this article, we'll walk through some pieces of the LLInt. But first, we need to describe the problem that the LLInt solves.

So, what is the fundamental problem of implementing JavaScript? Of course there are lots of things you could say here, but I'm going to claim that in the context of web browsers, it's the tension between the need to optimize small amounts of hot code, while minimizing overhead on large amounts of cold code.

Web browsers are like little operating systems unto themselves, into which you as a user install and uninstall hundreds of programs a day. The vast majority of code that a web browser sees is only executed once or twice, so for cold code, the name of the game is to do as little work as possible.

For cold code, a good bytecode interpreter can beat even a very fast native compiler. Bytecode can be more compact than executable code, and quicker to produce.

All of this is a long introduction into what the LLInt actually is: it's a better bytecode interpreter for JSC.

context

Before the introduction of the LLInt, the so-called "classic interpreter" was just a C++ function with a bunch of labels. Every kind of bytecode instruction had a corresponding label in the function.

As a hack, in the classic interpreter, bytecode instructions are actually wide enough to hold a pointer. The opcode itself takes up an entire word, and the arguments also take up entire words. It is strange to have such bloated bytecode, but there is one advantage, in that the opcode word can actually store the address of the label that implements the opcode. This is called direct threading, and presumably its effects on the branch prediction table are sufficiently good so as to be a win.

Anyway, it means that Interpreter.cpp has a method that looks like this:

// vPC is pronounced "virtual program counter".
JSValue Interpreter::execute(void **vPC)
{
  goto *vPC;
//...
  // add dst op1 op2: Add two values together.
op_add:
  {
    int dst = (int)vPC[1];
    int op1 = (int)vPC[2];
    int op2 = (int)vPC[3];

    fp[dst] = add(fp[op1], fp[op2]);

    vPC += 4;
    goto *vPC;
  }

  // jmp offset: Unconditional branch.
op_jmp:
  {
    ptrdiff_t offset = (ptrdiff_t)vPC[1];
    vPC += offset;
    goto *vPC;
  }
//...
}

It's a little different in practice, but the essence is there.

OK, so what's the problem? Well, readers who have been with me for a while might recall my article on static single assignment form, used as an internal representation by compilers like GCC and LLVM. One conclusion that I had was that SSA is a great first-order IR, but that its utility for optimizing higher-order languages is less clear. However, this computed-goto thing that I showed above (goto *vPC) is a form of higher-order programming!

Indeed, as the GCC internals manual notes:

Computed jumps contain edges to all labels in the function referenced from the code. All those edges have EDGE_ABNORMAL flag set. The edges used to represent computed jumps often cause compile time performance problems, since functions consisting of many taken labels and many computed jumps may have very dense flow graphs, so these edges need to be handled with special care.

Basically, GCC doesn't do very well at compiling interpreter loops. At -O3, it does get everything into registers on my machine, but it residualizes 54 KB of assembly, whereas I only have 64 KB of L1 instruction cache. Many other machines just have 32 KB of L1 icache, so for those machines, this interpreter is a lose. If I compile with -Os, I get it down to 32 KB, but that's still a tight fit.

Besides that, implementing an interpreter from C++ means that you can't use the native stack frame to track the state of a computation. Instead, you have two stacks: one for the interpreter, and one for the program being interpreted. Keeping them in sync has an overhead. Consider what GCC does for op_jmp at -O3:

op_jmp:
    mov    %rbp,%rax
    ; int currentOffset = vPC - bytecode + 1
    sub    0x58(%rbx),%rax
    sar    $0x3,%rax
    add    $0x1,%eax
    ; callFrame[CurrentOffset] = currentOffset
    mov    %eax,-0x2c(%r11)

    ; ptrdiff_t offset = (ptrdiff_t)vPC[1]
    movslq 0x8(%rbp),%rax

    ; vPC += offset
    lea    0x0(%rbp,%rax,8),%rbp

    ; goto *vPC
    mov    0x0(%rbp),%rcx
    jmpq   *%rcx

First there is this strange prelude, which effectively stores the current vPC into an address on the stack. To be fair to GCC, this prelude is part of the DEFINE_OPCODE macro, and is not an artifact of compilation. Its purpose is to let other parts of the runtime see where a computation is. I tried, but I can't convince myself that it is necessary to do this at the beginning of every instruction, so maybe this is just some badness that should be fixed, if the classic interpreter is worth keeping.

The rest of the opcode is similar to the version that the LLInt produces, as we will see, but it is less compact.

the compiler to make the code you want

The goal of compilation is to produce good code. It can often be a good technique to start from the code you would like to have, and then go backwards and see how to produce it. It's also a useful explanatory technique: we can take a look at the machine code of the LLInt, and use it to put the LLInt's source in context.

In that light, here is the code corresponding to op_jmp, as part of the LLInt:

llint_op_jmp:
    ; offset += bytecode[offset + 1]
    add    0x8(%r10,%rsi,8),%esi
    ; jmp *bytecode[offset]
    jmpq   *(%r10,%rsi,8)

That's it! Just 9 bytes. There is the slight difference that the LLInt moves around an offset instead of a pointer into the bytecode. The whole LLInt is some 14 KB, which fits quite confortably into icache, even on mobile devices.

This assembly code is part of the LLInt as compiled for my platform, x86-64. The source of the LLInt is written in a custom low-level assembly language. Here is the source for the jmp opcode:

_llint_op_jmp:
    traceExecution()
    dispatchInt(8[PB, PC, 8])

You can define macros in this domain-specific language. Here's the definition of dispatchInt:

macro dispatchInt(advance)
    addi advance, PC
    jmp [PB, PC, 8]
end

At this point it's pretty clear how the source corresponds to the assembly. The WebKit build system will produce a version of the LLInt customized for the target platform, and link that interpreter into the JavaScriptCore library. There are backend code generators for ARMv7, X86-32, and X86-64. These interpreters mostly share the same source, modulo the representation differences for JSValue on 32-bit and 64-bit systems.

Incidentally, the "offlineasm" compiler that actually produces the native code is written in Ruby. It's clear proof that Ruby can be fast, as long as you don't load it at runtime ;-)

but wait, there's more

We take for granted that low-level programs should be able to determine how their data is represented. C and C++ are low-level languages, to the extent that they offer this kind of control. But what is not often remarked is that for a virtual machine, its code is also data that needs to be examined at runtime.

I have written before about JSC's optimizing DFG compiler. In order to be able to optimize a computation that is in progress, or to abort an optimization because a speculation failed -- in short, to be able to use OSR to tier up or down -- you need to be able to replace the current stack frame. Well, with the classic interpreter written in C++, you can't do that, because you have no idea what's in your stack frame! In contrast, with the LLInt, JSC has precise control over the stack layout, allowing it to tier up and down directly from the interpreter.

This can obviate the need for the baseline JIT. Currently, WebKitGTK+ and Apple builds still include the baseline, quick-and-dirty JIT, as an intermediate tier between the LLInt and the DFG JIT. I don't have sure details about about long-term plans, but I would speculate that recent work on the DFG JIT by Filip Pizlo and others has had the goal of obsoleting the baseline JIT.

Note that in order to tier directly to the optimizing compiler, you need type information. Building the LLInt with the DFG optimizer enabled causes the interpreter to be instrumented to record value profiles. These profiles record the types of values seen by instructions that load and store values from memory. Unlike V8, which stores this information in executable code as part of the inline caches, in the LLInt these value profiles are in non-executable memory.

Spiritually speaking, I am all for run-time code generation. But it must be admitted that web browsers are a juicy target, and while it is unfair to make safe languages pay for bugs in C++ programs, writable text is an attack surface. Indeed in some environments, JIT compilation is prohibited by the operating system -- and in that sort of environment, a fast interpreter like the LLInt is a big benefit.

There are a few more LLInt advantages, like being able to use the CPU's overflow flags, doing better register allocation, and retaining less garbage (JSC scans the stack conservatively). But the last point I want to mention is memory: an interpreter doesn't have to generate executable code just to display a web page. Obviously, you need to complement the LLInt with an optimizing compiler for hot code, but laziness up front can improve real-world page load times.

in your webs, rendering your kittens

So that's the LLInt: a faster bytecode interpreter for JSC.

I wrote this post because I finally got around to turning on the LLInt in the WebKitGTK+ port a couple weeks ago. That means that in September, when GNOME 3.6 comes out, not only will you get an Epiphany with the one-process-per-tab WebKit2 backend, you also get cheaper and faster JavaScript with the LLInt. Those of you that want the advance experience can try it out with last Monday's WebKitGTK+ 1.9.4 release.

Syndicated 2012-06-27 11:25:30 from wingolog

dltool mines dwarf

This is going to sound like quite a yak-shave, but here goes: I was spending some research time here at Igalia working on a new virtual machine for Guile when I got interested by DWARF, the debugging format used in many UNIX systems (GNU, the BSDs, Mac OS). For various reasons, we needed a new debugging format, and DWARF seemed suitable.

I was going to need tools that would both produce and consume DWARF information, so I started with the parser. Of course there is a ready source of DWARF information on a GNU system, in all of the shared libraries. So I wrote a tool to snarf information out of those libraries. It is suprising how much there is!

dltool

Enough of the strange introduction! Let's try it out:

$ dltool print-one libc wmemchr
(subprogram
  (name "wmemchr")
  (prototyped #t)
  (type (pointer-type
          (byte-size 8)
          (type (named-type-reference (typedef "wchar_t")))))
  (formal-parameter
    (name "s")
    (type (pointer-type
            (byte-size 8)
            (type (const-type
                    (type (named-type-reference (typedef "wchar_t"))))))))
  (formal-parameter
    (name "c")
    (type (named-type-reference (typedef "wchar_t"))))
  (formal-parameter
    (name "n")
    (type (named-type-reference (typedef "size_t")))))

Nifty, no? It even prints the names of the formal parameters. The structure of the printout mirrors the structure of the DWARF information quite closely, so for a detailed reference you should refer to the DWARF standard. Note though that the "type" child of a "subprogram" (procedure, function, etc) denotes its return type. If it's missing, the function does not return a value.

In this case we see that wmemchr returns a pointer to a wchar_t, and we see that the pointer is 8 bytes wide. We don't see the definition of wchar_t though. You can pass a --depth=N to see more information, recursively, but it's most useful to simply have the tool print out a tree of needed declarations, this time with the print-decls command instead of print-one:

$ dltool print-decls libc wmemchr
(typedef
  (name "wchar_t")
  (type (base-type
          (byte-size 4)
          (encoding signed)
          (name "int"))))
(typedef
  (name "size_t")
  (type (base-type
          (byte-size 8)
          (encoding unsigned)
          (name "long unsigned int"))))
(subprogram
  (name "wmemchr")
  ...)

Here the body of the subprogram is the same as before. If I leave off the wmemchr argument, I get all the declarations that are publically exported by libc, and that have debugging information.

locating debug information

In the examples above, I'm just passing the basename of the library. dltool is smart enough to look in the library paths if needed, via parsing /etc/ld.so.conf and $LD_LIBRARY_PATH.

These days it is not so common for a stock GNU distribution to have debugging symbols installed for its libraries. Instead the debugging information is packaged separately, and a .gnu_debuglink section is left in the main binary (library, or executable, or both -- have you ran /lib/ld-linux.so.2 lately?). dltool can deal with that.

The only thing it doesn't handle is compressed debug info (.zdebug_info sections). Hopefully these sections go away at some point though: they prevent debuggers from mapping the debugging information directly, they don't save any distribution cost (packages being zipped already), and have no other runtime impact, as they debuginfo isn't normally loaded.

c++?!?!

Oddly enough, dltool works with C++ too. In this case, I point it at a debugging version of the JavaScriptCore library, looking for a definition of the JSC::JSCell class.

$ dltool print-one --grovel .libs/libjavascriptcoregtk-3.0.so JSC::JSCell
(class-type
  (specification
    (named-type-reference
      (structure-type "JSCell")
      (namespace "JSC")))
  (byte-size 16)
  (member
    (name "TypedArrayStorageType")
    (type (const-type
            (type (named-type-reference
                    (enumeration-type "TypedArrayType")
                    (namespace "JSC")))))
    (declaration #t)
    (const-value 0))
  (member
    (name "m_classInfo")
    (type (pointer-type
            (byte-size 8)
            (type (const-type
                    (type (named-type-reference
                            (structure-type "ClassInfo")
                            (namespace "JSC")))))))
    (data-member-location ((plus-uconst 0)))
    (accessibility private))
 ...)

In this case I used the --grovel option to indicate that I wanted to traverse all the debugging entries to find the one I wanted, instead of looking in the symbol table. That was necessary in this case because type definitions aren't present in a symbol table. The actual result is some 1500 lines long, with member variables and functions, and of course it's not complete as it's just a print-one output.

If the data-member-declaration entry looks a little odd to you, that's because it is odd: it's a sequence of opcodes for a virtual machine that computes locations of values. DWARF specifies a couple of little machines like this. For class members, there's usually only the one plus-uconst opcode, but with bitfields, different opcode sequences are possible.

challenges

Making dltool was quite an interesting hack. Oddly, writing the DWARF parser was the easy part. I had an ELF parser and linker already, and although quite flexible, DWARF is regular and well-specified.

One tricky part was making dltool fast, while not consuming too much memory. There can be a lot of data in a debugging object. For example, the debugging version of libwebkitgtk-3.0.so is bigger than 1GB. In a number of cases, we have to trade off the cost of re-parsing versus the cost of storing parsed data in memory. dltool can indeed handle WebKit, which is something I'm proud of, though printing all the exported declarations takes 2 GB of memory and some three or four minutes. More modest tasks like describing libcairo only take a second or two.

Another tricky part was determining which types are actually the same. In C and C++, typically you compile each file on its own, and then link them together. Each time you compile a file, the compiler loads the headers that the file needs and creates instances of those types, internal to the compiler. The debugging information that the compiler writes can contain as many (or more) definitions of (for example) size_t as there are compilation units that reference it. dltool needs to unify those types that are the same, not only to prevent multiple type declarations, but also to prevent divergence when processing recursive types.

C++'s nested namespaces also presented a problem. DWARF information is organized in trees, though they can also be reference each other by offset. However, there are no up-pointers in the tree nodes. It seems that you need to pre-traverse the node trees to determine the ranges of the namespaces so you can determine the right context for a node at any given offset. C does not have this issue, partly because it is less expressive, and partly due to some DWARF design decisions.

dltool doesn't currently demangle C++ names, for what that's worth.

Oddly, C++ templates were not very much of a problem. They simply create debugging entries with really crazy names, that's all.

"automatic" ffi?

I think dltool is pretty cool. Given an opaque information-ful object, I always want some tool to spread information on a screen in a structured way. So it's an interesting tool for binary spelunkers.

There are some obvious applications, though. One is to use the information that dltool finds to automatically generate bindings from C (or even C++) libraries to your language of choice. You can use a dltool run to generate the bindings at "compile-time", or save it to a file and generate bindings at run-time with node-ffi or ctypes or whatever the kids use these days.

Of course in Guile, besides importing the dlhacks module and creating bindings at runtime, we can get the best situation by running the dltool groveller from within a macro. That way we pull the DWARF groveller into our language, without having the module depend on dlhacks or on the presence of debugging information at runtime.

There are some limitations, though. DWARF is a flexible and extensible format, but it doesn't currently get all of the information you would want in a binding. For example there's no way to tell the C compiler to serialize an association between "struct ClutterActor" and a GType named "ClutterActor". You can't tell it that the return value of wcsdup is actually owned by the caller. You can't annotate pointer arguments to indicate that they are "out" or "in/out" or whatever. But this is not a failing of DWARF: it is a current limitation of our toolchain.

Actually it's also a limitation of C -- I mean, everything you would want to say about a variable or a function or whatever is, loosely speaking, the type of that object. We can't really change C's type system, but we can improve our toolchain by allowing for GCC to recognize and plumb through more relevant __attribute__ annotations.

all right, all ready

Anyway, that's the thing! The tool itself requires Guile 2.0. If you're running Debian, just install guile-2.0, then:

git clone git://gitorious.org/guile-dlhacks/guile-dlhacks.git
cd guile-dlhacks
autoreconf -vif && ./configure && make
./env dltool help

Visit the gitorious page for more, and find me on #guile for more.

Happy spelunking!

Syndicated 2012-06-19 15:55:52 from wingolog

rococo, and then rubble

Why is it that we in GNOME came to view GLib as the lowest level of our stack?

Forget, for the moment, about such small barbarisms as gint32, and even gint. Forget things like gnome-keyring versus GnuPG. Let us even pass over GObject as a whole. Today I had a thought that was new to me. Let's talk about DWARF.

DWARF describes ABIs. It is ubiquitous: it is on GNU systems as well as Mac and BSD, and could be on Windows if we put it there. It's extensible. It is mostly controlled by like-minded free software people.

So why is it that we invented GIR?

Discuss.

Syndicated 2012-06-08 14:31:38 from wingolog

10 years of wingolog

Greetings, friends!

This piece is even more navel-gazing than most, so if that's not your bag, surf on, dear surfer, surf on.

Tomorrow ten years ago I wrote:

What up ya'll. I've been looking at advogato recently, and seeing my co-hackers thomasvs, Uraeus, and hadess posting here all the time made me jealous, so here I am too.

summertime... and the living is easy.

Ten years! There are few things that I have done consistently for ten years, but typing at this end of the intertube appears to be one of them.

Although there is some juice to squeeze out of the history of this thing, I don't think it's interesting enough. So instead in the rest of this essay I'll just say whatever comes to my mind.

on identity

As you might have gathered, I don't like defining myself. I prefer to let what I do give you enough material so that you can conclude what you want. But in the absence of auto-definition, over time I feel that you are defining me.

Let me be concrete. You are mostly technical people, probably computer programmers. You respect what I write to some degree, and that is very gratifying. But let's say I start writing more about politics and social justice. For me it was much easier to write something political five or ten years ago than it is now. Now, I have to think about my readers, and say something that is both well-reasoned and not offensive, because I have the kind of audience for whom technically correct is the best kind of correct.

I don't know what I think about this. Politically I self-identify as an anarchist. (It's both more and less crazy than it sounds.) Politics are more important than what I do in code, but I'm more hesitant to write about such things, because it's not always the case that I can express it adequately. This may be an advantage to us all, but it is stifling as well.

I told you it was going to be navel-gazing ;-)

on you

Here I am, dropping the second-person plural like it's nothing. Au contraire! I do enjoy writing for an audience. Thank you all for reading: those who started with me at Advogato, those more recent subscribers, and those that just click through occasionally.

Thanks especially to people on Planet GNOME for putting up with me over these last few years. The things I do now are only tangentially related to GNOME, but it has been a great help to my writing to know that there were thousands of people reading. If you ever find it too off-topic, file a bug asking the PGO maintainers to restrict the feed to a particular tag or three.

Thanks also to my commenters. I have learned lots from yall. I especially appreciate the pleasant tone that you have. I apologize for not responding to all of the many good points that you make; sometimes the answer is too long, sometimes I have no excuse. In any case, I put my effort into the articles more than the comment threads. At least that way we mostly avoid back-and-forths; there are better media for that.

For the record, although I don't currently moderate comments, I do delete offensive or spammy comments whenever they appear. This has worked well for me.

on advice

I've done OK with this electrorag. If you write a blog, and aspire to such modest success, I have only this advice: write about interesting things. Write conversationally, and succinctly if possible. Write socially, addressing a community of readers. That's all!

on macro, on micro

This is just as good a time as any to note that I opened a Twitter account recently. It goes completely against my instincts. My blog is self-hosted. I wrote the software that runs the blog. I wrote the compiler that compiles the software that runs the blog, and the compiler is self-hosted!

I feel I have to justify myself here. In the end it is pretty simple: I need fresh chatter, without having an inbox. I feel like my news sources are an echo chamber, and I need to break out of it somehow. Twitter seems appropriate for grabbing links and zeitgeist. RSS seems too much like an inbox. I'll end up trying to self-host and do the identi.ca dance at some point, but for now, Twitter it is. If following is your game, @andywingo is my name.

I still have no idea what to type at that ridiculous 140-character box. When it comes to blogging, I don't know how to do micro :)

onwards

I have no idea what the future holds. Maybe it's a bunch of nerdy compiler articles. Maybe it's cat pictures. We'll see. Thanks for accompanying me this far!

summertime... and the living is easy.

Syndicated 2012-05-30 21:45:45 from wingolog

inline cache applications in scheme

The inline cache is a dynamic language implementation technique that originated in Smalltalk 80 and Self, and made well-known by JavaScript implementations. It is fundamental for getting good JavaScript performance.

a cure for acute dynamic dispatch

A short summary of the way inline caches work is that when you see an operation, like x + y, you don't compile in a procedure call to a generic addition subroutine. Instead, you compile a call to a procedure stub: the inline cache (IC). When the IC is first called, it will generate a new procedure specialized to the particular types that flow through that particular call site. On the next call, if the types are the same, control flows directly to the previously computed implementation. Otherwise the process repeats, potentially resulting in a polymorphic inline cache (one with entries for more than one set of types).

An inline cache is called "inline" because it is specific to a particular call site, not to the operation. Also, adaptive optimization can later inline the stub in place of the call site, if that is considered worthwhile.

Inline caches are a win wherever you have dynamic dispatch: named field access in JavaScript, virtual method dispatch in Java, or generic arithmetic -- and here we get to Scheme.

the skeptical schemer

What is the applicability of inline caches to Scheme? The only places you have dynamic dispatch in Scheme are in arithmetic and in ports.

Let's take arithmetic first. Arithmetic operations in Scheme can operate on number of a wide array of types: fixnums, bignums, single-, double-, or multi-precision floating point numbers, complex numbers, rational numbers, etc. Scheme systems are typically compiled ahead-of-time, so in the absence of type information, you always want to inline the fixnum case and call out [of line] for other cases. (Which line is this? The line of flow control: the path traced by a program counter.) But if you end up doing a lot of floating-point math, this decision can cost you. So inline caches can be useful here.

Similarly, port operations like read-char and write can operate on any kind of port. If you are always writing UTF-8 data to a file port, you might want to be able to inline write for UTF-8 strings and file ports, possibly inlining directly to a syscall. It's probably a very small win in most cases, but a win nonetheless.

These little wins did not convince me that it was worthwhile to use ICs in a Scheme implementation, though. In the context of Guile, they're even less applicable than usual, because Guile is a bytecode-interpreted implementation with a self-hosted compiler. ICs work best when implemented as runtime-generated native code. Although it probably will by the end of the year, Guile doesn't generate native code yet. So I was skeptical.

occam's elf

Somehow, through all of this JavaScript implementation work, I managed to forget the biggest use of inline caches in GNU systems. Can you guess?

The PLT!

You may have heard how this works, but if you haven't, you're in for a treat. When you compile a shared library that has a reference to printf, from the C library, the compiler doesn't know where printf will be at runtime. So even in C, that most static of languages, we have a form of dynamic dispatch: a call to an unknown callee.

When the dynamic linker loads a library at runtime, it could resolve all the dynamic references, but instead of doing that, it does something more clever: it doesn't. Instead, the compiler and linker collude to make the call to printf call a stub -- an inline cache. The first time that stub is called, it will resolve the dynamic reference to printf, and replace the stub with an indirect call to the procedure. In this way we trade off a faster loading time for dynamic libraries at the cost of one indirection per call site, for the inline cache. This stub, this inline cache, is sometimes called the PLT entry. You might have seen it in a debugger or a disassembler or something.

I found this when I was writing an ELF linker for Guile's new virtual machine. More on that at some point in the future. ELF is interesting: I find that if I can't generate good code in the ELF format, I'm generating the wrong kind of code. Its idiosyncrasies remind me of what happens at runtime.

lambda: the ultimate inline cache

So, back to Scheme. Good Scheme implementations are careful to have only one way of calling a procedure. Since the only kind of callable object in the Scheme language is generated by the lambda abstraction, Scheme implementations typically produce uniform code for procedure application: load the procedure, prepare the arguments, and go to the procedure's entry point.

However, if you're already eating the cost of dynamic linking -- perhaps via separately compiled Scheme modules -- you might as well join the operations of "load a dynamically-linked procedure" and "go to the procedure's entry point" into a call to an inline cache, as in C shared libraries. In the cold case, the inline cache resolves the dynamic reference, updates the cache, and proceeds with the call. In the hot case, the cache directly dispatches to the call.

One benefit of this approach is that it now becomes cheap to support other kinds of applicable objects. One can make hash tables applicable, if that makes sense. (Clojure folk seem to think that it does.) Another example would be to more efficiently support dynamic programming idioms, like generic functions. Inline caches in Scheme would allow generic functions to have per-call-site caches instead of per-operation caches, which could be a big win.

It seems to me that this dynamic language implementation technique could allow Guile programmers to write different kinds of programs. The code to generate an inline cache could even itself be controlled by a meta-object protocol, so that the user could precisely control application of her objects. The mind boggles, but pleasantly so!

Thanks to Erik Corry for provoking this thought, via a conversation at JSConf EU last year. All blame to me, of course.

as PLT_HULK would say

NOW THAT'S AN APPLICATION OF AN INLINE CACHE! HA! HA HA!

Syndicated 2012-05-29 08:07:39 from wingolog

list of ian lance taylor's linker articles

I was working on an ELF linker in Scheme recently, and wanted to re-read some of Ian Lance Taylor's 20-part series on ELF linkers and linking. In a brief search of the tubes, a list of the articles didn't come up, so perhaps this short and sweet list will make some future web searcher happy.

The list: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

They articles are well-written and fascinating, so don't click unless you have a few hours to burn.

Syndicated 2012-05-23 19:25:35 from wingolog

stranger in these parts

My JSConf 2012 video is out! Check it out:

The talk is called "Stranger in these parts: A hired gun in the JS corral", and in it I talk about my experiences as a Schemer in the implementation world, with a focus on JavaScriptCore, the JS implementation of the WebKit project.

If you want, you can fetch the slides or the notes. If you were unable to play the video in the browser, you can download it directly (25 minutes, ~80 MB, CC-BY-SA).

Special thanks to the A/V team for the fine recording. My talk was the first one that used the wireless mics, and it turned out there was some intermittent interference. They corrected this in later talks by double-miking the speakers. In my case, it was fortunate that they recorded the room as well, and with (I would imagine) a fair amount of post-processing the sound is perfectly intelligible. Cheers!

Finally, there were a number of other interesting talks whose recordings are starting to come out. I especially liked David Nolen's funhouse of ClojureScript and logic programming. It was pleasantly strange to see him mention Peter Norvig's 1992 book, Paradigms of Artificial Intelligence Programming, because I did too, and I think someone else did as well. Three people mentioning a somewhat obscure 20-year-old book, what are the odds?

I also liked Vyacheslav's amusing talk on V8's optimizing compiler. He actually showed some assembler! Folks that read this webrag might find it interesting. Dan Ingalls' talk was fun too. The ending scene was pretty surreal; be sure to watch all the way through.

Thanks again to the JSConf organizers for the invitation to speak. It was a pleasure to get to hang around the lively and energetic JS community. Happy hacking, all!

Syndicated 2012-05-16 11:35:25 from wingolog

doing it wrong: cse in guile

Greetings, readers! It's been a little while, but not because I haven't been doing anything, or nothing worth writing about. No, the reason I haven't written recently is because the perceived range of my ignorance has been growing faster than the domain of my expertise.

Knowledge may well be something one can dominate, but ignorance must forever be a range, stretching off to a hazy horizon. Climbing the hill of JavaScript implementations has let me see farther out on that plain. I think it's only the existence of the unknown unknowns that can let one muster up the arrogance to write at all.

But back to domains and dominators later. To begin, I note that there is very little in the way of correct, current, and universal folk wisdom as to how to implement a programming language. Compiler hackers are priests of their languages, yes, but their temples are in reality more or less isolated cults, in which the values of one's Gods may be unknown or abhorrent to those of others. Witness the attention paid to loop optimizations in some languages, compared to garbage collection in others, or closures in still others.

In my ecumenical capacity as abbot of Guile and adjuct deacon of JavaScriptCore, sometimes I'm tempted to mix rites: to sprinkle the holy water of lexical scope optimizations on JS, and, apropos of this article, to exorcise common subexpressions in Scheme.

As one might well imagine, the rites of one cult must be adapted to circumstances. I implemented CSE for Guile, but I don't know if it was actually a win. In this article I'll go into what CSE is, how it works in Guile, why it probably won't survive in its present form.

cse: common subexpression elimination

I implemented a source-to-source optimization pass in Guile that eliminates common subexpressions. It actually does both more and less than that: it propagates predicates and eliminates effect-free statements as well, and these latter optimizations are why I implemented the pass in the first place.

Let me give an example. Let's imagine we implement a binary tree in Guile, using the records facility.

(use-modules (srfi srfi-9))

(define-record-type btree
  (make-btree elt left right)
  btree?
  (elt btree-elt)
  (left btree-left)
  (right btree-right))

(define *btree-null* #f)

(define (btree-cons head tail)
  (if (btree? tail)
      (let ((elt (btree-elt tail)))
        (if (< elt head)
            (make-btree elt
                        (btree-left tail)
                        (btree-cons head (btree-right tail)))
            (make-btree elt
                        (btree-cons head (btree-left tail))
                        (btree-right tail))))
      (make-btree head
                  *btree-null*
                  *btree-null*)))

That's enough to illustrate my point, I think. We have the data type, the base case, and a constructor. Of course in Haskell or something with better data types it would be much cleaner, but hey, let's roll with this.

If you look at btree-cons, it doesn't seem to be amenable in its current form to classic common subexpression elimination. People don't tend to write duplicated code. You see that I bound the temporary elt instead of typing (btree-head btree) each time, and that was partly because of typing, and partly out of some inner efficiency puritan, telling me I shouldn't write duplicate expressions. (Cult behavior, again!)

But, note that all of these record abstractions will probably inline, instead of calling out to procedures. (They are implemented as inlinable procedures. Yes, it would be better to have cross-module inlining, but we don't, so this is what we do.) In general, syntactic abstraction in Scheme can lead to duplicate code. Apologies in advance for this eyeball-rending torrent, but here's a listing of what btree-cons reduces to, after expansion and partial evaluation:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (if (eq? (struct-vtable tail) btree)
                     (struct-ref tail 0)
                     (throw 'wrong-type-arg
                            'btree-elt
                            "Wrong type argument: ~S"
                            (list tail)
                            (list tail)))))
        (if (< elt head)
            (let ((left (if (eq? (struct-vtable tail) btree)
                            (struct-ref tail 1)
                            (throw 'wrong-type-arg
                                   'btree-left
                                   "Wrong type argument: ~S"
                                   (list tail)
                                   (list tail))))
                  (right (btree-cons
                           head
                           (if (eq? (struct-vtable tail) btree)
                               (struct-ref tail 2)
                               (throw 'wrong-type-arg
                                      'btree-right
                                      "Wrong type argument: ~S"
                                      (list tail)
                                      (list tail))))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons
                          head
                          (if (eq? (struct-vtable tail) btree)
                              (struct-ref tail 1)
                              (throw 'wrong-type-arg
                                     'btree-left
                                     "Wrong type argument: ~S"
                                     (list tail)
                                     (list tail)))))
                  (right (if (eq? (struct-vtable tail) btree)
                             (struct-ref tail 2)
                             (throw 'wrong-type-arg
                                    'btree-right
                                    "Wrong type argument: ~S"
                                    (list tail)
                                    (list tail)))))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

Again, I'm really sorry about that, and it's not just for your eyes: it's also because that's a crapload of code for what should be a simple operation. It's also redundant! There are 6 checks that btree is in fact a btree, when only one is needed semantically. (Note that the null case is not a btree, of course.)

Furthermore, all of the checks in the first arm of the if are redundant. The code above is what the optimizer produces -- which is, you know, turrible.

So, I thought, we could run a pass over the source that tries to propagate predicates, and then tries to fold predicates whose boolean value we already know.

And that's what I did. Here's what Guile's optimizer does with the function, including the CSE pass:

(define (btree-cons head tail)
  (if (and (struct? tail)
           (eq? (struct-vtable tail) btree))
      (let ((elt (struct-ref tail 0)))
        (if (< elt head)
            (let ((left (struct-ref tail 1))
                  (right (btree-cons head (struct-ref tail 2))))
              (make-struct/no-tail btree elt left right))
            (let ((left (btree-cons head (struct-ref tail 1)))
                  (right (struct-ref tail 2)))
              (make-struct/no-tail btree elt left right))))
      (let ((left *btree-null*) (right *btree-null*))
        (make-struct/no-tail btree head left right))))

This is much better. It's quite close to the source program, except the symbolic references like btree-head have been replaced with indexed references. The type check in the predicate of the if expression propagated to all the other type checks, causing those nested if expressions to fold.

Of course, CSE can also propagate bound lexicals:

(let ((y (car x)))
  (car x))
=> (let ((y (car x)))
     y)

This is the classic definition of CSE.

but is it a win?

I should be quite pleased with the results, except that CSE makes Guile's compiler approximately twice as slow. Granted, in the long run, this should be acceptable: code is usually run many more times than it is compiled. But this is a fairly expensive pass, and yet at the same time it's not as good as it could be.

In order to get to the heart of the matter, I need to explain a little about the implementation. CSE is a post-pass, that runs after partial evaluation (peval). I tried to make it a part of peval, as the two optimizations are synergistic -- oh yes, let's revel in that word -- are you feeling it? -- but it was too complicated in the end. The reason is that in functions like this:

(define (explode btree)
  (unless (btree? btree)
    (error "not a btree" btree))
  (values (btree-head btree)
          (btree-left btree)
          (btree-right btree)))

Here we have a sequence of two expressions. Since the first one bails out if the predicate is false, we should propagate a true predicate past the first expression. This means that running CSE on an expression returns two values: the rewritten expression, given the predicates already seen; and a new set of predicates that the expression asserts. We should be able to use these new assertions to elide the type checks in the second expression. And indeed, Guile can do this.

Perhaps you can see the size of the problem now. CSE is a pass that runs over all code, building up an ordered set of expressions that were evaluated, and in what context. When it sees a new expression in a test context -- as the predicate in an if -- it checks to see if the set contains that expression (or its negation) already, in test context, and if so tries to fold the expression to true or false. Already doing this set lookup and traversal is expensive -- at least N log N in the total size of the program, with some quadratic components in the case an expression is found, and also with high constant factors due to the need for custom hash and equality predicates.

The quadratic factor comes in when walking the set to see if the elimination is valid. Consider:

(if (car x)
    (if (car x) 10 20)
    30)

Here, we should be able to eliminate the second (car x), folding to (if (car x) 10 30). However, in this one:

(if (car x)
    (begin
      (y)
      (if (car x) 10 20))
    30)

If we don't know what (y) does, then we can't fold the second test, because perhaps (y) will change the contents of the pair, x. The information that allows us to make these decisions is effects analysis. For the purposes of Guile's optimizer, (car x) has two dependencies and can cause two effects: it depends on the contents of a mutable value, and on the value of a toplevel (x), and can cause the effect of an unbound variable error when resolving the toplevel, or a type error when accessing its car. Two expressions commute if neither depends on effects that the other causes.

I stole the idea of doing a coarse effects analysis, and representing it as bits in a small integer, from V8. Guile's version is here: effects.scm. The ordered set is a form of global value numbering. See the CSE pass here: cse.scm.

The commute test is fairly cheap, but the set traversal is currently a bit expensive.

and for what?

As I have indicated, the pass does do something useful on real programs, as in the binary tree example. But it does not do all it could, and it's difficult to fix that, for a few reasons.

Unlike traditional CSE, Guile's version of it is interprocedural. Instead of operating on just one basic block or one function, it operates across nested functions as well. However, only some dependencies can commute across a function boundary. For example:

(lambda (x)
  (if (pair? x)
      (let ((y (car x)))
        (lambda ()
          (and (pair? x) (car x))))))

Can the first pair? test propagate to the second expression? It can, because pair? does not depend on the values of mutable data, or indeed on any effect. If it's true once, it will always be true.

But can we replace the second (car x) with y? No, because (car x) has a dependency on mutable data, and because we don't do escape analysis on the closure, we don't let those dependencies commute across a procedure boundary. (In this case, even if we did escape analysis, we'd have the same conclusion.)

However, not all lambda abstractions are closures. Some of them might end up being compiled to labels in the function. Scheme uses syntactically recursive procedures to implement loops, after all. But Guile's CSE does poorly for lambda expressions that are actually labels. The reason is that lexical scope is not a dominator tree.

MLton hacker Stephen Weeks says it better than I do:

Thinking of it another way, both CPS and SSA require that variable definitions dominate uses. The difference is that using CPS as an IL requires that all transformations provide a proof of dominance in the form of the nesting, while SSA doesn't. Now, if a CPS transformation doesn't do too much rewriting, then the partial dominance information that it had from the input tree is sufficient for the output tree. Hence tree splicing works fine. However, sometimes it is not sufficient.

As a concrete example, consider common-subexpression elimination. Suppose we have a common subexpression x = e that dominates an expression y = e in a function. In CPS, if y = e happens to be within the scope of x = e, then we are fine and can rewrite it to y = x. If however, y = e is not within the scope of x, then either we have to do massive tree rewriting (essentially making the syntax tree closer to the dominator tree) or skip the optimization. Another way out is to simply use the syntax tree as an approximation to the dominator tree for common-subexpression elimination, but then you miss some optimization opportunities. On the other hand, with SSA, you simply compute the dominator tree, and can always replace y = e with y = x, without having to worry about providing a proof in the output that x dominates y. (i.e. without putting y in the scope of x)

[MLton-devel] CPS vs SSA

See my article on SSA and CPS for more context.

So that's one large source of lost CSE opportunities, especially in loops.

Another large source of lost opportunities is that the Tree-IL language, which is basically a macro-expanded Scheme, has the same property that Scheme does, that the order of evaluation of operands is unspecified.

Consider the base-case clause of my btree-cons above:

(let ((left *btree-null*) (right *btree-null*))
  (make-struct/no-tail btree head left right))

Now, *btree-null* is a toplevel lookup, that might have an unbound-variable effect. We should be able to eliminate one of them, though. Why doesn't the CSE pass do it? Because in Tree-IL, the order of evaluation of the right-hand-sides of left and right is unspecified. This gives Guile some useful freedoms, but little information for CSE.

This is an instance of a more general problem, that Tree-IL might be too high-level to be useful for CSE. For example, at runtime, not all lexical references are the same -- some are local, and some reference free variables. For mutated free variables, the variable is itself in a box, so to reference it you would load the box into a local and then dereference the box. CSE should allow you to eliminate duplicate loads of the box, even in the case that it can't eliminate duplicate references into the box.

conclusion

It is nice to be able to eliminate the duplicate checks, but not at any price. Currently the bootstrapping time cost is a bit irritating. I have other ideas on how to fix that, but ultimately we probably need to re-implement CSE at some lower level. More on that in a future post. Until then, happy hacking.

Syndicated 2012-05-14 17:07:14 from wingolog

the merry month of ma

or, from the department of self-inflicted injuries

Recently I saw a bunch of errors in my server logs. People were asking for pages on my web site, but only if they were newer than Thu, 08 Ma 2012 22:44:59 GMT. "Ma"? What kind of a month is that? The internets have so many crazy things.

On further investigation, it seemed this was just a case of garbage in, garbage out; my intertube was busted. I was the one returning a Last-Modified with that date. It was invalid, but client software sent it back with the conditional request.

Thinking more on this, though, and on the well-known last-modified hack in which that field can be used as an unblockable cookie, I think I have to share some blame with the clients again.

So, clients using at least Apple-PubSub/65.28, BottomFeeder/4.1, NetNewsWire, SimplePie, Vienna, and Windows-RSS-Platform/2.0 should ask the people that implement their RSS software to only pass a Last-Modified date if it's really a valid date. Implementors of the NetVibes and worio.com bots should also take a look at their HTTP stacks. I don't guess that there's much that you can do with an etag though, for better or for worse.

Previously, in a related department.

Syndicated 2012-03-12 20:27:53 from wingolog

an in-depth look at the performance of guile's web server

What ho, ladies! And what ho, gentlemen! The hack is on and apace. Today, the topic is performance: of Guile and of its web server, in microseconds and kiloinstructions. Brew up a pot of tea; this is a long article.

the problem

I have been poking at Guile's web server recently. To recap, Guile is an implementation of Scheme. It is byte-compiled, and has a set of runtime libraries written in C and, increasingly, in Scheme itself.

Guile includes some modules for dealing with the web: HTTP, clients, servers, and such. It's all written in Scheme. It runs this blog you're reading right now.

To be precise, this web log is served by tekuti, an application written on top of Guile's web server; and actually, in this case there is Apache in front of everything right now, so you're not talking directly to Guile. Perhaps that will change at some point.

But anyway, a few months ago, this blog was really slow to access. That turned out to be mostly due to a bug in tekuti, the blog application, in which generating the "related links" for a post would always end up invoking git. (The database for the blog is implemented in git.) Spawning another process was slow. Fellow hacker and tekuti user Aleix Conchillo FlaquƩ fixed the problem a year ago, but it took a while for it me to finally review and merge it.

So then, at that point, things were tolerable, but I had already contracted the performance bug, so I went on to spend a couple months drilling down, optimizing Guile and its web server -- the layers below tekuti.

10K requests per second: achievement unlocked!

After a lot of tweaking, to the compiler, runtime, HTTP libraries, and to the VM, Guile can now serve over 10000 requests per second on a simple "Hello world" benchmark. This is out of the box, so to speak, if the master branch in git were a box. You just check out Guile, build it using the normal autoreconf -vif & & ./configure && make dance, and then run the example uninstalled:

$ meta/guile examples/web/hello.scm

In another terminal, you can connect directly to the port to see what it does. Paste the first paragraph of this and press return, and you should see the second part:

$ nc localhost 8080
GET / HTTP/1.0
Host: localhost:8080
User-Agent: ApacheBench/2.3
Accept: */*

HTTP/1.0 200 OK
Content-Length: 13
Content-Type: text/plain;charset=utf-8

Hello, World!

I should say a little more about what the server does, and what the test application is. It reads the request, and parses all the headers to native Scheme data types. This is strictly more work than is needed for a simple "pong" benchmark, but it's really useful for applications to have all of the headers parsed for you already. It also filters out bad requests.

Guile then passes the request and request body to a handler, which returns the response and body. This example's handler is very simple:

(define (handler request body)
  (values '((content-type . (text/plain)))
          "Hello, World!"))

It uses a shorthand, that instead of building a proper response object, it just returns an association-list of headers along with the body as a string, and relies on the web server's sanitize-response to produce a response object and encode the body as a bytevector. Again, this is more work for the server, but it's a nice convenience.

Finally, the server writes the response and body to the client, and either closes the port, as in this case for HTTP/1.0 with no keep-alive, or moves the client back to the poll set if we have a persistent connection. The reads and writes are synchronous (blocking), and the web server runs in one thread. I'll discuss this a bit more later.

You can then use ApacheBench to test it out:

$ ab -n 100000 -c100 http://localhost:8080/
Server Software:        
Server Hostname:        localhost
Server Port:            8080

Document Path:          /
Document Length:        13 bytes

Concurrency Level:      100
Time taken for tests:   9.631 seconds
Complete requests:      100000
Failed requests:        0
Write errors:           0
Total transferred:      9200000 bytes
HTML transferred:       1300000 bytes
Requests per second:    10383.03 [#/sec] (mean)
Time per request:       9.631 [ms] (mean)
Time per request:       0.096 [ms] (mean, across all concurrent requests)
Transfer rate:          932.85 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       3
Processing:     1   10   1.7      9      20
Waiting:        1   10   1.7      9      20
Total:          3   10   1.7      9      20

Percentage of the requests served within a certain time (ms)
  50%      9
  66%      9
  75%     12
  80%     12
  90%     12
  95%     12
  98%     13
  99%     13
 100%     20 (longest request)

I give you the whole printout because I find it interesting. There's no huge GC pause anywhere. This laptop has an Intel i7-2620M 2.70GHz CPU. It was bought last year, so it's recent but not fancy. There are two cores, and so part of one core is being used by ApacheBench, and the whole other one used by Guile.

Of course, this is a flawed benchmark. If you really care about this sort of thing, Mark Nottingham wrote a nice article on HTTP benchmarking last year that shows all the ways in which this test is wrong.

However, flawed though it is, this test serves my purposes: to understand the overhead that Guile and its web server imposes on a web application.

So in that light, I'd like to take apart this test and try to understand its performance. I'll look at it from three directions: bottom-up (using low-level profiling), top-down (using a Scheme profiler), and transverse (profiling the garbage collector).

bottom-up: clock cycles, instructions retired

The best way to measure the performance of an application is with a statistical profiler. Statistical profiling samples what's really happening, without perturbing the performance characteristics of your application.

On GNU/Linux systems, we have a few options. None are really easy to use, however. Perf comes the closest. You just run perf record -g guile examples/web/hello.scm, and it records its information. Then you run perf report and dive into the details through the mostly pleasant curses application.

However, for some reason perf could not capture the call graphs, in my tests. My machine is x86-64, which does not include frame pointers in its call frames, so perhaps it was a naive stack-walking algorithm. The associated DWARF information does include the necessary stack-walking data.

Anyway, at least you can get a good handle on what individual functions are hot, and indeed what source lines and instructions take the most time. These lowest-level statistical profilers typically sample based on number of clock cycles, so they correspond to real time very well.

So much for measurement. For actually understanding and improving performance, I find valgrind's callgrind tool much more useful. You run it like this:

valgrind --tool=callgrind --num-callers=20 /path/to/guile examples/web/http.scm

Valgrind can record the execution of your program as it runs, tracking every call and every instruction that is executed. You can then explore the resulting log with the kcachegrind graphical tool. It's the best thing there is for exploring low-level execution of your program.

Now, valgrind is not a statistical profiler. It slows down and distorts the execution of your program, as you would imagine. Beyond those caveats, though, you have to keep two things in mind: firstly, that a count of instructions executed does not correspond directly to clock cycles spent. There are memory latencies, cache misses, branch mispredictions, and a whole host of randomnesses that can affect your runtime. Secondly, valgrind only records the user-space behavior of your program. I'll have more to say on that later.

If I run the "hello.scm" benchmark under valgrind, and hit it with a hundred thousand connections, I can get a pretty good idea what the aggregate behavior of my program is. But I can do better than big-picture, for this kind of test. Given that this test does the same thing a hundred thousand times, I can use valgrind's accurate call and instruction counts to give me precise information on how much a single request takes.

So, looking at the total instruction and call counts, and dividing by the number of requests (100K), I see that when Guile handles one request, the algorithmic breakdown is as follows:

instructions per request percentage of total procedure calls per request
250K 100 (total) -
134K 52.5 bytecode interpretation -
47K 18.5 port buffer allocation 1
18K 7.2 display 1
7.6K 3 allocation within VM (closures, pairs, etc) 83
7.2K 2.84 read-line 5
3.6K 1.40 substring-downcase 3
3.2K 1.24 string->symbol 4
3.1K 1.23 accept 1
2.7K 1.06 substring 11
2.6K 1.03 string-index 8
2.5K 1.00 hashq-ref 14
23K 9 other primitives < 1% each -

Guile's ports can be buffered, like C's stdio FILE* streams, and the web server does turn on buffering. The 18.5% of the time spent allocating port buffers seems like a ripe place for optimization, but digging into it, almost all of the time within those routines is spent in the garbage collector, and most of that marking the heap. Switching to a generational collector could help here, but I'm not sure how much, given that port buffers are probably 4 KB each for input and output, and thus might not fit into a young generation. Marking from more threads at once could help -- more on that in some future essay.

There are some primitives that can be optimized as well, but with the VM taking up 52% of the runtime, and 23% for allocation and the garbage collector, Amdahl's law is against us: making the primitives twice as fast would result only a 15% improvement in throughput.

Turning Amdahl's argument around, we can predict the effect of native compilation on throughput. If Guile 2.2 comes out with native compilation, as it might, and that makes Scheme code run 5 times faster (say), then the 50% of the instructions that are currently in the VM might drop to 10% -- leading to an expected throughput improvement of 67%.

top-down: scheme-level profiling with statprof

But what is going on in the VM? For that, I need a Scheme profiler. Fortunately, Guile comes with one, accessible at the REPL:

$ guile --no-debug
> ,profile (load "examples/web/hello.scm")
%     cumulative   self             
time   seconds     seconds      name
 15.42      1.55      1.55  close-port
 11.61      1.17      1.17  %after-gc-thunk
  6.20      1.65      0.62  setvbuf
  4.29      0.43      0.43  display
  3.34      0.35      0.34  accept
  2.86      0.35      0.29  call-with-error-handling
  2.38      0.24      0.24  hashq-ref
  2.23      9.16      0.22  with-default-trap-handler
  2.07      0.67      0.21  build-response
  1.75      1.17      0.18  sanitize-response
...
Sample count: 629
Total time: 10.055736686 seconds (1.123272351 seconds in GC)

(I use the --no-debug argument to avoid some per-VM-instruction overhead imposed by running Guile interactively; see VM Hooks for more.)

Here I get a really strange result. How is close-port taking all the time? It's implemented as a primitive, not in Scheme, and Valgrind only thought it took 0.40% of the instructions. How is that?

To answer this, we need to remember a couple things. First of all, we recall that Guile's statistical profiler uses setitimer to get signals delivered periodically, after some amount of time spent on the program's behalf, including time spent in the kernel. Valgrind doesn't account for system time. So here we are seeing that close-port is indeed taking time, specifically to flush out the buffered writes. The time is really being spent in the write system call.

So this is good! We know now that we should perhaps look at tuning the kernel to buffer our writes better.

We can also use the profiling data counts to break down the time spent in serving one request from a high level. For example, http-read handles traversing the poll set and accepting connections, and tail-calls read-request to actually read the request. Looking at the cumulative times in the chart tells me that out of each request, the time spent breaks down like this:

microseconds per request procedure
100 (total)
29 poll and accept
16 reading request
0 request handler
12 sanitize-response
8 writing response headers
0 writing response body
15 close-port
20 other

Of course, since this is statistical, there is some uncertainty about the whole thing. Still, it seems sensible enough.

transverse: who is doing all the allocating?

Often when you go to optimize a Scheme program, you find that it's spending a fair amount of time in garbage collection. How do you optimize that? The rookie answer to this is to try to patch the collector to allocate faster, or less frequently, or something. Veterans know that the solution to GC woes is usually just to allocate less. But how do you know what is allocating? GC is a fairly transverse cost, in that it can charge the good parts of your program for the expenses of the bad.

Guile uses the Boehm-Demers-Weiser conservative collector. For what it is, it's pretty good. However its stock configuration does not provide very much insight into the allocation patterns of your program. One approximation that can be made, though, is that the parts of the program that cause garbage collection to run are the parts that are allocating the most.

Based on this insight, Guile includes a statistical profiler that samples when the garbage collector runs. One thing to consider is that GC probably doesn't run very often, so for gcprof tests, one might need to run the test for longer. In this example, I increased the load to a million requests.

$ guile --no-debug
> (use-modules (statprof))
> (gcprof (lambda () (load "examples/web/hello.scm")))
%     cumulative   self             
time   seconds     seconds      name
 86.55     82.75     82.75  setvbuf
  9.37      8.96      8.96  accept
  1.70      1.63      1.63  call-with-error-handling
  1.31      1.25      1.25  %read-line
  0.29      0.28      0.28  substring
  0.24      0.23      0.23  string-downcase
  0.10     91.80      0.09  http-read
  0.10      0.09      0.09  parse-param-list
  0.10      0.09      0.09  write-response
---
Sample count: 2059
Total time: 95.611532937 seconds (9.371730448 seconds in GC)

Here we confirm the result that we saw in the low-level profile: that the setvbuf Scheme procedure, which can cause Guile to allocate buffers for ports, is the primary allocator in this test.

Another interesting question to answer is, "how much are we allocating, anyway?" Using the statistics REPL command, I can see that the 100K requests entailed a total allocation of about 1.35 GB, which divides out to 13.5 KB per request. That sounds reasonable: about 4 KB each for the read and write buffers, and some 4 KB of various other allocation: strings, pairs, the final bytevector for output, etc.

The test incurred 220 stop-the-world garbage collections. So, about 1 out of 450 (0.2%) of requests trigger a GC. The average GC time seems to be about 5 ms (1100 ms / 220 times). That squares fairly well with our last-percentile ApacheBench results.

The total heap size is modest: 14 megabytes. It does not leak memory, thankfully. If I run mem_usage.py on it, I get:

Mapped memory:
               Shared            Private
           Clean    Dirty    Clean    Dirty
    r-xp    1236        0     1028        0  -- Code
    rw-p      24        0       28      160  -- Data
    r--p      60        0     1580      112  -- Read-only data
    ---p       0        0        0        0
    r--s      24        0        0        0
   total    1344        0     2636      272
Anonymous memory:
               Shared            Private
           Clean    Dirty    Clean    Dirty
    r-xp       4        0        0        0
    rw-p       0        0        0    11652  -- Data (malloc, mmap)
    ---p       0        0        0        0
   total       4        0        0    11652
   ----------------------------------------
   total    1348        0     2636    11924

Native ahead-of-time compilation would allow for more shareable, read-only memory. Still, as it is, this seems acceptable.

a quick look at more dynamic tests

That's all well and good, you say, but it's a fairly static test, right?

For that I have a couple of data points. One is the simple SXML debugging test in examples/test/debug-sxml.scm, which simply spits back the headers that it receives, wrapped in an HTML table. The values are printed as the Scheme object that they parse to. Currently, I have a version of this script running on my site. (You can see the headers that Apache adds on, there.)

Testing it as before tells me that Guile can serve 6000 of these pages per second, on my laptop. That's pretty respectable for an entirely dynamic page that hasn't been optimized at all. You can search around the net for comparable tests in other languages; I think you'll find Guile's performance to be very good.

The other point to mention is Tekuti itself. Tekuti includes a caching layer in the application, so after the first request, it's not really a dynamic test. It does check to make sure its caches are fresh on every round, though, by checking the value of the refs/heads/master ref. But still, it is a test of pushing a lot of data; for my test, the page is 50 KB, and Guile still reaches 5700 requests per second on this one core, serving 280 megabytes per second of... well, of my blather, really. But it's a powerful blather-pipe!

related & future work

Here's a similarly flawed test from a year ago of static servers, serving small static files over localhost. Flawed, but it's a lot like this "hello world" test in semantics. We see that nginx gets up to about 20K requests per second, per core. It also does so with a flat memory profile, which is nice.

Guile's bundled web server is currently single-threaded and blocking, which does not make it a good frontend server. There's room for a project to build a proper web server on top of Guile, I think, but I probably won't do it myself. In the meantime though, I do want to offer the possibility for the built-in web server to be multithreaded, with some number of I/O threads and some more limited number of compute threads. I've been testing out some code in that direction -- in fact, this server is running that code -- but as yet Guile's synchronization primitives have too much overhead for it to be a real win. There's more runtime and compiler work to do here.

As far as web servers written in safe languages, it would be remiss to not mention Warp, a Haskell web server. Again, their tests effectively utilize multiple cores, but it seems that 20K per core is the standard.

Unlike Haskell, Guile lacks a proper event manager. I'm not sure whether to work on such a thing; Havoc seems to think it's necessary, and who am I to argue.

Finally, I mention a benchmark of python WSGI servers from a couple years ago. (Is March the month of benchmarks?) The python performance is notably worse; hopefully PyPy has improved things in the meantime. On the other hand, GEvent's use of greenlets is really nice, and makes me envious.

conclusions

Hey, you read the thing! Congratulations to you! Good thing you didn't just skip down to the end :)

If I have a message to send, it's this: that you should consider using Guile to be perfectly acceptable for implementing dynamic web applications with high performance requirements.

It's a modest point, I know. There are all kinds of trade-offs here, but hey, Guile is plucky and still a little bit shy, but would love it if you to ask it to the hack.

If it works for you, boast about it to your colleagues! And if it doesn't, let us know, over at the usual places (guile-user@gnu.org, and #guile on freenode). Happy hacking!

Syndicated 2012-03-08 22:44:59 from wingolog

387 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!