Older blog entries for wingo (starting at number 450)

concurrent ml versus go

Peoples! Lately I've been navigating the guile-ship through waters unknown. This post is something of an echolocation to figure out where the hell this ship is and where it should go.

Concretely, I have been working on getting a nice lightweight concurrency system rolling for Guile. I'll write more about that later, but you can think of it as being modelled on Go, though built as a library. (I had previously described it as "Erlang-like", but that's just not accurate.)

Earlier this year at Curry On this topic was burning in my mind and of course when I saw the language-hacker fam there I had to bend their ears. My targets: Matthew Flatt, the amazing boundary-crossing engineer, hacker, teacher, researcher, and implementor of Racket, and Matthias Felleisen, the godfather of the PLT research family. I saw them sitting together and I thought, you know what, what can they have to say to each other? These people have been talking together for 30 years right? Surely they are actually waiting for some ignorant dude to saunter up to the PL genius bar, right?

So saunter I do, saying, "if someone says to you that they want to build a server that will handle 100K or so simultaneous connections on Racket, what abstraction do you tell them to use? Racket threads?" Apparently: yes. A definitive yes, in the case of Matthias, with a pointer to Robby Findler's paper on kill-safe abstractions; and still a yes from Matthew with the caveat that for the concrete level of concurrency that I described, you'd have to run tests. More fundamentally, I was advised to look at Concurrent ML (on which Racket's concurrency facilities were based), that CML was much better put together than many modern variants like Go.

This was very interesting and new to me. As y'all probably know, I don't have a formal background in programming languages, and although I've read a lot of literature, reading things only makes you aware of the growing dimension of the not-yet-read. Concurrent ML was even beyond my not-yet-read horizon.

So I went back and read a bunch of papers. Turns out Concurrent ML is like Lisp in that it has a tribe and a tightly-clutched history and a diaspora that reimplements it in whatever language they happen to be working in at the moment. Kinda cool, and, um... a bit hard to appreciate in the current-day context when the only good references are papers from 10 or 20 years ago.

However, after reading a bunch of John Reppy papers, here is my understanding of what Concurrent ML is. I welcome corrections; surely I am getting this wrong.

1. CML is like Go, composed of channels and goroutines. (Forgive the modern referent; I assume most folks know Go at this point.)

2. Unlike Go, in CML a channel is never buffered. To make a buffered channel in CML, you spawn a thread that manages a buffer between two channels.

3. Message send and receive operations in CML are built on a lower-level primitive called "events". (send ch x) is instead euivalent to (sync (send-event ch x)). It's like an event is the derivative of a message send with respect to time, or something.

4. Events can be combined and transformed using the choose and wrap combinators.

5. Doing a sync on an event created by choose allows a user to build select in "user-space", as a library. Cool stuff. So this is what events are for.

6. There are separate event type implementations for timeouts, channel send/recv blocking operations, file descriptor blocking operations, syscalls, thread joins, and the like. These are supported by the CML implementation.

7. The early implementations of Concurrent ML were concurrent but not parallel; they did not run multiple "goroutines" on separate CPU cores at the same time. It was only in like 2009 that people started to do CML in parallel. I do not know if this late parallelism has a practical impact on the viability of CML.

ok go

What is the relationship of CML to Go? Specifically, is CML more expressive than Go? (I assume the reverse is not the case, but that would also be an interesting result!)

There are a few languages that only allow you to select over message receives (not sends), but Go's select doesn't have this limitation, so that's not a differentiator.

Some people say that it's nice to have events as the common denominator, but I don't get this argument. If the only event under consideration is message send or receive over a channel, events + choose + sync is the same in expressive power as a built-in select, as far as I can see. If there are other events, then your runtime already has to support them either way, and something like (let ((ch (make-channel))) (spawn-fiber (lambda () (put-message ch exp))) (get-message ch)) should be sufficient for any runtime-supported event in exp, like sleeps or timeouts or thread joins or whatever.

To me it seems like Go has made the right choices here. I do not see the difference, and that's why I wrote all this, is to be shown the error of my ways. Choosing channels, send, receive, and select as the primitives seems to have the same power as SML events.

Let this post be a pentagram on the floor, then, to summon the CML cognoscenti. Well-actuallies are very welcome; hit me up in the comments!

Syndicated 2016-09-20 21:33:41 from wingolog

a simple (local) solution to the pay gap

International Working Women's Day was earlier this month, a day that reminds the world how far it has yet to go to achieve just treatment of women in the workplace. Obviously there are many fronts on which to fight to dismantle patriarchy, and also cissexism, and also transphobia, and also racism, and sometimes it gets a bit overwhelming just to think of a world where people treat each other right.

Against this backdrop, it's surprising that some policies are rarely mentioned by people working on social change. This article is about one of them -- a simple local change that can eliminate the pay gap across all axes of unfair privilege.


OK here it is: just pay everyone in a company the same hourly wage.

That's it!

on simple, on easy

But, you say, that's impossible!

Rich Hickey has this famous talk where he describes one thing as simple and the other as easy. In his narrative, simple is good but hard, and easy is bad but, you know, easy. I enjoy this talk because it's easy (hah!) to just call one thing simple and the other easy and it's codewords for good and bad, and you come across as having the facile prestidigitatory wisdom of a Malcolm Gladwell.

As far as simple, the substance of equal pay is as simple as it gets. And as far as practical implementation goes, it only needs buy-in from one person: your boss could do it tomorrow.

But, you say, a real business would never do this! This is getting closer to the real issues, but not there yet. There are plenty of instances of real businesses that do this. Incidentally, mine is one of them! I do not intend this to be an advertisement for my company, but I have to mention this early because society does its best to implant inside our brains the ideas that certain ideas are possible and certain others are not.

But, you say, this would be terrible for business! Here I think we are almost there. There's a question underneath, if we can manage to phrase it in a more scientific way -- I mean, the ideal sense in which science is a practice of humankind in which we use our limited powers to seek truth, with hypotheses but without prejudice. It might sound a bit pompous to invoke capital-S Science here, but I think few conversations of this kind try to honestly even consider existence proofs in the form of already-existing histories (like the company I work for), much less an unbiased study of the implications of modelling the future on those histories.

Let's assume that you and I want to work for justice, and in this more perfect world, men and women and nonbinary people will have equal pay for equal work, as will all people that lie on all axes of privilege that currently operate in society. If you are with me up to here: great. If not, we don't share a premise so it's not much use to go farther. You can probably skip to the next article in your reading list.

So, then, the questions: first of all, would a flat equal wage within a company actually help people in marginalized groups? What changes would happen to a company if it enacted a flat wage tomorrow? What are its limitations? How could this change come about?

would it help?

Let's take the most basic question first. How would this measure affect people in marginalized groups?

Let us assume that salaries are distributed inversely: the higher salaries are made by fewer people. A lower salary corresponds to more people. So firstly, we are in a situation where the median salary is less than the mean: that if we switched to pay everyone the mean, then most people would see an increase in their salary.

Assuming that marginalized people were evenly placed in a company, that would mean that most would benefit. But we know that is not the case: "marginalized" is the operative term. People are categorized at a lower point than their abilities; people's climb of the organizational hierarchy (and to higher salaries) is hindered by harassment, by undervalued diversity work, and by external structural factors, like institutionalized racism or the burden of having to go through a gender transition. So probably, even if a company touts equal pay within job classifications, the job classifications themselves unfairly put marginalized people lower than white dudes like me. So, proportionally marginalized people would benefit from an equal wage more than most.

Already this plan is looking pretty good: more money going to marginalized people is a necessary step to bootstrap a more just world.

All that said, many (but not most) people from marginalized groups will earn more than the mean. What for them? Some will decide that paying for a more just company as a whole is worth a salary reduction. (Incidentally, this applies to everyone: everyone has their price for justice. It might be 0.1%, it might be 5%, it might be 50%.)

Some, though, will decide it is not worth paying. They will go work elsewhere, probably for even more money (changing jobs being the best general way to advance your salary). I don't blame marginalized folks for getting all they can: more power to them.

From what I can tell, things are looking especially good for marginalized people under a local equal-wage initiative. Not perfect, not in all cases, but generally better.

won't someone think of the dudes

I don't believe in value as a zero-sum proposition: there are many ways in which a more fair world could be more productive, too. But in the short term, a balance sheet must balance. Salary increases in the bottom will come from salary decreases from the top, and the dudebro is top in tech.

We should first note that many and possibly most white men will see their wages increase under a flat-wage scheme, as most people earn below the mean.

Secondly, some men will be willing to pay for justice in the form of equal pay for equal work. An eloquent sales pitch as to what they are buying will help.

Some men would like to pay but have other obligations that a "mean" salary just can't even. Welp, there are lots of jobs out there. We'll give you a glowing recommendation :)

Finally there will be dudes that are fine with the pay gap. Maybe they have some sort of techno-libertarian justification? Oh well! They will find other jobs. As someone who cares about justice, you don't really want to work with these people anyway. Call it "bad culture fit", and treat it as a great policy to improve the composition of your organization.

an aside: what are we here for anyway?

A frequent objection to workplace change comes in the form of a pandering explanation of what companies are for, that corporations are legally obligated to always proceed along the the most profitable path.

I always find it extraordinarily ignorant to hear this parroted by people in tech: it's literally part of the CS canon to learn about the limitations of hill-climbing as an optimization strategy. But on the other hand, I do understand; the power of just-so neoliberal narrative is immense, filling your mind with pat explanations, cooling off your brain into a poorly annealed solid mass.

The funny thing about corporate determinism that it's not even true. Folks who say this have rarely run companies, otherwise they should know better. Loads of corporate decisions are made with a most tenuous link on profitability, and some that probably even go against the profit interest. It's always easy to go in a known-profitable direction, but that doesn't mean it's the only way to go, nor that all the profitable directions are known.

Sometimes this question is framed in the language of "what MyDesignCo really cares about is good design; we're worried about how this measure might affect our output". I respect this question more, because it's more materialist (you can actually answer the question!), but I disagree with the premise. I don't think any company really cares about the product in a significant way. Take the design company as an example. What do you want on your tombstone: "She made good advertisements"??? Don't get me wrong, I like my craft, and I enjoy practicing it with my colleagues. But if on my tombstone they wrote "He worked for justice", and also if there were a heaven, I would be p OK with that. What I'm saying is, you start a company, you have an initial idea, you pivot, whatever, it doesn't matter in the end. What matters is you relationship with life on the planet, and that is the criteria you should use to evaluate what you do.

Beyond all that -- it's amazing how much wrong you can wrap up in a snarky hacker news one-liner -- beyond all that, the concern begs the question by assuming that a flat-wage arrangement is less profitable. People will mention any down-side they can but never an up-side.

possible flat-wage up-sides from a corporate perspective

With that in mind, let's consider some ways that a flat wage can actually improve the commercial fate of a company.

A company with a flat wage already has a marketing point that they can use to attract people that care about this sort of thing. It can make your company stand out from the crowd and attract good people.

The people you attract will know you're doing the flat-wage thing, and so will be predisposed to want to work together. This can increase productivity. It also eliminates some material sources of conflict between different roles in an organization. You would still need "human resources" people but they would need to spend less time on mitigating the natural money-based conflicts that exist in other organizations.

Another positive side relates to the ability of the company to make collective sacrifices. For example a company that is going through harder times can collectively decide not to raise wages or even to lower them, rather than fire people. Obviously this outcome depends on the degree to which people feel responsible for the organization, which is incomplete without a feeling of collective self-management as in a cooperative, but even in a hierarchical organization these effects can be felt.

Incidentally a feeling of "investment" in the organization is another plus. When you work in a company in which compensation depends on random factors that you can't see, you always wonder if you're being cheated out of your true value. If everyone is being paid the same you know that everyone's interest in improving company revenue is aligned with their own salary interest -- you can't gain by screwing someone else over.

limitations of a flat wage at improving justice

All that said, paying all workers/partners/employees the same hourly wage is not a panacea for justice. It won't dismantle patriarchy overnight. It won't stop domestic violence, and it won't stop the cops from killing people of color. It won't stop microagressions or harassment in the workplace, and in some ways if there are feelings of resentment, it could even exacerbate them. It won't arrest attrition of marginalized people from the tech industry, and it won't fix hiring. Enacting the policy in a company won't fix the industry as a whole, even if all companies enacted it, as you would still have different wages at different companies. It won't fix the situation outside of the tech industry; a particularly egregious example being that in almost all places, cleaning staff are hired via subcontracts and not as employees. And finally, it won't resolve class conflict at work: the owner still owns. There are still pressures on the owner to keep the whole balance sheet secret, even if the human resources side of things is transparent.

All that said, these are mainly ways in which an equal wage policy is incomplete. A step in the right direction, on a justice level, but incomplete. In practice though the objections you get will be less related to justice and more commercial in nature. Let's take a look at some of them.

commercial challenges to a flat wage

Having everyone paid the same makes it extraordinarily difficult to hire people that are used to being paid on commission, like sales people. Sales people drive Rolexes and wear Mercedes. It is very, very tough to hire good sales people on salary. At my work we have had some limited success hiring, and some success growing technical folks into sales roles, but this compensation package will hinder your efforts to build and/or keep your sales team.

On the other hand, having the same compensation between sales and engineering does eliminate some of the usual sales-vs-product conflicts of interest.

Another point it that if you institute a flat-wage policy, you will expect to lose some fraction of your highly-skilled workers, as many of these are more highly paid. There are again some mitigations but it's still a reality. Perhaps more perniciously, you will have greater difficulties hiring senior people: you literally can't get into a bidding war with a competitor over a potential hire.

On the flip side, a flat salary can make it difficult to hire more junior positions. There are many theories here but I think that a company is healthy when it has a mix of experiences, that senior folks and junior folks bring different things to the table. But if your flat wage is higher than the standard junior wage, then your potential junior hires are now competing against more senior people -- internally it will be hard to keep a balance between different experiences.

Indeed junior workers that you already have are now competing at their wage level with potential hires that might be more qualified in some way. An unscrupulous management could fire those junior staff members and replace them with more senior candidates. An equal wage policy does not solve internal class conflicts; you need to have equal ownership and some form of workplace democracy for that.

You could sort people into pay grades, but in many ways this would formalize injustice. Marginalized people are by definition not equally distributed across pay grades.

Having a flat wage also removes a standard form of motivation, that your wage is always rising as you get older. It could be that after 5 years in a job, maybe your wages went up because the company's revenues went up, but they're still the same as a new hire's -- how do you feel about that? It's a tough question. I think an ever-rising wage has a lot of negative aspects, including decreasing the employability of older workers, but it's deeply rooted in tech culture at least.

Another point is motivation of people within the same cadre. Some people are motivated by bonuses, by performing relatively well compared to their peers. This wouldn't be an option in an organization with a purely flat wage. Does it matter? I do not know.

work with me tho

As the prophet Pratchett said, "against one perfect moment, the centuries beat in vain". There are some definite advantages to a flat wage within a company: it's concrete, it can be immediately enacted, it solves some immediate problems in a local way. Its commercial impact is unclear, but the force of narrative can bowl over many concerns in that department: what's important is to do the right thing. Everybody knows that!

As far as implementation, I see three-and-a-half ways this could happen in a company.

The first is that equal pay could be a founding principle of the company. This was mostly the case in the company I work for (and operate, and co-own equally with the other 40 or so partners). I wasn't a founder of the company, and the precise set of principles and policies has changed over the 15 years of the company's life, but it's more obvious for this arrangement to continue from a beginning than to change from the normal pay situation.

The second is, the change could come from the top down. Some CEOs get random brain waves and this happens. In this case, the change is super-easy to make: you proclaim the thing and it's done. As a person who has had to deal with cash-flow and payroll and balance sheets, I can tell you that this considerably simplifies HR from a management perspective.

The third is via collective action. This only works if workers are able to organize and can be convinced to be interested in justice in this specific way. In some companies, a worker's body might simply be able to negotiate this with management -- e.g., we try it out for 6 months and see. In most others you'd probably need to unionize and strike.

Finally, if this practice were more wider-spread in a sector, it could be that it just becomes "best practice" in some way -- that company management could be shamed into doing it, or it could just be the way things are done.


Many of these points are probably best enacted in the context of a worker-owned cooperative, where you can do away with the worker-owner conflict at the same time. But still, they are worth thinking of in a broader context, and worth evaluating in the degree to which they work for (or against) justice in the workplace. But enough blathering from me today :) Happy hacking!

Syndicated 2016-03-24 21:49:19 from wingolog

a lambda is not (necessarily) a closure


Greets, folks! Check it out: Guile had a whole track devoted to it at FOSDEM this year. OK, so it was only half a day, but there were like a dozen talks! And the room was full all the morning! And -- get this -- I had nothing to do with its organization! I think we can credit the Guix project with the recent surge of interest in Guile; fully half the talks were from people excited about using Guix to solve their problems. Thanks very, very much to Pjotr Prins for organizing the lovely event.

I gave a talk on how the Guile 2.2 compiler and virtual machine could change the way people program. Happily, the video recording came out OK! Video below (or here if that doesn't work), and slides here.

The time was super-limited though and I wasn't able to go into the detail that I'd like. So, dear readers, here we are, with a deeper look on lambda representation in Guile.

a lambda is not (necessarily) a closure

What is this?

(lambda (a b) (+ a b))

If you answer, "it's a lambda expression", you're right! You're also right if you say it's a function -- I mean, lambda makes a function, right? There are lots of things that you could say that would be right, including silly things like "twenty-two characters set in an awkward typeface".

But if you said "it's a closure" -- well you're right in general I guess, like on a semantic what-does-it-mean level, but as far as how Guile represents this thing at run-time, hoo boy are there a number of possibilities, and a closure is just one of them. This article dives into the possibilities, with the goal being to help you update your mental model of "how much do things cost".

In Guile, a lambda expression can be one of the following things at run-time:

  1. Gone

  2. Inlined

  3. Contified

  4. Code pointer

  5. Closure

Let's look into these one-by-one.

lambda: gone

If Guile can prove that a lambda expression is never reached, it won't be present at run-time. The main way this happens is via partial evaluation, but later passes can do this too. In the most basic example, consider the lambda bound to f by this let expression.

(let ((f (lambda ()

Guile has an ,optimize command that can be run at the REPL to show the effect of partial evaluation on your code. These days it's a bit out of date in a way -- it can't show what CPS-based optimization will do to your code -- but for our purposes here it will transform the expression to the following code:

(let ((f (lambda ()
=> 42

So the lambda is gone, big whoop. The interesting thing though is that this happens concurrently with other things that partial evaluation does, so the lambda goes away in this expression too:

(let ((launch? #f)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> 'just-kidding

lambda: inlined

The other trick that partial evaluation can do with lambda expressions is inlining. Re-taking the example above, if we change launch? to #t, the branch folds the other way and the application (f) inlines:

(let ((launch? #t)
      (f (lambda ()
  (if launch? (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
     (if #t (f) 'just-kidding))
=> (let ((launch? #t)
         (f (lambda ()
=> (let ((launch? #t)
         (f (lambda ()
     ((lambda () (launch-the-missiles!))))
=> (let ((launch? #t)
         (f (lambda ()
=> (launch-the-missiles!)

Here again the lambda is gone, but not because it was unreachable, but because it was inlined into its use. I showed some intermediate steps as well, just so you get a feel about how partial evaluation works. The inlining step is illustrated by the fourth transformation, where the lambda application went away, replaced by its body.

Partial evaluation can also unroll many kinds of recursion:

(letrec ((lp (lambda (n)
               (if (zero? n)
                   (+ n (lp (1- n)))))))
  (lp 5))
=> 15

The partial evaluator in Guile 2.2 is more or less unchanged from the one in Guile 2.0, so you get these benefits on old Guile as well. Building a good intuition as to what the partial evaluator will do is important if you want to get the best performance out of Guile. Use the ,optimize command at the REPL to see the effects of partial evaluation on any given expression.

lambda: contified

So, here we step into the unknown, in the sense that from here on out, these optimizations are new in Guile 2.2. Unfortunately, they can be hard to see as they aren't really representable in terms of source-to-source transformations over Scheme programs. Consider this program:

(define (count-down n)
  (define loop
    (lambda (n out)
      (let ((out (cons n out)))
        (if (zero? n)
            (loop (1- n) out)))))
  (loop n '()))

It's a little loop that builds a list of integers. The lambda in this loop, bound to loop, will be contified into the body of count-down.

To see that this is the case, we have to use a new tool, ,disassemble (abbreviated ,x). This takes a procedure and prints its bytecode. It can be hard to understand, so I'm going to just point out some "shapes" of disassembly that you can recognize.

> ,x count-down
Disassembly of #<procedure count-down (n)> at #x9775a8:

  10    (cons 2 1 2)
  11    (br-if-u64-=-scm 0 1 #f 5) ;; -> L2
  14    (sub/immediate 1 1 1)
  15    (br -5)                    ;; -> L1

I've snipped the disassembly to the interesting part. The first thing to notice is that there's just one procedure here: only one time that ,x prints "Disassembly of ...". That means that the lambda was eliminated somehow, either because it was dead or inlined, as described above, or because it was contified. It wasn't dead; we can see that from looking at the ,optimize output, which doesn't significantly change the term. It wasn't inlined either; again, ,optimize can show you this, but consider that because partial evaluation can't determine when the loop would terminate, it won't find a point at which it can stop unrolling the loop. (In practice what happens though is that it tries, hits an effort or code growth limit, then aborts the inlining attempt.)

However, what we see in the disassembly is the body of the loop: we cons something onto a list (the cons), check if a two numbers are equal (br-if-u64-=-scm), and if they are we jump out of the loop (L2). Otherwise we subtract 1 from a number (sub/immediate) and loop (br to L1). That is the loop. So what happened?

Well, if inlining is copying, then contification is rewiring. Guile's compiler was able to see that although it couldn't inline the loop function, it could see all of loop's callers, and that loop always returned to the same "place". (Another way to say this is that loop is always called with the same continuation.) The compiler was then able to incorporate the body of loop into count-down, rewiring calls to loop to continue to loop's beginning, and rewriting returns from loop to proceed to the continuation of the loop call.

a digression on language

These words like "contification" and "continuation" might be unfamiliar to you, and I sympathize. If you know of a better explanation of contification, I welcome any links you might have. The name itself comes from a particular formulation of the intermediate language used in Guile, the so-called "CPS" language. In this language, you convert a program to make it so it never returns: instead, each sub-expression passes its values to its continuation via a tail call. Each continuation is expressed as a lambda expression. See this article for an intro to CPS and how it relates to things you might know like SSA.

Transforming a program into CPS explodes it into a bunch of little lambdas: every subexpression gets its own. You would think this would be a step backwards, if your goal is to eliminate closures in some way. However it's possible to syntactically distinguish between lambda expressions which are only ever used as continuations and those that are used as values. Let's call the former kind of lambda a cont and the latter a function. A cont-lambda can be represented at run-time as a label -- indeed, the disassembly above shows this. It turns out that all lambda expressions introduced by the CPS transformation are conts. Conts form a first-order flow graph, and are basically the same as SSA basic blocks. If you're interested in this kind of thing, see Andrew Kennedy's great paper, Compiling with Continuations, Continued, and see also CPS soup for more on how this evolved in Guile 2.2.

I say all this to give you a vocabulary. Functions that are present in the source program start life as being treated as function-lambdas. Contification takes function-lambda values and turns then into cont-lambda labels, if it can. That's where the name "contification" comes from. For more on contification, see MLton's page on its contification pass, linking to the original paper that introduces the concept.

and we're back

Contification incorporates the body of a function into the flow graph of its caller. Unlike inlining, contification is always an optimization: it never causes code growth, and it enables other optimizations by exposing first-order control flow. (It's easier for the compiler to reason about first-order loops than it is to reason about control flow between higher-order functions.)

Contification is a reliable optimization. If a function's callers are always visible to the compiler, and the function is always called with the same continuation, it will be contified. These are two fairly simple conditions that you can cultivate your instincts to detect and construct.

Contification can also apply to mutually recursive functions, if as a group they are all always called with the same continuation. It's also an iterative process, in the sense that contifying one set of functions can expose enough first-order control flow that more contification opportunities become apparent.

It can take a while to get a feel for when this optimization applies. You have to have a feel for what a continuation is, and what it means for a function's callers to all be visible to the compiler. However, once you do internalize these conditions, contification is something you can expect Guile's compiler to do to your code.

lambda: code pointer

The next representation a lambda might have at run-time is as a code pointer. In this case, the function fails the conditions for contification, but we still avoid allocating a closure.

Here's a little example to illustrate the case.

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

In this example, log is called with three different continuations, so it's not eligible for contification. Unfortunately, this example won't illustrate anything for us because the log function is so small that partial evaluation will succeed in inlining it. (You could determine this for yourself by using ,optimize.) So let's make it bigger, to fool the inliner:

(define (thing)
  (define (log what)
    (format #t "Very important log message: ~a\n" what)
    ;; If `log' is too short, it will be inlined.  Make it bigger.
    (format #t "Did I ever tell you about my chickens\n")
    (format #t "I was going to name one Donkey\n")
    (format #t "I always wanted a donkey\n")
    (format #t "In the end we called her Raveonette\n")
    (format #t "Donkey is not a great name for a chicken\n")
    (newline) (newline) (newline) (newline) (newline))
  (log "ohai")
  (log "kittens")
  (log "donkeys"))

Now if we disassembly it, we do get disassembly for two different functions:

,x thing
Disassembly of #<procedure thing ()> at #x97d704:

Disassembly of log at #x97d754:

So, good. We defeated the inliner. Let's look closer at the disassembly of the outer function.

,x thing
Disassembly of #<procedure thing ()> at #x97d704:
  12    (call-label 3 2 8)              ;; log at #x97d754

Here we see that instead of the generic call instruction, we have the specific call-label instruction which calls a procedure whose code is at a known offset from the calling function.

call-label is indeed a cheaper call than the full call instruction that has to check that the callee is actually a function and so on. But that's not the real optimization here. If all callers of a function are known -- and by this time, you're starting to catch the pattern, I think -- if all callers are known, then the procedure does not need to exist as a value at run-time.

This affords a number of optimization opportunities. Theoretically there are many -- all call sites can be specialized to the specific callee. The callee can have an optimized calling convention that doesn't have anything to do with the generic convention. Effect analysis can understand the side effects and dependencies of the callee in a more precise way. The compiler can consider unboxing some arguments and return values, if it finds that useful.

In Guile though, there's only one real optimization that we do, and that is related to free variables. Currently in Guile, all procedures have a uniform calling convention, in which the procedure being called (the callee) is itself passed as the zeroeth argument, and then the arguments follow on the stack. The function being called accesses its free variables through that zeroeth argument. If however there is no need for the procedure to be represented as a value, we are free to specialize that zeroeth argument.

So, consider a well-known procedure like log above. (By "well-known", we mean that all of log's callers are known.) Since log doesn't actually have any lexically bound free variables, we can just pass in anything as argument zero when invoking it. In practice we pass #f, because it happens to be an easy value to make.

(Why isn't format treated as a free variable in log? Because there is special support from the linker for lazily initializing the locations of variables imported from other modules or defined at the top level instead of within a lexical contour. In short: only variables that are (a) used within the lambda and (b) defined within a let or similar count towards a lambda's free variables.)

For a well-known procedure with only one free variable, we can pass in that free variable as the zeroeth argument. Internally to the function, we rewrite references to that free variable to reference argument 0 instead. This is a neat hack because we can have a lambda with a free variable but which results in no allocation at run-time.

Likewise if there are two free variables -- and this is starting to sound like Alice's restaurant, isn't it -- well we do have to pass in their values to the procedure, but we don't have to build an actual closure object with a tag and a code pointer and all. Pairs happen to be small and have some fast paths in Guile, so we use that. References to the free variables get internally rewritten to be car or cdr of argument 0.

For three or more free variables, we do the same, but with a vector.

For a final trick, a set of mutually recursive procedures whose callers are all known can share the object that collects their free variables. We collect the union of the free variables of all of the procedures, and pack them into a specialized representation as above.

Note that for well-known procedures, all variables that are free in the lambda are also free in the caller; that's why the 1-free-variable substitution works. The lambda is bound in a scope that dominates its callers, but its free variables dominate the lambda so they dominate the callers too. For that reason in this case we could choose to do lambda lifting instead, with no penalty: instead of bundling up the free variables in a heap object, we could pass them as arguments. Dybvig claims this is not a great idea because it increases register pressure. That could be true, but I haven't seen the numbers. Anyway, we do the flat closure thing, so we pack the free vars into data.

All these ideas came pretty much straight from the great Optimizing Closures in O(0) Time by Andrew Keep et al.

lambda: closure

OK! So you have a lambda whose callees are not all visible to the compiler. You need to reify the procedure as a value. That reified procedure-as-value is a closure: an object with a tag, a code pointer, and an array of free variables.

Of course, if the procedure has no free variables, you just have the tag and the code pointer... and because Scheme is semantically squirrely when it comes to the result of (eqv? (lambda () 10) (lambda () 10)) (it's unspecified: lambda expressions don't have identity), we can statically allocate the closure in the binary, as a constant.

Otherwise we do allocate the heap object.

Note however that if a group of mutually recursive procedures has just one entry that is not "well-known", then that procedure clique can share one closure object.

lambda: it's complicated

In summary, a lambda is an abstraction that has many concrete representations. Guile will choose the cheapest representation that it can. If you need to eke out even more performance from your program, having a good mental model of how the abstract maps to the concrete will help you know where to focus your efforts, and what changes might be helpful. Good luck, and happy hacking!

Syndicated 2016-02-08 10:12:42 from wingolog

guile compiler tasks

Hey! We released Guile 2.1.2, including the unboxing work, and we fixed the slow bootstrap problem by shipping pre-built bootstraps in tarballs. A pretty OK solution in my opinion; check it out!

future work

At this point I think I'm happy with Guile's compiler and VM, enough for now. There is a lot more work to do but it's a good point at which to release a stable series. There will probably be a number of additional pre-releases, but not any more significant compiler/VM work that must be done before a release.

However, I was talking with Guilers at FOSDEM last weekend and we realized that although we do a pretty good job at communicating the haps in compiler-land, we don't do a good job at sharing a roadmap or making it possible for other folks to join the hack. And indeed, it's been difficult to do so while things were changing so much: I had to get things right in my head before joining in the confusion of other people's heads.

In that spirit I'd like to share a list of improvements that it would be nice to make at some point. If you take one of these tasks, be my guest: find me on IRC (wingo on freenode) and let me know, and I'll help as I am able. You need to be somewhat independent; I'm not offering a proper mentoring or anything, more like office hours or something, where you come with the problem you are having and I commiserate and give context/background/advice as I am able.

So with that out of the way, here's a huge list of stuff! Following this, more details on each one.

  1. stripping binaries

  2. full source in binaries

  3. cps in in binaries

  4. linking multiple modules together

  5. linking a single executable

  6. instruction explosion

  7. elisp optimizations

  8. prompt removal

  9. basic register allocation

  10. optimal register allocation

  11. unboxed record fields

  12. textual CPS

  13. avoiding arity checks

  14. unboxed calls and returns

  15. module-level inlining

  16. cross-module inlining

As a bonus, in the end I'll give some notes on native compilation. But first, the hacks!

stripping binaries

Guile uses ELF as its object file format, and currently includes source location information as DWARF data. On space-constrained devices this might be too much. Your task: add a hack to the linker that can strip existing binaries. Read Ian Lance Taylor's linker articles for more background, if you don't know things about linkers yet.

full source in binaries

Wouldn't it be nice if the ELF files that Guile generates actually included the source as well as the line numbers? We could do that, in a separate strippable ELF section. This point is like the reverse of the previous point :)

cps in in binaries

We could also include the CPS IR in ELF files too. This would enable some kinds of link-time optimization and cross-module inlining. You'd need to define a binary format for CPS, like LLVM bitcode or so. Neat stuff :)

linking multiple modules together

Currently in Guile, just about every module is a separate .go file. Loading a module will cause a few stat calls and some seeks and reads and all that. Wouldn't it be nice if you could link together all the .go files that were commonly used into one object? Again this is a linker hack, but it needs support from the run-time as well: when the run-time goes to load a file, it should first check in a registry if that file has been logically provided by some other file. We'd be able to de-duplicate constant data from various modules. However there is an initialization phase when loading a .go file which effectively performs all the relocations needed by constants that need a fix-up at load-time; see the ELF article I linked to above for more. For some uses, it would be OK to produce one relocation/initialization procedure. For others, if you expected to only load a fraction of the modules in a .go file, it would be a lose on startup time,
so you would probably need to support lazy relocation when a module is first loaded.

Anyway, your task would be to write a linker hack that loads a bunch of .go files, finds the relocations in them, de-duplicates the constants, and writes out a combined .go file that includes a table of files contained in it. Good luck :) This hack would work great for Emacs, where it's effectively a form of unexec that doesn't actually rely on unexec.

linking a single executable

In the previous task, you could end up with the small guile binary that links to libguile (or your binary linking to libguile), and then a .go file containing all the modules you are interestd in. It sure would be nice to be able to link those together into just one binary, or at least to link the .go into the Guile binary. If the Guile is statically linked itself, you would have a statically linked application. If it's dynamically linked, it would remain dynamically linked. Again, a linker hack, but one that could provide a nicer way to distribute Guile binaries.

instruction explosion

Now we get more to the compiler side of things. Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do
this, which would be a lose overall because you have more instruction dispatch.

Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level.

elisp optimizations

Guile implements Emacs Lisp, and does so well. However it hasn't been the focus of a lot of optimization. Emacs has a lot of stuff going on on its side, and so have we, so we haven't managed to replace the Elisp interpreter in Emacs with one written in Guile, though Robin Templeton has brought us a long way forward. We need someone to do both the integration work but also to poke the compiler and make sure it's a clear win.

prompt removal

It's pretty natural to use delimited continuations when compiling some kind of construct that includes a break statement to Guile, whether that compiler is part of Elisp or just implemented as a Scheme macro. But, many instances of prompts can be contified, resulting in no overhead at run-time. Read up on contification and contify the hell out of some prompts!

basic register allocation

Guile usually tries its best to be safe-for-space: only the data which might be used in the future of a program is kept alive, and the rest is available for garbage collection. Notably, this applies to function arguments, temporaries, and lexical variables: if a value is dead, the GC can collect it and re-use its space. However this isn't always what you want. Sometimes you might want to have all variables that are in scope to be available, for better debugging. Your task would be to implement a "slot allocator" (which is really register allocation) that keeps values alive in the parts of the programs that they dominate.

optimal register allocation

On the other hand, our slot allocator -- which is basically register allocation, but for stack slots -- isn't so great. It does OK but you can often end up shuffling values in a loop, which is the worst. Your task would be to implement a proper register allocator: puzzle-solving, graph-coloring, iterative coalescing, something that really tries to do a good job. Good luck!

unboxed record fields

Guile's "structs", on which records are implemented, support unboxed values, but these values are untyped, not really integrated with the record layer, and always boxed in the VM. Your task would be to design a language facility that allows us to declare records with typed fields, and to store unboxed values in those fields, and to cause access to their values to emit boxing/unboxing instructions around them. The optimizer will get rid of those boxing/unboxing instructions if it can. Good luck!

textual CPS

The CPS language is key to all compiler work in Guile, but it doesn't have a nice textual form like LLVM IR does. Design one, and implement a parser and an unparser!

avoiding arity checks

If you know the procedure you are calling, like if it's lexically visible, then if you are calling it with the right number of arguments you can skip past the argument check and instead do a call-label directly into the body. Would be pretty neat!

unboxed calls and returns

Likewise if a function's callers are all known, it might be able to unbox its arguments or return value, if that's a good idea. Tricky! You could start with a type inference pass or so, and maybe that could produce some good debugging feedback too.

module-level inlining

Guile currently doesn't inline anything that's not lexically visible. Unfortunately this restriction extends to top-level definitions in a module: they are treated as mutable and so never inlined/optimized/etc. Probably we need to change the semantics here such that a module can be compiled as a unit, and all values which are never mutated can be assumed to be constant. Probably you also want a knob to turn off this behavior, but really you can always re-compile and re-load a module as a whole if re-loading a function at run-time doesn't work because it was inlined. Anyway. Some semantic work here, but some peval work as well. Be careful!

cross-module inlining

Likewise Guile currently doesn't inline definitions from other modules. However for small functions this really hurts. Guile should probably serialize tree-il for small definitions in .go files, and allow peval to speculatively inline imported definitions. This is related to the previous point and has some semantic implications.

bobobobobobonus! native compilation

Thinking realistically, native compilation is the next step. We have the object file format, cool. We will need the ability to call out from machine code in .go files to run-time functions, so we need to enhance the linker, possibly even with things like PLT/GOT sections to avoid dirtying too many pages. We need to lower the CPS even further, to get closer to some kind of machine model, then go specific, with an assembler for each architecture. The priority in the beginning will be simplicity and minimal complexity; good codegen will come later. This is obviously the most attractive thing but it's also the most tricky, design-wise. I want to do at least part of this, so though you can't have it all, you are welcome to help :)

That's it for now. I'll amend the post with more things as and when I think of them. Comments welcome too, as always. Happy hacking!

Syndicated 2016-02-04 21:38:05 from wingolog

talks i would like to give in 2016

Every year I feel like I'm trailing things in a way: I hear of an amazing conference with fab speakers, but only after the call for submissions had closed. Or I see an event with exactly the attendees I'd like to schmooze with, but I hadn't planned for it, and hey, maybe I could have even spoke there.

But it's a new year, so let's try some new things. Here's a few talks I would love to give this year.

building languages on luajit

Over the last year or two my colleagues and I have had good experiences compiling in, on, and under LuaJIT, and putting those results into production in high-speed routers. LuaJIT has some really interesting properties as a language substrate: it has a tracing JIT that can punch through abstractions, it has pretty great performance, and it has a couple of amazing escape hatches that let you reach down to the hardware in the form of the FFI and the DynASM assembly generator. There are some challenges too. I can tell you about them :)

try guile for your next project!

This would be a talk describing Guile, what it's like making programs with it, and the kind of performance you can expect out of it. If you're a practicing programmer who likes shipping small programs that work well, are fun to write, and run with pretty good performance, I think Guile can be a great option.

I don't get to do many Guile talks because hey, it's 20 years old, so we don't get the novelty effect. Still, I judge a programming language based on what you can do with it, and recent advances in the Guile implementation have expanded its scope significantly, allowing it to handle many problem sizes that it couldn't before. This talk will be a bit about the language, a bit about the implementation, and a bit about applications or problem domains.

compiling with persistent data structures

As part of Guile's recent compiler improvements, we switched to a somewhat novel intermediate language. It's continuation-passing-style, but based on persistent data structures. Programming with it is interesting and somewhat different than other intermediate languages, and so this would be a talk describing the language and what it's like to work in it. Definitely a talk for compiler people, by a compiler person :)

a high-performance networking with luajit talk

As I mentioned above, my colleagues and I at work have been building really interesting things based on LuaJIT. In particular, using the Snabb Switch networking toolkit has let us build an implementation of a "lightweight address family translation router" -- the internet-facing component of an IPv4-as-a-service architecture, built on an IPv6-only network. Our implementation flies.

It sounds a bit specialized, and it is, but this talk could go two ways.

One version of this talk could be for software people that aren't necessarily networking specialists, describing the domain and how with Snabb Switch, LuaJIT, compilers, and commodity x86 components, we are able to get results that compete well with offerings from traditional networking vendors. Building specialized routers and other network functions in software is an incredible opportunity for compiler folks.

The other version would be more for networking people. We'd explain the domain less and focus more on architecture and results, and look more ahead to challenges of 100Gb/s ports.

let me know!

I'll probably submit some of these to a few conferences, but if you run an event and would like me to come over and give one of these talks, I would be flattered :) Maybe that set of people is empty, but hey, it's worth a shot. Probably contact via the twitters has the most likelihood of response.

There are some things you need to make sure are covered before reaching out, of course. It probably doesn't need repeating in 2016, but make sure that you have a proper code of conduct, and that that you'll be able to put in the time to train your event staff to create that safe space that your attendees need. Getting a diverse speaker line-up is important to me too; conferences full of white dudes like me are not only boring but also serve to perpetuate an industry full of white dudes. If you're reaching out, reach out to woman and people of color too, and let me know that you're working on it. This old JSConf EU post has some ideas too. Godspeed, and happy planning!

Syndicated 2016-01-21 11:59:18 from wingolog

unboxing in guile

Happy snowy Tuesday, hackfolk! I know I said in my last dispatch that I'd write about Lua soon, but that article is still cooking. In the meantime, a note on Guile and unboxing.

on boxen, on blitzen

Boxing is a way for a programming language implementation to represent a value.

A boxed value is the combination of a value along with a tag providing some information about the value. Both the value and the tag take up some space. The value can be thought to be inside a "box" labelled with the tag and containing the value.

A value's tag can indicate whether the value's bits should be interpreted as an unsigned integer, as a double-precision floating-point number, as an array of words of a particular data type, and so on. A tag can also be used for other purposes, for example to indicate whether a value is a pointer or an "immediate" bit string.

Whether values in a programming language are boxed or not is an implementation consideration. It can be the case that in languages with powerful type systems that a compiler can know what the representation of all values are in all parts of all programs, and so boxing is never needed. However, it's much easier to write a garbage collector if values have a somewhat uniform representation, with tag bits to tell the GC how to trace any pointers that might be contained in the object. Tags can also carry run-time type information needed by a dynamically typed language like Scheme or JavaScript, to allow for polymorphic predicates like number? or pair?.

Boxing all of the values in a program can incur significant overhead in space and in time. For example, one way to implement boxes is to allocate space for the tag and the value on the garbage-collected heap. A boxed value would then be referred to via a pointer to the corresponding heap allocation. However, most memory allocation systems align their heap allocations on word-sized boundaries, for example on 8-byte boundaries. That means that the low 3 bits of a heap allocation will always be zero. If you make a bit string whose low 3 bits are not zero, it cannot possibly be a valid pointer. In that case you can represent some types within the set of bit strings that cannot be valid pointers. These values are called "immediates", as opposed to "heap objects". In Guile, we have immediate representations for characters, booleans, some special values, and a subset of the integers. Alternately, a programming language implementation can represent values as double-precision floating point numbers, and shove pointers into the space of the NaN values. And for heap allocations, some systems can associate one tag with a whole page of values, minimizing per-value boxing overhead.

The goal of these optimizations is to avoid heap allocation for some kinds of boxes. While most language implementations have good garbage collectors that make allocation fairly cheap, the best way to minimize allocation cost is to refrain from it entirely.

In Guile's case, we currently use a combination of low-bit tagging for immediates, including fixnums (a subset of the integers), and tagged boxes on the heap for everything else, including floating-point numbers.

Boxing floating-point numbers obviously incurs huge overhead on floating-point math. You have to consider that each intermediate value produced by a computation will result in the allocation of another 8 bytes for the value and 4 or 8 bytes for the tag. Given that Guile aligns allocations on 8-byte boundaries, the result is a 16-byte allocation in either case. Consider this loop to sum the doubles in a bytevector:

(use-modules (rnrs bytevectors))
(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

Each trip through the loop is going to allocate not one but two heap floats: one to box the result of bytevector-ieee-double-native-ref (whew, what a mouthful), and one for the sum. If we have a bytevector of 10 million elements, that will be 320 megabytes of allocation. Guile can allocate short-lived 16-byte allocations at about 900 MB/s on my machine, so summing this vector is going to take at least 350ms, just for the allocation. Indeed, without unboxing I measure this loop at 580ms for a 10 million element vector:

> (define v (make-f64vector #e10e6 1.0))
> ,time (f64-sum v)
$1 = 1.0e7
;; 0.580114s real time, 0.764572s run time.  0.268305s spent in GC.

The run time is higher than the real time due to parallel marking. I think in this case, allocation has even higher overhead because it happens outside the bytecode interpreter. The add opcode has a fast path for small integers (fixnums), and if it needs to work on flonums it calls out to a C helper. That C helper doesn't have a pointer to the thread-local freelist so it has to go through a more expensive allocation path.

Anyway, in the time that Guile takes to fetch one f64 value from the vector and add it to the sum, the CPU ticked through some 150 cycles, so surely we can do better than this.

unboxen, unblitzen

Let's take a look again at the loop to see where the floating-point allocations are produced.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (+ sum (bytevector-ieee-double-native-ref v i)))

It turns out there's no reason for the loquatiously-named bytevector-ieee-double-native-ref to return a boxed number. It's a monomorphic function that is well-known to the Guile compiler and virtual machine, and it even has its own opcode. In Guile 2.0 and until just a couple months ago in Guile 2.2, this function did box its return value, but that was because the virtual machine had no facility for unboxed values of any kind.

To allow bytevector-ieee-double-native-ref to return an unboxed double value, the first item of business was then to support unboxed values in Guile's VM. Looking forward to unboxed doubles, we made a change such that all on-stack values are 64 bits wide, even on 32-bit systems. (For simplicity, all locals in Guile take up the same amount of space. For the same reason, fetching 32-bit floats also unbox to 64-bit doubles.)

We also made a change to Guile's "stack maps", which are data structures that tell the garbage collector which locals are live in a stack frame. There is a stack map recorded at every call in a procedure, to be used when an activation is pending on the stack. Stack maps are stored in a side table in a separate section of the compiled ELF library. Live values are traced by the garbage collector, and dead values are replaced by a special "undefined" singleton. The change we made was to be able to indicate that live values were boxed or not, and if they were unboxed, what type they were (e.g. unboxed double). Knowing the type of locals helps the debugger to print values correctly. Currently, all unboxed values are immediates, so the GC doesn't need to trace them, but it's conceivable that we could have unboxed pointers at some point. Anyway, instead of just storing one bit (live or dead) per local in the stack map, we store two, and reserve one of the bit patterns to indicate that
the local is actually an f64 value.

But the changes weren't done then: since we had never had unboxed locals, there were quite a few debugging-related parts of the VM that assumed that we could access the first slot in an activation to see if it was a procedure. This dated from a time in Guile where slot 0 would always be the procedure being called, but the check is bogus ever since Guile 2.2 allowed local value slots corresponding to the closure or procedure arguments to be re-used for other values, if the closure or argument was dead. Another nail in the coffin of procedure-in-slot-0 was driven by closure optimizations, in which closures whose callees are all visible could specialize the representation of their closure in non-standard ways. It took a while, but unboxing f64 values flushed out these bogus uses of slot 0.

The next step was to add boxing and unboxing operations to the VM (f64->scm and scm->f64, respectively). Then we changed bytevector-ieee-double-native-ref to return an unboxed value and then immediately box it via f64->scm. Similarly for bytevector-ieee-double-native-set!, we unbox the value via scm->f64, potentially throwing a type error. Unfortunately our run-time type mismatch errors got worse; although the source location remains the same, scm->f64 doesn't include the reason for the unboxing. Oh well.

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
              (+ sum boxed))

When we lower Tree-IL to CPS, we insert the needed f64->scm and scm->f64 boxing and unboxing operations around bytevector accesses. Cool. At this point we have a system with unboxed f64 values, but which is slower than the original version because every f64 bytevector access involves two instructions instead of one, although the instructions themselves together did the same amount of work. However, telling the optimizer about these instructions could potentially eliminate some of them. Let's keep going and see where we get.

Let's attack the other source of boxes, the accumulation of the sum. We added some specialized instuctions to the virtual machine to support arithmetic over unboxed values. Doing this is potentially a huge win, because not only do you avoid allocating a box for the result, you also avoid the type checks on the incoming values. So we add f64+, f64-, and so on.

Unboxing the + to f64+ is a tricky transformation, and relies on type analysis. Our assumption is that if type analysis indicates that we are in fact able to replace a generic arithmetic instruction with a combination of operand unboxing, unboxed arithmetic, and a boxing operation, then we should do it. Separating out the boxes and the monomorphic arithmetic opens the possibility to remove the resulting box, and possibly remove the unboxing of operands too. In this case, we run an optimization pass and end up with something like:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)
                     (scm->f64 boxed)))))

Scalar replacement via fabricated expressions will take the definition of boxed as (f64->scm f64) and fabricate a definition of f64 as (scm->f64 boxed), which propagates down to the f64+ so we get:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i))
                  (boxed (f64->scm f64)))
               (f64+ (scm->f64 sum)

Dead code elimination can now kill boxed, so we end up with:

(define (f64-sum v)
  (let lp ((i 0) (sum 0.0))
    (if (< i (bytevector-length v))
        (lp (+ i 8)
            (let ((f64 (bytevector-ieee-double-native-ref v i)))
               (f64+ (scm->f64 sum)

Voilà, we removed one allocation. Yay!

As we can see from the residual code, we're still left with one f64->scm boxing operation. That expression is one of the definitions of sum, one of the loop variables. The other definition is 0.0, the starting value. So, after specializing arithmetic operations, we go through the set of multiply-defined variables ("phi" variables) and see what we can do to unbox them.

A phi variable can be unboxed if all of its definitions are unboxable. It's not always clear that you should unbox, though. For example, maybe you know via looking at the definitions for the value that it can be unboxed as an f64, but all of its uses are boxed. In that case it could be that you throw away the box when unboxing each definition, only to have to re-create them anew when using the variable. You end up allocating twice as much instead of not at all. It's a tricky situation. Currently we assume a variable with multiple definitions should only be unboxed if it has an unboxed use. The initial set of unboxed uses is the set of operands to scm->f64. We iterate this set to a fixed point: unboxing one phi variable could cause others to be unbox as well. As a heuristic, we only require one unboxed use; it could be there are other uses that are boxed, and we could indeed hit that pessimal double-allocation case. Oh well!

In this case, the intermediate result looks something like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (let ((f64 (bytevector-ieee-double-native-ref v i)))
                  (f64+ (scm->f64 sum-box)

After the scalar replacement and dead code elimination passes, we end up with something more like:

(define (f64-sum v)
  (let lp ((i 0) (sum (scm->f64 0.0)))
    (let ((sum-box (f64->scm sum)))
      (if (< i (bytevector-length v))
          (lp (+ i 8)
              (f64+ sum
                    (bytevector-ieee-double-native-ref v i)))

Well this is looking pretty good. There's still a box though. Really we should sink this to the exit, but as it happens there's something else that accidentally works in our favor: loop peeling. By peeling the first loop iteration, we create a control-flow join at the loop exit that defines a phi variable. That phi variable is subject to the same optimization, sinking the box down to the join itself. So in reality the result looks like:

(define (f64-sum v)
  (let ((i 0)
        (sum (scm->f64 0.0))
        (len (bytevector-length v)))
     (if (< i len)
         (let ((i (+ i 8))
               (sum (f64+ sum
                          (bytevector-ieee-double-native-ref v i))))
           (let lp ((i i) (sum sum))
             (if (< i len)
                 (lp (+ i 8)
                     (f64+ sum (bytevector-ieee-double-native-ref v i)))

As you can see, the peeling lifted the length computation up to the top too, which is a bonus. We should probably still implement allocation sinking, especially for loops for which peeling isn't an option, but the current status often works well. Running f64-sum on a 10-million-element packed double array goes down from 580ms to 99ms, or to some 25 or 30 CPU cycles per element, and of course no time in GC. Considering that this loop still has the overhead of bytecode interpretation and cache misses, I think we're doing A O K.


It used to be that using packed bytevectors of doubles was an easy way to make your program slower using types (thanks to Sam Tobin-Hochstadt for that quip). The reason is that although a packed vector of doubles uses less memory, every access to it has to allocate a new boxed number. Compare to "normal" vectors where sure, it uses more memory, but fetching an element fetches an already-boxed value. Now with the unboxing optimization, this situation is properly corrected... in most cases.

The major caveat is that for unboxing to work completely, each use of a potentially-unboxable value has to have an alternate implementation that can work on unboxed values. In our example above, the only use was f64+ (which internally is really called fadd), so we win. Writing an f64 to a bytevector can also be unboxed. Unfortunately, bytevectors and simple arithmetic are currently all of the unboxable operations. We'll implement more over time, but it's a current limitation.

Another point is that we are leaning heavily on the optimizer to remove the boxes when it can. If there's a bug or a limitation in the optimizer, it could be the box stays around needlessly. It happens, hopefully less and less but it does happen. To be sure you get the advantages, you need to time the code and see if it's spending significant time in GC. If it is, then you need to disassemble your code to see where that's happening. It's not a very nice thing, currently. The Scheme-like representations I gave above were written by hand; the CPS intermediate language is much more verbose than that.

Another limitation is that function arguments and return values are always boxed. Of course, the compiler can inline and contify a lot of functions, but that means that to use abstraction, you need to build up a mental model of what the inliner is going to do.

Finally, it's not always obvious to the compiler what the type of a value is, and that necessarily limits unboxing. For example, if we had started off the loop by defining sum to be 0 instead of 0.0, the result of the loop as a whole could be either an exact integer or an inexact real. Of course, loop peeling mitigates this to an extent, unboxing sum within the loop after the first iteration, but it so happens that peeling also prevents the phi join at the loop exit from being unboxed, because the result from the peeled iteration is 0 and not 0.0. In the end, we are unable to remove the equivalent of sum-box, and so we still allocate once per iteration. Here is a clear case where we would indeed need allocation sinking.

Also, consider that in other contexts the type of (+ x 1.0) might actually be complex instead of real, which means that depending on the type of x it might not be valid to unbox this addition. Proving that a number is not complex can be non-obvious. That's the second way that fetching a value from a packed vector of doubles or floats is useful: it's one of the rare times that you know that a number is real-valued.

on integer, on fixnum

That's all there is to say about floats. However, when doing some benchmarks of the floating-point unboxing, one user couldn't reproduce some of the results: they were seeing huge run-times for on a microbenchmark that repeatedly summed the elements of a vector. It turned out that the reason was that they were on a 32-bit machine, and one of the loop variables used in the test was exceeding the fixnum range. Recall that fixnums are the subset of integers that fit in an immediate value, along with their tag. Guile's fixnum tag is 2 bits, and fixnums have a sign bit, so the most positive fixnum on a 32-bit machine is 229—1, or around 500 million. It sure is a shame not to be able to count up to #xFFFFFFFF without throwing an allocation party!

So, we set about seeing if we could unbox integers as well in Guile. Guile's compiler has a lot more visibility as to when something is an integer, compared to real numbers. Anything used as an index into a vector or similar data structure must be an exact integer, and any query as to the length of a vector or a string or whatever is also an integer.

Note that knowing that a value is an exact integer is insufficient to unbox it: you have to also know that it is within the range of your unboxed integer data type. Here we take advantage of the fact that in Guile, type analysis also infers ranges. So, cool. Because the kinds of integers that can be used as indexes and lengths are all non-negative, our first unboxed integer type is u64, the unsigned 64-bit integers.

If Guile did native compilation, it would always be a win to unbox any integer operation, if only because you would avoid polymorphism or any other potential side exit. For bignums that are within the unboxable range, the considerations are similar to the floating-point case: allocation costs dominate, so unboxing is almost always a win, provided that you avoid double-boxing. Eliminating one allocation can pay off a lot of instruction dispatch.

For fixnums, though, things are not so clear. Immediate tagging is such a cheap way of boxing that in an interpreter, the extra instructions you introduce could outweigh any speedup from having faster operations.

In the end, I didn't do science and I decided to just go ahead and unbox if I could. We are headed towards native compilation, this is a necessary step along that path, and what the hell, it seemed like a good idea at the time.

Because there are so many more integers in a typical program than floating-point numbers, we had to provide unboxed integer variants of quite a number of operations. Of course we could unconditionally require unboxed arguments to vector-ref, string-length and so on, but in addition to making u64 variants of arithmetic, we also support bit operations like logand and such. Unlike the current status with floating point numbers, we can do test-and-branch over unboxed u64 comparisons, and we can compare u64 values to boxed SCM values.

In JavaScript, making sure an integer is unboxed is easy: you just do val | 0. The bit operation | truncates the value to a uint32. In Guile though, we have arbitrary-precision bit operations, so although (logior val 0) would assert that val is an integer, it wouldn't necessarily mean that it's unboxable.

Instead, the Guile idiom for making sure you have an unboxed integer in a particular range should go like this:

(define-inlinable (check-uint-range x mask)
  (let ((x* (logand x mask)))
    (unless (= x x*)
      (error "out of range" x))

A helper like this is useful to assert that an argument to a function is of a particular type, especially given that arguments to functions are always boxed and treated as being of unknown type. The logand asserts that the value is an integer, and the comparison asserts that it is within range.

For example, if we want to implement a function that does modular 8-bit addition, it can go like:

(define-inlinable (check-uint8 x)
  (check-uint-range x #xff))
(define-inlinable (truncate-uint8 x)
  (logand x #xff))
(define (uint8+ x y)
  (truncate-uint8 (+ (check-uint8 x) (check-uint8 y))))

If we disassemble this function, we get something like:

Disassembly of #<procedure uint8+ (x y)> at #xa8d0f8:

   0    (assert-nargs-ee/locals 3 2)    ;; 5 slots (2 args)
   1    (scm->u64/truncate 4 3)
   2    (load-u64 1 0 255)
   5    (ulogand 4 4 1)
   6    (br-if-u64-=-scm 4 3 #f 17)     ;; -> L1
;; [elided code to throw an error if x is not in range]
  23    (scm->u64/truncate 3 2)
  24    (ulogand 3 3 1)
  25    (br-if-u64-=-scm 3 2 #f 18)     ;; -> L2
;; [elided code to throw an error if y is not in range]
  43    (uadd 4 4 3)
  44    (ulogand 4 4 1)
  45    (u64->scm 3 4)
  46    (return-values 2)               ;; 1 value

The scm->u64/truncate instructions unbox an integer, but truncating it to the u64 range. They are used when we know that any additional bits won't be used, as in this case where we immediately do a logand of the unboxed value. All in all it's not a bad code sequence; there are two possible side exits for each argument (not an integer signalled by the unboxing, and out of range signalled by the explicit check), and no other run-time dispatch. For now I think we can be pretty happy with the code.

That's about it for integer unboxing. We also support unboxed signed 64-bit integers, mostly for use as operands or return values from bytevector-s8-ref and similar unboxed accessors on bytevectors. There are fewer operations that have s64 variants, though, compared to u64 variants.


Up until now in Guile, it could be that you might have to avoid Scheme if you needed to do some kinds of numeric computation. Unboxing floating-point and integer numbers makes it feasible to do more computation in Scheme instead of having to rely in inflexible C interfaces. At the same time, as a Scheme hacker I feel much more free knowing that I can work on 64-bit integers without necessarily allocating bignums. I expect this optimization to have a significant impact on the way I program, and what I program. We'll see where this goes, though. Until next time, happy hacking :)

Syndicated 2016-01-19 11:57:53 from wingolog

the half strap: self-hosting and guile

or, "why does building guile take so friggin long"

Happy new year's, hackfolk! I don't know about y'all, but I'm feeling pretty good about 2016. Let's make some cool stuff!

Today's article is about Guile and how it builds itself. It's a Scheme implementation mostly written in Scheme, so how it would go about doing that isn't straightforward. And although the performance of Guile is pretty great these days, a user's first experience with it will probably be building it, which is a process that takes approximately forever. Seriously. On this newish laptop with an i7-5600U CPU and four cores it takes like 45 minutes. On older machines it can take even longer. What gives?

Well, fictional reader, it's a good question. I'm glad you asked! Before getting to the heart of the matter, I summarize a bit of background information.

and then nothing turned itself inside out

Guile is mostly written in Scheme. Some parts of it are written in C -- some runtime routines, some supporting libraries (the garbage collector, unicode support, arbitrary precision arithmetic), and the bytecode interpreter. The first phase when building Guile is to take the system's C compiler -- a program that takes C source code and produces native machine code -- and use it to build libguile, the part of Guile that is written in C.

The next phase is to compile the parts of Guile written in Scheme. Currently we compile to bytecode which is then interpreted by libguile, but this discussion would be the same if we compiled Scheme to native code instead of bytecode.

There's a wrinkle, though: the Scheme compiler -- the program that takes a Scheme program and produces bytecode -- is written in Scheme. When we built libguile, we could use the system's C compiler. But the system has no Scheme compiler, so how do we do?

The answer is that in addition to a Scheme compiler, Guile also includes a Scheme interpreter. We use the interpreter to load the Scheme compiler, and then use the compiler to produce bytecode from Scheme.

There's another wrinkle, though, and I bet you can guess what it is :) The Scheme interpreter is also written in Scheme. It used to be that Guile's Scheme interpreter was written in C, but that made it impossible to tail-call between compiled and interpreted code. So some six years ago, I rewrote the interpreter in Scheme.

As I mention in that article, Guile actually has two Scheme interpreters: the one in Scheme and one in C that is only used to compile the one in Scheme, and never used again. The bootstrap interpreter written in C avoids the problem with tail calls to compiled code because when it runs, there is no compiled code.

So in summary, Guile's build has the following general phases:

  1. The system C compiler builds libguile.

  2. The bootstrap C interpreter in libguile loads the Scheme compiler and builds eval.go from eval.scm. (Currently .go is the extension for compiled Guile code. The extension predates the Go language. Probably we switch to .so at some point, though.)

  3. The Scheme interpreter from eval.go loads the Scheme compiler and compiles the rest of the Scheme code in Guile, including the Scheme compiler itself.

In the last step, Guile compiles each file in its own process, allowing for good parallelization. This also means that as the compiler builds, the compiler itself starts running faster because it can use the freshly built .go files instead having to use the interpreter to load the source .scm files.

so what's slow?

Building libguile is not so slow; it takes about a minute on my laptop. Could be faster, but it's fine.

Building eval.go is slow, but at two and half minutes it's bearable.

Building the rest of the Scheme code is horribly slow though, and for me takes around 40 or 50 minutes. What is going on?

The crucial difference between building libguile and building the .go files is that when we build libguile, we use the C compiler, which is itself a highly optimized program. When we build .go files, we use the Scheme compiler, which hasn't yet been compiled! Indeed if you rebuild all the Scheme code using a compiled Scheme compiler instead of an interpreted Scheme compiler, you can rebuild all of Guile in about 5 minutes. (Due to the way the Makefile dependencies work, the easiest way to do this if you have a built Guile is rm bootstrap/ice-9/eval.go && make -jN.)

The story is a bit complicated by parallelism, though. Usually if you do a make -j4, you will be able to build 4 things at the same time, taking advantage of 4 cores (if you have them). However Guile's Makefile rules are arranged in such a way that the initial eval.go compile is done serially, when nothing else is running. This is because the bootstrap interpreter written in C uses C stack space as temporary storage. It could be that when compiling bigger files, the C interpreter might run out of stack, and with C it's hard to detect exactly how much stack you have. Indeed, sometimes we get reports of strange bootstrap failures that end up being because Guile was built with -O0 and the compiler decided to use much more stack space than we usually see. We try to fix these, usually by raising the static stack limits that Guile's C interpreter imposes, but we certainly don't want a limitation in the bootstrap interpreter to affect the internal structure of the rest of Guile. The
bootstrap interpreter's only job is to load the compiler and build eval.go, and isn't tested in any other way.

So eval.go is build serially. After that, compilation can proceed in parallel, but goes more slowly before speeding up. To explain that, I digress!

a digression on interpreters

When Scheme code is loaded into Guile from source, the process goes like this:

  1. Scheme code is loaded from disk or wherever as a stream of bytes.

  2. The reader parses that byte stream into S-expressions.

  3. The expander runs on the S-expressions, expanding macros and lowering Scheme code to an internal language called "Tree-IL".

Up to here, the pipeline is shared between the interpreter and the compiler. If you're compiling, Guile will take the Tree-IL, run the partial evaluator on it, lower to CPS, optimize that CPS, and then emit bytecode. The next time you load this file, Guile will just mmap in the .go file and skip all of the other steps. Compilation is great!

But if you are interpreting, a few more things happen:

  1. The memoizer does some analysis on the Tree-IL and turns variable references into two-dimensional (depth, offset) references on a chained environment. See the story time article for more; scroll down about halfway for the details. The goal is to do some light compilation on variable access so that the interpreter will have to do less work, and also prevent closures from hanging on to too much data; this is the "flat closure" optimization, for the interpreter.

  2. The interpreter "compiles" the code to a chain of closures. This is like the classic direct-threading optimization, but for a tree-based interpreter.

The closure-chaining strategy of the interpreter is almost exactly as in described in SICP's analyze pass. I came up with it independently, but so did Jonathan Rees in 1982 and Marc Feeley in 1986, so I wasn't surprised when I found the prior work!

Back in 2009 when we switched to the eval-in-Scheme, we knew that it would result in a slower interpreter. This is because instead of the interpreter being compiled to native code, it was compiled to bytecode. Also, Guile's Scheme compiler wasn't as good then, so we knew that we were leaving optimizations on the floor. Still, the switch to an evaluator in Scheme enabled integration of the compiler, and we thought that the interpreter speed would improve with time. I just took a look and with this silly loop:

(let lp ((n 0)) (if (< n #e1e7) (lp (1+ n))))

Guile 1.8's interpreter written in C manages to run this in 1.1 seconds. Guile 2.0's interpreter written in Scheme and compiled to the old virtual machine does it in 16.4 seconds. Guile 2.1.1's interpreter, with the closure-chaining optimization, a couple of peephole optimizations in the interpreter, and compiled using the better compiler and VM from Guile 2.2, manages to finish in 2.4 seconds. So we are definitely getting better, and by the time we compile eval.scm to native code I have no doubt that we will be as good as the old C implementation. (Of course, when compiled to Guile 2.2's VM, the loop finishes in 55 milliseconds, but comparing a compiler and an interpreter is no fair.)

The up-shot for bootstrap times is that once the interpreter is compiled, the build currently runs a little slower, because the compiled eval.go interpreter is a bit slower than the bootstrap interpreter in libguile.

bottom up, top down

Well. Clearly I wanted to share a thing with you about interpreters; thank you for following along :) The salient point is that Guile's interpreter is now pretty OK, though of course not as good as the compiler. Still, Guile 2.0 builds in 12 minutes, while Guile 2.2 builds in 40 or 50, and Guile 2.2 has a faster interpreter. What's the deal?

There are a few factors at play but I think the biggest is that Guile 2.2's compiler is simply much more sophisticated than Guile 2.0's compiler. Just loading it up at bootstrap-time takes longer than loading Guile 2.0's compiler, because there's more code using more macro abstractions than in Guile 2.0. The expander has to do more work, and the evaluator has to do more work. A compiler is a program that runs on programs, and interpreting a bigger program is going to be slower than interpreting a smaller program.

It's a somewhat paradoxical result: to make programs run faster, we needed a better compiler, but that better compiler is bigger, and so it bootstraps from source more slowly. Some of the improvements to generated code quality were driven by a desire to have the compiler run faster, but this only had the reverse effect on bootstrap time.

Unfortunately, Guile 2.2's compiler also runs slow when it's fully compiled: compiling one largeish module in Guile 2.2 compared to 2.0 takes 10.7 seconds instead of 1.9. (To reproduce, ,time (compile-file "module/ice-9/psyntax-pp.scm") from a Guile 2.0 or 2.2 REPL.) How can we explain this?

Understanding this question has taken me some time. If you do a normal profile of the code using statprof, you get something like this:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm")
%     cumulative   self             
time   seconds     seconds  procedure
 12.41      1.61      1.61  language/cps/intmap.scm:393:0:intmap-ref
  6.35      1.05      0.82  vector-copy
  5.92     13.09      0.77  language/cps/intset.scm:467:5:visit-branch
  5.05      0.71      0.65  language/cps/intmap.scm:183:0:intmap-add!
  4.62      1.40      0.60  language/cps/intset.scm:381:2:visit-node
  3.61      0.93      0.47  language/cps/intset.scm:268:0:intset-add
  3.46      0.49      0.45  language/cps/intset.scm:203:0:intset-add!
  3.17      1.01      0.41  language/cps/intset.scm:269:2:adjoin
  3.03      1.46      0.39  language/cps/intmap.scm:246:2:adjoin

("Cumulative seconds" can be greater than the total number of seconds for functions that have multiple activations live on the stack.)

These results would seem to unequivocally indicate that the switch to persistent data structures in the new compiler is to blame. This is a somewhat disheartening realization; I love working with the new data structures. They let me write better code and think about bigger things.

Seeing that most of the time is spent in intmap and intset manipulations, I've tried off and on over the last few months to speed them up. I tried at one point replacing hot paths with C -- no speedup, so I threw it away. I tried adding an alternate intmap implementation that, for transient packed maps, would store the map as a single vector; no significant speedup, binned it. I implemented integer unboxing in the hopes that it would speed up the results; more about that in another missive. I stared long and hard at the generated code, looking for opportunities to improve it (and did make some small improvements). Even when writing this article, the results are such a shame that I put the article on hold for a couple weeks while I looked into potential improvements, and managed to squeak out another 10%.

In retrospect, getting no speedup out of C hot paths should have been a hint.

For many years, a flat statistical profile with cumulative/self timings like the one I show above has been my go-to performance diagnostic. Sometimes it does take a bit of machine sympathy to understand, though; when you want to know what's calling a hot function, usually you look farther down the list for functions that don't have much self time but whose cumulative time matches the function you're interested in. But this approach doesn't work for hot functions that are called from many, many places, as is the case with these fundamental data structure operations.

Indeed at one point I built a tool to visualize statistical stack samples, the idea being you often want to see how a program gets to its hot code. This tool was useful but its output could be a bit overwhelming. Sometimes you'd have to tell it to generate PDF instead of PNG files because the height of the image exceeded Cairo's internal limits. The tool also had too many moving pieces to maintain. Still, the core of the idea was a good one, and I incorporated the non-graphical parts of it into Guile proper, where they sat unused for a few years.

Fast-forward to now, where faced with this compiler performance problem, I needed some other tool to help me out. It turns out that in the 2.0 to 2.2 transition, I had to rewrite the profiler's internals anyway to deal with the new VM. The old VM could identify a frame's function by the value in local slot 0; the new one has to look up from instruction pointer values. Because this lookup can be expensive, the new profiler just writes sampled instruction pointer addresses into an array for later offline analysis, eventual distilling to a flat profile. It turns out that this information is exactly what's needed to do a tree profile like I did in chartprof. I had to add cycle detection to prevent the graphs from being enormous, but cycle detection makes much more sense in a tree output than in a flat profile. The result, distilled a bit:

> ,profile (compile-file "module/ice-9/psyntax-pp.scm") #:display-style tree
100.0% read-and-compile at system/base/compile.scm:208:0
  99.4% compile at system/base/compile.scm:237:0
    99.4% compile-fold at system/base/compile.scm:177:0
      75.3% compile-bytecode at language/cps/compile-bytecode.scm:568:0
        73.8% lower-cps at language/cps/compile-bytecode.scm:556:0
          41.1% optimize-higher-order-cps at language/cps/optimize.scm:86:0
          29.9% optimize-first-order-cps at language/cps/optimize.scm:106:0
          1.5% convert-closures at language/cps/closure-conversion.scm:814:0
      20.5% emit-bytecode at language/cps/compile-bytecode.scm:547:0
        18.5% visit-branch at language/cps/intmap.scm:514:5
          18.5% #x7ff420853318 at language/cps/compile-bytecode.scm:49:15
            18.5% compile-function at language/cps/compile-bytecode.scm:83:0
              18.5% allocate-slots at language/cps/slot-allocation.scm:838:0
      3.6% compile-cps at language/tree-il/compile-cps.scm:1071:0
        2.5% optimize at language/tree-il/optimize.scm:31:0
        0.6% cps-convert/thunk at language/tree-il/compile-cps.scm:924:0
        0.4% fix-letrec at language/tree-il/fix-letrec.scm:213:0
  0.6% compile-fold at system/base/compile.scm:177:0
    0.6% save-module-excursion at ice-9/boot-9.scm:2607:0
      0.6% #x7ff420b95254 at language/scheme/compile-tree-il.scm:29:3

I've uploaded the full file here, for the curious Guile hacker.

So what does it mean? The high-order bit is that we spend some 70% of the time in the optimizer. Indeed, running the same benchmark but omitting optimizations gets a much more respectable time:

$ time meta/uninstalled-env \
  guild compile -O0 module/ice-9/psyntax-pp.scm -o /tmp/foo.go
wrote `/tmp/foo.go'

real	0m3.050s
user	0m3.404s
sys	0m0.060s

One of the results of this investigation was that we should first compile the compiler with -O0 (no optimizations), then compile the compiler with -O2 (with optimizations). This change made it into the 2.1.1 release a couple months ago.

We also spend around 18.5% of time in slot allocation -- deciding what local variable slots to allocate to CPS variables. This takes time because we do a precise live variable analysis over the CPS, which itself has one variable for every result value and a label for every program point. Then we do register allocation, but in a way that could probably be optimized better. Perhaps with -O0 we should use a different strategy to allocate slots: one which preserves the values of variables that are available but dead. This would actually be an easier allocation task. An additional 1.5% is spent actually assembling the bytecode.

Interestingly, partial evaluation, CPS conversion, and a couple of other small optimizations together account for only 3.6% of time; and reading and syntax expansion account for only 0.6% of time. This is good news at least :)

up in the trees, down in the weeds

Looking at the top-down tree profile lets me see that the compiler is spending most of its time doing things that the Guile 2.0 compiler doesn't do: loop optimizations, good slot allocations, and so on. To an extent, then, it's to be expected that the Guile 2.2 compiler is slower. This also explains why the C fast-paths weren't so effective at improving performance: the per-operation costs for the already pretty low and adding C implementations wasn't enough of a speedup to matter. The problem was not that intmap-ref et al were slow, it was that code was calling them a lot.

Improving the optimizer has been a bit challenging, not least due to the many axes of "better". Guile's compiler ran faster before the switch to "CPS soup" and persistent data structures, but it produced code that ran slower because I wasn't able to write the optimizations that I would have liked. Likewise, Guile 2.0's compiler ran faster, because it did a worse job. But before switching to CPS soup, Guile's compiler also used more memory, because per-program-point and per-variable computations were unable to share space with each other.

I think the top-down profiler has given me a better point of view in this instance, as I can reason about what I'm doing on a structural level, which I wasn't able to understand from the flat profile. Still, it's possible to misunderstand the performance impact of leaf functions when they are spread all over a tree, and for that reason I think we probably need both kinds of profilers.

In the case of Guile's compiler I'm not sure that I'll change much at this point. We'll be able to switch to native compilation without a fundamental compiler rewrite. But spending most of the time in functions related to data structures still seems pretty wrong to me on some deep level -- what if the data structures were faster? What if I wrote the code in some other way that didn't need the data structures so much? It gnaws at me. It gnaws and gnaws.

the half strap

Unfortunately, while compiling Scheme to native code will probably speed up the compiler, it won't necessarily speed up the bootstrap. I think the compiler has some 800 KB of source code right now, and let's say that we're able to do native compilation with 1200 KB. So 50% more code, but probably the result is two to ten times faster on average: a win, in terms of compiler speed, when compiled. But for bootstrap time, because in the beginning of the bootstrap most of the compiler isn't compiled, it could well be a slowdown.

This is the disadvantage of bootstrapping from an interpreter -- the more compiler you write, the slower your strap.

Note that this is different from the case where you bootstrap from a compiled Scheme compiler. In our case we do a half-bootstrap, first building an interpreter in C, compiling the interpreter in Scheme, then bootstrapping off that.

It's a common trope in compiler development where the heroic, farsighted compiler hacker refuses to add optimizations unless they make the compiler bootstrap faster. Dybvig says as much in his "History of Chez Scheme" paper. Well, sure -- if you're willing to accept complete responsibility for bootstrapping. From my side, I'm terrified that I could introduce some error in a binary that could reproduce itself worm-like into all my work and it make it impossible to change anything. You think I jest, but the Sanely Bootstrappable Common Lisp papers instilled me with fear. Want to change your tagging scheme? You can't! Want to experiment with language, start programming using features from your own dialect? You can't! No, thank you. I value my sanity more than that.

Incidentally, this also answers a common question people have: can I use some existing Guile to compile a new Guile? The answer is tricky. You can if the two Guiles implement the same language and virtual machine. Guile-the-language is fairly stable. However, due to the way that the VM and the compiler are co-developed, some of the compiler is generated from data exported by libguile. If that information happens to be the same on your Guile, then yes, it's possible. Otherwise no. For this reason it's not something we describe, besides cross-compilers from the same version. Just half strap: it takes a while but it's fairly fool-proof.

and that's it!

Thanks for reading I guess. Good jobbies! Next time, some words on Lua. Until then, happy strapping!

Syndicated 2016-01-11 21:51:51 from wingolog

embracing conway's law

Most of you have heard of "Conway's Law", the pithy observation that the structure of things that people build reflects the social structure of the people that build them. The extent to which there is coordination or cohesion in a system as a whole reflects the extent to which there is coordination or cohesion among the people that make the system. Interfaces between components made by different groups of people are the most fragile pieces. This division goes down to the inner life of programs, too; inside it's all just code, but when a program starts to interface with the outside world we start to see contracts, guarantees, types, documentation, fixed programming or binary interfaces, and indeed faults as well: how many bug reports end up in an accusation that team A was not using team B's API properly?

If you haven't heard of Conway's law before, well, welcome to the club. Inneresting, innit? And so thought I until now; a neat observation with explanatory power. But as aspiring engineers we should look at ways of using these laws to build systems that take advantage of their properties.

in praise of bundling

Most software projects depend on other projects. Using Conway's law, we can restate this to say that most people depend on things built by other people. The Chromium project, for example, depends on many different libraries produced by many different groups of people. But instead of requiring the user to install each of these dependencies, or even requiring the developer that works on Chrome to have them available when building Chrome, Chromium goes a step further and just includes its dependencies in its source repository. (The mechanism by which it does this isn't a direct inclusion, but since it specifies the version of all dependencies and hosts all code on Google-controlled servers, it might as well be.)

Downstream packagers like Fedora bemoan bundling, but they ignore the ways in which it can produce better software at lower cost.

One way bundling can improve software quality is by reducing the algorithmic complexity of product configurations, when expressed as a function of its code and of its dependencies. In Chromium, a project that bundles dependencies, the end product is guaranteed to work at all points in the development cycle because its dependency set is developed as a whole and thus uniquely specified. Any change to a dependency can be directly tested against the end product, and reverted if it causes regressions. This is only possible because dependencies have been pulled into the umbrella of "things the Chromium group is responsible for".

Some dependencies are automatically pulled into Chrome from their upstreams, like V8, and some aren't, like zlib. The difference is essentially social, not technical: the same organization controls V8 and Chrome and so can set the appropriate social expectations and even revert changes to upstream V8 as needed. Of course the goal of the project as a whole has technical components and technical considerations, but they can only be acted on to the extent they are socially reified: without a social organization of the zlib developers into the Chromium development team, Chromium has no business automatically importing zlib code, because the zlib developers aren't testing against Chromium when they make a release. Bundling zlib into Chromium lets the Chromium project buffer the technical artifacts of the zlib developers through the Chromium developers, thus transferring responsibility to Chromium developers as well.

Conway's law predicts that the interfaces between projects made by different groups of people are the gnarliest bits, and anyone that has ever had to maintain compatibility with a wide range of versions of upstream software has the scar tissue to prove it. The extent to which this pain is still present in Chromium is the extent to which Chromium, its dependencies, and the people that make them are not bound tightly enough. For example, making a change to V8 which results in a change to Blink unit tests is a three-step dance: first you commit a change to Blink giving Chromium a heads-up about new results being expected for the particular unit tests, then you commit your V8 change, then you commit a change to Blink marking the new test result as being the expected one. This process takes at least an hour of human interaction time, and about 4 hours of wall-clock time. This pain would go away if V8 were bundled directly into Chromium, as you could make the whole change at once.

forking considered fantastic

"Forking" sometimes gets a bad rap. Let's take the Chromium example again. Blink forked from WebKit a couple years ago, and things have been great in both projects since then. Before the split, the worst parts in WebKit were the abstraction layers that allowed Google and Apple to use the dependencies they wanted (V8 vs JSC, different process models models, some other things). These abstraction layers were the reified software artifacts of the social boundaries between Google and Apple engineers. Now that the social division is gone, the gnarly abstractions are gone too. Neither group of people has to consider whether the other will be OK with any particular change. This eliminates a heavy cognitive burden and allows both projects to move faster.

As a pedestrian counter-example, Guile uses the libltdl library to abstract over the dynamic loaders of different operating systems. (Already you are probably detecting the Conway's law keywords: uses, library, abstract, different.) For years this library has done the wrong thing while trying to do the right thing, ignoring .dylib's but loading .so's on Mac (or vice versa, I can't remember), not being able to specify soversions for dependencies, throwing a stat party every time you load a library because it grovels around for completely vestigial .la files, et cetera. We sent some patches some time ago but the upstream project is completely unmaintained; the patches haven't been accepted, users build with whatever they have on their systems, and though we could try to take over upstream it's a huge asynchronous burden for something that should be simple. There is a whole zoo of concepts we don't need here and Guile would have done better to include libltdl into its source tree, or even to have forgone libltdl and just written our own thing.

Though there are costs to maintaining your own copy of what started as someone else's work, people who yammer on against forks usually fail to recognize their benefits. I think they don't realize that for a project to be technically cohesive, it needs to be socially cohesive as well; anything else is magical thinking.

not-invented-here-syndrome considered swell

Likewise there is an undercurrent of smarmy holier-than-thou moralism in some parts of the programming world. These armchair hackers want you to believe that you are a bad person if you write something new instead of building on what has already been written by someone else. This too is magical thinking that comes from believing in the fictional existence of a first-person plural, that there is one "we" of "humanity" that is making linear progress towards the singularity. Garbage. Conway's law tells you that things made by different people will have different paces, goals, constraints, and idiosyncracies, and the impedance mismatch between you and them can be a real cost.

Sometimes these same armchair hackers will shake their heads and say "yeah, project Y had so much hubris and ignorance that they didn't want to bother understanding what X project does, and they went and implemented their own thing and made all their own mistakes." To which I say, so what? First of all, who are you to judge how other people spend their time? You're not in their shoes and it doesn't affect you, at least not in the way it affects them. An armchair hacker rarely understands the nature of value in an organization (commercial or no). People learn more when they write code than when they use it or even when they read it. When your product has a problem, where will you find the ability to fix it? Will you file a helpless bug report or will you be able to fix it directly? Assuming your software dependencies model some part of your domain, are you sure that their models are adequate for your purpose, with the minimum of useless abstraction? If the answer is "well, I'm sure they know what they're doing" then if your organization survives a few years you are certain to run into difficulties here.

One example. Some old-school Mozilla folks still gripe at Google having gone and created an entirely new JavaScript engine, back in 2008. This is incredibly naïve! Google derives immense value from having JS engine expertise in-house and not having to coordinate with anyone else. This control also gives them power to affect the kinds of JavaScript that gets written and what goes into the standard. They would not have this control if they decided to build on SpiderMonkey, and if they had built on SM, they would have forked by now.

As a much more minor, insignificant, first-person example, I am an OK compiler hacker now. I don't consider myself an expert but I do all right. I got here by making a bunch of mistakes in Guile's compiler. Of course it helps if you get up to speed using other projects like V8 or what-not, but building an organization's value via implementation shouldn't be discounted out-of-hand.

Another point is that when you build on someone else's work, especially if you plan on continuing to have a relationship with them, you are agreeing up-front to a communications tax. For programmers this cost is magnified by the degree to which asynchronous communication disrupts flow. This isn't to say that programmers can't or shouldn't communicate, of course, but it's a cost even in the best case, and a cost that can be avoided by building your own.

When you depend on a project made by a distinct group of people, you will also experience churn or lag drag, depending on whether the dependency changes faster or slower than your project. Depending on LLVM, for example, means devoting part of your team's resources to keeping up with the pace of LLVM development. On the other hand, depending on something more slow-moving can make it more difficult to work with upstream to ensure that the dependency actually suits your use case. Again, both of these drag costs are magnified by the asynchrony of communicating with people that probably don't share your goals.

Finally, for projects that aim to ship to end users, depending on people outside your organization exposes you to risk. When a security-sensitive bug is reported on some library that you use deep in your web stack, who is responsible for fixing it? If you are responsible for the security of a user-facing project, there are definite advantages for knowing who is on the hook for fixing your bug, and knowing that their priorities are your priorities. Though many free software people consider security to be an argument against bundling, I think the track record of consumer browsers like Chrome and Firefox is an argument in favor of giving power to the team that ships the product. (Of course browsers are terrifying security-sensitive piles of steaming C++! But that choice was made already. What I assert here is that they do well at getting security fixes out to users in a timely fashion.)

to use a thing, join its people

I'm not arguing that you as a software developer should never use code written by other people. That is silly and I would appreciate if commenters would refrain from this argument :)

Let's say you have looked at the costs and the benefits and you have decided to, say, build a browser on Chromium. Or re-use pieces of Chromium for your own ends. There are real costs to doing this, but those costs depend on your relationship with the people involved. To minimize your costs, you must somehow join the community of people that make your dependency. By joining yourself to the people that make your dependency, Conway's law predicts that the quality of your product as a whole will improve: there will be fewer abstraction layers as your needs are taken into account to a greater degree, your pace will align with the dependency's pace, and colleagues at Google will review for you because you are reviewing for them. In the case of Opera, for example, I know that they are deeply involved in Blink development, contributing significantly to important areas of the browser that are also used by Chromium. We at Igalia do this too; our most successful customers are those who are able to work the most closely with upstream.

On the other hand, if you don't become part of the community of people that makes something you depend on, don't be surprised when things break and you are left holding both pieces. How many times have you heard someone complain the "project A removed an API I was using"? Maybe upstream didn't know you were using it. Maybe they knew about it, but you were not a user group they cared about; to them, you had no skin in the game.

Foundations that govern software projects are an anti-pattern in many ways, but they are sometimes necessary, born from the need for mutually competing organizations to collaborate on a single project. Sometimes the answer for how to be able to depend on technical work from others is to codify your social relationship.

hi haters

One note before opening the comment flood: I know. You can't control everything. You can't be responsible for everything. One way out of the mess is just to give up, cross your fingers, and hope for the best. Sure. Fine. But know that there is no magical first-person-plural; Conway's law will apply to you and the things you build. Know what you're actually getting when you depend on other peoples' work, and know what you are paying for it. One way or another, pay for it you must.

Syndicated 2015-11-09 13:48:51 from wingolog

two paths, one peak: a view from below on high-performance language implementations

Ohmigod it's November. Time flies amirite. Eck-setra. These are not actually my sentiments but sometimes I do feel like a sloth or a slow loris, grasping out at quarter-speed. Once I get a hold it's good times, but hoo boy. The tech world churns and throws up new languages and language implementations every year and how is it that in 2015, some 20 years after the project was started, Guile still doesn't do native compilation?

Though I've only been Guiling for the last 10 years or so, this article aims to plumb those depths; and more than being an apology or a splain I want to ponder the onward journey from the here and the now. I was going to write something like "looking out from this peak to the next higher peak" but besides being a cliché that's exactly what I don't mean to do. In Guile performance work I justify my slow loris grip by a mistrust of local maxima. I respect and appreciate the strategy of going for whatever gains you the most in the short term, especially in a commercial context, but with a long view maybe this approach is a near win but a long lose.

That's getting ahead of myself; let's get into this thing. We started byte-compiling Guile around 2008 or so. Guile is now near to native compilation. Where are we going with this thing?

short term: template jit

The obvious next thing to do for Guile would be to compile its bytecodes to machine code using a template JIT. This strategy just generates machine code for each bytecode instruction without regard to what comes before or after. It's dead simple. Guile's bytecode is quite well-suited to this technique, even, in the sense that an instruction doesn't correspond to much code. As Guile has a register-based VM, its instructions will also specialize well against their operands when compiled to native code. The only global state that needs to be carried around at runtime is the instruction pointer and the stack pointer, both of which you have already because of how modern processors work.

Incidentally I have long wondered why CPython doesn't have a template JIT. Spiritually I am much more in line with the PyPy project but if I were a CPython maintainer, I would use a template JIT on the bytecodes I already have. Using a template JIT preserves the semantics of bytecode, including debugging and introspection. CPython's bytecodes are at a higher level than Guile's though, with many implicit method/property lookups (at least the last time I looked at them), and so probably you would need to add inline caches as well; but no biggie. Why aren't the CPython people doing this? What is their long-term perf story anyway -- keep shovelling C into the extension furnace? Lose to PyPy?

In the case of Guile we are not yet grasping in this direction because we don't have (direct) competition from PyPy :) But also there are some problems with a template JIT. Once you internalize the short-term mentality of a template JIT you can get stuck optimizing bytecode, optimizing template JIT compilation, and building up a baroque structure that by its sheer mass may prevent you from ever building The Right Thing. You will have to consider how a bytecode-less compilation pipeline interacts with not only JITted code but also bytecode, because it's a lose to do a template JIT for code that is only executed once.

This sort of short-term thinking is what makes people also have to support on-stack replacement (OSR), also known as hot loop transfer. The basic idea is that code that executes often has to be JITted to go fast, but you can't JIT everything because that would be slow. So you wait to compile a function until it's been called a few times; fine. But with loops it could be that a function is called just once but a loop in the function executes many times. You need to be able to "tier up" to the template JIT from within a loop. This complexity is needed at the highest performance level, but if you choose to do a template JIT you're basically committing to implementing OSR early on.

Additionally the implementation of a template JIT compiler is usually a bunch of C or C++ code. It doesn't make sense to include a template JIT in a self-hosted system that compiles to bytecode, because it would be sad to have the JIT not be written in the source language (Guile Scheme in our case).

Finally in Scheme we have tail-call and delimited continuation considerations. Currently in Guile all calls happen in the Guile bytecode interpreter, which makes tail calls easy -- the machine frame stays the same and we just have to make a tail call on the Scheme frame. This is fine because we don't actually control the machine frame (the C frame) of the bytecode interpreter itself -- the C compiler just does whatever it does. But how to tail call between the bytecode interpreter and JIT-compiled code? You'd need to add a trampoline beneath both the C interpreter and any entry into compiled code that would trampoline to the other implementation, depending on how the callee "returns". And how would you capture stack slices with delimited continuations? It's possible (probably -- I don't know how to reinstate a delimited continuation with both native and interpreted frames), but something of a headache, and is it really necessary?

if you compile ahead-of-time anyway...

The funny thing about CPython is that, like Guile, it is actually an ahead-of-time compiler. While the short-term win would certainly be to add a template JIT, because the bytecode is produced the first time a script is run and cached thereafter, you might as well compile the bytecode to machine code ahead-of-time too and skip the time overhead of JIT compilation on every run. In a template JIT, the machine code is only a function of the bytecode (assuming the template JIT doesn't generate code that depends on the shape of the heap).

Compiling native code ahead of time also saves on memory usage, because you can use file-backed mappings that can be lazily paged in and shared between multiple processes, and when these pages are in cache that translates also to faster startup too.

But if you're compiling bytecode ahead of time to native code, what is the bytecode for?

(not) my beautiful house

At some point you reach a state where you have made logical short-term decisions all the way and you end up with vestigial organs of WTF in your language runtime. Bytecode, for example. A bytecode interpreter written in C. Object file formats for things you don't care about. Trampolines. It's time to back up and consider just what it is that we should be building.

The highest-performing language implementations will be able to compile together the regions of code in which a program spends most of its time. Ahead-of-time compilers can try to predict these regions, but you can't always know what the behavior of a program will be. A program's run-time depends on its inputs, and program inputs are late-bound.

Therefore these highest-performing systems will use some form of adaptive optimization to apply run-time JIT compilation power on whatever part of a program turns out to be hot. This is the peak performance architecture, and indeed in the climb to a performant language implementation, there is but one peak that I know of. The question becomes, how to get there? What path should I take, with the priorities I have and the resources available to me, which lets me climb the farthest up the hill while always leaving the way clear to the top?

guile's priorities

There are lots of options here, and instead of discussing the space as a whole I'll just frame the topic with some bullets. Here's what I want out of Guile:

  1. The project as a whole should be pleasing to hack on. As much of the system as possible should be written in Scheme, as little as possible in C or assembler, and dependencies on outside projects should be minimized.

  2. Guile users should be able to brag about startup speed to their colleagues. We are willing to trade away some peak throughput for faster startup, if need be.

  3. Debuggability is important -- a Guile hacker will always want to be able to get stack traces with actual arguments and local variable values, unless they stripped their compiled Guile binaries, which should be possible as well. But we are willing to give up some debuggability to improve performance and memory use. In the same way that a tail call replaces the current frame in its entirety, we're willing to lose values of dead variables in stack frames that are waiting on functions to return. We're also OK with other debuggability imprecisions if the performance gains are good enough. With macro expansion, Scheme hackers expect a compilation phase; spending time transforming a program via ahead-of-time compilation is acceptable.

Call it the Guile Implementor's Manifesto, or the manifesto of this implementor at least.

beaucoup bucks

Of course if you have megabucks and ace hackers, then you want to dial back on the compromises: excellent startup time but also source-level debugging! The user should be able to break on any source position: the compiler won't even fold 1 + 1 to 2. But to get decent performance you need to be able to tier up to an optimizing compiler soon, and soon in two senses: soon after starting the program, but also soon after starting your project. It's an intimidating thing to build when you are just starting on a language implementation. You need to be able to tier down too, at least for debugging and probably for other reasons too. This strategy goes in the right direction, performance-wise, but it's a steep ascent. You need experienced language implementors, and they are not cheap.

The usual strategy for this kind of implementation is to write it all in C++. The latency requirements are too strict to do otherwise. Once you start down this road, you never stop: your life as an implementor is that of a powerful, bitter C++ wizard.

The PyPy people have valiently resisted this trend, writing their Python implementation in Python itself, but they concede to latency by compiling their "translated interpreter" into C, which then obviously can't itself be debugged as Python code. It's self-hosting, but staged into C. Ah well. Still, a most valiant, respectable effort.

This kind of language implementation usually has bytecode, as it's a convenient reification of the source semantics, but it doesn't have to. V8 is a good counterexample, at least currently: it treats JavaScript source code as the canonical representation of program semantics, relying on its ability to re-parse source text to an AST in the same way every time as needed. V8's first-tier implementation is actually a simple native code compiler, generated from an AST walk. But things are moving in the bytecode direction in the V8 world, reluctantly, so we should consider bytecode as the backbone of the beaucoup-bucks language implementation.

shoestring slim

If you are willing to relax on source-level debugging, as I am in Guile, you can simplify things substantially. You don't need bytecode, and you don't need a template JIT; in the case of Guile, probably the next step in Guile's implementation is to replace the bytecode compiler and interpreter with a simple native code compiler. We can start with the equivalent of a template JIT, but without the bytecode, and without having to think about the relationship between compiled and (bytecode-)interpreted code. (Guile still has a traditional tree-oriented interpreter, but it is actually written in Scheme; that is a story for another day.)

There's no need to stop at a simple compiler, of course. Guile's bytecode compiler is already fairly advanced, with interprocedural optimizations like closure optimization, partial evaluation, and contification, as well as the usual loop-invariant code motion, common subexpression elimination, scalar replacement, unboxing, and so on. Add register allocation and you can have quite a fine native compiler, and you might even beat the fabled Scheme compilers on the odd benchmark. They'll call you plucky: high praise.

There's a danger in this strategy though, and it's endemic in the Scheme world. Our compilers are often able to do heroic things, but only on the kinds of programs they can fully understand. We as Schemers bend ourselves to the will of our compilers, writing only the kinds of programs our compilers handle well. Sometimes we're scared to fold, preferring instead to inline the named-let iteration manually to make sure the compiler can do its job. We fx+ when we should +; we use tagged vectors when we should use proper data structures. This is déformation professionelle, as the French would say. I gave a talk at last year's Scheme workshop on this topic. PyPy people largely don't have this problem, for example; their langauge implementation is able to see through abstractions at run-time to produce good code, but using adaptive optimization instead of ahead-of-time trickery.

So, an ahead-of-time compiler is perhaps a ridge, but it is not the peak. No amount of clever compilation will remove the need for an adaptive optimizer, and indeed too much cleverness will stunt the code of your users. The task becomes, how to progress from a decent AOT native compiler to a system with adaptive optimization?

Here, as far as I know, we have a research problem. In Guile we have mostly traced the paths of history, re-creating things that existed before. As Goethe said, quoted in the introduction to The Joy of Cooking: "That which thy forbears have bequeathed to thee, earn it anew if thou wouldst possess it." But finally we find here something new, or new-ish: I don't know of good examples of AOT compilers that later added adaptive optimization. Do you know of any, dear reader? I would be delighted to know.

In the absence of a blazed trail to the top, what I would like to do is to re-use the AOT compiler to do dynamic inlining. We might need to collect type feedback as well, though inlining is the more important optimization. I think we can serialize the compiler's intermediate representation into a special section in the ELF object files that Guile produces. A background thread or threads can monitor profiling information from main threads. If a JIT thread decides two functions should be inlined, it can deserialize compiler IR and run the standard AOT compiler. We'd need a bit of mutability in the main program in which to inject such an optimization; an inline cache would do. If we need type feedback, we can make inline caches do that job too.

All this is yet a ways off. The next step for Guile, after the 2.2 release, is a simple native compiler, then register allocation. Step by step.

but what about llvmmmmmmmmmmmmm

People always ask about LLVM. It is an excellent compiler backend. It's a bit big, and maybe you're OK with that, or maybe not; whatever. Using LLVM effectively depends on your ability to deal with churn and big projects. But if you can do that, swell, you have excellent code generation. But how does it help you get to the top? Here things are less clear. There are a few projects using LLVM effectively as a JIT compiler, but that is a very recent development. My hubris, desire for self-hosting, and lack of bandwidth for code churn makes it so that I won't use LLVM myself but I have no doubt that a similar strategy to that which I outline above could work well for LLVM. Serialize the bitcode into your object files, make it so that you can map all optimization points to labels in that bitcode, and you have the ability to do some basic dynamic inlining. Godspeed!


If you're interested, I gave a talk a year ago on the state of JavaScript implementations, and how they all ended up looking more or less the same. This common architecture was first introduced by Self; languages implementations in this category include HotSpot and any of the JavaScript implementations.

Some notes on how PyPy produces interpreters from RPython.

and so I bid you good night

Guile's compiler has grown slowly, in tow of my ballooning awareness of ignorance and more slowly inflating experience. Perhaps we could have done the native code compilation thing earlier, but I am happy with our steady progress over the last five years or so. We had to scrap one bytecode VM and one or two compiler intermediate representations, and though that was painful I think we've done pretty well as far as higher-order optimizations go. If we had done native compilation earlier, I can't but think the inevitably wrong decisions we would have made on the back-end would have prevented us from having the courage to get the middle-end right. As it is, I see the way to the top, through the pass of ahead-of-time compilation and thence to a dynamic inliner. It will be some time before we get there, but that's what I signed up for :) Onward!

Syndicated 2015-11-03 23:47:10 from wingolog

type folding in guile

A-hey hey hey, my peeps! Today's missive is about another optimization pass in Guile that we call "type folding". There's probably a more proper name for this, but for the moment we go with "type folding" as it's shorter than "abstract constant propagation, constant folding, and branch folding based on flow-sensitive type and range analysis".

on types

A word of warning to the type-system enthusiasts among my readers: here I'm using "type" in the dynamic-languages sense, to mean "a property about a value". For example, whether a value is a vector or a pair is a property of that value. I know that y'all use that word for other purposes, but there are other uses that do not falute so highly, and it's in the more pedestrian sense that I'm interested here.

To back up a bit: what are the sources of type information in dynamic languages? In Guile, there are three ways the compiler can learn about a value's type.

One source of type information is the compiler's knowledge of the result types of expressions in the language, especially constants and calls to the language's primitives. For example, in the Scheme definition (define y (vector-length z)), we know that y is a non-negative integer, and we probably also know a maximum value for y too, given that vectors have a maximum size.

Conditional branches with type predicates also provide type information. For example, in consider this Scheme expression:

(lambda (x)
  (if (pair? x)
      (car x)
      (error "not a pair" x)))

Here we can say that at the point of the (car x) expression, x is definitely a pair. Conditional branches are interesting because they add a second dimension to type analysis. The question is no longer "what is the type of all variables", but "what is the type of all variables at all points in the program".

Finally, we have the effect of argument type checks in function calls. For example in the (define y (vector-length z)) definition, after (vector-length z) has been evaluated, we know that z is indeed a vector, because if it weren't, then the primitive call would raise an exception.

In summary, the information that we would like to have is what type each variable has at each program point (label). This information comes from where the variables are defined (the first source of type information), conditional branches and control-flow joins (the second source), and variable use sites that imply type checks (the third). It's a little gnarly but in essence it's a classic flow analysis. We treat the "type" of a variable as a set of possible types. A solution to the flow equations results in a set of types for each variable at each label. We use the intmap data structures to share space between the solution at different program points, resulting in an O(n log n) space complexity.

In Guile we also solve for the range of values a variable may take, at the same time as solving for type. I started doing this as part of the Compost hack a couple years ago, where I needed to be able to prove that the operand to sqrt was non-negative in order to avoid sqrt producing complex numbers. Associating range with type turns out to generalize nicely to other data types which may be thought of as having a "magnitude" -- for example a successful (vector-ref v 3) implies that v is at least 4 elements long. Guile can propagate this information down the flow graph, or propagate it in the other way: if we know the vector was constructed as being 10 elements long, then a successful (vector-ref v n) can only mean that n is between 0 and 9.

what for the typing of the things

Guile's compiler uses type analysis in a few ways at a few different stages. One basic use is in dead code elimination (DCE). An expression can be eliminated from a program if its value is never used and if it causes no side effects. Guile models side effects (and memory dependencies between expressions) with effects analysis. I quote:

We model four kinds of effects: type checks (T), allocations (A), reads (R), and writes (W). Each of these effects is allocated to a bit. An expression can have any or none of these effects.

In an expression like (vector-ref v n), type analysis may compute that in fact v is indeed a vector and n is an integer, and also that n is within the range of valid indexes of v. In that case we can remove the type check (T) bit from the expression's effects, opening up the expression for DCE.

Getting back to the topic of this article, Guile's "type folding" pass uses type inference in three ways.

The first use of type information is if we determine that, at a given use site, a variable has precisely one type and one value. In that case we can do constant folding over that expression, replacing its uses with its value. For example, let's say we have the expression (define len (vector-length v)). If we know that v is a vector of length length 5, we can replace any use of len with the constant, 5. As an implementation detail we actually keep the definition of len in place and let DCE remove it later. We can consider this to be abstract constant propagation: abstract in the sense that it folds over abstract values, represented just as type sets and ranges, and which materializes a concrete value only if it is able to do so. Since ranges propagate through operators as well, it can also be considered as abstract constant folding; the type inference operators act as constant folders.

Another use of type information is in branches. If Guile sees (if (< n (vector-length v)) 1 2) and n and v have the right types and disjoint ranges, then we can fold the test and choose 1 or 2 depending on how the test folds.

Finally type information can enable strength reduction. For example it's a common compiler trick to want to reduce (* n 16) to (ash n 4), but if n isn't an integer this isn't going to work. Likewise, (* n 0) can be 0, 0.0, 0.0+0.0i, something else, or an error, depending on the type of n and whether the * operator has been extended to apply over non-number types. Type folding uses type information to reduce the strength of operations like these, but only where it can prove that the transformation is valid.

So that's type folding! It's a pretty neat pass that does a few things as once. Code here, and code for the type inference itself here.

type-driven unboxing

Guile uses type information in one other way currently, and that is to determine when to unbox floating-point numbers. The current metric is that whenever an arithmetic operation will produce a floating-point number -- in Scheme parlance, an inexact real -- then that operation should be unboxed, if it has an unboxed counterpart. Unboxed operations on floating-point numbers are advantageous because they don't have to allocate space on the garbage-collected heap for their result. Since an unboxed operation like the fadd floating-point addition operator takes raw floating-point numbers as operands, it also will never cause a type check, unlike the polymorphic add instruction. Knowing that fadd has no effects lets the compiler do a better job at common subexpression elimination (CSE), dead code elimination, loop-invariant code motion, and so on.

To unbox an operation, its operands are unboxed, the operation itself is replaced with its unboxed counterpart, and the result is then boxed. This turns something like:

(+ a b)


(f64->scm (fl+ (scm->f64 a) (scm->f64 b)))

You wouldn't think this would be an optimization, except that the CSE pass can eliminate many of these conversion pairs using its scalar elimination via fabricated expressions pass.

A proper flow-sensitive type analysis is what enables sound, effective unboxing. After arithmetic operations have been unboxed, Guile then goes through and tries to unbox loop variables and other variables with more than one definition ("phi' variables, for the elect). It mostly succeeds at this. The results are great: summing a packed vector of 10 million 32-bit floating-point values goes down from 500ms to 130ms, on my machine, with no time spent in the garbage collector. Once we start doing native compilation we should be up to about 5e8 or 10e8 floats per second in this microbenchmark, which is totally respectable and about what gcc's -O0 performance gets.


This kind of type inference works great in tight loops, and since that's how many programs spend most of their time, that's great. Of course, this situation is also a product of programmers knowing that tight loops are how computers go the fastest, or at least of how compilers do the best on their code.

Where this approach to type inference breaks down is at function boundaries. There are no higher-order types and no higher-order reasoning, and indeed no function types at all! This is partially mitigated by earlier partial evaluation and contification passes, which open up space in which the type inferrer can work. Method JIT compilers share this weakness with Guile; tracing JIT runtimes like LuaJIT and PyPy largely do not.

up summing

So that's the thing! I was finally motivated to dust off this draft given the recent work on unboxing in a development branch of Guile. Happy hacking and we promise we'll actually make releases so that you can use it soon soon :)

Syndicated 2015-10-29 22:13:27 from wingolog

441 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!