nikodemus is currently certified at Journeyer level.

Name: Nikodemus Siivola
Member since: 2003-04-17 09:27:17
Last Login: 2008-06-07 13:49:51

FOAF RDF Share This

Homepage: http://random-state.net/

Notes:

I finally have access to my Advogato account again after several years... Let's see if I can make this thing get updates from my actual blog.

Recent blog entries by nikodemus

Syndication: RSS 2.0

Efficient Doesn't Equal Performant

This is a bit of a rant, I guess. Sorry about that.

A few years back I needed a binary heap, and I needed one that was fast and thread safe. So I wrote Pileup.

There are other heaps for Common Lisp, and some of them support operations Pileup doesn't implement out of the box, and all of them claim to be efficient.

...and I'm sure that algorithmically they are. However, constant factors matter more often than you might think. I'll single out CL-HEAP primarily because it has such an authoritative name. :)

A tiny benchmark:

(defvar *1M-numbers* (loop repeat 100000
                           collect (random most-positive-fixnum)))

(defvar *4K-numbers* (loop repeat 4000
                           collect (random most-positive-fixnum)))

(defun insert-and-pop (insert pop heap things)
  (declare (function insert pop))
  (dolist (thing things)
    (funcall insert heap thing))
  (loop for thing = (funcall pop heap)
        while thing))

(defun make-insert-and-pop (make insert pop things)
  (declare (function make))
  (insert-and-pop insert pop (funcall make) things))

(defun time-insert-and-pop (insert pop heap things)
  ;; Time 4 runs.
  (time
   (loop repeat 4 do (insert-and-pop insert pop heap things)))
  t)

(defun time-make-insert-and-pop (make insert pop things)
  ;; Time 1000 runs.
  (time
   (loop repeat 1000 do (make-insert-and-pop make insert pop things)))
  t)

(defun cl-heap-make ()
  (make-instance 'cl-heap:priority-queue))

(defun cl-heap-insert (heap thing)
  (cl-heap:enqueue heap thing thing))

(defun cl-heap-pop (heap)
  (cl-heap:dequeue heap))

(defun pileup-make ()
  (pileup:make-heap #'

Results: (median of three runs for each)

;;; CL-HEAP: insert and pop 1M numbers x 4 into a single heap
Evaluation took:
  6.077 seconds of real time
  6.054307 seconds of total run time (5.871448 user, 0.182859 system)
  [ Run times consist of 0.300 seconds GC time, and 5.755 seconds non-GC time. ]
  99.62% CPU
  15,758,370,862 processor cycles
  208,389,696 bytes consed

;;; PILEUP: insert and pop 1M numbers x 4 into a single heap
Evaluation took:
  0.409 seconds of real time
  0.410249 seconds of total run time (0.409089 user, 0.001160 system)
  100.24% CPU
  1,060,531,810 processor cycles
  3,053,296 bytes consed

;;; CL-HEAP: make 1K heaps, insert and pop 4K numbers
Evaluation took:
  38.221 seconds of real time
  38.051652 seconds of total run time (37.799509 user, 0.252143 system)
  [ Run times consist of 0.433 seconds GC time, and 37.619 seconds non-GC time. ]
  99.56% CPU
  99,125,251,225 processor cycles
  1,940,254,144 bytes consed

;;; PILEUP: make 1K heaps, insert and pop 4K numbers
Evaluation took:
  3.468 seconds of real time
  3.476932 seconds of total run time (3.453248 user, 0.023684 system)
  [ Run times consist of 0.021 seconds GC time, and 3.456 seconds non-GC time. ]
  100.26% CPU
  8,991,845,681 processor cycles
  98,563,520 bytes consed

(I was also going to compare parallel performance, but CL-HEAP doesn't appear to be thread-safe, so...)

This is not to disparage CL-HEAP: it supports things which Pileup doesn't, but it clearly isn't written with constant factors in mind, and this shows.

Constant factors matter.

(Admittedly, I tested this only on SBCL, and it might turn out that CL-HEAP does a lot better -- and Pileup a lot worse -- on some other implementation. This does not alter my main contention that you ignore constant factors at your own peril.)

Syndicated 2013-03-16 12:24:17 from Nikodemus Siivola

Userspace Threads for SBCL -- a short discussion

(This is in response to a wish from an IndieGoGo funder.)

First off a disclaimer: I'm not really into green threads / fibers distinction, so I'm just going to be rambling about userspace threads in general. I'm also making the assumption that having the option of userspace threads in addition to native threads would be a good thing, and not spending time on ramifications of that.

I'm also not working, or planning to work on this area in near future. Consider this a vague roadmap for those wanting to look into doing this.

Are Some Threads More Equal Than Others?

How are userspace threads distinct from native threads?

Does

(subtypep 'userspace-thread 'thread)

hold? Is there going to even be a distinct userspace thread type?

Because semantically userspace threads should mostly be indistinguishable from native threads (ie. dynamic binding works the same way, locks work the same way, etc), I think they should indeed be just like threads except for the "who is responsible for scheduling" bit.

So I'm thinking all lisp threads are really going to be userspace threads, and it's just that some of them have a dedicated OS thread from the start.

Let's say MAKE-THREAD grows an argument :RUN, which defaults to true. If it's NIL you get an inert suspended thread object that won't run until someone yields to it.

So from lisp land all threads are going to look identical -- but with the new distinction that some lisp threads may be in suspended state, not being currently run by any OS thread.

Lies, Statistics, and Schedules

How does scheduling work?

Do users of userspace threads need explicit scheduling, or does the system eg. automatically schedule them on blocking IO?

I think both have merits, but automatic scheduling on blocking IO is really explicit scheduling under the hood, so let's consider that only.

We already have #'THREAD-YIELD. Let's just add an optional argument to it. If that argument is a thread that isn't currently running, we yield to it -- otherwise we will just consider it a hint that our processor resources could be better spent elsewhere now. (Or maybe "yield execution context" and "yield processor time" should not be mixed up? Hm.)

It's possible that we may also want SUSPEND-THREAD, which dissociates a lisp thread from the underlying OS thread, and RUN-THREAD which starts a new OS thread in order to run a suspended lisp thread, but I'll ignore them for now.

One thing that this does mean, which the current system may have conflicting assumptions about, is that the OS thread associated with a single lisp thread may change over it's lifetime. This needs to be checked. (Or indeed that a thread has an OS thread associated with it!)

We're also going to need critical sections during which the scheduling status of the thread (ie, if it's currently running == associated with an OS thread) cannot be changed. Not sure if WITHOUT-INTERRUPTS should subsume this, or if it has to be distinct from it.

Enough About Design, Let's Do This!

So, what does a thread need? A stack and a context, pretty much.

I'll make the wild assumption that we're on a platform with fully functional swapcontext(3). I have sometimes heard whispered that those API's aren't all that great, so it's possible that we may need to implement them in asm on our own -- but I haven't ever used them personally, so I don't really claim to know.

If that is how we're going to be switching from one userspace thread to another, how do we make it play nice with the rest of SBCL?

Let's start by taking a look at MAKE-THREAD. At first blush at least it looks to me like the only thing that really needs to be different for userspace threads is the call to %CREATE-THREAD, which currently ends up doing the following:

  • Creates the C-side thread struct, which contains the stack(s) and thread-local bindings, and a bunch of other stuff. What it doesn't currently have is space for everything swapcontext(3) needs, so we'll need to add that.

  • Creates the OS thread, including all the signal handling setup, etc.

    Definitely prime reading ground for anyone looking to add userspace threads to SBCL: the stuff that needs to happen when we switch to a new thread is going to look a lot like create_os_thread and new_thread_trampoline.

    This gets factored into RUN-THREAD and THREAD-YIELD, pretty much (or at least the C-code those will end up calling). Not rocket science, but a lot of details...

(Unless you're just skimming this, go ahead and at least skim the relevant parts of the code.)

The other end of thread lifetime is another obvious place to look in -- but mostly it comes down to undoing whatever was done when the thread was created. This raises a hairy design question, though: do OS threads die when the current thread associated with them dies? I don't know. I suspect this points to problem in my overall design, but possibly it is a simple policy question.

The final place that needs attention is GC: it needs to be able to find all C-side thread structs in order to scavenge their stacks, and it needs to know how to scavenge the contexts of suspended threads as well -- not rocket science, again, but details.

Is this all? Probably not!

I'm pretty sure signal handling needs some very careful consideration -- but if WITHOUT-INTERRUPTS also means "without userspace thread state changes", then possibly current code is a decent match.

I think the easiest way to find out what is missing, however, is to start working towards an implementation.

The biggest issue with this sketch in my mind is the question of thread death mentioned above. The easiest way to solve it (not necessarily the best!) would be to say that each OS thread does indeed die when the currently executing lisp thread dies. The second easiest would be to have something like QUEUE-THREAD, which would mean that when the next lisp thread dies, the queued one should receive the OS thread instead of it going the way of the dodo.

...and now I'm out of time, and this still needs proofreading. Hopefully this inspires someone to do somethign awesome. :)

Happy Hacking!

Addendum: locking really needs thinking about. Suspending a thread that holds locks is not going to end well, or yielding while holding locks. Not sure if the locking API should be clever about this, or if this can all be punted to the users.

Syndicated 2012-10-06 13:22:47 from Nikodemus Siivola

Is That A Rest-List In Your Pocket?

Prelude

SBCL has for a while now been able to elide &REST list allocation when it is only used as an argument to APPLY, so

(defun foo (&rest args) (apply #'bar args))

is non-consing if BAR is. Note: I'm not saying eliding heap-allocation, I'm saying eliding list allocation completely: instead of moving the arguments into a stack or heap allocated list and then pulling them out later, the compiler generates code that directly passes them as arguments to FOO.

This doesn't make a difference to stack-allocation if all you look at is heap-consing, but it does save a noticeable amount of work at runtime, and it doesn't break tail-calls like stack allocation does.

That's how far it went, however: if you did anything else with the rest-list, the compiler gave up and allocated the full list -- on stack if you asked for that using DYNAMIC-EXTENT.

First Act

Earlier this week Nathan Froyd (an SBCL hacker, and him of Ironclad-fame) committed a change that fixed a rather embarrassing oversight: we were heap allocating the rest-lists in full calls to vararg entry points to arithmetic functions like + and *.

This is less catastrophic for most code than you might imagine, since SBCL works pretty hard to call more efficient entry points -- so those calls are virtually never seen in performance sensitive code.

Doesn't make it any less embarrassing, though. Years and years it's been like that, until Nathan noticed.

Nathan fixed the heap consing by adding DYNAMIC-EXTENT declarations to those functions involved, which not only reduced GC pressure a bit, but provided a small performance boost.

Second Act

Adding those DYNAMIC-EXTENT declarations had another side effect as well -- a couple of backtrace tests broke from to unexpected frames, due to tail-calls being foiled by the stack allocation: several tests used division by zero to trigger an error, so the arithmetic changes showed up there.

That would have been a fair tradeoff, and the backtrace tests could just have been adjusted to allow the extra frames, but we could do a bit better.

SBCL has an extra (internal, unsupported) lambda-list keyword: SB-INT:&MORE, which is a fair deal hairier to use than &REST, but allows dealing with variable arguments without any consing -- heap or stack. So those arithmetic functions got changed to use SB-INT:&MORE instead, which fixed the backtrace tests and gave another small performance boost.

Third Act

I was looking at the SB-INT:&MORE changes, and wondering if we should expose it to users as well, since it obviously is useful occasionally -- and what kind of interface cleanup that would entail.

Thinking about that I realized that I could just extend the compiler smarts for dealing with &REST instead. Under the hood, when SBCL optimizes an APPLY with a rest list as the final argument, it actually changes into using &MORE.

So, I extended that part of the compiler to deal with (for starters) a few list functions that would be sufficient for implementing the arithmetic functions using rest lists and compiler magic.

The conversion path looks roughly like this:

;;; Original source, using LENGTH as an example
(defun foo (&rest args)
  (length args))

;;; Compiler adds hidden &MORE arguments when it sees the &REST.
(lambda (&rest args &more #:context #:count)
  (length args))

;;; Source level transformation notices LENGTH is applied to a &REST
;;; argument and transform into %REST-LENGTH.
(lambda (&rest args &more #:context #:count)
  (sb-c::%rest-length args #:context #:count))

;;; During optimization another transformation sees the %REST-LENGTH,
;;; and verifies that the rest list is never modified, or used in
;;; any place that would require actually allocating it -- this being
;;; the case, it proceeds.
(lambda (&rest args &more #:context #:count)
  #:count)

;;; Since the rest list isn't used anymore, it is deleted.
(lambda (&more #:context #:count)
  #:count)

That's it, approximately. Currently this can be done for: ELT, NTH, CAR, FIRST, LENGTH, LIST-LENGTH, and VALUES-LIST -- and additionally using a rest-list as the test-form in an IF is equally efficient and doesn't force its allocation.

LENGTH, ELT, and NTH on rest-lists deserve a special mention: they're all O(1) when this optimization has been applied.

Unfortunately we don't yet have any compiler notes about this, so if you intend to take advantage of this optimization, you're best off verifying the results from assembly.

Coda

With that in place, I rewrote the vararg arithmetic functions using &REST. Amusingly they now look rather charmingly naive: the way someone who doesn't understand the cost of list traversal would write things:

(defun - (number &rest more-numbers)
  (if more-numbers
      (let ((result number))
        (dotimes (i (length more-numbers) result)
          (setf result (- result (nth i more-numbers)))))
      (- number)))

...but using bleeding edge SBCL, this compiles into rather nice code.

Finally, some pretty pictures. These are benchmark results for calling the vararg #'+ with 2, 4, or 8 arguments. F means fixnum, S a single float, and D a double float. The numbers are benchmark iterations per second, so bigger is better. Topmost chart is for the current version using SBCL's newly found rest-smarts, middle chart is for the version using DYNAMIC-EXTENT, and bottom one is for the version before all this madness started.

Benchmarks, linear scale.

Benchmarks, logarithmic scale.

If you look at the vararg+[ff], vararg+[ffff], and vararg+[ffffffff] benchmarks, you can see how the &REST list allacation and access costs almost dominate them: even with stack allocation going from 8 to 2 arguments barely doubles the speed; with the latest version each halving of the argument count doubles the speed for both the fixnums-only and the singles-floats-only benchmarks.

This was run on x86-64, so both single-floats and fixnums are immediate objects. Doubles, however, need heap allocation here -- so if you look at the double float numbers some of the allocation costs come from the numbers and intermediate results.

...but before you get too excited about these numbers, remember the reason why no-one noticed this for such a long time: in real-world performance sensitive code these entry points don't really matter that much.

Syndicated 2012-09-23 14:37:00 from Nikodemus Siivola

Neat TYPEP Trick

How do you test if an object is a cons that has the desired symbol in the car?

(typep x '(cons (eql :foo)))

Sure,

(and (consp x) (eq :foo (car x)))

is essentially just as short...

I still find cons types neat, even if they're a nightmare when it comes to type derivation, but that's a different matter. Some nightmares aren't all bad.

Syndicated 2012-05-15 17:19:49 from Nikodemus Siivola

Trolled

(set-macro-character #\{ (lambda (s c) (read-delimited-list #\} s t)) nil)
(set-macro-character #\} (get-macro-character #\)) nil)

Syndicated 2012-05-15 05:36:46 from Nikodemus Siivola

75 older entries...

 

nikodemus certified others as follows:

  • nikodemus certified nikodemus as Apprentice
  • nikodemus certified biz as Apprentice
  • nikodemus certified wmf as Journeyer
  • nikodemus certified bonzini as Master
  • nikodemus certified varjag as Journeyer
  • nikodemus certified dan as Master
  • nikodemus certified shillo as Journeyer
  • nikodemus certified mdanish as Journeyer
  • nikodemus certified Akira as Journeyer
  • nikodemus certified jolan as Journeyer

Others have certified nikodemus as follows:

  • nikodemus certified nikodemus as Apprentice
  • bonzini certified nikodemus as Apprentice
  • jolan certified nikodemus as Apprentice
  • lerdsuwa certified nikodemus as Apprentice
  • ak certified nikodemus as Journeyer
  • xach certified nikodemus as Journeyer
  • cyrus certified nikodemus as Journeyer
  • technik certified nikodemus as Journeyer
  • pthierry certified nikodemus as Journeyer
  • kai certified nikodemus as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page