Recent blog entries for crhodes

17 Nov 2014 (updated 17 Nov 2014 at 13:14 UTC) »

hearing wagner data preparations

Last week’s activity – in between the paperwork, the teaching, the paperwork, the paperwork, the teaching and the paperwork – was mostly taken up in preparations for the Hearing Wagner event, part of the AHRC’s Being Human festival.

Being a part of the Being Human festival gave us the opportunity to work to collect data that we wouldn’t otherwise have had access to: because of the fortuitous timing of the Mariinsky Theatre’s production of the Ring at the Birmingham Hippodrome between 5th and 9th November, we were able to convince funders to allow us to offer free tickets to Birmingham Conservatoire students, in exchange for being wired up to equipment measuring their electrodermal activity, blood flow, and hand motion.

Why collect these data? Well, on of the themes of the Transforming Musicology project as a whole is to examine the perception of leitmotive, particularly Wagner’s use of them in the Ring, and the idea behind gathering these data is to have ecologically-valid (in as much as that is possible when there’s a device strapped to you) measurements of participants’ physical responses to the performance, where those physical responses are believed to correlate with emotional arousal. Using those measurements, we can then go looking for signals of responses to leitmotives, or to other musical or production cues: as well as the students attending the performance, some of the research team were present backstage, noting down the times of events in the staging of (subjective) particular significance – lighting changes, for example.

And then all of these data come back to base, and we have to go through the process of looking for signal. And before we can do anything else, we have to make sure that all of our data are aligned to a reference timeline. For each of the operas, we ended up with around 2GB of data files: up to 10 sets of data from the individual participants, sampled at 120Hz or so; times of page turns in the vocal score, noted by a musicologist member of the research team (a coarse approximation to the sound experienced by the participants); timestamped performance annotations, generated by a second musicologist and dramaturge. How to get all of this onto a common timeline?

Well, in the best of all possible worlds, all of the clocks in the system would have been synchronized by ntp, and that synchronization would have been stable and constant throughout the process. In this case, the Panglossians would have been disappointed: in fact none of the various devices was sufficiently stably synchronized with any of the others to be able to get away with no alignment.

Fortunately, the experimental design was carried out by people with a healthy amount of paranoia: the participants were twice asked to clap in unison: once in the backstage area, where there was only speed-of-sound latency to the listeners (effectively negligible), and once when seated in the auditorium, where there was additional latency from the audio feed from the auditorium to backstage. Those claps gave us enough information, on the rather strong assumption that they were actually simultaneous, to tie everything together: the first clap could be found on each individual measuring device by looking at the accelerometer data for the signature, which establishes a common timeline for the measurement data and the musicologists; the second clap gives a measure for the additional latency introduced by the audio feed. Since the participants’ claps weren’t actually simultaneous – despite the participants being music students, and the clap being conducted – we have a small error, but it’s likely to be no more than about one second.

And this week? This week we’ll actually be looking for interesting signal; there’s reason to believe that electrodermal activity (basically, the change in skin conductance due to sweat) is indicative of emotional arousal, and quite a sensitive measure of music-induced emotion. This is by its nature an exploratory study: at least to start with, we’re looking at particular points of interest (specified by musicologists, in advance) for any correlation with biosignal response – and we’ll be presenting initial results about anything we find at the Hearing Wagner event in Birmingham this weekend. The clock is ticking...

edit: see also a similar post on the project blog

Syndicated 2014-11-17 10:58:17 (Updated 2014-11-17 12:49:11) from notes

reproducible builds - a month ahead of schedule

I think this might be my last blog entry on the subject of building SBCL for a while.

One of the premises behind SBCL as a separate entity from CMUCL, its parent, was to make the result of its build be independent of the compiler used to build it. To a world where separate compilation is the norm, the very idea that building some software should persistently modify the state of the compiler probably seems bizarre, but the Lisp world evolved in that way and Lisp environments (at least those written in themselves) developed build recipes where the steps to construct a new Lisp system from an old one and the source code would depend critically on internal details of both the old and the new one: substantial amounts of introspection on the build host were used to bootstrap the target, so if the details revealed by introspection were no longer valid for the new system, there would need to be some patching in the middle of the build process. (How would you know whether that was necessary? Typically, because the build would fail with a more-or-less – usually more – cryptic error.)

Enter SBCL, whose strategy is essentially to use the source files first to build an SBCL!Compiler running in a host Common Lisp implementation, and then to use that SBCL!Compiler to compile the source files again to produce the target system. This requires some contortions in the source files: we must write enough of the system in portable Common Lisp so that an arbitrary host can execute SBCL!Compiler to compile SBCL-flavoured sources (including the standard headache-inducing (defun car (list) (car list)) and similar, which works because SBCL!Compiler knows how to compile calls to car).

How much is “enough” of the system? Well, one answer might be when the build output actually works, at least to the point of running and executing some Lisp code. We got there about twelve years ago, when OpenMCL (as it was then called) compiled SBCL. And yet... how do we know there aren't odd differences that depend on the host compiler lurking, which will not obviously affect normal operation but will cause hard-to-debug trouble later? (In fact there were plenty of those, popping up at inopportune moments).

I’ve been working intermittently on dealing with this, by attempting to make the Common Lisp code that SBCL!Compiler is written in sufficiently portable that executing it on different implementations generates bitwise-identical output. Because then, and only then, can we be confident that we are not depending in some unforseen way on a particular implementation-specific detail; if output files are different, it might be a harmless divergence, for example a difference in ordering of steps where neither depends on the other, or it might in fact indicate a leak from the host environment into the target. Before this latest attack at the problem, I last worked on it seriously in 2009, getting most of the way there but with some problems remaining, as measured by the number of output files (out of some 330 or so) whose contents differed depending on which host Common Lisp implementation SBCL!Compiler was running on.

Over the last month, then, I have been slowly solving these problems, one by one. This has involved refining what is probably my second most useless skill, working out what SBCL fasl files are doing by looking at their contents in a text editor, and from that intuiting the differences in the implementations that give rise to the differences in the output files. The final pieces of the puzzle fell into place earlier this week, and the triumphant commit announces that as of Wednesday all 335 target source files get compiled identically by SBCL!Compiler, whether that is running under Clozure Common Lisp (32- or 64-bit versions), CLISP, or a different version of SBCL itself.

Oh but wait. There is another component to the build: as well as SBCL!Compiler, we have SBCL!Loader, which is responsible for taking those 335 output files and constructing from them a Lisp image file which the platform executable can use to start a Lisp session. (SBCL!Loader is maybe better known as “genesis”; but it is to load what SBCL!Compiler is to compile-file). And it was slightly disheartening to find that despite having 335 identical output files, the resulting cold-sbcl.core file differed between builds on different host compilers, even after I had remembered to discount the build fingerprint constructed to be different for every build.

Fortunately, the actual problem that needed fixing was relatively small: a call to maphash, which (understandably) makes no guarantees about ordering, was used to affect the Lisp image data directly. I then spent a certain amount of time being thoroughly confused, having managed to construct for myself a Lisp image where the following forms executed with ... odd results:

  (loop for x being the external-symbols of "CL" count 1)
; => 1032
(length (delete-duplicates (loop for x being the external-symbols of "CL" collect x)))
; => 978

(shades of times gone by). Eventually I realised that

  (unless (member (package-name package) '("COMMON-LISP" "KEYWORD" :test #'string=))
  ...)

was not the same as

  (unless (member (package-name package) '("COMMON-LISP" "KEYWORD") :test #'string=)
  ...)

and all was well again, and as of this commit the cold-sbcl.core output file is identical no matter the build host.

It might be interesting to survey the various implementation-specific behaviours that we have run into during the process of making this build completely repeatable. The following is a probably non-exhaustive list – it has been twelve years, after all – but maybe is some food for thought, or (if you have a particularly demonic turn of mind) an ingredients list for a maximally-irritating CL implementation...

  • Perhaps most obviously, various constants are implementation-defined. The ones which caused the most trouble were undoubtably most-positive-fixnum and most-negative-fixnum – particularly since they could end up being used in ways where their presence wasn’t obvious. For example, (deftype fd () `(integer 0 ,most-positive-fixnum)) has, in the SBCL build process, a subtly different meaning from (deftype fd () (and fixnum unsigned-byte)) – in the second case, the fd type will have the intended meaning in the target system, using the target’s fixnum range, while in the first case we have no way of intercepting or translating the host’s value of most-positive-fixnum. Special mentions go to array-dimension-limit, which caused Bill Newman to be cross on the Internet, and to internal-time-units-per-second; I ended up tracking down one difference in output machine code from a leak of the host’s value of that constant into target code.
  • Similarly, random and sxhash quite justifiably differ between implementations. The practical upshot of that is that these functions can’t be used to implement a cache in SBCL!Compiler, because the access patterns and hence the patterns of cache hits and misses will be different depending on the host implementation.
  • As I’ve already mentioned, maphash does not iterate over hash-table contents in a specified order, and in fact that order need not be deterministic; similarly, with-package-iterator can generate symbols in arbitrary orders, and set operations (intersection, set-difference and friends) will return the set as a list whose elements are in an arbitrary order. Incautious use of these functions tended to give rise to harmless but sometimes hard-to-diagnose differences in output; the solution was typically to sort the iteration output before operating on any of it, to introduce determinism...
  • ... but it was possible to get that wrong in a harder-to-detect way, because sort isnot specified to be stable. In some implementations, it actually is a stable sort in some conditions, but for cases where it’s important to preserve an already-existing partial order, stable-sort is the tool for the job.
  • The language specification explicitly says that the initial contents of uninitialized arrays are undefined. In most implementations, at most times, executing (make-array 8 :element-type (unsigned-byte 8)) will give a zero-filled array, but there are circumstances in some implementations where the returned array will have arbitrary data.
  • Not only are some constants implementation-defined, but so also are the effects of normal operation on some variables. *gensym-counter* is affected by macroexpansion if the macro function calls gensym, and implementations are permitted to macroexpand macros an arbitrary number of times. That means that our use of gensym needs to be immune to whatever the host implementation’s macroexpansion and evaluation strategy is.
  • The object returned by byte to represent a bitfield with size and position is implementation-defined. Implementations (variously) return bitmasks, conses, structures, vectors; host return values of byte must not be used during the execution of SBCL!Compiler. More subtly, the various boole-related constants (boole-and and friends) also need special treatment; at one point, their host values were used when SBCL!Compiler compiled the boole function itself, and it so happens that CLISP and SBCL both represent the constants as integers between 0 and 15... but with a different mapping between operation and integer.
  • my last blog entry talked about constant coalescing, and about printing of (quote foo). In fact printing in general has been a pain, and there are still significant differences in interpretation or at least in implementation of pretty-printing: to the extent that at one point we had to minimize printing at all in order for the build to complete under some implementations.
  • there are a number of things which are implementation-defined but have caused a certain amount of difficulty. Floating point in general is a problem, not completely solved (SBCL will not build correctly if its host doesn’t have distinct single- and double-float types that are at least approximately IEEE754-compliant). Some implementations lack denormalized numbers; some do not expose signed zeros to the user; and some implementations compute (log 2d0 10d0) more accurately than others, including SBCL itself, do. The behaviour of the host implementation on legal but dubious code is also potentially tricky: SBCL’s build treats full warnings as worthy of stopping, but some hosts emit full warnings for constructs that are tricky to write in other ways: for example to write portable code to handle multiple kinds of string, one might write (typecase string (simple-base-string ...) ((simple-array character (*)) ...)) (string ...)) but some implementations emit full warnings if a clause in a typecase is completely shadowed by other clauses, and if base-char and character are identical in that implementation the typecase above will signal.

There were probably other, more minor differences between implementations, but the above list gives a flavour of the things that needed doing in order to get to this point, where we have some assurance that our code is behaving as intended. And all of this is a month ahead of my self-imposed deadline of SBCL’s 15th birthday: SBCL was announced to the world on December 14th, 1999. (I’m hoping to be able to put on an sbcl15 workshop in conjunction with the European Lisp Symposium around April 20th/21st/22nd – if that sounds interesting, please pencil the dates in the diary and let me know...)

Syndicated 2014-11-08 22:38:14 (Updated 2014-11-10 17:00:04) from notes

still working on reproducible builds

It’s been nearly fifteen years, and SBCL still can’t be reliably built by other Lisp compilers.

Of course, other peoples’ definition of “reliably” might differ. We did achieve successful building under unrelated Lisp compilers twelve years ago; there were a couple of nasty bugs along the way, found both before and after that triumphant announcement, but at least with a set of compilers whose interpretation of the standard was sufficiently similar to SBCL’s own, and with certain non-mandated but expected features (such as the type (array (unsigned-byte 8) (*)) being distinct from simple-vector, and single-float being distinct from double-float), SBCL achieved its aim of being buildable on a system without an SBCL binary installed (indeed, using CLISP or XCL as a build host, SBCL could in theory be bootstrapped starting with only gcc).

For true “reliability”, though, we should not be depending on any particular implementation-defined features other than ones we actually require – or if we are, then the presence or absence of them should not cause a visible difference in the resulting SBCL. The most common kind of leak from the host lisp to the SBCL binary was the host’s value of most-positive-fixnum influencing the target, causing problems from documentation errors all the way up to type errors in the assembler. Those leaks were mostly plugged a while ago, though they do recur every so often; there are other problems, and over the last week I spent some time tracking down three of them.

The first: if you’ve ever done (apropos "PRINT") or something similar at the SBCL prompt, you might wonder at the existence of functions named something like SB-VM::|CACHED-FUN--PINSRB[(EXT-2BYTE-XMM-REG/MEM ((PREFIX (QUOTE (102))) (OP1 (QUOTE (58))) (OP2 (QUOTE (32))) (IMM NIL TYPE (QUOTE IMM-BYTE))) (QUOTE (NAME TAB REG , REG/MEM ...)))]-EXT-2BYTE-XMM-REG/MEM-PRINTER|.

What is going on there? Well, these functions are a part of the disassembler machinery; they are responsible for taking a certain amount of the machine code stream and generating a printed representation of the corresponding assembly: in this case, for the PINSRB instruction. Ah, but (in most instruction sets) related instructions share a fair amount of structure, and decoding and printing a PINSRD instruction is basically the same as for PINSRB, with just one #x20 changed to a #x22 – in both cases we want the name of the instruction, then a tab, then the destination register, a comma, the source, another comma, and the offset in the destination register. So SBCL arranges to reuse the PINSRB instruction printer for PINSRD; it maintains a cache of printer functions, looked up by printer specification, and reuses them when appropriate. So far, so normal; the ugly name above is the generated name for such a function, constructed by interning a printed, string representation of some useful information.

Hm, but wait. See those (QUOTE (58)) fragments inside the name? They result from printing the list (quote (58)). Is there a consensus on how to print that list? Note that *print-pretty* is bound to nil for this printing; prior experience has shown that there are strong divergences between implementations, as well as long-standing individual bugs, in pretty-printer support. So, what happens if I do (write-to-string '(quote foo) :pretty nil)?

  • SBCL: "(QUOTE FOO)", unconditionally
  • CCL: "'FOO" by default; "(QUOTE FOO)" if ccl:*print-abbreviate-quote* is set to nil
  • CLISP: "'FOO", unconditionally (I read the .d code with comments in half-German to establish this)

So, if SBCL was compiled using CLISP, the name of the same function in the final image would be SB-VM::|CACHED-FUN--PINSRB[(EXT-2BYTE-XMM-REG/MEM ((PREFIX '(102)) (OP1 '(58)) (OP2 '(32)) (IMM NIL TYPE 'IMM-BYTE)) '(NAME TAB REG , REG/MEM ...))]-EXT-2BYTE-XMM-REG/MEM-PRINTER|. Which is shorter, and maybe marginally easier to read, but importantly for my purposes is not bitwise-identical.

Thus, here we have a difference between host Common Lisp compilers which leaks over into the final image, and it must be eliminated. Fortunately, this was fairly straightforward to eliminate; those names are never in fact used to find the function object, so generating a unique name for functions based on a counter makes the generated object file bitwise identical, no matter how the implementation prints two-element lists beginning with quote.

The second host leak is also related to quote, and to our old friend backquote – though not related in any way to the new implementation. Consider this apparently innocuous fragment, which is a simplified version of some code to implement the :type option to defstruct:

  (macrolet ((def (name type n)
             `(progn
                (declaim (inline ,name (setf ,name)))
                (defun ,name (thing)
                  (declare (type simple-vector thing))
                  (the ,type (elt thing ,n)))
                (defun (setf ,name) (value thing)
                  (declare (type simple-vector thing))
                  (declare (type ,type value))
                  (setf (elt thing ,n) value)))))
  (def foo fixnum 0)
  (def bar string 1))

What’s the problem here? Well, the functions are declaimed to be inline, so SBCL records their source code. Their source code is generated by a macroexpander, and so is made up of conses that are generated programmatically (as opposed to freshly consed by the reader). That source code is then stored as a literal object in an object file, which means in practice that instructions for reconstructing a similar object are dumped, to be executed when the object file is processed by load.

Backquote is a reader macro that expands into code that, when evaluated, generates list structure with appropriate evaluation and splicing of unquoted fragments. What does this mean in practice? Well, one reasonable implementation of reading `(type ,type value) might be:

  (cons 'type (cons type '(value)))

and indeed you might (no guarantees) see something like that if you do

  (macroexpand '`(type ,type value))

in the implementation of your choice. Similarly, reading `(setf (elt thing ,n) value) will eventually generate code like

  (cons 'setf (cons (cons 'elt (list 'thing n)) '(value)))

Now, what is “similar”? In this context, it has a technical definition: it relates two objects in possibly-unrelated Lisp images, such that they can be considered to be equivalent despite the fact that they can’t be compared:

similar adj. (of two objects) defined to be equivalent under the similarity relationship.

similarity n. a two-place conceptual equivalence predicate, which is independent of the Lisp image so that two objects in different Lisp images can be understood to be equivalent under this predicate. See Section 3.2.4 (Literal Objects in Compiled Files).

Following that link, we discover that similarity for conses is defined in the obvious way:

Two conses, S and C, are similar if the car of S is similar to the car of C, and the cdr of S is similar to the cdr of C.

and also that implementations have some obligations:

Objects containing circular references can be externalizable objects. The file compiler is required to preserve eqlness of substructures within a file.

and some freedom:

With the exception of symbols and packages, any two literal objects in code being processed by the file compiler may be coalesced if and only if they are similar [...]

Put this all together, and what do we have? That def macro above generates code with similar literal objects: there are two instances of '(value) in it. A host compiler may, or may not, choose to coalesce those two literal '(value)s into a single literal object; if it does, the inline expansion of foo (and bar) will have a circular reference, which must be preserved, showing up as a difference in the object files produced during the SBCL build. The fix? It’s ugly, but portable: since we can’t stop an aggressive compiler from coalescing constants which are similar but not identical, we must make sure that any similar substructure is in fact identical:

  (macrolet ((def (name type n)
             (let ((value '(value)))
               `(progn
                  (declaim (inline ,name (setf ,name)))
                  (defun ,name (thing)
                    (declare (type simple-vector thing))
                    (the ,type (elt thing ,n)))
                  (defun (setf ,name) (value thing)
                    (declare (type simple-vector thing))
                    (declare (type ,type . ,value))
                    (setf (elt thing ,n) . ,value)))))
  (def foo fixnum 0)
  (def bar string 1))

Having dealt with a problem with quote, and a problem with backquote, what might the Universe serve up for my third problem? Naturally, it would be a problem with a code walker. This code walker is somewhat naïve, assuming as it does that its body is made up of forms or tags; it is the assemble macro, which is used implicitly in the definition of VOPs (reusable assembly units); for example, like

  (assemble ()
  (move ptr object)
  (zeroize count)
  (inst cmp ptr nil-value)
  (inst jmp :e DONE)
 LOOP
  (loadw ptr ptr cons-cdr-slot list-pointer-lowtag)
  (inst add count (fixnumize 1))
  (inst cmp ptr nil-value)
  (inst jmp :e DONE)
  (%test-lowtag ptr LOOP nil list-pointer-lowtag)
  (error-call vop 'object-not-list-error ptr)
 DONE))

which generates code to compute the length of a list. The expander for assemble scans its body for any atoms, and generates binding forms for those atoms to labels:

  (let ((new-labels (append labels
                          (set-difference visible-labels inherited-labels))))
  ...
  `(let (,@(mapcar (lambda (name) `(,name (gen-label))) new-labels))
     ...))

The problem with this, from a reproducibility point of view, is that set-difference (and the other set-related functions: union, intersection, set-exclusive-or and their n-destructive variants) do not return the sets with a specified order – which is fine when the objects are truly treated as sets, but in this case the LOOP and DONE label objects ended up in different stack locations depending on the order of their binding. Consequently the machine code for the function emitting code for computing a list’s length – though not the machine code emitted by that function – would vary depending on the host’s implementation of set-difference. The fix here was to sort the result of the set operations, knowing that all the labels would be symbols and that they could be treated as string designators.

And after all this is? We’re still not quite there: there are three to four files (out of 330 or so) which are not bitwise-identical for differing host compilers. I hope to be able to rectify this situation in time for SBCL’s 15th birthday...

Syndicated 2014-10-14 06:51:19 from notes

interesting pretty-printer bug

One of SBCL’s Google Summer of Code students, Krzysztof Drewniak (no relation) just got to merge in his development efforts, giving SBCL a far more complete set of Unicode operations.

Given that this was the merge of three months’ out-of-tree work, it’s not entirely surprising that there were some hiccups, and indeed we spent some time diagnosing and fixing a 1000-fold slowdown in char-downcase. Touch wood, all seems mostly well, except that Jan Moringen reported that, when building without the :sb-unicode feature (and hence having a Lisp with 8-bit characters) one of the printer consistency tests was resulting in an error.

Tracking this down was fun; it in fact had nothing in particular to do with the commit that first showed the symptom, but had been lying latent for a while and had simply never shown up in automated testing. I’ve expressed my admiration for the Common Lisp standard before, and I’ll do it again: both as a user of the language and as an implementor, I think the Common Lisp standard is a well-executed document. But that doesn’t stop it from having problems, and this is a neat one:

When a line break is inserted by any type of conditional newline, any blanks that immediately precede the conditional newline are omitted from the output and indentation is introduced at the beginning of the next line.

(from pprint-newline)

For the graphic standard characters, the character itself is always used for printing in #\ notation---even if the character also has a name[5].

(from CLHS 22.1.3.2)

Space is defined to be graphic.

(from CLHS glossary entry for ‘graphic’)

What do these three requirements together imply? Imagine printing the list (#\a #\b #\c #\Space #\d #\e #\f) with a right-margin of 17:

  (write-to-string '(#\a #\b #\c #\Space #\d #\e #\f) :pretty t :right-margin 17)
; => "(#\\a #\\b #\\c #\\
; #\\d #\\e #\\f)"

The #\Space character is defined to be graphic; therefore, it must print as #\ rather than #\Space; if it happens to be printed just before a conditional newline (such as, for example, generated by using pprint-fill to print a list), the pretty-printer will helpfully remove the space character that has just been printed before inserting the newline. This means that a #\Space character, printed at or near the right margin, will be read back as a #\Newline character.

It’s interesting to see what other implementations do. CLISP 2.49 in its default mode always prints #\Space; in -ansi mode it prints #\ but preserves the space even before a conditional newline. CCL 1.10 similarly preserves the space; there’s an explicit check in output-line-and-setup-for-next for an “escaped” space (and a comment that acknowledges that this is a heuristic that can be wrong in the other direction). I’m not sure what the best fix for this is; it’s fairly clear that the requirements on the printer aren’t totally consistent. For SBCL, I have merged a one-line change that makes the printer print using character names even for graphic characters, if the *print-readably* printer control variable is true; it may not be ideal that print/read round-tripping was broken in the normal case, but in the case where it’s explicitly been asked for it is clearly wrong.

Syndicated 2014-10-06 20:48:14 from notes

settings for gnome-shell extension

A long, long time ago, I configured my window manager. What can I say? I was a student, with too much free time; obviously hoursdays spent learning some configuration file format and tweaking some aspect of behaviour would be repaid many times over the course of a working life. I suppose one thing it led to was my current career, so it’s probably not all a loss.

However, the direct productivity benefits almost certainly were a chimera; unfortunately, systems (hardware and software) changed too often for the productivity benefit (if any) to amortize the fixed set-up time, and so as a reaction I stopped configuring my window manager. In fact, I went the other way, becoming extremely conservative about upgrades of anything at all; I had made my peace with GNOME 2, accepting that there was maybe enough configurability and robustness in the 2.8 era or so for me not to have to change too much, and then not changing anything.

Along comes GNOME 3, and there are howls all over the Internet about the lack of friendly behaviour – “friendly” to people like me, who like lots of terminals and lots of editor buffers; there wasn’t much of an outcry from other people with more “normal” computing needs; who knows why? In any case, I stuck with GNOME 2 for a long time, eventually succumbing at the point of an inadvisable apt-get upgrade and not quite hitting the big red ABORT button in time.

So, GNOME 3. I found that I shared a certain amount of frustration with the vocal crowd: dynamic, vertically-arranged workspaces didn’t fit my model; I felt that clicking icons should generate new instances of applications rather than switch to existing instances, and so on. But, in the same timeframe, I adopted a more emacs-centric workflow, and the improvements of the emacs daemon meant that I was less dependent on particular behaviours of window manager and shell, so I gave it another try, and, with the right extensions, it stuck.

The right extensions? “What are those?” I hear you cry. Well, in common with illustrious Debian Project Leaders past, I found that a tiling extension made many of the old focus issues less pressing. My laptop is big enough, and I have enough (fixed) workspaces, that dividing up each screen between applications mostly works. I also have a bottom panel, customized to a height of 0 pixels, purely to give me the fixed number of workspaces; the overview shows them in a vertical arrangement, but the actual workspace arrangement is the 2x4 that I’m used to.

One issue with a tiled window arrangement, though, is an obvious cue to which window has focus. I have also removed all window decorations, so the titlebar or border don’t help with this; instead, a further extension to shade inactive windows helps to minimize visual distraction. And that’s where the technical part of this blog entry starts...

One of the things I do for my employer is deliver a module on Perception and Multimedia Computing. In the course of developing that module, I learnt a lot about how we see what we see, and also how digital displays work. And one of the things I learnt to be more conscious about was attention: in particular, how my attention can be drawn (for example, I can not concentrate on anything where there are animated displays, such as are often present in semi-public spaces, such as bars or London City airport.)

The shade inactive windows extension adds a brightness-reducing effect to windows without focus. So, that was definitely useful, but after a while I noticed that emacs windows with some text in error-face (bold, bright red) in them were still diverting my attention, even when they were unfocussed and therefore substantially dimmed.

So I worked on a patch to the extension to add a saturation-reducing effect in addition to reducing the brightness. And that was all very well – a classic example of taking code that almost does what you want it to do, and then maintenance-programming it into what you really want it to do – until I also found that the hard-wired time over which the effect took hold (300ms) was a bit too long for my taste, and I started investigating what it would take to make these things configurable.

Some time later, after exploring the world with my most finely-crafted google queries, I came to the conclusion that there was in fact no documentation for this at all. The tutorials that I found were clearly out-dated, and there were answers to questions on various forums whose applicability was unclear. This is an attempt to document the approach that worked for me; I make no claims that this is ‘good’ or even acceptable, but maybe there’s some chance that it will amortize the cost of the time I spent flailing about over other people wanting to customize their GNOME shell.

The first thing that something, anything with a preference needs, is a schema for that preference. In this instance, we’re starting with the shade-inactive-windows in the hepaajan.iki.fi namespace, so our schema will have a path that begins “/fi/iki/hepaajan/shade-inactive-windows”, and we're defining preferences, so let’s add “/preferences” to that.

  <?xml version="1.0" encoding="UTF-8"?>
<schemalist>
  <schema path="/fi/iki/hepaajan/shade-inactive-windows/preferences/"

a schema also needs an id, which should probably resemble the path

            id="fi.iki.hepaajan.shade-inactive-windows.preferences"

except that there are different conventions for hierarchy (. vs /).

            gettext-domain="gsettings-desktop-schemas">

and then here’s a cargo-culted gettext thing, which is probably relevant if the rest of the schema will ever be translated into any non-English language.

In this instance, I am interested in a preference that can be used to change the time over which the shading of inactie windows happens. It’s probably easiest to define this as an integer (the "i" here; other GVariant types are possible):

      <key type="i" name="shade-time">

which corresponds to the number of milliseconds

        <summary>Time in milliseconds over which shading occurs</summary>
      <description>
        The time over which the shading effect is applied, in milliseconds.
      </description>

which we will constrain to be between 0 and 1000, so that the actual time is between 0s and 1s, with a default of 0.3s:

        <range min="0" max="1000"/>
      <default>300</default>

and then there's some XML noise

      </key>
  </schema>
</schemalist>

and that completes our schema. For reasons that will become obvious later, we need to store that schema in a directory data/glib-2.0/schemas relative to our base extension directory; giving it a name that corresponds to the schema id (so fi.iki.hepaajan.shade-inactive-windows.preferences.gschema.xml in this case) is wise, but probably not essential. In order for the schema to be useful later, we also need to compile it: that’s as simple as executing glib-compile-schemas . within the schemas directory, which should produce a gschemas.compiled file in the same directory.

Then, we also need to adapt the extension in question to lookup a preference value when needed, rather than hard-coding the default value. I have no mental model of the namespacing, or other aspects of the environment, applied to GNOME shell extensions’ javascript code, so being simultaneously conservative and a javascript novice I added a top-level variable unlikely to collide with anything:

  var ShadeInactiveWindowsSettings = {};
function init() {

The extension previously didn’t need to do anything on init(); now, however, we need to initialize the settings object, including looking up our schema to discover what settings there are. But where is our schema? Well, if we’re running this extension in-place, or as part of a user installation, we want to look in data/glib-2.0/schemas/ relative to our own path; if we have performed a global installation, the schema will presumably be in a path that is already searched for by the default schema finding methods. So...

      var schemaDir = ExtensionUtils.getCurrentExtension().dir.get_child('data').get_child('glib-2.0').get_child('schemas');
    var schemaSource = Gio.SettingsSchemaSource.get_default();

    if(schemaDir.query_exists(null)) {
        schemaSource = Gio.SettingsSchemaSource.new_from_directory(schemaDir.get_path(), schemaSource, false);
    }

... we distinguish between those two cases by checking to see if we can find a data/glib-2.0/schemas/ directory relative to the extension’s own directory; if we can, we prepend that directory to the schema source search path. Then, we lookup our schema using the id we gave it, and initialize a new imports.gi.Gio.Settings object with that schema.

      var schemaObj = schemaSource.lookup('fi.iki.hepaajan.shade-inactive-windows.preferences', true);
    if(!schemaObj) {
        throw new Error('failure to look up schema');
    }
    ShadeInactiveWindowsSettings = new Gio.Settings({ settings_schema: schemaObj });
}

Then, whenever we use the shade time in the extension, we must make sure to look it up afresh:

  var shade_time = ShadeInactiveWindowsSettings.get_int('shade-time') / 1000;

in order that any changes made by the user take effect immediately. And that’s it. There’s an additional minor wrinkle, in that altering that configuration variable is not totally straightforward; dconf and gettings also need to be told where to look for their schema; that’s done using the XDG_DATA_DIRS configuration variable. For example, once the extension is installed locally, you should be able to run

  XDG_DATA_DIRS=$HOME/.local/gnome-shell/extensions/shade-inactive-windows@hepaajan.iki.fi/data:$XDG_DATA_DIRS dconf

and then navigate to the fi/iki/hepaajan/shade-inactive-windows/preferences schema and alter the shade-time preference entry.

Hooray! After doing all of that, we have wrestled things into being configurable: we can use the normal user preferences user interface to change the time over which the shading animation happens. I’m not going to confess how many times I had to restart my gnome session, insert logging code, look at log files that are only readable by root, and otherwise leave my environment; I will merely note that we are quite a long way away from the “scriptable user interface” – and that if you want to do something similar (not identical, but similar) in an all-emacs world, it might be as simple as evaluating these forms in your *scratch* buffer...

  (set-face-attribute 'default nil :background "#eeeeee")
(defvar my/current-buffer-background-overlay nil)

(defun my/lighten-current-buffer-background ()
  (unless (window-minibuffer-p (selected-window))
    (unless my/current-buffer-background-overlay
      (setq my/current-buffer-background-overlay (make-overlay 1 1))
      (overlay-put my/current-buffer-background-overlay
       'face '(:background "white")))
    (overlay-put my/current-buffer-background-overlay 'window
                 (selected-window))
    (move-overlay my/current-buffer-background-overlay
                  (point-min) (point-max))))
(defun my/unlighten-current-buffer-background ()
  (when my/current-buffer-background-overlay
    (delete-overlay my/current-buffer-background-overlay)))

(add-hook 'pre-command-hook #'my/unlighten-current-buffer-background) 
(add-hook 'post-command-hook #'my/lighten-current-buffer-background)

Syndicated 2014-10-06 19:55:13 from notes

code walking for pipe sequencing

Since it seems still topical to talk about Lisp and code-transformation macros, here’s another worked example – this time inspired by the enthusiasm for the R magrittr package.

The basic idea behind the magrittr package is, as Hadley said at EARL2014, to convert from a form of code where arguments to the same function are far apart to one where they’re essentially close together; the example he presented was converting

  arrange(
  summarise
    group_by(
      filter(babynames, name == "Hadley"),
      year),
    total = sum(n)
  desc(year))

to

  b0 <- babynames
b1 <- filter(b0, name == "Hadley")
b2 <- group_by(b1, year)
b3 <- summarise(b2, total = sum(n))
b4 <- arrange(b3, desc(year))

only without the danger of mistyping one of the variable names along the way and failing to perform the computation that was intended.

R, as I have said before, is a Lisp-1 with weird syntax and wacky evaluation semantics. One of the things that ordinary user code can do is inspect the syntactic form of its arguments, before evaluating them. This means that when looking at a fragment of code such as

  foo(bar(2,3), 4)

where a call-by-value language would first evaluate bar(2,3), then call foo with two arguments (the value resulting from the evaluation, and 4), R instead uses a form of call-by-need evaluation, and also provides operators for inspecting the promise directly. This means R users can do such horrible things as

  foo <- function(x) {
    tmp <- substitute(x)
    sgn <- 1
    while(class(tmp) == "(") {
        tmp <- tmp[[2]]
        sgn <- sgn * -1
    }
    sgn * eval.parent(tmp)
}
foo(3) # 3
foo((3)) # -3
foo(((3))) # 3
foo((((3)))) # -3 (isn’t this awesome?  I did say “wacky”)

In the case of magrittr, the package authors have taken advantage of this to invent some new syntax; the pipe operator %>% is charged with inserting its first argument (its left-hand side, in normal operation) as the first argument to the call of its second argument (right-hand side). Hadley’s example is

  babynames %>%
  filter(name == "Hadley") %>%
  group_by(year) %>%
  summarise(total = sum(n)) %>%
  arrange(desc(year))

and this is effective because the data flow in this case really is a pipeline: there's a dataset, which needs filtering, then grouping, then summarization, then sorting, and each operation works on the result of the previous. This already needs to inspect the syntactic form of the argument; an additional feature is recognizing the presence of .s in the call, and placing the left-hand side value in that argument position instead of as the first argument if it is present.

In Common Lisp, there are some piping or chaining operators out there (e.g. one two three (search for ablock) four and probably many others), and they do well enough. However! They mostly suffer from similar problems that we’ve seen before: doing code transformations with not quite enough understanding of the semantics of the code that they’re transforming; again, that’s fine for normal use, but for didactic purposes let’s pretend that we really care about this.

The -> macro from http://stackoverflow.com/a/11080068 is basically the same as the magrittr %>% operator: it converts symbols in the pipeline to function calls, and places the result of the previous evaluation as the first argument of the current operator, except if a $ is present in the arguments, in which case it replaces that. (This version doesn’t support more than one $ in the argument list; it would be a little bit of a pain to support that, needing a temporary name, but it’s straightforward in principle).

Since the -> macro does its job, a code-walker implementation isn’t strictly necessary: pure syntactic manipulation is good enough, and if it’s used with just the code it expects, it will do it well. It is of course possible to express what it does using a code-walker; we’ll fix the multiple-$ ‘bug’ along the way, by explicitly introducing bindings rather than replacements of symbols:

  (defmacro -> (form &body body)
  (labels ((find-$ (form env)
             (sb-walker:walk-form form env
              (lambda (f c e)
                (cond
                  ((eql f '$) (return-from find-$ t))
                  ((eql f form) f)
                  (t (values f t)))))
             nil)
           (walker (form context env)
             (cond
               ((symbolp form) (list form))
               ((atom form) form)
               (t (if (find-$ form env)
                      (values `(setq $ ,form) t)
                      (values `(setq $ ,(list* (car form) '$ (cdr form))) t))))))
    `(let (($ ,form))
       ,@(mapcar (lambda (f) (sb-walker:walk-form f nil #'walker)) body))))

How to understand this implementation? Well, clearly, we need to understand what sb-walker:walk does. Broadly, it calls the walker function (its third argument) on successive evaluated subforms of the original form (and on variable names set by setq); the primary return value is used as the interim result of the walk, subject to further walking (macroexpansion and walking of its subforms) except if the second return value from the walker function is t.

Now, let’s start with the find-$ local function: its job is to walk a form, and returns t if it finds a $ variable to be evaluated at toplevel and nil otherwise. It does that by returning t if the form it’s given is $; otherwise, if the form it’s given is the original form, we need to walk its subforms, so return f; otherwise, return its form argument f with a secondary value of t to inhibit further walking. This operation is slightly at odds with the use of a code walker: we are explicitly not taking advantage of the fact that it understands the semantics of the code it’s walking. This might explain why the find-$ function itself looks a bit weird.

The walker local function is responsible for most of the code transformation. It binds $ to the value of the first form, then repeatedly sets $ to the value of successive forms, rewritten to interpolate a $ in the first argument position if there isn’t one in the form already (as reported by find-$). If any of the forms is a symbol, it gets listified and subsequently re-walked. Thus

  (macroexpand-1 '(-> "THREE" string-downcase (char 0)))
; => (LET (($ "THREE"))
;      (SETQ $ (STRING-DOWNCASE $))
;      (SETQ $ (CHAR $ 0))),
;    T

So far, so good. Now, what could we do with a code-walker that we can’t without? Well, the above implementation of -> supports chaining simple function calls, so one answer is “chaining things that aren’t just function calls”. Another refinement is to support eliding the insertion of $ when there are any uses of $ in the form, not just as a bare argument. Looking at the second one first, since it’s less controversial:

  (defmacro -> (form &body body)
  (labels ((find-$ (form env)
             (sb-walker:walk-form form env
              (lambda (f c e)
                (cond
                  ((and (eql f '$) (eql c :eval))
                   (return-from find-$ t))
                  (t f))))
             nil)
           (walker (form context env)
             (cond
               ((symbolp form) (list form))
               ((atom form) form)
               (t (if (find-$ form env)
                      (values `(setq $ ,form) t)
                      (values `(setq $ ,(list* (car form) '$ (cdr form))) t))))))
    `(let (($ ,form))
       ,@(mapcar (lambda (f) (sb-walker:walk-form f nil #'walker)) body))))

The only thing that’s changed here is the definition of find-$, and in fact it’s a little simpler: the task is now to walk the entire form and find uses of $ in an evaluated position, no matter how deep in the evaluation. Because this is a code-walker, this will correctly handle macros, backquotes, quoted symbols, and so on, and this allows code of the form

  (macroexpand-1 '(-> "THREE" string-downcase (char 0) char-code (complex (1+ $) (1- $))))
; => (LET (($ "THREE"))
;      (SETQ $ (STRING-DOWNCASE $))
;      (SETQ $ (CHAR-CODE $))
;      (SETQ $ (COMPLEX (1+ $) (1- $)))),
;    T

which, as far as I can tell, is not supported in magrittr: doing 3 %>% complex(.+1,.-1) is met with the error that “object '.' not found”. Supporting this might, of course, not be a good idea, but at least the code walker shows that it’s possible.

What if we wanted to augment -> to handle binding forms, or special forms in general? This is probably beyond the call of duty, but let’s just briefly imagine that we wanted to be able to support binding special variables around the individual calls in the chain; for example, we want

  (-> 3 (let ((*random-state* (make-random-state))) rnorm) mean)

to expand to

  (let (($ 3))
  (setq $ (let ((*random-state* (make-random-state))) (rnorm $)))
  (setq $ (mean $)))

and let us also say, to make it interesting, that uses of $ in the bindings clauses of the let should not count against inhibiting the insertion of $ in the first argument position of the first form in the body of the let, so

  (-> 3 (let ((y (1+ $))) (atan y)))

should expand to

  (let (($ 3)) (setq $ (let ((y (1+ $))) (atan $ y))))

So our code walker needs to walk the bindings of the let, merely collecting information into the walker’s lexical environment, then walk the body performing the same rewrite as before. CHALLENGE ACCEPTED:

  (defmacro -> (&body forms)
  (let ((rewrite t))
    (declare (special rewrite))
    (labels ((find-$ (form env)
               (sb-walker:walk-form form env
                (lambda (f c e)
                  (cond
                    ((and (eql f '$) (eql c :eval))
                     (return-from find-$ t))
                    (t f))))
               nil)
             (walker (form context env)
               (declare (ignore context))
               (typecase form
                 (symbol (if rewrite (list form) form))
                 (atom form)
                 ((cons (member with-rewriting without-rewriting))
                  (let ((rewrite (eql (car form) 'with-rewriting)))
                    (declare (special rewrite))
                    (values (sb-walker:walk-form (cadr form) env #'walker) t)))
                 ((cons (member let let*))
                  (unless rewrite
                    (return-from walker form))
                  (let* ((body (member 'declare (cddr form)
                                       :key (lambda (x) (when (consp x) (car x))) :test-not #'eql))
                         (declares (ldiff (cddr form) body))
                         (rewritten (sb-walker:walk-form
                                     `(without-rewriting
                                          (,(car form) ,(cadr form)
                                            ,@declares
                                            (with-rewriting
                                                ,@body)))
                                     env #'walker)))
                    (values rewritten t)))
                 (t
                  (unless rewrite
                    (return-from walker form))
                  (if (find-$ form env)
                      (values `(setq $ ,form) t)
                      (values `(setq $ ,(list* (car form) '$ (cdr form))) t))))))
      `(let (($ ,(car forms)))
         ,@(mapcar (lambda (f) (sb-walker:walk-form f nil #'walker)) (cdr forms))))))

Here, find-$ is unchanged from the previous version; all the new functionality is in walker. How does it work? The default branch of the walker function is also unchanged; what has changed is handling of let and let* forms. The main trick is to communicate information between successive calls to the walker function, and turn the rewriting on and off appropriately: we wrap parts of the form in new pseudo-special operators with-rewriting and without-rewriting, which is basically a tacky and restricted implementation of compiler-let – if we needed to, we could do a proper one with macrolet. Within the scope of a without-rewriting, walker doesn’t do anything special, but merely return the form it was given, except if the form it’s given is a with-rewriting form. This is a nice illustration, incidentally, of the idea that lexical scope in the code translates nicely to dynamic scope in the compiler; I can’t remember where I read that first (but it’s certainly not a new idea).

And now

  (macroexpand '(-> 3 (let ((*random-state* (make-random-state))) rnorm) mean))
; => (LET (($ 3))
;      (LET ((*RANDOM-STATE* (MAKE-RANDOM-STATE)))
;        (SETQ $ (RNORM $)))
;      (SETQ $ (MEAN $))),
;    T
(macroexpand '(-> 3 (let ((y (1+ $))) (atan y))))
; => (LET (($ 3))
;      (LET ((Y (1+ $)))
;        (SETQ $ (ATAN $ Y)))),
;    T

Just to be clear: this post isn’t advocating a smarter pipe operator; I don’t have a clear enough view, but I doubt that the benefits of the smartness outweigh the complexity. It is demonstrating what can be done, in a reasonably controlled way, using a code-walker: ascribing semantics to fragments of Common Lisp code, and combining those fragments in a particular way, and of course it’s another example of sb-walker:walk in use.

Finally, if something like this does in fact get used, people sometimes get tripped up by the package system: the special bits of syntax are symbols, and importing or package-qualifying -> without doing the corresponding thing to $ would lead to cryptic errors, wrong results and/or confusion. One possibility to handle that is to invent a bit more reader syntax:

  (set-macro-character #\¦
 (defun pipe-reader (stream char)
   (let ((*readtable* (copy-readtable)))
     (set-macro-character #\·
      (lambda (stream char)
        (declare (ignore stream char))
        '$) t)
   (cons '-> (read-delimited-list char stream t)))) nil)
¦"THREE" string-downcase (find-if #'alpha-char-p ·) char-code¦

If this is the exported syntax, it has the advantage that the interface can only be misused intentionally: the actual macro and its anaphoric symbol are both hidden from the programmer; and the syntax is reasonably easy to type – on my keyboard ¦ is AltGr+| and · is AltGr+. – and moderately mnemonic from shell pipes and function notation respectively. It also has all the usual disadvantages of reader-based interfaces, such as composability, somewhat mitigated if pipe-reader is part of the macro’s exported interface.

Syndicated 2014-09-25 14:01:45 from notes

earl conference

This week, I went to the Effective Applications of the R Language conference. I’d been alerted to its existence from my visit to a londonR meeting in June. Again, I went for at least two reasons: one as an R enthusiast, though admittedly one (as usual) more interested in tooling than in applications; and one as postgraduate coordinator in Goldsmiths Computing, where for one of our modules in the new Data Science programme (starting todayyesterday! Hooray for “Welcome Week”!) involves exposing students to live data science briefs from academia and industry, with the aim of fostering a relevant and interesting final project.

A third reason? Ben Goldacre as invited speaker. A fantastic choice of keynote, even if he did call us ‘R dorks’ a lot and confess that he was a Stata user. The material, as one might expect, was derived from his books and related experience, but the delivery was excellent, and the drive clear to see. There were some lovely quotes that it’s tempting to include out of context; in the light of my as-yet unposed ‘research question’ for the final module of the PG Certificate in Higher Education – that I am still engaged on – it is tempting to bait the “infestation of qualitative researchers in the educational research establishment”, and attempt a Randomised Controlled Trial, or failing that a statistical analysis of assessments to try to uncover suitable hypotheses for future testing.

Ben’s unintended warm-up act – they were the other way around in the programme, but clearly travelling across London is complicated – was Hadley Wickham, of ggplot2 fame. His talk, about RStudio, his current view on the data analysis workflow, and new packages to support it, was a nice counterpoint: mostly tools, not applications, but clearly focussed to help make sense of complicated (and initially untidy) datasets. I liked the shiny-based in-browser living documents in R-markdown, which is not a technology that I’ve investigated yet; at this rate I will have more options for reproducible research than reports written. He, and others at the conference, were advocating a pipe-based code sequencing structure – the R implementation of this is called magrittr (ha, ha) and has properties that are made possible to user code through R’s nature of a Lisp-1 with crazy evaluation semantics, on which more in another post.

The rest of the event was made up of shorter, usually more domain-specific talks: around 20 minutes for each speaker. I think it suffered a little bit from many of the participants not being able to speak freely – a natural consequence of a mostly-industrial event, but frustrating. I think it was also probably a mistake to schedule in the first one of the regular (parallel) sessions of the event a reflective slot for three presentations about comparing R with other languages (Python, Julia, and a more positive one about R’s niche): there hadn’t really been time for a positive tone to be established, and it just felt like a bit of a downer. (Judging by the room, most of the delegates – perhaps wisely – had opted for the other track, on “Business Applications of R”).

Highlights of the shorter talks, for me:

  • YPlan’s John Sandall talking about “agile” data analytics, leading to agile business practices relentlessly focussed on one KPI. At the time, I wondered whether the focus on the first time a user gives them money would act against building something of lasting value – analytics give the power to make decisions, but the underlying strategy still has to be thought about. On the other hand, I’m oh-too familiar with the notion that startups must survive first and building something “awesome” is a side-effect, and focussing on money in is pretty sensible.
  • Richard Pugh’s (from Mango Solutions) talk about modelling and simulating the behaviour of a sales team did suffer from the confidentiality problem (“I can’t talk about the project this comes from, or the data”) but was at least entertaining: the behaviours he talked about (optimistic opportunity value, interactions of CRM closing dates with quarter boundaries) were highly plausible, and the question of whether he was applying the method to his own sales team quite pointed. (no)
  • the team from Simpson Carpenter Ltd, as well as saying that “London has almost as many market research agencies as pubs” (which rings true) had what I think is a fair insight: R is perhaps less of a black-box than certain commercial tools; there’s a certain retrocomputing feel to starting R, being at the prompt, and thinking “now what?” That implies that to actually do something with R, you need to know a bit more about what you’re doing. (That didn’t stop a few egregiously bad graphs being used in other presentations, including my personal favourite of a graph of workflow expressed as business value against time, with the inevitable backwards-arrows).
  • some other R-related tools to look into:

And then there was of course the hallwaybreak room track; Tower Hotel catered admirably for us, with free-flowing coffee, nibbles and lunch. I had some good conversations with a number of people, and am optimistic that students with the right attitude could both benefit and gain hugely from data science internships. I’m sure I was among the most tool-oriented of the attendees (most of the delegates were actually using R), but I did get to have a conversation with Hadley about “Advanced R”, and we discussed object systems, and conditions and restarts. More free-form notes about the event on my wiki.

Meanwhile, in related news, parts of the swank backend implementation of SLIME changed, mostly moving symbols to new packages. I've updated swankr to take account of the changes, and (I believe, untested) preserved compatibility with older (pre 2014-09-13) SLIMEs.

Syndicated 2014-09-23 10:08:42 (Updated 2014-09-23 20:33:50) from notes

naive vs proper code-walking

I said in my discussion about backquote representations that some utilities had defects made manifest by SBCL 1.2.2’s new internal representation for backquote and related operators, and that those defects could have been avoided by using a code-walker. I’m going to look at let-over-lambda code here, to try to demonstrate what I meant by that, and show how a proper code-walker can quite straightforwardly be used for the code transformations that have been implemented using a naïve walker (typically walking over a tree of conses), removing whole classes of defects in the process.

The let-over-lambda code I’m discussing is from https://github.com/thephoeron/let-over-lambda, specifically this version. This isn’t intended to be a hatchet job on the utility – clearly, it is of use to its users – but to show up potential problems and offer solutions for how to fix them. I should also state up front that I haven’t read the Let over Lambda book, but it’s entirely possible that discussing and using a full code-walker would have been out of scope (as it explicitly was for On Lisp).

Firstly, let’s deal with how the maintainer of the let-over-lambda code is dealing with the change in backquote representations, since it’s still topical:

  ;; package definition here just in case someone decides to paste
;; things into a Lisp session, and for private namespacing
(defpackage "LOL" (:use "CL"))
(in-package "LOL")
;; actual excerpts from let-over-lambda code from
;; <https://github.com/thephoeron/let-over-lambda/blob/a202167629cb421cbc2139cfce1db22a84278f9f/let-over-lambda.lisp>
;; begins here:
#+sbcl
(if (string-lessp (lisp-implementation-version) "1.2.2")
    (pushnew :safe-sbcl *features*)
    (setq *features* (remove :safe-sbcl *features*)))
(defun flatten (x)
  (labels ((rec (x acc)
             (cond ((null x) acc)
                   #+(and sbcl (not safe-sbcl))
                   ((typep x 'sb-impl::comma) (rec (sb-impl::comma-expr x) acc))
                   ((atom x) (cons x acc))
                   (t (rec (car x) (rec (cdr x) acc))))))
    (rec x nil)))

The issues around the (*features*) handling here have been reported at github; for the purpose of this blog entry, I will just say that I wrote about them in Maintaining Portable Lisp Programs, a long time ago, and that a better version might look a bit like this:

  #+sbcl
(eval-when (:compile-toplevel :execute)
  (defun comma-implementation ()
    (typecase '`,x
      (symbol 'old)
      ((cons symbol (cons structure-object)) 'new)))
  (if (eql (comma-implementation) 'old)
      (pushnew 'cons-walkable-backquote *features*)
      (setq *features* (remove 'cons-walkable-backquote *features*))))
(defun flatten (x)
  (labels ((rec (x acc)
             (cond ((null x) acc)
                   #+lol::cons-walkable-backquote
                   ((typep x 'sb-impl::comma) (rec (sb-impl::comma-expr x) acc))
                   ((atom x) (cons x acc))
                   (t (rec (car x) (rec (cdr x) acc))))))
    (rec x nil)))

With these changes, the code is (relatively) robustly testing for the particular feature it needs to know about at the time that it needs to know, and recording it in a way that doesn’t risk confusion or contention with any other body of code. What is the let-over-lambda library using flatten for?

  (defun g!-symbol-p (thing)
  (and (symbolp thing)
       (eql (mismatch (symbol-name thing) "G!") 2)))
(defmacro defmacro/g! (name args &rest body)
  (let ((syms (remove-duplicates
               (remove-if-not #'g!-symbol-p (flatten body)))))
    `(defmacro ,name ,args
       (let ,(mapcar
              (lambda (s)
                `(,s (gensym ,(subseq (symbol-name s) 2))))
              syms)
         ,@body))))

The intent behind this macro-defining macro, defmacro/g!, appears to be automatic gensym generation: being able to write

  (defmacro/g! with-foo ((foo) &body body)
  `(let ((,g!foo (activate-foo ,foo)))
     (unwind-protect
         (progn ,@body)
       (deactivate-foo ,g!foo))))

without any explicit calls to gensym but retaining the protection that gensyms give against name capture:

  (macroexpand-1 '(with-foo (3) 4))
; => (let ((#1=#:FOO1 (activate-foo 3)))
;      (unwind-protect
;          (progn 4)
;        (deactivate-foo #1#)))

That's fine; it’s reasonable to want something like this. Are there any issues with this, apart from the one exposed by SBCL’s new backquote implementation? In its conventional use, probably not – essentially, all uses of g! symbols are unquoted (i.e. behind commas) – but there are a couple of more theoretical points. One issue is that flatten as it currently stands will look for all symbols beginning with g! in the macroexpander function source, whether or not they are actually variable evaluations:

  (defmacro/g! with-bar ((bar) &body body)
  `(block g!block
     (let ((,g!bar ,bar)) ,@body)))
; unused variable G!BLOCK
(macroexpand-1 '(with-bar (3) 4))
; => (block g!block (let ((#:BAR1 3)) 4))

In this example, that’s fair enough: it’s probably user error to have those g! symbols not be unquoted; this probably only becomes a real problem if there are macro-defining macros, with both the definer and the definition using g! symbols. It's not totally straightforward to demonstrate other problems with this simple approach to Lisp code transformation using just this macro; the transformation is sufficiently minimal, and the symptoms of problems relatively innocuous, that existing programming conventions are strong enough to prevent anything seriously untoward going wrong.

Before getting on to another example where the problems with this approach become more apparent, how could this transformation be done properly? By “properly” here I mean that the defmacro/g! should arrange to bind gensyms only for those g! symbols which are to be evaluated by the macroexpander, and not for those which are used for any other purpose. This is a task for a code-walker: a piece of code which exploits the fact that Lisp code is made up of Lisp data structures, all of which are introspectable, and the semantics of which in terms of effect on environment and execution are known. It is tedious, though possible, to write a mostly-portable code-walker (there needs to be some hook into the implementation’s representation of environments); I’m not going to do that here, but instead will use SBCL’s built-in code-walker.

The sb-walker:walk-form function takes three arguments: a form to walk, an initial environment to walk it in, and a walker function to perform whatever action is necessary on the walk. That walker function itself takes three arguments, a form, context and environment, and the walker arranges for it to be called on every macroexpanded or evaluated subform in the original form. The walker function should return a replacement form for the subform it is given (or the subform itself if it doesn’t want to take any action), and a secondary value of t if no further walking of that form should take place.

To do g! symbol detection and binding is fairly straightforward. If a symbol is in a context for evaluation, we collect it, and here we can take the first benefit from a proper code walk: we only collect g! symbols if the code-walker deems that they will be evaluated and there isn't an already-existing lexical binding for it:

  (defmacro defmacro/g!-walked (name args &body body)
  (let* (g!symbols)
    (flet ((g!-walker (subform context env)
             (declare (ignore context))
             (typecase subform
               (symbol
                (when (and (g!-symbol-p subform)
                           (not (sb-walker:var-lexical-p subform env)))
                  (pushnew subform g!symbols))
                subform)
               (t subform))))
      (sb-walker:walk-form `(progn ,@body) nil #'g!-walker)
      `(defmacro ,name ,args
         (let ,(mapcar (lambda (s) (list s `(gensym ,(subseq (symbol-name s) 2))))
                       g!symbols)
           ,@body)))))

The fact that we only collect symbols which will be evaluated deals with the problem exhibited by with-bar, above:

  (defmacro/g!-walked with-bar/walked ((bar) &body body)
  `(block g!block
     (let ((,g!bar ,bar)) ,@body)))
(macroexpand-1 '(with-bar/walked (3) 4))
; => (block g!block (let ((#:BAR1 3)) 4))

Only gathering symbols which don’t have lexical bindings (testing sb-walker:var-lexical-p) deals with another minor problem:

  (defmacro/g!-walked with-baz ((baz) &body body)
  (let ((g!sym 'sym))
    `(let ((,g!sym ,baz)) ,@body)))
(macroexpand-1 '(with-baz (3) 4))
; => (let ((sym 3)) 4)

(the cons-walker – flatten – would not be able to detect that there is already a binding for g!sym, and would introduce another one, again leading to an unused variable warning.)

OK, time to recap. So far, we’ve corrected the code that tests for particular backquote implementations, which was used in flatten, which itself was used to perform a code-walk; we’ve also seen some low-impact or theoretical problems with that simple code-walking technique, and have used a proper code-walker instead of flatten to deal with those problems. If the odd extra unused variable binding were the worst thing that could happen, there wouldn’t be much benefit from using a code-walker (other than the assurance that the walker is dealing with forms for execution); however, let us now turn our attention to the other macro in let-over-lambda’s code which does significant codewalking:

  (defun dollar-symbol-p (thing)
  (and (symbolp thing)
       (char= (char (symbol-name thing) 0) #\$)
       (ignore-errors (parse-integer (subseq (symbol-name thing) 1)))))
(defun prune-if-match-bodies-from-sub-lexical-scope (tree)
  (if (consp tree)
      (if (or (eq (car tree) 'if-match)
              (eq (car tree) 'when-match))
          (cddr tree)
          (cons (prune-if-match-bodies-from-sub-lexical-scope (car tree))
                (prune-if-match-bodies-from-sub-lexical-scope (cdr tree))))
      tree))
;; WARNING: Not %100 correct. Removes forms like (... if-match ...) from the
;; sub-lexical scope even though this isn't an invocation of the macro.
#+cl-ppcre
(defmacro! if-match ((test str) conseq &optional altern)
  (let ((dollars (remove-duplicates
                  (remove-if-not #'dollar-symbol-p
                                 (flatten (prune-if-match-bodies-from-sub-lexical-scope conseq))))))
    (let ((top (or (car (sort (mapcar #'dollar-symbol-p dollars) #'>)) 0)))
      `(let ((,g!str ,str))
         (multiple-value-bind (,g!s ,g!e ,g!ms ,g!me) (,test ,g!str)
           (declare (ignorable ,g!e ,g!me))
           (if ,g!s
               (if (< (length ,g!ms) ,top)
                   (error "ifmatch: too few matches")
                   ;; lightly edited here to remove irrelevant use of #`
                   (let ,(mapcar (lambda (a1) `(,(symb "$" a1)
                                                (subseq ,g!str (aref ,g!ms ,(1- a1))
                                                               (aref ,g!me ,(1- a1)))))
                                 (loop for i from 1 to top collect i))
                     ,conseq))
               ,altern))))))
(defmacro when-match ((test str) conseq &rest more-conseq)
  `(if-match (,test ,str)
     (progn ,conseq ,@more-conseq)))

What’s going on here? We have a prune-if-match-bodies-from-sub-lexical-scope function which, again, performs some kind of cons-based tree walk, removing some conses whose car is if-match or when-match. We have a trivial macro when-match which transforms into an if-match; the if-match macro is more involved. Any symbols named as a $ sign followed by an integer (in base 10) are treated specially; the intent is that they will be bound to capture groups of the cl-ppcre match. So it would be used in something like something like

  (defun key-value (line)
  (if-match ((lambda (s) (scan "^\\(.*\\): \\(.*\\)$" s)) line)
      (list $1 $2)
      (error "not actually a key-value line: ~S" line)))

and that would macroexpand to, roughly,

  (defun key-value (line)
  (multiple-value-bind (s e ms me)
      ((lambda (s) (scan "^\\(.*\\): \\(.*\\)$" s)) line)
    (if s
        (if (< (length ms) 2)
            (error "if-match: not enough matches)
            (let (($1 (subseq line (aref ms 0) (aref me 0)))
                  ($2 (subseq line (aref ms 1) (aref me 1))))
              (list $1 $2)))
        (error "not actually a key-value line: ~S" line))))

(there's additional reader macrology in let-over-lambda to make that lambda form unnecessary, but we can ignore that for our purposes).

Now, if-match has a similar problem that defmacro/g! had: since the tree walker doesn’t make a distinction between symbols present for evaluation and symbols for any other purpose, it is possible to confuse the walker. For example:

  (if-match (scanner string)
    (if (> (length $1) 6)
        '|$1000000|
        'less-than-$1000000))

This form, if macroexpanded, will attempt to bind one million variables to matched groups; even if the compiler doesn’t choke on that, evaluation will go wrong, as the matcher is unlikely to match one million groups (so the “not enough matches” error branch will be taken) – whereas of course the quoted one million dollar symbol is not intended for evaluation.

But the nesting problems are more obvious in this case than for defmacro/g!. Firstly, take the simple case:

  (if-match (scanner string)
    (list $1
          (if-match (scanner2 string)
              $2
              nil))
    nil)

Here, the $2 is in the scope of the inner if-match, and so mustn’t be included for the macroexpansion of the outer if-match. This case is handled in let-over-lambda’s implementation by the prune-if-match-bodies-from-sub-lexical-scope: the consequent of the inner if-match is pruned from the dollar-symbol accumulator. However, there are several issues with this; the first is that the test is pruned:

  (if-match (scanner string)
    (if-match (scanner2 $2)
        $1
        nil)
    nil)

In this example, the $2 is ‘invisible’ to the outer if-match, and so won’t get a binding. That’s straightforwardly fixable, along with the mishandling of when-let’s syntax (the entire body of when-let should be pruned, not just the first form), and what I think is an error in the pruning of if-match (it should recurse on the cdddr, not the cddr; github issue).

Not fixable at all while still using naïve code-walking are two other problems, one of which is noted in the comment present in the let-over-lambda code: the pruner doesn’t distinguish between if-match forms for evaluation and other conses whose car is if-match. Triggering this problem does involve some contortions – in order for it to matter, we need an if-match not for evaluation followed by a dollar symbol which is to be evaluated; but, for example:

  (defmacro list$/q (&rest args)
  `(list ,@(mapcar (lambda (x) (if (dollar-symbol-p x) x `',x)) args)))
(if-match (scanner string)
    (list$/q foo if-match $2)
    nil)

Here, although the $2 is in a position for evaluation (after macroexpansion), it will have no binding because it will have been pruned when naïvely walking the outer if-match macro. The if-match symbol argument to `list$/q ends up quoted, and should not be treated as a macro call.

Also, the pruner function must have special knowledge not just about the semantics of if-match, but also of any macro which can expand to if-match – see the attempt to handle when-match in the pruner. If a user were to have the temerity to define case-match

  (defmacro case-match (string &rest clauses)
  (if (null clauses)
      nil
      `(if-match (,(caar clauses) ,string)
           (progn ,@(cdar clauses))
           (case-match string ,@(cdr clauses)))))

any attempt to nest a case-match inside an outer if-match is liable to fail, as the pruner has no knowledge of how to handle the case-match form.

All of these problems are solvable by using a proper code-walker. The code-walker should collect up all dollar symbols to be evaluated in the consequent of an if-match form, so that bindings for them can be generated, except for those with already existing lexical bindings within the if-match (not those from outside, otherwise nesting won’t work). For testing purposes, we’ll also signal a diagnostic condition within the macroexpander to indicate which dollar symbols we’ve found.

  (define-condition if-match/walked-diagnostic (condition)
  ((symbols :initarg :symbols :reader if-match-symbols)))
(defmacro if-match/walked ((test string) consequent &optional alternative)
  (let* (dollar-symbols)
    (flet ((dollar-walker (subform context env)
             (declare (ignore context))
             (typecase subform
               (symbol
                (when (and (dollar-symbol-p subform)
                           (not (sb-walker:var-lexical-p subform env)))
                  (pushnew subform dollar-symbols))
                subform)
               (t subform))))
      (handler-bind ((if-match/walked-diagnostic #'continue))
        (sb-walker:walk-form consequent nil #'dollar-walker))
      (let* ((dollar-symbols (sort dollar-symbols #'> :key #'dollar-symbol-p))
             (top (dollar-symbol-p (car dollar-symbols))))
        (with-simple-restart (continue "Ignore diagnostic condition")
          (signal 'if-match/walked-diagnostic :symbols dollar-symbols))
        (sb-int:with-unique-names (start end match-start match-end)
          (sb-int:once-only ((string string))
            `(multiple-value-bind (,start ,end ,match-start ,match-end)
                 (,test ,string)
               (declare (ignore ,end) (ignorable ,match-end))
               (if ,start
                   (if (< (length ,match-start) ,top)
                       (error "~S: too few matches: needed ~D, got ~D." 'if-match
                              ,top (length ,match-start))
                       (let ,(mapcar (lambda (s)
                                       (let ((i (1- (dollar-symbol-p s))))
                                         `(,s (subseq ,string (aref ,match-start ,i) (aref ,match-end ,i)))))
                                     (reverse dollar-symbols))
                         ,consequent))
                   ,alternative))))))))

(I'm using sb-int:once-only and sb-int:with-unique-names to avoid having to include their definitions in this post, which is getting a bit lengthy). Testing this looks like

  (defmacro test-if-match (form expected-symbols)
  `(handler-case (macroexpand-1 ',form)
     (if-match/walked-diagnostic (c)
       (assert (equal (if-match-symbols c) ',expected-symbols)))
     (:no-error (&rest values) (declare (ignore values)) (error "no diagnostic"))))
(test-if-match (if-match/walked (test string) (list $1 $2) 'foo) ($2 $1))
(test-if-match (if-match/walked (test string) (if (> (length $1) 6) '$10 '$8) nil) ($1))
(test-if-match (if-match/walked (scanner string)
                   (list $1
                         (if-match/walked (scanner2 string)
                             $2
                             nil))
                   nil)
               ($1))
(test-if-match (if-match/walked (scanner string) (list$/q foo if-match/walked $3) nil) ($3))
(defmacro case-match/walked (string &rest clauses)
  (if (null clauses)
      nil
      `(if-match/walked (,(caar clauses) ,string)
           (progn ,@(cdar clauses))
           (case-match/walked string ,@(cdr clauses)))))
(test-if-match (if-match/walked (scanner string)
                   (case-match/walked $1
                     (foo $2)
                     (bar $3)))
               ($1))

To summarize: I’ve shown here how to make use of a full code-walker to make a couple of code transforming macros more robust. Full code-walkers can do more than just what I've shown here: the sb-walker:walk-form interface can also inhibit macroexpansion, transform function calls into calls to other functions, while respecting the semantics of the Lisp operators in the code that is being walked and allowing some introspection of the lexical environment. Here, we have called sb-walker:walk-form for side effects from the walker function we’ve provided; it is also possible to use its value (that’s how sb-cltl2:macroexpand-all is implemented, for example). I hope that this can help users affected by the change in internal representation of backquote, as well as others who want to write advanced code-transforming macros. If the thought of using an SBCL-internal code-walker makes you a bit queasy (as well it might), you could instead start by looking at one or two other more explicitly-portable code-walkers out there, for example John Fremlin’s macroexpand-dammit, the walker in Alex Plotnick's CLWEB literate programming system (github link), or the code walker in iterate.

Syndicated 2014-09-08 20:13:38 from notes

31 Jul 2014 (updated 31 Jul 2014 at 18:14 UTC) »

london employment visualization part 2

Previously, I did all the hard work to obtain and transform some data related to London, including borough and MSOA shapes, population counts, and employment figures, and used them to generate some subjectively pretty pictures. I promised a followup on the gridSVG approach to generating visualizations with more potential for interactivity than a simple picture; this is the beginning of that.

Having done all the heavy lifting in the last post, including being able to generate ggplot objects (whose printing results in the pictures), it is relatively simple to wrap output to SVG instead of output to PNG around it all. In fact it is extremely simple to output to SVG; simply use an SVG output device

  svg("/tmp/london.svg", width=16, height=10)

rather than a PNG one

  png("/tmp/london.png", width=1536, height=960)

(which brings back for me memories of McCLIM, and my implementation of an SVG backend, about a decade ago). So what does that look like? Well, if you’ve entered those forms at the R repl, close the png device

  dev.off()

and then (the currently active device being the SVG one)

  print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()

default (cairo) SVG device

That produces an SVG file, and if SVG in and of itself is the goal, that’s great. But I would expect that the main reason for producing SVG isn’t so much for the format itself (though it is nice that it is a vector image format rather than rasterized, so that zooming in principle doesn’t cause artifacts) but for the ability to add scripting to it: and since the output SVG doesn’t retain any information about the underlying data that was used to generate it, it is very difficult to do anything meaningful with it.

I write “very difficult” rather than “impossible”, because in fact the SVGAnnotation package aimed to do just that: specifically, read the SVG output produced by the R SVG output device, and (with a bit of user assistance and a liberal sprinkling of heuristics) attempt to identify the regions of the plot corresponding to particular slices of datasets. Then, using a standard XML library, the user could decorate the SVG with extra information, add links or scripts, and essentially do whatever they needed to do; this was all wrapped up in an svgPlot function. The problem with this approach is that it is fragile: for example, one heuristic used to identify a lattice plot area was that there should be no text in it, which fails for custom panel functions with labelled guidlines. It is possible to override the default heuristic, but it’s difficult to build a robust system this way (and in fact when I tried to run some two-year old analysis routines recently, the custom SVG annotation that I wrote broke into multiple pieces given new data).

gridSVG’s approach is a little bit different. Instead of writing SVG out and reading it back in, it relies on the grid graphics engine (so does not work with so-called base graphics, the default graphics system in R), and on manipulating the grid object which represents the current scene. The gridsvg pseudo-graphics-device does the behind-the-scenes rendering for us, with some cost related to yet more wacky interactions with R’s argument evaluation semantics which we will pay later.

  gridsvg("/tmp/gridsvg-london.svg", width=16, height=10)
print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()

Because ggplot uses grid graphics, this just works, and generates a much more structured svg file, which should render identically to the previous one:

SVG from gridSVG device

If it renders identically, why bother? Well, because now we have something that writes out the current grid scene, we can alter that scene before writing out the document (at dev.off() time). For example, we might want to add tooltips to the MSOAs so that their name and the quantity value can be read off by a human. Wrapping it all up into a function, we get

  gridsvg.london <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {

We need to compute the subset in this function, even though we’re going to be using the full dataset in ggplot.london when we call it, in order to get the values and zone labels.

      london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))

Then we need to map (pun mostly intended) the values in the fortified data frame to the polygons drawn; without delving into the format, my intuition is that the fortified data frame contains vertex information, whereas the grid (and hence SVG) data is organized by polygons, and there may be more than one polygon for a region (for example if there are islands in the Thames). Here we simply generate an index from a group identifier to the first row in the dataframe in that group, and use it to pull out the appropriate value and label.

      is <- match(levels(london.data$group), london.data$group)
    vals <- eval(substitute(expr), london.data)[is]
    labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]

Then we pay the cost of the argument evaluation semantics. My first try at this line was gridsvg(filename, width=16, height=10), which I would have (perhaps naïvely) expected to work, but which in fact gave me an odd error suggesting that the environment filename was being evaluated in was the wrong one. Calling gridsvg like this forces evaluation of filename before the call, so there should be less that can go wrong.

      do.call(gridsvg, list(filename, width=16, height=10))

And, as before, we have to do substitutions rather than evaluations to get the argument expressions evaluated in the right place:

      print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))

Now comes the payoff. At this point, we have a grid scene, which we can investigate using grid.ls(). Doing so suggests that the map data is in a grid object named like GRID.polygon followed by an integer, presumably in an attempt to make names unique. We can “garnish” that object with attributes that we want: some javascript callbacks, and the values and labels that we previously calculated.

      grid.garnish("GRID.polygon.*",
                 onmouseover=rep("showTooltip(evt)", length(is)),
                 onmouseout=rep("hideTooltip()", length(is)),
                 zonelabel=labels, value=vals,
                 group=FALSE, grep=TRUE)

We need also to provide implementations of those callbacks. It is possible to do that inline, but for simplicity here we simply link to an external resource.

      grid.script(filename="tooltip.js")

Then close the gridsvg device, and we’re done!

      dev.off()
}

Then gridsvg.london(fulltime/(allages-younger-older)) produces:

proportion employed full-time

which is some kind of improvement over a static image for data of this complexity.

And yet... the perfectionist in me is not quite satisfied. At issue is a minor graphical glitch, but it’s enough to make me not quite content; the border of each MSOA is stroked in a slightly lighter colour than the fill colour, but that stroke extends beyond the border of the MSOA region (the stroke’s centre is along the polygon edge). This means that the strokes from adjacent MSOAs overlie each other, so that the most recently drawn obliterates any drawn previously. This also causes some odd artifacts around the edges of London (and into the Thames, and pretty much obscures the river Lea).

This can be fixed by clipping; I think the trick to clip a path to itself counts as well-known. But clipping in SVG is slightly hard, and the gridSVG facilities for doing it work on a grob-by-grob basis, while the map is all one big polygon grid object. So to get the output I want, I am going to have to perform surgery on the SVG document itself after all; we are still in a better position than before, because we will start with a sensible hierarchical arrangement of graphical objects in the SVG XML structure, and gridSVG furthermore provides some introspective capabilities to give XML ids or XPath query strings for particular grobs.

grid.export exports the current grid scene to SVG, returning a list with the SVG XML itself along with this mapping information. We have in the SVG output an arbitrary number of polygon objects; our task is to arrange such that each of those polygons has a clip mask which is itself. In order to do that, we need for each polygon a clipPath entry with a unique id in a defs section somewhere, where each clipPath contains a use pointing to the original polygon’s ID; then each polygon needs to have a clip-path style property pointing to the corresponding clipPath object. Clear?

  addClipPaths <- function(gridsvg, id) {

given the return value of grid.export and the identifier of the map grob, we want to get the set of XML nodes corresponding to the polygons within that grob.

      ns <- getNodeSet(gridsvg$svg, sprintf("%s/*", gridsvg$mappings$grobs[[id]]$xpath))

Then for each of those nodes, we want to set a clip path.

      for (i in 1:length(ns)) {
        addAttributes(ns[[i]], style=sprintf("clip-path: url(#clipPath%s)", i))
    }

For each of those nodes, we also need to define a clip path

      clippaths <- list()
    for (i in 1:length(ns)) {
        clippaths[[i]] <- newXMLNode("clipPath", attrs=c(id=sprintf("clipPath%s", i)))
        use <- newXMLNode("use", attrs = c("xlink:href"=sprintf("#%s", xmlAttrs(ns[[i]])[["id"]])))
        addChildren(clippaths[[i]], kids=list(use))
    }

And hook it into the existing XML

      defs <- newXMLNode("defs")
    addChildren(defs, kids=clippaths)
    top <- getNodeSet(gridsvg$svg, "//*[@id='gridSVG']")[[1]]
    addChildren(top, kids=list(defs))
}

Then our driver function needs some slight modifications:

  gridsvg.london2 <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {
    london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))
    is <- match(levels(london.data$group), london.data$group)
    vals <- eval(substitute(expr), london.data)[is]
    labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]

Until here, everything is the same, but we can’t use the gridsvg pseudo-graphics device any more, so we need to do graphics device handling ourselves:

      pdf(width=16, height=10)
    print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))
    grid.garnish("GRID.polygon.*",
                 onmouseover=rep("showTooltip(evt)", length(is)),
                 onmouseout=rep("hideTooltip()", length(is)),
                 zonelabel=labels, value=vals,
                 group=FALSE, grep=TRUE)
    grid.script(filename="tooltip.js")

Now we export the scene to SVG,

      gridsvg <- grid.export()

find the grob containing all the map polygons,

      grobnames <- grid.ls(flatten=TRUE, print=FALSE)$name
    grobid <- grobnames[[grep("GRID.polygon", grobnames)[1]]]

add the clip paths,

      addClipPaths(gridsvg, grobid)
    saveXML(gridsvg$svg, file=filename)

and we’re done!

      dev.off()
}

Then gridsvg.london2(fulltime/(allages-younger-older)) produces:

proportion employed full-time (with polygon clipping)

and I leave whether the graphical output is worth the effort to the beholder’s judgment.

As before, these images contain National Statistics and Ordnance Survey data © Crown copyright and database right 2012.

Syndicated 2014-07-31 17:07:34 (Updated 2014-07-31 17:14:34) from notes

19 Jul 2014 (updated 19 Jul 2014 at 21:12 UTC) »

london employment visualization

I went to londonR a month or so ago, and the talk whose content stayed with me most was the one given by Simon Hailstone on map-making and data visualisation with ggplot2. The take-home message for me from that talk was that it was likely to be fairly straightforward to come up with interesting visualisations and data mashups, if you already know what you are doing.

Not unrelatedly, in the Goldsmiths Data Science MSc programme which we are running from October, it is quite likely that I will be teaching a module in Open Data, where the students will be expected to produce visualisations and mashups of publically-accessible open data sets.

So, it is then important for me to know what I am doing, in order that I can help the students to learn to know what they are doing. My first, gentle steps are to look at life in London; motivated at least partly by the expectation that I encounter when dropping off or picking up children from school that I need more things to do with my time (invitations to come and be trained in IT at a childrens' centre at 11:00 are particularly misdirected, I feel) I was interested to try to find out how local employment varies. This blog serves at least partly as a stream-of-exploration of not just the data but also ggplot and gridsvg; as a (mostly satisfied) user of lattice and SVGAnnotation in a previous life, I'm interested to see how the R graphing tools have evolved.

Onwards! Using R 3.1 or later: first, we load in the libraries we will need.

  library(ggplot2)
library(extremevalues)
library(maps)
library(rgeos)
library(maptools)
library(munsell)
library(gridSVG)

Then, let's define a couple of placeholder lists to accumulate related data structures.

  london <- list()
tmp <- list()

Firstly, let's get hold of some data. I want to look at employment; that means getting hold of counts of employed people, broken down by area, and also counts of people living in that area in the first place. I found some Excel spreadsheets published by London datastore and the Office for National Statistics, with data from the 2011 census, and extracted from them some useful data in CSV format. (See the bottom of this post for more detail about the data sources).

  london$labour <- read.csv("2011-labour.csv", header=TRUE, skip=1)
london$population <- read.csv("mid-2011-lsoa-quinary-estimates.csv", header=TRUE, skip=3)

The labour information here does include working population counts by area – specifically by Middle Layer Super Output Area (MSOA), a unit of area holding roughly eight thousand people. Without going into too much detail, these units of area are formed by aggregating Lower Layer Super Output Areas (LSOAs), which themselves are formed by aggregating census Output Areas. In order to get the level of granularity we want, we have to do some of that aggregation ourselves.

The labour data is published in one sheet by borough (administrative area), ward (local electoral area) and LSOA. The zone IDs for these areas begin E09, E05 and E01; we need to manipulate the data frame we have to include just the LSOA entries

  london$labour.lsoa <- droplevels(subset(london$labour, grepl("^E01", ZONEID)))

and then aggregate LSOAs into MSOAs “by hand”, by using the fact that LSOAs are named such that the name of their corresponding MSOA is all but the last letter of the LSOA name (e.g. “Newham 001A” and “Newham 001B” LSOAs both have “Newham 001” as their parent MSOA). In doing the aggregation, all the numeric columns are count data, so the correct aggregation is to sum corresponding entries of numerical data (and discard non-numeric ones)

  tmp$nonNumericNames <- c("DISTLABEL", "X", "X.1", "ZONEID", "ZONELABEL")
tmp$msoaNames <- sub(".$", "", london$labour.lsoa$ZONELABEL)
tmp$data <- london$labour.lsoa[,!names(london$labour.lsoa) %in% tmp$nonNumericNames]

london$labour.msoa <- aggregate(tmp$data, list(tmp$msoaNames), sum)
tmp <- list()

The population data is also at LSOA granularity; it is in fact for all LSOAs in England and Wales, rather than just London, and includes some aggregate data at a higher level. We can restrict our attention to London LSOAs by reusing the zone IDs from the labour data, and then performing the same aggregation of LSOAs into MSOAs.

  london$population.lsoa <- droplevels(subset(london$population, Area.Codes %in% levels(london$labour.lsoa$ZONEID)))
tmp$nonNumericNames <- c("Area.Codes", "Area.Names", "X")
tmp$msoaNames <- sub(".$", "", london$population.lsoa$X)
tmp$data <- london$population.lsoa[,!(names(london$population.lsoa) %in% tmp$nonNumericNames)]
london$population.msoa <- aggregate(tmp$data, list(tmp$msoaNames), sum)
tmp <- list()

This is enough data to be getting on with; it only remains to join the data frames up. The MSOA name column is common between the two tables; we can therefore reorder one of them so that it lines up with the other before simply binding columns together:

  london$employment.msoa <- cbind(london$labour.msoa, london$population.msoa[match(london$labour.msoa$Group.1, london$population.msoa$Group.1)])

(In practice in this case, the order of the two datasets is the same, so the careful reordering is in fact the identity transformation)

  tmp$notequals <- london$population.msoa$Group.1 != london$labour.msoa$Group.1
stopifnot(!sum(tmp$notequals))
tmp <- list()

Now, finally, we are in a position to look at some of the data. The population data contains a count of all people in the MSOA, along with counts broken down into 5-year age bands (so 0-5 years old, 5-10, and so on, up to 90+). We can take a look at the proportion of each MSOA's population in each age band:

  ggplot(reshape(london$population.msoa, varying=list(3:21),
               direction="long", idvar="Group.1")) +
    geom_line(aes(x=5*time-2.5, y=X0.4/All.Ages, group=Group.1),
              alpha=0.03, show_guide=FALSE) +
    scale_x_continuous(name="age / years") +
    scale_y_continuous(name="fraction") +
    theme_bw()

age distributions in London MSOAs

This isn't a perfect visualisation; it does show the broad overall shape of the London population, along with some interesting deviations (some MSOAs having a substantial peak in population in the 20-25 year band, while rather more peak in the 25-30 band; a concentration around 7% of 0-5 year olds, with some significant outliers; and a tail of older people with some interesting high outliers in the 45-50 and 60-65 year bands, as well as one outlier in the 75-80 and 80-85).

That was just population; what about the employment statistics? Here's a visualisation of the fraction of people in full-time employment, treated as a distribution

  ggplot(london$employment.msoa, aes(x=Full.time/All.Ages)) +
    geom_density() + theme_bw()

fraction in full-time employment in London MSOAs

showing an overal concentration of full-time employment level around 35%. But this of course includes in the population estimate people of non-working ages; it may be more relevant to view the distribution of full-time employment as a fraction of those of “working age”, which here we will define as those between 15 and 65 (not quite right but close enough):

  london$employment.msoa$younger = rowSums(london$employment.msoa[,62:64])
london$employment.msoa$older = rowSums(london$employment.msoa[,75:80])
ggplot(london$employment.msoa, aes(x=Full.time/(All.Ages-younger-older))) +
    geom_density() + theme_bw()

fraction of those of working age in full-time employment in London MSOAs

That's quite a different picture: we've lost the bump of outliers of high employment around 60%, and instead there's substantial structure at lower employment rates (plateaus in the density function around 50% and 35% of the working age population). What's going on? The proportions of people not of working age must not be evenly distributed between MSOAs; what does it look like?

  ggplot(london$employment.msoa, aes(x=older/All.Ages, y=younger/All.Ages)) +
    geom_point(alpha=0.1) + geom_density2d() + coord_equal() + theme_bw()

distribution of younger vs older people in London MSOAs

So here we see that a high proportion of older people in an MSOA (around 20%) is a strong predictor of the proportion of younger people (around 17%); similarly, if the younger population is 25% or so of the total population, there are likely to be less than 10% of people of older non-working ages. But at intermediate values of either older or younger proportions, 10% of older or 15% of younger, the observed variation of the other proportion is very high: what this says is that the proportion of people of working age in an MSOA varies quite strongly, and for a given proportion of people of working age, some MSOAs have more older people and some more younger people. Not a world-shattering discovery, but important to note.

So, going back to employment: what are the interesting cases? One way of trying to characterize them is to fit the central portion of the observations to some parameterized distribution, and then examining cases where the observation is of a sufficiently low probability. The extremevalues R package does some of this for us; it offers the option to fit a variety of distributions to a set of observations with associated quantiles. Unfortunately, the natural distribution to use for proportions, the Beta distribution, is not one of the options (fitting involves messing around with derivatives of the inverse of the incomplete Beta function, which is a bit of a pain), but we can cheat: instead of doing the reasonable thing and fitting a beta distribution, we can transform the proportion using the logit transform and fit a Normal distribution instead, because everything is secretly a Normal distribution. Right? Right? Anyway.

  tmp$outliers <- with(london$employment.msoa, getOutliers(log(Full.time/All.Ages/(1-Full.time/All.Ages))))
tmp$high <- london$employment.msoa$Group.1[tmp$outliers$iRight]
tmp$low <-london$employment.msoa$Group.1[tmp$outliers$iLeft]
print(tmp[c("high", "low")])
tmp <- list()

This shows up 9 MSOAs as having a surprisingly high full employment rate, and one (poor old Hackney 001) as having a surprisingly low rate (where “surprising” is defined relative to our not-well-motivated-at-all Normal distribution of logit-transformed full employment rates, and so shouldn't be taken too seriously). Where are these things? Clearly the next step is to put these MSOAs on a map. Firstly, let's parse some shape files (again, see the bottom of this post for details of where to get them from).

  london$borough <- readShapePoly("London_Borough_Excluding_MHW")
london$msoa <- readShapePoly("MSOA_2011_London_gen_MHW")

In order to plot these shape files using ggplot, we need to convert the shape file into a “fortified” form:

  london$borough.fortified <- fortify(london$borough)
london$msoa.fortified <- fortify(london$msoa)

and we will also need to make sure that we have the MSOA (and borough) information as well as the shapes:

  london$msoa.fortified$borough <- london$msoa@data[as.character(london$msoa.fortified$id), "LAD11NM"]
london$msoa.fortified$zonelabel <- london$msoa@data[as.character(london$msoa.fortified$id), "MSOA11NM"]

For convenience, let's add to the “fortified” MSOA data frame the columns that we've been investigating so far:

  tmp$is <- match(london$msoa.fortified$zonelabel, london$employment.msoa$Group.1)
london$msoa.fortified$fulltime <- london$employment.msoa$Full.time[tmp$is]
london$msoa.fortified$allages <- london$employment.msoa$All.Ages[tmp$is]
london$msoa.fortified$older <- london$employment.msoa$older[tmp$is]
london$msoa.fortified$younger <- london$employment.msoa$younger[tmp$is]

And now, a side note about R and its evaluation semantics. R has... odd semantics. The basic evaluation model is that function arguments are evaluated lazily, when the callee needs them, but in the caller's environment – except if the argument is a defaulted one (when no argument was actually passed), in which ase it's in the callee's environment active at the time of evaluation, except that these rules can be subverted using various mechanisms, such as simply looking at the call object that is currently being evaluated. There's significant complexity lurking, and libraries like ggplot (and lattice, and pretty much anything you could mention) make use of that complexity to give a convenient surface interface to their functions. Which is great, until the time comes for someone to write a function that calls these libraries, such as the one I'm about to present which draws pictures of London. I have to know about the evaluation semantics of the functions that I'm calling – which I suppose would be reasonable if there weren't quite so many possible options – and necessary workarounds that I need to install so that my function has the right, convenient evaluation semantics while still being able to call the library functions that I need to. I can cope with this because I have substantial experience with programming languages with customizeable evaluation semantics; I wonder how to present this to Data Science students.

With all that said, I present a fairly flexible function for drawing bits of London. This presentation of the finished object of course obscures all my exploration of things that didn't work, some for obvious reasons and some which remain a mystery to me.

  ggplot.london <- function(expr, subsetexpr=TRUE) {

We will call this function with an (unevaluated) expression to be displayed on a choropleth, to be evaluated in the data frame london$msoa.fortified, and an optional subsetting expression to restrict the plotting. We compute the range of the expression, independent of any data subsetting, so that we have a consistent scale between calls

      expr.limits <- range(eval(substitute(expr), london$msoa.fortified))

and compute a suitable default aesthetic structure for ggplot

      expr.aes <- do.call(aes, list(fill=substitute(expr), colour=substitute(expr)))

where substitute is used to propagate the convenience of an expression interface in our function to the same expression interface in the ggplot functions, and do.call is necessary because aes does not evaluate its arguments, while we do need to perform the substitution before the call.

      london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))

This is another somewhat complicated set of calls to make sure that the subset expression is evaluated when we need it, in the environment of the data frame. london.data at this point contains the subset of the MSOAs that we will be drawing.

      (ggplot(london$borough.fortified) + coord_equal() +

Then, we set up the plot, specify that we want a 1:1 aspect ratio,

       (theme_minimal() %+replace%
      theme(panel.background = element_rect(fill="black"),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())) +

a black background with no grid lines,

       geom_map(map=london.data, data=london.data, modifyList(aes(map_id=id, x=long, y=lat), expr.aes), label="map") +

and plot the map.

       scale_fill_gradient(limits=expr.limits, low=mnsl2hex(darker(rgb2mnsl(coords(hex2RGB("#13432B", TRUE))))), high=mnsl2hex(darker(rgb2mnsl(coords(hex2RGB("#56F7B1", TRUE)))))) +
     scale_colour_gradient(limits=expr.limits, low="#13432B", high="#56F7B1", guide=FALSE) +

We specify a greenish colour scale, with the fill colour being slightly darker than the outline colour for prettiness, and

       geom_map(map=london$borough.fortified, aes(map_id=id, x=long, y=lat), colour="white", fill=NA, size=0.1)

also draw a superposition of borough boundaries to help guide the eye.

       )
}

So, now, we can draw a map of full time employment:

  ggplot.london(fulltime/(allages-younger-older))

proportion employed full-time

or the same, but just the outlier MSOAs:

  ggplot.london(fulltime/(allages-younger-older), zonelabel %in% london$outliers)

proportion employed full-time (outlier MSOAs)

or the same again, plotting only those MSOAs which are within 0.1 of the extremal values:

  ggplot.london(fulltime/(allages-younger-older), pmin(fulltime/(allages-younger-older) - min(fulltime/(allages-younger-older)), max(fulltime/(allages-younger-older)) - fulltime/(allages-younger-older)) < 0.1)

proportion employed full-time (near-extremal MSOAs)

Are there take-home messages from this? On the data analysis itself, probably not; it's not really a surprise to discover that there's substantial underemployment in the North and East of London, and relatively full employment in areas not too far from the City of London; it is not unreasonable for nursery staff to ask whether parents dropping children off might want training; it might be that they have the balance of probabilities on their side.

From the tools side, it seems that they can be made to work; I've only really been kicking the tyres rather than pushing them hard, and the number of times I had to rewrite a call to ggplot, aes or one of the helper functions as a response to an error message, presumably because of some evaluation not happening at quite the right time, is high – but the flexibility and interactivity they offer for exploration feels to me as if the benefits outweigh the costs.

The R commands in this post are here; the shape files are extracted from Statistical GIS Boundary Files for London and redistributed here under the terms of the Open Government Licence; they contain National Statistics and Ordnance Survey data © Crown copyright and database right 2012. The CSV data file for population counts is derived from an ONS publication, distributed here under the terms of the Open Government Licence, and contains National Statistics data © Crown copyright and database right 2013; the CSV data file for labour statistics is derived from a spreadsheet published by the London datastore, distributed here under the terms of the Open Government Licence, and contains National Statistics data © Crown copyright and database right 2013.

Still here after all that licence information? I know I promised some gridSVG evaluation too, but it's getting late and I've been drafting this post for long enough already. I will come back to it, though, because it was the more fun bit, and it makes some nifty interactive visualizations fairly straightforward. Watch this space.

Syndicated 2014-07-19 20:12:00 (Updated 2014-07-19 20:18:42) from notes

192 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!