crhodes is currently certified at Master level.

Name: Christophe Rhodes
Member since: 2001-05-03 06:41:31
Last Login: 2014-04-25 07:18:49

FOAF RDF Share This

Homepage: http://www.doc.gold.ac.uk/~mas01cr/

Notes:

Notes here and here.

Oh, that probably isn't what "Notes" means. Oh well.

Physicist, Musician, Common Lisp programmer. Move along, there's nothing to see.

Projects

Recent blog entries by crhodes

Syndication: RSS 2.0
31 Jul 2014 (updated 31 Jul 2014 at 18:14 UTC) »

london employment visualization part 2

Previously, I did all the hard work to obtain and transform some data related to London, including borough and MSOA shapes, population counts, and employment figures, and used them to generate some subjectively pretty pictures. I promised a followup on the gridSVG approach to generating visualizations with more potential for interactivity than a simple picture; this is the beginning of that.

Having done all the heavy lifting in the last post, including being able to generate ggplot objects (whose printing results in the pictures), it is relatively simple to wrap output to SVG instead of output to PNG around it all. In fact it is extremely simple to output to SVG; simply use an SVG output device

  svg("/tmp/london.svg", width=16, height=10)

rather than a PNG one

  png("/tmp/london.png", width=1536, height=960)

(which brings back for me memories of McCLIM, and my implementation of an SVG backend, about a decade ago). So what does that look like? Well, if you’ve entered those forms at the R repl, close the png device

  dev.off()

and then (the currently active device being the SVG one)

  print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()

default (cairo) SVG device

That produces an SVG file, and if SVG in and of itself is the goal, that’s great. But I would expect that the main reason for producing SVG isn’t so much for the format itself (though it is nice that it is a vector image format rather than rasterized, so that zooming in principle doesn’t cause artifacts) but for the ability to add scripting to it: and since the output SVG doesn’t retain any information about the underlying data that was used to generate it, it is very difficult to do anything meaningful with it.

I write “very difficult” rather than “impossible”, because in fact the SVGAnnotation package aimed to do just that: specifically, read the SVG output produced by the R SVG output device, and (with a bit of user assistance and a liberal sprinkling of heuristics) attempt to identify the regions of the plot corresponding to particular slices of datasets. Then, using a standard XML library, the user could decorate the SVG with extra information, add links or scripts, and essentially do whatever they needed to do; this was all wrapped up in an svgPlot function. The problem with this approach is that it is fragile: for example, one heuristic used to identify a lattice plot area was that there should be no text in it, which fails for custom panel functions with labelled guidlines. It is possible to override the default heuristic, but it’s difficult to build a robust system this way (and in fact when I tried to run some two-year old analysis routines recently, the custom SVG annotation that I wrote broke into multiple pieces given new data).

gridSVG’s approach is a little bit different. Instead of writing SVG out and reading it back in, it relies on the grid graphics engine (so does not work with so-called base graphics, the default graphics system in R), and on manipulating the grid object which represents the current scene. The gridsvg pseudo-graphics-device does the behind-the-scenes rendering for us, with some cost related to yet more wacky interactions with R’s argument evaluation semantics which we will pay later.

  gridsvg("/tmp/gridsvg-london.svg", width=16, height=10)
print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()

Because ggplot uses grid graphics, this just works, and generates a much more structured svg file, which should render identically to the previous one:

SVG from gridSVG device

If it renders identically, why bother? Well, because now we have something that writes out the current grid scene, we can alter that scene before writing out the document (at dev.off() time). For example, we might want to add tooltips to the MSOAs so that their name and the quantity value can be read off by a human. Wrapping it all up into a function, we get

  gridsvg.london <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {

We need to compute the subset in this function, even though we’re going to be using the full dataset in ggplot.london when we call it, in order to get the values and zone labels.

      london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))

Then we need to map (pun mostly intended) the values in the fortified data frame to the polygons drawn; without delving into the format, my intuition is that the fortified data frame contains vertex information, whereas the grid (and hence SVG) data is organized by polygons, and there may be more than one polygon for a region (for example if there are islands in the Thames). Here we simply generate an index from a group identifier to the first row in the dataframe in that group, and use it to pull out the appropriate value and label.

      is <- match(levels(london.data$group), london.data$group)
    vals <- eval(substitute(expr), london.data)[is]
    labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]

Then we pay the cost of the argument evaluation semantics. My first try at this line was gridsvg(filename, width=16, height=10), which I would have (perhaps naïvely) expected to work, but which in fact gave me an odd error suggesting that the environment filename was being evaluated in was the wrong one. Calling gridsvg like this forces evaluation of filename before the call, so there should be less that can go wrong.

      do.call(gridsvg, list(filename, width=16, height=10))

And, as before, we have to do substitutions rather than evaluations to get the argument expressions evaluated in the right place:

      print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))

Now comes the payoff. At this point, we have a grid scene, which we can investigate using grid.ls(). Doing so suggests that the map data is in a grid object named like GRID.polygon followed by an integer, presumably in an attempt to make names unique. We can “garnish” that object with attributes that we want: some javascript callbacks, and the values and labels that we previously calculated.

      grid.garnish("GRID.polygon.*",
                 onmouseover=rep("showTooltip(evt)", length(is)),
                 onmouseout=rep("hideTooltip()", length(is)),
                 zonelabel=labels, value=vals,
                 group=FALSE, grep=TRUE)

We need also to provide implementations of those callbacks. It is possible to do that inline, but for simplicity here we simply link to an external resource.

      grid.script(filename="tooltip.js")

Then close the gridsvg device, and we’re done!

      dev.off()
}

Then gridsvg.london(fulltime/(allages-younger-older)) produces:

proportion employed full-time

which is some kind of improvement over a static image for data of this complexity.

And yet... the perfectionist in me is not quite satisfied. At issue is a minor graphical glitch, but it’s enough to make me not quite content; the border of each MSOA is stroked in a slightly lighter colour than the fill colour, but that stroke extends beyond the border of the MSOA region (the stroke’s centre is along the polygon edge). This means that the strokes from adjacent MSOAs overlie each other, so that the most recently drawn obliterates any drawn previously. This also causes some odd artifacts around the edges of London (and into the Thames, and pretty much obscures the river Lea).

This can be fixed by clipping; I think the trick to clip a path to itself counts as well-known. But clipping in SVG is slightly hard, and the gridSVG facilities for doing it work on a grob-by-grob basis, while the map is all one big polygon grid object. So to get the output I want, I am going to have to perform surgery on the SVG document itself after all; we are still in a better position than before, because we will start with a sensible hierarchical arrangement of graphical objects in the SVG XML structure, and gridSVG furthermore provides some introspective capabilities to give XML ids or XPath query strings for particular grobs.

grid.export exports the current grid scene to SVG, returning a list with the SVG XML itself along with this mapping information. We have in the SVG output an arbitrary number of polygon objects; our task is to arrange such that each of those polygons has a clip mask which is itself. In order to do that, we need for each polygon a clipPath entry with a unique id in a defs section somewhere, where each clipPath contains a use pointing to the original polygon’s ID; then each polygon needs to have a clip-path style property pointing to the corresponding clipPath object. Clear?

  addClipPaths <- function(gridsvg, id) {

given the return value of grid.export and the identifier of the map grob, we want to get the set of XML nodes corresponding to the polygons within that grob.

      ns <- getNodeSet(gridsvg$svg, sprintf("%s/*", gridsvg$mappings$grobs[[id]]$xpath))

Then for each of those nodes, we want to set a clip path.

      for (i in 1:length(ns)) {
        addAttributes(ns[[i]], style=sprintf("clip-path: url(#clipPath%s)", i))
    }

For each of those nodes, we also need to define a clip path

      clippaths <- list()
    for (i in 1:length(ns)) {
        clippaths[[i]] <- newXMLNode("clipPath", attrs=c(id=sprintf("clipPath%s", i)))
        use <- newXMLNode("use", attrs = c("xlink:href"=sprintf("#%s", xmlAttrs(ns[[i]])[["id"]])))
        addChildren(clippaths[[i]], kids=list(use))
    }

And hook it into the existing XML

      defs <- newXMLNode("defs")
    addChildren(defs, kids=clippaths)
    top <- getNodeSet(gridsvg$svg, "//*[@id='gridSVG']")[[1]]
    addChildren(top, kids=list(defs))
}

Then our driver function needs some slight modifications:

  gridsvg.london2 <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {
    london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))
    is <- match(levels(london.data$group), london.data$group)
    vals <- eval(substitute(expr), london.data)[is]
    labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]

Until here, everything is the same, but we can’t use the gridsvg pseudo-graphics device any more, so we need to do graphics device handling ourselves:

      pdf(width=16, height=10)
    print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))
    grid.garnish("GRID.polygon.*",
                 onmouseover=rep("showTooltip(evt)", length(is)),
                 onmouseout=rep("hideTooltip()", length(is)),
                 zonelabel=labels, value=vals,
                 group=FALSE, grep=TRUE)
    grid.script(filename="tooltip.js")

Now we export the scene to SVG,

      gridsvg <- grid.export()

find the grob containing all the map polygons,

      grobnames <- grid.ls(flatten=TRUE, print=FALSE)$name
    grobid <- grobnames[[grep("GRID.polygon", grobnames)[1]]]

add the clip paths,

      addClipPaths(gridsvg, grobid)
    saveXML(gridsvg$svg, file=filename)

and we’re done!

      dev.off()
}

Then gridsvg.london2(fulltime/(allages-younger-older)) produces:

proportion employed full-time (with polygon clipping)

and I leave whether the graphical output is worth the effort to the beholder’s judgment.

As before, these images contain National Statistics and Ordnance Survey data © Crown copyright and database right 2012.

Syndicated 2014-07-31 17:07:34 (Updated 2014-07-31 17:14:34) from notes

19 Jul 2014 (updated 19 Jul 2014 at 21:12 UTC) »

london employment visualization

I went to londonR a month or so ago, and the talk whose content stayed with me most was the one given by Simon Hailstone on map-making and data visualisation with ggplot2. The take-home message for me from that talk was that it was likely to be fairly straightforward to come up with interesting visualisations and data mashups, if you already know what you are doing.

Not unrelatedly, in the Goldsmiths Data Science MSc programme which we are running from October, it is quite likely that I will be teaching a module in Open Data, where the students will be expected to produce visualisations and mashups of publically-accessible open data sets.

So, it is then important for me to know what I am doing, in order that I can help the students to learn to know what they are doing. My first, gentle steps are to look at life in London; motivated at least partly by the expectation that I encounter when dropping off or picking up children from school that I need more things to do with my time (invitations to come and be trained in IT at a childrens' centre at 11:00 are particularly misdirected, I feel) I was interested to try to find out how local employment varies. This blog serves at least partly as a stream-of-exploration of not just the data but also ggplot and gridsvg; as a (mostly satisfied) user of lattice and SVGAnnotation in a previous life, I'm interested to see how the R graphing tools have evolved.

Onwards! Using R 3.1 or later: first, we load in the libraries we will need.

  library(ggplot2)
library(extremevalues)
library(maps)
library(rgeos)
library(maptools)
library(munsell)
library(gridSVG)

Then, let's define a couple of placeholder lists to accumulate related data structures.

  london <- list()
tmp <- list()

Firstly, let's get hold of some data. I want to look at employment; that means getting hold of counts of employed people, broken down by area, and also counts of people living in that area in the first place. I found some Excel spreadsheets published by London datastore and the Office for National Statistics, with data from the 2011 census, and extracted from them some useful data in CSV format. (See the bottom of this post for more detail about the data sources).

  london$labour <- read.csv("2011-labour.csv", header=TRUE, skip=1)
london$population <- read.csv("mid-2011-lsoa-quinary-estimates.csv", header=TRUE, skip=3)

The labour information here does include working population counts by area – specifically by Middle Layer Super Output Area (MSOA), a unit of area holding roughly eight thousand people. Without going into too much detail, these units of area are formed by aggregating Lower Layer Super Output Areas (LSOAs), which themselves are formed by aggregating census Output Areas. In order to get the level of granularity we want, we have to do some of that aggregation ourselves.

The labour data is published in one sheet by borough (administrative area), ward (local electoral area) and LSOA. The zone IDs for these areas begin E09, E05 and E01; we need to manipulate the data frame we have to include just the LSOA entries

  london$labour.lsoa <- droplevels(subset(london$labour, grepl("^E01", ZONEID)))

and then aggregate LSOAs into MSOAs “by hand”, by using the fact that LSOAs are named such that the name of their corresponding MSOA is all but the last letter of the LSOA name (e.g. “Newham 001A” and “Newham 001B” LSOAs both have “Newham 001” as their parent MSOA). In doing the aggregation, all the numeric columns are count data, so the correct aggregation is to sum corresponding entries of numerical data (and discard non-numeric ones)

  tmp$nonNumericNames <- c("DISTLABEL", "X", "X.1", "ZONEID", "ZONELABEL")
tmp$msoaNames <- sub(".$", "", london$labour.lsoa$ZONELABEL)
tmp$data <- london$labour.lsoa[,!names(london$labour.lsoa) %in% tmp$nonNumericNames]

london$labour.msoa <- aggregate(tmp$data, list(tmp$msoaNames), sum)
tmp <- list()

The population data is also at LSOA granularity; it is in fact for all LSOAs in England and Wales, rather than just London, and includes some aggregate data at a higher level. We can restrict our attention to London LSOAs by reusing the zone IDs from the labour data, and then performing the same aggregation of LSOAs into MSOAs.

  london$population.lsoa <- droplevels(subset(london$population, Area.Codes %in% levels(london$labour.lsoa$ZONEID)))
tmp$nonNumericNames <- c("Area.Codes", "Area.Names", "X")
tmp$msoaNames <- sub(".$", "", london$population.lsoa$X)
tmp$data <- london$population.lsoa[,!(names(london$population.lsoa) %in% tmp$nonNumericNames)]
london$population.msoa <- aggregate(tmp$data, list(tmp$msoaNames), sum)
tmp <- list()

This is enough data to be getting on with; it only remains to join the data frames up. The MSOA name column is common between the two tables; we can therefore reorder one of them so that it lines up with the other before simply binding columns together:

  london$employment.msoa <- cbind(london$labour.msoa, london$population.msoa[match(london$labour.msoa$Group.1, london$population.msoa$Group.1)])

(In practice in this case, the order of the two datasets is the same, so the careful reordering is in fact the identity transformation)

  tmp$notequals <- london$population.msoa$Group.1 != london$labour.msoa$Group.1
stopifnot(!sum(tmp$notequals))
tmp <- list()

Now, finally, we are in a position to look at some of the data. The population data contains a count of all people in the MSOA, along with counts broken down into 5-year age bands (so 0-5 years old, 5-10, and so on, up to 90+). We can take a look at the proportion of each MSOA's population in each age band:

  ggplot(reshape(london$population.msoa, varying=list(3:21),
               direction="long", idvar="Group.1")) +
    geom_line(aes(x=5*time-2.5, y=X0.4/All.Ages, group=Group.1),
              alpha=0.03, show_guide=FALSE) +
    scale_x_continuous(name="age / years") +
    scale_y_continuous(name="fraction") +
    theme_bw()

age distributions in London MSOAs

This isn't a perfect visualisation; it does show the broad overall shape of the London population, along with some interesting deviations (some MSOAs having a substantial peak in population in the 20-25 year band, while rather more peak in the 25-30 band; a concentration around 7% of 0-5 year olds, with some significant outliers; and a tail of older people with some interesting high outliers in the 45-50 and 60-65 year bands, as well as one outlier in the 75-80 and 80-85).

That was just population; what about the employment statistics? Here's a visualisation of the fraction of people in full-time employment, treated as a distribution

  ggplot(london$employment.msoa, aes(x=Full.time/All.Ages)) +
    geom_density() + theme_bw()

fraction in full-time employment in London MSOAs

showing an overal concentration of full-time employment level around 35%. But this of course includes in the population estimate people of non-working ages; it may be more relevant to view the distribution of full-time employment as a fraction of those of “working age”, which here we will define as those between 15 and 65 (not quite right but close enough):

  london$employment.msoa$younger = rowSums(london$employment.msoa[,62:64])
london$employment.msoa$older = rowSums(london$employment.msoa[,75:80])
ggplot(london$employment.msoa, aes(x=Full.time/(All.Ages-younger-older))) +
    geom_density() + theme_bw()

fraction of those of working age in full-time employment in London MSOAs

That's quite a different picture: we've lost the bump of outliers of high employment around 60%, and instead there's substantial structure at lower employment rates (plateaus in the density function around 50% and 35% of the working age population). What's going on? The proportions of people not of working age must not be evenly distributed between MSOAs; what does it look like?

  ggplot(london$employment.msoa, aes(x=older/All.Ages, y=younger/All.Ages)) +
    geom_point(alpha=0.1) + geom_density2d() + coord_equal() + theme_bw()

distribution of younger vs older people in London MSOAs

So here we see that a high proportion of older people in an MSOA (around 20%) is a strong predictor of the proportion of younger people (around 17%); similarly, if the younger population is 25% or so of the total population, there are likely to be less than 10% of people of older non-working ages. But at intermediate values of either older or younger proportions, 10% of older or 15% of younger, the observed variation of the other proportion is very high: what this says is that the proportion of people of working age in an MSOA varies quite strongly, and for a given proportion of people of working age, some MSOAs have more older people and some more younger people. Not a world-shattering discovery, but important to note.

So, going back to employment: what are the interesting cases? One way of trying to characterize them is to fit the central portion of the observations to some parameterized distribution, and then examining cases where the observation is of a sufficiently low probability. The extremevalues R package does some of this for us; it offers the option to fit a variety of distributions to a set of observations with associated quantiles. Unfortunately, the natural distribution to use for proportions, the Beta distribution, is not one of the options (fitting involves messing around with derivatives of the inverse of the incomplete Beta function, which is a bit of a pain), but we can cheat: instead of doing the reasonable thing and fitting a beta distribution, we can transform the proportion using the logit transform and fit a Normal distribution instead, because everything is secretly a Normal distribution. Right? Right? Anyway.

  tmp$outliers <- with(london$employment.msoa, getOutliers(log(Full.time/All.Ages/(1-Full.time/All.Ages))))
tmp$high <- london$employment.msoa$Group.1[tmp$outliers$iRight]
tmp$low <-london$employment.msoa$Group.1[tmp$outliers$iLeft]
print(tmp[c("high", "low")])
tmp <- list()

This shows up 9 MSOAs as having a surprisingly high full employment rate, and one (poor old Hackney 001) as having a surprisingly low rate (where “surprising” is defined relative to our not-well-motivated-at-all Normal distribution of logit-transformed full employment rates, and so shouldn't be taken too seriously). Where are these things? Clearly the next step is to put these MSOAs on a map. Firstly, let's parse some shape files (again, see the bottom of this post for details of where to get them from).

  london$borough <- readShapePoly("London_Borough_Excluding_MHW")
london$msoa <- readShapePoly("MSOA_2011_London_gen_MHW")

In order to plot these shape files using ggplot, we need to convert the shape file into a “fortified” form:

  london$borough.fortified <- fortify(london$borough)
london$msoa.fortified <- fortify(london$msoa)

and we will also need to make sure that we have the MSOA (and borough) information as well as the shapes:

  london$msoa.fortified$borough <- london$msoa@data[as.character(london$msoa.fortified$id), "LAD11NM"]
london$msoa.fortified$zonelabel <- london$msoa@data[as.character(london$msoa.fortified$id), "MSOA11NM"]

For convenience, let's add to the “fortified” MSOA data frame the columns that we've been investigating so far:

  tmp$is <- match(london$msoa.fortified$zonelabel, london$employment.msoa$Group.1)
london$msoa.fortified$fulltime <- london$employment.msoa$Full.time[tmp$is]
london$msoa.fortified$allages <- london$employment.msoa$All.Ages[tmp$is]
london$msoa.fortified$older <- london$employment.msoa$older[tmp$is]
london$msoa.fortified$younger <- london$employment.msoa$younger[tmp$is]

And now, a side note about R and its evaluation semantics. R has... odd semantics. The basic evaluation model is that function arguments are evaluated lazily, when the callee needs them, but in the caller's environment – except if the argument is a defaulted one (when no argument was actually passed), in which ase it's in the callee's environment active at the time of evaluation, except that these rules can be subverted using various mechanisms, such as simply looking at the call object that is currently being evaluated. There's significant complexity lurking, and libraries like ggplot (and lattice, and pretty much anything you could mention) make use of that complexity to give a convenient surface interface to their functions. Which is great, until the time comes for someone to write a function that calls these libraries, such as the one I'm about to present which draws pictures of London. I have to know about the evaluation semantics of the functions that I'm calling – which I suppose would be reasonable if there weren't quite so many possible options – and necessary workarounds that I need to install so that my function has the right, convenient evaluation semantics while still being able to call the library functions that I need to. I can cope with this because I have substantial experience with programming languages with customizeable evaluation semantics; I wonder how to present this to Data Science students.

With all that said, I present a fairly flexible function for drawing bits of London. This presentation of the finished object of course obscures all my exploration of things that didn't work, some for obvious reasons and some which remain a mystery to me.

  ggplot.london <- function(expr, subsetexpr=TRUE) {

We will call this function with an (unevaluated) expression to be displayed on a choropleth, to be evaluated in the data frame london$msoa.fortified, and an optional subsetting expression to restrict the plotting. We compute the range of the expression, independent of any data subsetting, so that we have a consistent scale between calls

      expr.limits <- range(eval(substitute(expr), london$msoa.fortified))

and compute a suitable default aesthetic structure for ggplot

      expr.aes <- do.call(aes, list(fill=substitute(expr), colour=substitute(expr)))

where substitute is used to propagate the convenience of an expression interface in our function to the same expression interface in the ggplot functions, and do.call is necessary because aes does not evaluate its arguments, while we do need to perform the substitution before the call.

      london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))

This is another somewhat complicated set of calls to make sure that the subset expression is evaluated when we need it, in the environment of the data frame. london.data at this point contains the subset of the MSOAs that we will be drawing.

      (ggplot(london$borough.fortified) + coord_equal() +

Then, we set up the plot, specify that we want a 1:1 aspect ratio,

       (theme_minimal() %+replace%
      theme(panel.background = element_rect(fill="black"),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())) +

a black background with no grid lines,

       geom_map(map=london.data, data=london.data, modifyList(aes(map_id=id, x=long, y=lat), expr.aes), label="map") +

and plot the map.

       scale_fill_gradient(limits=expr.limits, low=mnsl2hex(darker(rgb2mnsl(coords(hex2RGB("#13432B", TRUE))))), high=mnsl2hex(darker(rgb2mnsl(coords(hex2RGB("#56F7B1", TRUE)))))) +
     scale_colour_gradient(limits=expr.limits, low="#13432B", high="#56F7B1", guide=FALSE) +

We specify a greenish colour scale, with the fill colour being slightly darker than the outline colour for prettiness, and

       geom_map(map=london$borough.fortified, aes(map_id=id, x=long, y=lat), colour="white", fill=NA, size=0.1)

also draw a superposition of borough boundaries to help guide the eye.

       )
}

So, now, we can draw a map of full time employment:

  ggplot.london(fulltime/(allages-younger-older))

proportion employed full-time

or the same, but just the outlier MSOAs:

  ggplot.london(fulltime/(allages-younger-older), zonelabel %in% london$outliers)

proportion employed full-time (outlier MSOAs)

or the same again, plotting only those MSOAs which are within 0.1 of the extremal values:

  ggplot.london(fulltime/(allages-younger-older), pmin(fulltime/(allages-younger-older) - min(fulltime/(allages-younger-older)), max(fulltime/(allages-younger-older)) - fulltime/(allages-younger-older)) < 0.1)

proportion employed full-time (near-extremal MSOAs)

Are there take-home messages from this? On the data analysis itself, probably not; it's not really a surprise to discover that there's substantial underemployment in the North and East of London, and relatively full employment in areas not too far from the City of London; it is not unreasonable for nursery staff to ask whether parents dropping children off might want training; it might be that they have the balance of probabilities on their side.

From the tools side, it seems that they can be made to work; I've only really been kicking the tyres rather than pushing them hard, and the number of times I had to rewrite a call to ggplot, aes or one of the helper functions as a response to an error message, presumably because of some evaluation not happening at quite the right time, is high – but the flexibility and interactivity they offer for exploration feels to me as if the benefits outweigh the costs.

The R commands in this post are here; the shape files are extracted from Statistical GIS Boundary Files for London and redistributed here under the terms of the Open Government Licence; they contain National Statistics and Ordnance Survey data © Crown copyright and database right 2012. The CSV data file for population counts is derived from an ONS publication, distributed here under the terms of the Open Government Licence, and contains National Statistics data © Crown copyright and database right 2013; the CSV data file for labour statistics is derived from a spreadsheet published by the London datastore, distributed here under the terms of the Open Government Licence, and contains National Statistics data © Crown copyright and database right 2013.

Still here after all that licence information? I know I promised some gridSVG evaluation too, but it's getting late and I've been drafting this post for long enough already. I will come back to it, though, because it was the more fun bit, and it makes some nifty interactive visualizations fairly straightforward. Watch this space.

Syndicated 2014-07-19 20:12:00 (Updated 2014-07-19 20:18:42) from notes

off to tours

I’m on a train. And, more than usually, I don’t know what I’m doing.

Because of my involvement with Transforming Musicology, I’ve been invited to participate in a seminar on Renaissance musicat the Centre d'Études Supérieurs de la Renaissance at Tours. This is a real privilege, and the guest list is a bit intimidating – I went through a French Baroque phase 20 years ago, and it’s therefore nice to be able to bring the book on Marc-Antoine Charpentier to get it autographed by the author.

But if the guest list is intimidating, that’s nothing compared to the format. It has been made very clear that this is not a conference – we don’t have to present research we've already done, but rather our plans for future research that might be of interest to the assembled community. And it will be really interesting for me to hear that, of course, but I will of course be called on to justify my own presence there.

I think I might have two stories to tell. The first is about tools for directly improving the musicologist’s life, with the exemplar I would use being the Gombert/Josquin case study that, depressingly, remains aspirational rather than actual. Here, although more of the pieces are in place than previously, and I do have a complete workflow for examining individual fragments, the thing that is holding up the investigation is that aggregation of search results for those individual fragments into a useful summary suitable for digestion is more challenging and involves more time for concentrated thought than I currently have available.

The second story involves Bernardino de Ribera, and would aim to tell the musicological story from as near as possible to the start – Bruno Turner’s discovery of manuscripts at Toledo and Valencia – to finish, a professional-quality recording. The hope here is that by making the intermediate “data” available (electronic versions of manuscripts, critical editions, recording sessions) as well as links to the commercial performing editions and recording release, we enable a wider community of researchers and enthusiasts to engage with the material, enriching the commons. I should make clear that this story, too, is aspirational; I do not yet have clearance for releasing /any/ of this, and there are many hurdles to overcome before that clearance is given – but here this is intended as a case study of how musicologists and associated parties can publish their work and make them accessible to allow more people to engage with their work. This might address the fears of some that archive-based musicology is a dying art, by making the fruits of the labour much more visible.

The common theme underlying these two stories is, for me, the validation of new knowledge. I am not one who has completely bought in to the “impact” agenda currently driving a fair amount of the evaluation of academic utility in the UK, but I do perhaps take a harder line about the creation of knowledge than some of my colleagues at Goldsmiths, where I strongly feel that new knowledge itself should be communicable to others and not just used in the context of artistic practice. (I should say that my own line has softened quite substantially in the last decade; it is entirely possible that given a few more I will be as arty as the next Goldsmiths academic). The validation of the new knowledge does not have to be fully quantitative or “scientific” – I have certainly mellowed enough to allow that – but there has to be some kind of proof of the pudding, and I strongly think that that proof should be able to be entered into the permanent record and made as accessible as possible, and that this is orthogonal to making a value judgment on individuals or fields of research.

[ update: I’ve arrived. The weather is horrible, but the natives are friendly ]

Syndicated 2014-05-27 08:33:38 from notes

15 May 2014 (updated 17 May 2014 at 20:13 UTC) »

languages for teaching programming

There’s seems to be something about rainy weekends in May that stimulates academics in Computing departments to have e-mail discussions about programming languages and teaching. The key ingredients probably include houseboundness, the lull between the end of formal teaching and the start of exams, the beginning of contemplation of next year’s workload, and enthusiasm; of course, the academic administrative timescale is such that any changes that we contemplate now, in May 2014, can only be put in place for the intake in September 2015... if you’ve ever wondered why University programmes in fashionable subjects seem to lag about two years behind the fashion (e.g. the high growth of Masters programmes in “Financial Engineering” or similar around 2007 – I suspect that demand for paying surprisingly high tuition fees for a degree in Synthetic Derivative Construction weakened shortly after those programmes came on stream, but I don’t actually have the data to be certain).

Meanwhile, I was at the European Lisp Symposium in Paris last week, where there was a presentation of a very apposite nature: the Computing Department at Middlesex university has implemented an integrated first-year of undergraduate teaching, covering a broad range of the computing curriculum (possibly not as broad as at Goldsmiths, though) through an Arduino and Raspberry Pi robot with a Racket-based programmer interface. Students’ progress is evaluated not through formal tests, courseworks or exams, but through around 100 binary judgments in the natural context of “student observable behaviours” at three levels (‘threshold’, which students must exhibit to progress to the second year; ‘typical’, and ‘excellent’).

This approach has a number of advantages, I think, over a more traditional division of the year into four thirty-credit modules (e.g. Maths, Java, Systems, and Profession): for one, it pretty much guarantees a coherent approach to the year, where in the divided modules case it is surprisingly easy for one module to be updated in syllabus or course content without adjustments to the others, leaving for example some programming material insufficiently supported by the maths (and some maths taught without motivation). The assessment method is in principle transparent to the students, who know what they have to do to progress (and to get better marks); I'm not convinced that this is always a good thing, but for an introductory and core course I think the benefits substantially outweigh the disadvantages. The use of Racket as the teaching language has an equalising effect – it’s unlikely that students will have prior experience with it, so everyone starts off at the same point at least with respect to the language – and the use of a robot provides visceral feedback and a sense of achievement when it is made to do something in a way that text and even pixels on a screen might not. (This feedback and tangible sense of achievement is perhaps why the third-year option of Physical Computing at Goldsmiths is so popular: often oversubscribed by a huge margin).

With these thoughts bubbling around in my head, then, when the annual discussion kicked off at the weekend I decided to try to articulate my thoughts in a less ephemeral way than in the middle of a hydra-like discussion: so I wrote a wiki page, and circulated that. One of the points of having a personal wiki is that the content on it can evolve, but before I eradicate the evidence of what was there before, and since it got at least one response (beyond “why don’t you allow comments on your wiki?”) it's worth trying to continue the dialogue.

Firstly, Chris Cannam pulls me up on not including an Understand goal, or one like it: teaching students to understand and act on their understanding of computing artifacts, hardware and software. I could make the argument that this lives at the intersection of my Think and Experiment goals, but I think that would be retrospective justification and that there is a distinct aim there. I’m not sure why I left it out; possibly, I am slightly hamstrung in this discussion about pedagogy by a total absence of formal computing education; one course in fundamentals of computing as a 17-year-old, and one short course on Fortran and numerical methods as an undergraduate, and that’s it. It's in some ways ironic that I left out Understand, given that in my use of computers as a hobby it’s largely what I do: Lisp software maintenance is often a cross between debugger-oriented programming and software archaeology. But maybe that irony is not as strong as it might seem; I learnt software and computing as a craft, apprenticed to a master; I learnt the shibboleths (“OAOO! OAOO! OAOO!”) and read the training manuals, but I learnt by doing, and I’m sure I’m far from alone, even in academic Computing let alone among programmers or computing professionals more generally.

Maybe the Middlesex approach gets closer, in the University setting, to the apprenticeship (or pre-apprenticeship) period; certainly, there are clear echoes in their approach of the switch by MIT from the SICP- and Scheme-based 6.001 to the Robotics- and Python-based introduction to programming – and Sussman’s argument from the time of the switch points to a qualitative difference in the role of programmers, which happens to dovetail with current research on active learning. In my conversation with the lecturers involved after their presentation, they said that the students take a more usual set of languages in their second-years (Java, C++); obviously, since this is the first year of their approach, they don’t yet know how the transition will go.

And then there was the slightly cynical Job vs Career distinction that I drew, being the difference between a graduate-level job six months after graduation as distinct from a fulfilling career. One can of course lead to the other, but it’s by no means guaranteed, and I would guess that if asked most people would idealistically say we as teachers in Universities should be attempting to optimize for the latter. Unfortunately, we are measured in our performance in the former; one of the “Key Information Sets” collected by higher-education agencies and presented to prospective undergraduates is the set of student ‘destinations’. Aiming to optimize this statistic is somewhat akin to schools optimizing their GCSE results with respect to the proportion of pupils gaining at least 5 passes: ideally, the measurement should be a reflection of the pedagogical practice, but the foreknowledge of the measurement has the potential to distort the allocation of effort and resources. In the case of the school statistic, there’s evidence that extra effort is concentrated on borderline pupils, at the expense of both the less able and potential high-fliers; the distortion isn’t so stark in the case of programming languages, because the students have significant agency between teaching and measurement, but there is certainly pressure to ensure that we teach the most common language used in assessment centres.

In Chris’ case, C++ might have had a positive effect on both; the programming language snob in me, though, wants to believe that there are hordes of dissatisfied programmers out there, having got a job using their competence in some “industry-standard” language, desperately wanting to know about a better way of doing things. I might be overprojecting the relationship of a professional programmer with their tools, of course: for many a career in programming and development is just that, rather than a love-hate relationship with their compiler and linker. (Also, there’s the danger of showing the students that there is a better way, but that the market doesn’t let them use it...)

Is it possible to conclude anything from all of this? Well, as I said in my initial thoughts, this is an underspecified problem; I think that a sensible decision can only be taken in practice once the priorities for teaching programming at all have been established, in combination with the resources available for delivering teaching. I’m also not enough of a polyglot in practice to offer a mature judgment on many fashionable languages; I know enough of plenty of languages to modify and maintain code, but am comfortable writing from scratch in far fewer.

So with all those caveats, an ideal programming curriculum (reflecting my personal preferences and priorities) might include in its first two years: something in the Lisp family, for Think and Experiment; ARM Assembler, for Think and Understand; C++, for Job and Career (and all three together for Society). Study should probably be handled in individual third-year electives, and I would probably also include a course hybrid between programming, database and system administration to cover a Web programming stack for more Job purposes. Flame on (but not here, because system administration is hard).

Syndicated 2014-05-15 19:50:23 (Updated 2014-05-17 19:45:54) from notes

just like old times

Sometimes, it feels like old times, rolling back the years, and so on. One of the things I did while avoiding work on my PhD (on observational tests of exotic cosmologies inspired by then-fashionable high-energy physics theories, since you ask) was to assist in resurrecting old and new CMUCL backends for SBCL, and to do some additional backend hacking. I learnt a lot! (To be fair, I learnt a lot about General Relativity too). And I even wrote about some of the funny bits when porting, such as getting to a working REPL with a completely broken integer multiply on SPARC, or highly broken bignums when working on a proper 64-bit backend for alpha.

On the #sbcl IRC channel over the last few days, the ARM porting crew (not me! I've just been kibitzing) has made substantial progress, to the point that it's now possible to boot a mostly-working REPL. Some of the highlights of the non-workingness:

  (truncate (expt 2 32) 10)
21074691228794313799883120286105
6

(fib 50)
6277101738309684039858724727853387073209631205623600462197

(expt 2.0 3)
2.0

(sb-ext:run-program "uname" '() :search t :output *standard-output*)
Linux
Memory fault at 3c

Meanwhile, given that most ARM cores don't have an integer division instruction, it's a nice bit of synchronicity that one of this year's Summer of Code projects for SBCL is to improve division by constant integers; I'm very much looking forward to getting involved with that, and hopefully also bringing some of the next generation of SBCL hackers into the community. Maybe my writing a while back about a self-sustaining system wasn't totally specious.

Syndicated 2014-04-25 16:43:10 from notes

189 older entries...

 

crhodes certified others as follows:

  • crhodes certified crhodes as Journeyer
  • crhodes certified dan as Master
  • crhodes certified wnewman as Master
  • crhodes certified fufie as Journeyer
  • crhodes certified moray as Apprentice
  • crhodes certified mjg59 as Master
  • crhodes certified adw as Apprentice
  • crhodes certified bmastenbrook as Journeyer
  • crhodes certified magnusjonsson as Journeyer
  • crhodes certified danstowell as Journeyer

Others have certified crhodes as follows:

  • crhodes certified crhodes as Journeyer
  • rjain certified crhodes as Journeyer
  • pvaneynd certified crhodes as Journeyer
  • fufie certified crhodes as Journeyer
  • mwh certified crhodes as Journeyer
  • cmm certified crhodes as Master
  • dan certified crhodes as Master
  • jao certified crhodes as Master
  • varjag certified crhodes as Journeyer
  • fxn certified crhodes as Journeyer
  • jtjm certified crhodes as Journeyer
  • mk certified crhodes as Journeyer
  • mjg59 certified crhodes as Journeyer
  • adw certified crhodes as Journeyer
  • tbmoore certified crhodes as Master
  • mdanish certified crhodes as Master
  • negative certified crhodes as Journeyer
  • hanna certified crhodes as Journeyer
  • Fare certified crhodes as Master
  • moray certified crhodes as Journeyer
  • richdawe certified crhodes as Journeyer
  • pjcabrera certified crhodes as Journeyer
  • water certified crhodes as Master
  • ingvar certified crhodes as Journeyer
  • ebf certified crhodes as Journeyer
  • bmastenbrook certified crhodes as Master
  • xach certified crhodes as Master
  • kai certified crhodes as Master
  • davep certified crhodes as Master
  • bhyde certified crhodes as Master
  • cyrus certified crhodes as Master
  • technik certified crhodes as Master
  • danstowell certified crhodes as Journeyer
  • cannam certified crhodes as Journeyer

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page