Older blog entries for danstowell (starting at number 48)

Chutney from 2004

We found this jar of chutney in the back of the cupboard - dated 24/8/04, made and kindly given to us by Lucy and Gen:

So, eight years on, how is it? Nice - a sour and tamarindy taste, and it complemented very well the cheshire cheese we had for our lunch. (I do hope tamarind was one of the ingredients!) I wonder how much it has changed since 2004...

Syndicated 2013-02-10 10:13:37 (Updated 2013-02-10 10:45:40) from Dan Stowell

Update on GM-PHD filter (with Python code)

Note: I drafted this a while back but didn't get round to putting it on the blog. Now I have published code and a published paper about the GM-PHD filter, I thought these practical insights might be useful:

I've been tweaking the GM-PHD filter which I blogged about recently. (Gaussian mixture PHD is a GM implementation of the Probability Hypothesis Density filter, for tracking multiple objects in a set of noisy observations.)

I think there are some subtleties to it which are not immediately obvious from the research articles.

Also, I've published my open source GM-PHD Python code so if anyone finds it useful (or has patches to contribute) I'd be happy. There's also a short research paper about using the GM-PHD filter for multi-pitch tracking.

In that original blog post I said the results were noisier than I was hoping. I think there are a couple of reasons for this:

  • The filter benefits from a high-entropy representation and a good model of the target's movement. I started off with a simple 1D collection of particles with fixed velocity, and in my modelling I didn't tell the GM-PHD about the velocity - I just said there was position with some process noise and observation noise. Well, if I update this so the model knows about velocity too, and I specify the correct linear model (i.e. position is updated by adding the velocity term on to it) the results improve a little. I was hoping that I coud be a bit more generic than that. It may also be that my 1D example is too low-complexity, and a 2D example would give it more to focus on. Whatever happened to "keep it simple"?!

  • The filter really benefits from knowing where targets are likely to come from. In the original paper, the simulation examples are of objects coming from a fixed small number of "air bases" and so they can be tracked as soon as they "take off". If I'm looking to model audio, then I don't know what frequency things will start from, there's no strong model for that. So, I can give it a general "things can come from anywhere" prior, but that leads to the burn-in problem that I mentioned in my first blog post - targets will not accumulate much evidence for themselves, until many frames have elapsed. (It also adds algorithmic complexity, see below.)

  • Cold-start problem: the model doesn't include anything about pre-existing targets that might already be in the space, before the first frame (i.e. when the thing is "turned on"). It's possible to account for this slightly hackily by using a boosted "birth" distribution when processing the first frame, but this can't answer the question of how many objects to expect in the first frame - so you'd have to add a user parameter. It would be nice to come up with a neat closed-form way to decide what the steady-state expectation should be. (You can't just burn it in by running the thing with no observations for a while before you start - "no observations" is expressed as "empty set", which the model takes to mean definitely nothing there rather than ignorance. Ignorance would be expressed as an equal distribution over all possible observation sets, which is not something you can just drop in to the existing machinery.)

One mild flaw I spotted is in the pruning algorithm. It's needed because without it the number of Gaussians would diverge exponentially, so to keep it manageable you want to reduce this to some maximum limit at each step. However, the pruning algorithm given in the paper is a bit arbitrary, and in particular it fails to maintain the total sum of weights. It chops off low-weight components, and doesn't assign their lost weight to any of the survivors. This is important because the sum of weights for a GMPHD filter is essentially the estimated number of tracked objects. If you have a strong clean signal then it'll get over this flaw, but if not, you'll be leaking away density from your model at every step. So in my own code I renormalise the total mass after simplification - a simple change, hopefully a good one.

And a note about runtime: the size of the birth GMM strongly affects the running speed of the model. If you read through the description of how it works, this might not be obvious because the "pruning" is supposed to keep the number of components within a fixed limit so you might think it allows it to scale fine. However, the if birth GMM has many components, then they all must be cross-fertilised with each observation point at every step, and then pruned afterwards, so even if they don't persist they are still in action for the CPU-heavy part of the process. (The complexity has a kind of dependence on number-of-observations * number-of-birth-Gaussians.) If like me you have a model where you don't know where tracks will be born from, then you need many components to represent a flat distribution. (In my tests, using a single very wide Gaussian led to unpleasant bias towards the Gaussian's centre, no matter how wide I spread it.)

Syndicated 2013-02-05 05:22:51 (Updated 2013-02-05 05:24:22) from Dan Stowell

Google Map Maker

Just found this lovely website about Google Map Maker.

Syndicated 2013-02-02 09:10:01 from Dan Stowell

OpenStreetMap: animated dataviz of edits per year

Another iteration of my visualisation of OpenStreetMap edits - here's an animation showing, for each year 2005-2012, the density of edits according to their geographic location:

The upper plot is the raw edit density. The lower one (which I think is more illuminating) is the edit density per unit population, as described in a previous post (with source code).

So what can you see? Well, both of them show the humble London-centred beginnings in 2005, followed by solid growth until the whole world is filled out. I think the lower plot more clearly shows when the "filling out" happens. 2007 is the year OpenStreetMap "goes global" but 2009 is the year it levels out. Before 2009, the edits-per-population are very variable, but from 2009 onwards the picture is much whiter and there's not much annual change in the colouring. This means the distribution of edits much more closely fits the population distribution, though (as noted last time) central Africa and around China are relatively underrepresented.

Syndicated 2013-01-17 06:09:36 (Updated 2013-01-17 09:14:33) from Dan Stowell

The best recipes of 2012

If you follow my @nomnomdan feed you'll have seen me trying out various recipes throughout the year. As we go into 2013 it occurred to me to wonder, what are my best recipes of 2012, things I definitely want to make again?

So here they are, the top hits of 2012. The recipes are mostly not invented by me but each one of them is gorgeous, and they're not very difficult - go on, try at least one of them:

Fish:

  • Poached sea bass, thai style - this is a beautiful way to treat sea bass; and so eeeeeeasy. I prefer to use noodles rather than rice, and it means you don't need any advance preparation.

  • Mackerel (or herring), de-spined and coated in english mustard then in oats, fried 3 mins each side. Served with kale & spuds. The combination of the oily fish with the hot mustard is surprisingly lovely, and kale is a great complement to it. (Kale has always been a bit odd-one-out in the past, but here it goes really well.)

Meat:

  • Chairman Mao's red braised pork - apparently a famous Chinese recipe, and it's lovely. I would never have thought to use the sugar to make it caramely sticky, but it works great. I didn't use pork belly but something a little bit less unhealthy.

  • Honey chicken with hoi-sin plantain - sounds like a crazy combination, but again, delicious, and this is very very easy with only a handful of ingredients.

Salad:

  • Lime tabbouleh - always wanted to be able to make this at home, now I know how.

Afters:

  • Rhubarb in the hole - spent the summer experimenting with rhubarb, and this invention was one of the best things I tried. The ginger jam goes really well with the tangy rhubarb.

  • Rhubarb snacking cake - in this one, the magic is the way the fruity lemon bottom layer goes with the tangy rhubarb top layer.

So there you go. Try one of them. And let me know @nomnomdan how you get on...

Syndicated 2013-01-12 07:27:02 from Dan Stowell

OpenStreetMap: where should the next recruitment drive be?

I watched the fancy OpenStreetMap Year of Edits 2012 video, which shows a data-driven animation of all the map edits happening around the world from thousands of contributors. It certainly makes the project look busy!

BUT it's not the kind of data-viz that particularly wants you to understand the data. If you watch the video, can you tell which was the busiest part of the world? Which bit was least busy -- where should OSM's next recruitment drive be?

So here's what I wondered: can we visualise the density of map edits for a place, relative to the population of the place? You see, if we assume that the population density of one part of the world should be roughly proportional to the number of things-that-should-be-mapped in that part of the world, then a low value of this ratio (edit rate divided by population density) indicates a place that needs more mapping.

So how to do it? I downloaded the OSM changesets from http://planet.osm.org/ and piled up all the bounding boxes from the 2012 changesets, converting that into a grid giving the edit density. Then I was lucky enough to find this gridded world population density data download from Columbia University.

Then I wrote a Python script to divide one by the other and plot the result. Here it is:

Blue areas have a relatively high number of edits per head of population, red have relatively low. White is average.

(BTW, here's the plot of the edit density, before taking the ratio.)

This is only a rough sketch, since it relies on some assumptions (e.g. every "changeset" was an equally important edit; also the map-features-per-population assumption I already mentioned). But the general story is: we need more mappers in South-East Asia (especially China) and Africa, please!

The plot clearly shows a general pattern connected with relative wealth / access to tech, so maybe initiatives like operation cowboy are the way to do it - get places mapped on behalf of others.

Syndicated 2013-01-08 17:17:55 (Updated 2013-01-08 17:31:04) from Dan Stowell

Comment on 'High heels as supernormal stimuli: How wearing high heels affects judgements of female attractiveness'

There's a research paper just out which has gained itself some press: "High heels as supernormal stimuli: How wearing high heels affects judgements of female attractiveness". It's described in the popular press as "proving" that high heels make women attractive, and that's fair enough but it's obviously not very surprising news given that high heels are widely known in current Western society to have that association. The research paper is slightly more specific than that: it finds that whatever "information" is transmitted to the viewer by high heels is even transmitted when we can see nothing but a handful of moving dots, hiding everything about the viewee except their gait.

That's interesting. But unfortunately, the authors go on to make one further step, which strikes me as a step too far - namely they infer that this reflects some evolutionary explanation for the popularity of high heels. The word "supernormal" in the title refers to the idea that high heels might cause women to walk in a way which exaggerates female aspects of gait, i.e. makes them walk even more unlike males than otherwise. There is indeed evidence for this in their paper. But the authors explicitly test for whether the "female" aspects of gait correlate with attractiveness judgments, and they find insignificant or barely significant correlations.

(Technical note: two of the correlations attain p<0.05, but they didn't control for multiple comparisons, so the true significance is probably lower. And the correlations I'm talking about now are in their Table 2, which is looking at differences within the high-heel category and within the flat-shoe category. The main effect demonstrated by the authors is indeed significant: viewers rated the high-heel videos as more attractive.)

So what does this suggest? To me it seems they've demonstrated that
(a) high heels affect gait (as you can tell on most Friday nights in town), and
(b) people recognise the change in gait as being associated with attractiveness and femininity.
But this second finding can just as easily be explained by cultural learning as by something evolutionary, despite the fact that the paper was published in "Evolution and Human Behavior".

In fact, (b) could conceivably be caused by a conjunction of:
(b1) people recognise the change as being caused by high heels (whether consciously or not); and
(b2) people recognise that high heels are associated with attrractiveness and femininity.
(This b1-and-b2 scenario is also a potential explanation for their second set of findings, in which the gaits of high-heeled walkers are less often mistaken for men.)

All of which means that I don't think these experiments manage to discern any difference between effects caused by evolved factors and effects caused by cultural learning. Given that, the obvious way to test that difference would be to show the dot videos to viewers who grew up in a non-Western society which doesn't have a tradition of high heels. (Not a convenient test to do - but I'd definitely be interested in the results!)

Here's one quote from their results, about a minor aspect, whether male or female onlookers have different opinions:

"note that there was no shoetype-gender interaction, showing that both males and females judged high heels to be more attractive than flat shoes. [...] furthermore, there were high correlations between male and female attractiveness ratings of the walkers in both the flat and heels condition demonstrating that males and females agreed which were the attractive and unattractive walkers."

So, in this study, the male and female onlookers showed the same pattern of response to the presence of high heels. Does this perhaps hint that the difference might be learned, rather than from some presumed phwoar-factor inbuilt in men?

This study is an example of what I see as a frustrating tendency for people in biological disciplines to do interesting quantitative studies, but then to plunge into the discussion section and make unwarranted generalisations about the evolutionary reasons for something's existence. As well as invoking evolution, in this case they also discuss women's motivation for how they dress:

"Therefore we suggest that one, conscious or unconscious, motivation for women to wear high heels is to increase their attractiveness."

Firstly, this study explicitly does not explore women's motivations, in any sense. It only studies judgments made by outside observers. Secondly, as the authors have already acknowledged,

"High heels have become a part of the uniform of female attire in a number of different contexts and as such are part of a much more complex set of display rules."

I don't dispute that attractiveness might be a more important motivation for some than other motivations (fashion, identity, confidence, social norms, availability, symbolism), but let's not imply that this hunch is an empirical finding, please. The association of high heels with attractiveness is already a common trope, so the idea that women might be motivated to buy into that trope is perfectly plausible, but this study throws no light on it.

Still, as I said, the main finding is interesting: the differences in gait induced by high heels, and the rating of such gaits as attractive, are demonstrated to be easily perceivable even in a display reduced to a handful of green dots.

Syndicated 2013-01-05 11:45:32 (Updated 2013-01-05 11:55:23) from Dan Stowell

Amazon Kindle Fire HD limitations

Over Christmas I helped someone set up their brand new Kindle Fire HD. I hadn't realised quite how coercive Amazon have been: they're using Android as the basis for the system (for which there is a whole world of handy free stuff), but they've thrown various obstacles in your way if you want to do anything that doesn't involve buying stuff from Amazon.

Now, many of these obstacles can be circumvented if you are willing to do moderately techy things such as side-loading apps, but for the non-techy user those options simply won't appear to exist, and I'm sure Amazon uses this to railroad many users into just buying more stuff. It's rude to be so obstructive to their customers who have paid good money for the device.

The main symptoms of this this attitude which I encountered:

  • You need to set up an Amazon one-click credit-card connection even before you can download FREE apps. It's not enough to have an Amazon account connected; you also need the one-click credit card thing.

  • One of the most vital resources for ebooks readers is Project Gutenberg, the free library of out-of-copyright books - but Amazon don't want you to go there. There's no easy way to read Project Gutenberg stuff on Kindle Fire. (Instructions here.) They will happily sell you their version of a book that you could easily get for zero money, of course.

  • You can't get Google Maps. This is just one result of the more general lockdown where Amazon doesn't want you to access the big wide Google Play world of apps, but it's a glaring absence since the Fire has no maps app installed. We installed Skobbler's ForeverMap 2 app which is a nice alternative, which can calculate routes for walking and for driving. In my opinion, the app has too many text boxes ("Shall I type the postcode in here?" "No that's the box to type a city name") and so the search could do with being streamlined. Other than that it seems pretty good.

So, unlike most tablet devices out there, if you have a Kindle Fire it's not straightforward to get free apps, free ebooks, or Google tools. This is disappointing, since the original black-and-white Kindle was such a nicely thought-through object, an innovative product, but now the Kindle Fire is just an Android tablet with things taken away. That seems to be why the Project Gutenberg webmaster recommends "don't get a Kindle Fire, get a Nexus 7".

There are good things about the device, though. It has a nice bright screen, good for viewing photos (though the photo viewer app has a couple of odd limitations: it doesn't rotate to landscape when you rotate the device - seems a very odd and kinda obvious omission since almost everything else rotates; and it doesn't make it obvious whether you've reached the end of a set, so you end up swiping a few times before you're sure you've finished). There's a good responsive pinch zoom on photos, maps etc. And the home screen has a lovely and useful skeumorph: the main feature is a "pile" of recently-used things, a scrollable pile of books and apps. A great way to visually skim what you recently did and how to jump back to it - biased towards books and Amazon things, but still, a nice touch. Shame about the overall coercive attitude.

Syndicated 2012-12-30 10:20:12 from Dan Stowell

Pub statistics for UK and Eire

I just extracted all the pubs in UK & Eire from OpenStreetMap. (Tech tip: XAPI URL builder makes it easy.)

There are 32,822 pubs listed. (As someone pointed out, that's 38.4% of all the pubs in OSM. So the UK is doing well - but come on, rest of the world, get mapping yr pubs ;)

A handful of quick statistics from the data I extracted:

  • The real_ale tag indicates 1080 real ale pubs (the tag is blank for 31678 of them, "no" for 64 of them). That's 3%, probably much less than the true number.
  • The toilets tag indicates 1211 have toilets available - again about 3%, whereas I bet most of them do really!
  • The food tag shows food available at 1119 of them (31686 blank, 17 "no"). Again about 3%, gotta be more than this.
  • The wifi tag shows wifi available at 274 of them (32450 blank, 98 "no"). I've no idea how common wifi is in pubs these days.

Syndicated 2012-12-20 07:26:06 (Updated 2012-12-20 07:36:04) from Dan Stowell

How to remove big old files from git history

I've been storing a lot of my files in a private git repository, for a long time now. Back when I started my PhD, I threw all kinds of things into it, including PDFs of handy slides, TIFF images I generated from data, journal-article PDFs... ugh. Mainly a lot of big bloaty files that I didn't really need to be long-term-archived (because I already had archived the nice small files that generated them - scripts, data tables, tex files).

So now I'm many years on, and I know FOR SURE that I don't need any trace of those darn PDFs in my archive, I want to delete them from the git history. Not just delete them from the current version, that's easy ("git rm"), but delete them from history so that my git repository could be nice and compact and easy to take around with me.

NOTE: Deleting things from the history is a very tricky operation! ALL of your commit IDs get changed, and if you're sharing the repos with anyone you're quite likely to muck them up. Don't do it casually!

But how can you search your git history for big files, inspect them, and then choose whether to dump them or not? There's a stackoverflow question about this exact issue, and I used a script from one of the answers, but to be honest it didn't get me very far. It was able to give me the names of many big files, but when I constructed a simple "git-filter-branch" command based on those filenames, it chugged through, rewriting history, and then failed to give me any helpful size difference. It's quite possible that it failed because of things like files moving location over time, and therefore not getting 100% deleted from the history.

Luckily, Roberto is a better git magician than I am, and he happened to be thinking about a similar issue. Through his git-and-shell skills I got my respository down to 60% of its previous size, and cleared out all those annoying PDFs. Roberto's tips came in some github gists and tweets - so I'm going to copy-and-link them here for posterity...

  1. Make a backup of your repository somewhere.

  2. Create a ramdisk on which to do the rewriting - this makes it go MUCH faster, it can be a slow process. (For me it reduced two days to two hours.)

    mkdir repo-in-ram
    sudo mount -t tmpfs -o size=2048M tmpfs repo-in-ram
    cp -r myrepo.git repo-in-ram/
    cd repo-in-ram/myrepo.git/
    
  3. This command gets a list of all blobs ever contained in your repo, along with their associated filename (quite slow), so that we can check the filenames later:

    git verify-pack -v .git/objects/pack/*.idx | grep tree | cut -c1-40 | xargs -n1 -iX sh -c "git ls-tree X | cut -c8- | grep ^blob | cut -c6-" | sort | uniq > blobfilenames.txt
    
  4. This command gets the top 500 biggest blobs in your repo, ordered by size they occupy when compressed in the packfile:

    git verify-pack -v .git/objects/pack/*.idx | grep blob | cut -c1-40,48- | cut -d' ' -f1,3 | sort -n -r --key 2 | head -500 > top-500-biggest-blobs.txt
    
  5. Go through that "top-500-biggest-blobs.txt" file and inspect the filenames. Are there any you want to keep? If so DELETE the line - this file is going to be used as a list of things that will get deleted. What I actually did was I used Libreoffice Calc to cross-tabulate the filenames against the blob IDs.

  6. Create this file somewhere, with a name "replace-with-sha.sh", and make it executable:

    #!/usr/bin/env sh
    TREEDATA=$(git ls-tree -r $2 | grep ^.......blob | cut -c13-)
    while IFS= read -r line ; do
        echo "$TREEDATA" | grep ^$line | cut -c42- | xargs -n1 -iX sh -c "echo $line > 'X.REMOVED.sha' && rm 'X'" &
    done < $1
    wait
    
  7. Now we're ready for the big filter. This will invoke git-filter-branch, using the above script to trim down every single commit in your repos:

    git filter-branch --tree-filter '/home/dan/stuff/replace-with-sha.sh /home/dan/stuff/top-500-biggest-blobs.txt $GIT_COMMIT' -- --all
    
  8. (Two hours later...) Did it work? Check that nothing is crazy screwed up.

  9. Git probably has some of the old data hanging around from before the rewrite. Not sure if all three of these lines are needed but certainly the last one:

    rm -rf .git/refs/original/
    git reflog expire --all
    git gc --prune=now
    

After I'd run that last "gc" line, that was the point that I could tell that I'd successfully got the disk-space right down.

If everything's OK at this point, you can copy the repos from the ramdisk back to replace the repos in its official location.

Now, when you next pull or push that repository, please be careful. You might need to rebase your latest work on top of the rewritten history.

Syndicated 2012-12-05 05:25:35 (Updated 2012-12-05 05:39:09) from Dan Stowell

39 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!