18 Jan 2010 asmodai   » (Journeyer)

Clustering and relevant algorithms

Disclaimer: I’m mainland European, we tend to use the , to separate digits from the whole numbers.

Clustering is quite a common approach to aggregate coordinates that are relatively close together. The problem lies in the choice of algorithm to use. This choice is highly dependent on the space in which the coordinates are laid out. Quite often you can just use basic Euclidean distance which, for a 2-dimensional space, simply takes the square root of the sum of the squared subtraction of the respective coordinates of each point. So if you have a point p with coordinates (33, 52) and a point q with coordinates (82, 19), the distance between p and q would be:

>>> import math
>>> math.sqrt(pow(33 - 82, 2) + pow(52 - 19, 2))
59.076221950967721

And based on that distance you can start to cluster points together that are all roughly the same distance from a certain point, say 59,1. The fun part of this is that this distance is the radius of a circle. So if you would plot every possible coordinate at that distance you will see a circle emerge.

In looking at clustering algorithms I also encountered something called Manhattan distance, but this algorithm only makes sense if you are working in a grid with roughly equidistant lengths to the other coordinates in this space. Normally the shortest distance from A to B would be a straight line, as the Euclidean distance shows. However, if the movement from coordinate to coordinate is restricted to straight lines, say the grid layout of a lot of North American cities, then Euclidean distance cannot apply. This is the same problem a taxi faces when trying to find the shortest distance to drive from A to B and as such the algorithm is also known as the taxicab distance or geometry. It takes the sum of the absolute value of the subtraction of the respective coordinates of each point. So if you take point p and q again, the distance would in this case be:

>>> abs(33 - 82) + abs(52 - 19)
82

Now, if you would plot all possible coordinates with that distance you will see a circle emerge again. However, keep in mind that a circle is nothing more than a set of points with a fixed distance (the radius). In this case our geometry uses a differently defined distance. If you would plot this out with a finer and finer grid the circle shape that emerges is a square rotated 45° so that it rests on its point.

Syndicated 2010-01-18 11:35:38 from In Nomine - The Lotus Land

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!