nbm is currently certified at Journeyer level.

Name: Neil Blakey-Milner
Member since: 2000-04-05 07:42:45
Last Login: 2008-08-26 12:56:44

FOAF RDF Share This

Homepage: http://techgeneral.org/

Notes:

Just another South African open source developer

Projects

Recent blog entries by nbm

Syndication: RSS 2.0

In San Francisco in October

Visa-willing, I'll be in San Francisco for about three weeks from early October.  The SynthaSite Cape Town office is heading over to the San Francisco office for a mix of team training, team building, end-of-year partying, and planning sessions.

My last trip to San Francisco in May/June included Google I/O and a Pylons/TG2/WSGI sprint, and I really enjoyed being in the company of geeks.  This time around, it doesn't seem like there are any good conferences to squeeze in or stay around for and so far my only plans are to attend the Bay Area Python Interest Group with Jonathan.

Are there any interesting tech events happening in October in or around San Francisco I should try to attend?

Syndicated 2008-09-22 11:58:24 from Neil Blakey-Milner

Further adventures in Sitemaps

Sitemap by Brian Talbot, CC BY NC
Sitemap by Brian Talbot CC BY NC

While the two Sitemap formats are straightforward, deciding on the data to put into the templates not always altogether obvious.

There are three main types of metadata about sitemaps and URLs:

  • Last modification time
  • Change Frequency
  • Priority

Last modified time

squared circles - Clocks by Leo Reynolds, CC BY NC SA
squared circles - Clocks by Leo Reynolds CC BY NC SA

Last modified time of sitemaps

Setting the last modified time on a sitemap allows consumers of the sitemap index to not download the referenced sitemap again if they've already got an up-to-date sitemap.  Getting this wrong (say, by always giving the same last modified time) may mean consumers of your sitemap index will try the referenced sitemaps less often than they should.

The last modified time for a sitemap for a web log will probably be the most recent last modified time of the posts.  Depending on whether the comments constitute valuable content, the last modified time of comments on the posts may be useful too.

Last modified time of URLs

As with sitemaps in sitemap indices, last modification time for URLs listen in a sitemap is pretty easy — the last time that particular URL's content changed.  For a CMS page or web log post, it would usually be the time of the last edit.  For a post, the time of the last comment is relevant.

Complications with last modified

Things get a bit murky if you change your web site's style though — the HTML output has changed, but the most relevant content hasn't.  If your style change majorly affects the navigation potential or relevance of content, it may be worthwhile updating the last modification time.

Things are also complicated on pages that aggregate content from elsewhere.  For example, page two of the archives for March 2008 on a web log.  The "correct" answer to that is probably the last updated time of any posts originally posted in March 2008.  But if you change from having full-content to summary content per post, or remove any content per post, or add tags to your content, or otherwise change navigation or content relevance, then you might want to update the last modified time for all archives pages to when you made the style change.

Change frequency

Toronto subway frequency by Elijah van der Giessen, CC BY NC
Toronto subway frequency by Elijah van der Giessen CC BY NC

Change frequency is (currently) unique to URLs in a sitemap.  It's an opportunity to tell consumers of your sitemap how often you think the content at that URL changes.  Valid values are:

  • always
  • hourly
  • daily
  • weekly
  • monthly
  • yearly
  • never
It isn't yet obvious how seriously search engines (for example) take these values.  I imagine that if you say that all your URLs change hourly, then you probably won't get any change in their behaviour.  However, it can help reduce the amount of spider traffic that older pages get, and if consumers trust you, may get some of your pages checked for changes more often.

Determining change frequency of URLs

The change frequency of a front page will probably be hourly.  Similarly, an archives page for the current day, month, year, or all time would be hourly.  The change frequency for an archives page for previous days, months, or years could potentially be considered "never" or "yearly", but you can always set it to "monthly" if you're worried about such long periods of time.  (The sitemap consumer will watch the last modified time of the entry in your sitemap anyway, and probably try visit that content more often than that just in case anyway.)

The change frequency for a post on a web log or a news article depends on a few things.  For example, if you use "related posts" or "related stories", you may not want to use values such as "never" or "yearly" even for posts from years back.  If you allow comments, you may similarly want not to use those values.

The most important indicator of likely change frequency in standard cases is probably how long it has been since a particular page has changed.  In GibeSitemap, I use a relatively naive algorithm:

  • If the content has changed in the last three days, the change frequency is hourly.
  • If changed in the last 15 days, daily.
  • If changed in the last 45 days, weekly.
  • older, monthly.

Priority

Changed priorities ahead by Peter Reed, CC BY NC SA
Changed priorities ahead by Peter Reed CC BY NC

The priority of a page signals how valuable and relevant the content on that URL is likely to be to the consumer, relative to other pages on your web site.  Priority can run from 0.0 (low) to 1.0 (high).  Your front page is likely to have a very high priority (say, 1.0).  A web log "About" page is probably one of the highest priority pages (say, 0.9).

Determining priority of URLs

For a CMS with a hierarchical path structure, you can use a simple algorithm to determine priority — the fewer folders between the site root and the page, the more important it likely is.  For the Gibe Pages plugin, pages at the top level are given 0.9, losing 0.1 for each folder until a lowest value of 0.6.  So:

  • /about : 0.9
  • /about/team : 0.8
  • /about/team/neil : 0.7
  • /about/team/neil/interests : 0.6

Web log or news archives pages should not have remotely high priority, since the content on them is more relevant in the individual posts.  A value of 0.1 is appropriate.

For web log posts or news articles, priority depends on a number of factors.  For example, you may want to set existing popular posts or articles with a high priority, so that people are more likely to find that post or article when searching for them.  You may want to set posts with a particular tag or articles in a particular section to have higher or lower priority.

For the basic case, though, you can probably just use the publishing date or last modification time to help determine the priority.  More recent posts and news are probably more relevant (on your site) than older ones.  You might want to use a simple algorithm like the one I used on Gibe:

  • If the publish date is within the last 15 days, priority of 0.9
  • last month, 0.8
  • last three months, 0.7
  • last half-year, 0.6
  • last year, 0.5
  • last two years, 0.4
  • older, 0.3

Syndicated 2008-09-15 08:47:02 from Neil Blakey-Milner

Early adventures with Sitemaps

Perhaps entirely randomly, I decided that TechGeneral would need Sitemaps before I put it live.

A Sitemap (sometimes called a Google Sitemap, although you won't see Google calling it that, and it is a standard that Yahoo!, Ask, and Live all support) is an XML file (or bunch of XML files) that describe the various resources on your web site which allows search engines and other programs to discover them more easily.

There are a few advantages to putting together a Sitemap.  Generally, search engines give up after they travel a few links into a web site to avoid infinite automatically generated links (not because of malicious intent necessarily, but because of weird programming).  With a Sitemap, each listed resource can potentially be treated as a first visit.  Also, if a site has navigation that search engines can't traverse to get to certain pages, Sitemaps can assist search engines to find those resources.

They also optionally assign a priority to each resource as a way to influence the importance assigned to the resource relative to other resources on your web site.  Similarly, an optional update frequency per resource can influence how often a search engine or other program should check back for new versions of that resource.  Last modified dates also optionally help to determine whether to try revisit a resource earlier or later than would normally happen.

Example Sitemap File

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
 
    <url>
        <loc>http://techgeneral.org/diary</loc>
        <lastmod>2008-08-16T22:52:41+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.9</priority>
    </url>
 
    <url>
        <loc>http://techgeneral.org/speaking</loc>
        <lastmod>2008-08-16T22:52:13+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.9</priority>
    </url>
 
    <url>
        <loc>http://techgeneral.org/contact</loc>
        <lastmod>2008-08-10T16:59:32+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.9</priority>
    </url>
 
    <url>
        <loc>http://techgeneral.org/about</loc>
        <lastmod>2008-08-10T12:42:06+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.9</priority>
    </url>
</urlset>

There are two types of Sitemaps - individual Sitemap files and Sitemap Index files.  Why would you want a Sitemap Index?  One, less relevant to many, reason is that individual Sitemap files can only contain 50 000 URLs (which, admittedly, the average blog isn't going to hit) and be less than 10MB uncompressed.  Another reason is that you might be using multiple systems that each generate Sitemap files (or you've hacked them to do so) but you don't want to merge them yourself.

Example Sitemap Index

<sitemapindex
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
 
    <sitemap>
        <loc>http://techgeneral.org/sitemap_posts.xml</loc>
    </sitemap>
 
    <sitemap>
        <loc>http://techgeneral.org/sitemap_archives.xml</loc>
    </sitemap>
 
    <sitemap>
        <loc>http://techgeneral.org/sitemap_pages.xml</loc>
    </sitemap>
</sitemapindex>

One useful side-effect of using a Sitemap with Google's webmaster tools is that you can see errors that occur on resources listed in the Sitemap.  So, if a request for a resource starts returning 404 or 500 errors, you can separate that more specific set of errors from those caused by broken links on your site or on other sites.

However, Google's webmaster tools doesn't seem to like having a whole bunch of separate Sitemap files with a central Sitemap Index.  I mean, it seems to work, but it complains (warnings, not errors) that many of the Sitemaps (all on this site, most on my personal web site) have only entries with the same priority.  I'm setting the priority of all the archives low (they have noindex, follow set anyway, so won't show up in search results), the frontpage high, and the posts are priorities based on age.

I get the feeling that the priorities only apply within the same file, and not within the same site.  This somewhat makes sense, since one can delegate a sitemap for a particular folder on your web site, and you wouldn't want an overeager person assigning "1.0" to all content within the folder, overriding your beautifully crafted values for the base site.  However, in this case, they're all at the same level, and I really want the archives lower than the posts, and the frontpage higher than most of the posts.

Oh well, I'll push on and see whether it's just a matter of warnings that aren't affecting things (my favourite kind) or an indication of things being as I suspect.

Syndicated 2008-08-26 16:50:44 from Neil Blakey-Milner

Wordpress.com scalability at WordCamp SA 2008

At WordCamp South Africa 2008, held in Cape Town yesterday, we were given a brief overview of how Wordpress.com is set up to scale.

Matt Mullenweg set the scene with some idea of just how huge Wordpress.com is.  I may mess up a few numbers mentioned, but there've been something like 6.5 billion page views on Wordpress.com since the beginning of the year, there are 3.8 million Wordpress.com hosted blogs (only Blogger is bigger), and there are 1.4 billion words in posts created on Wordpress.com.

Warwick Poole then gave us some more in-depth numbers, although pointing out that Wordpress.com was bigger than AdultFriendFinder was a pretty good and well-understood indication from the audience's reaction.  In May 2008, Wordpress.com was served 693 million page views, but this rose to 812 million page views in July.  Over 1TB of media was uploaded in May, 1.3TB in July.  In May, 417TB of traffic left the Wordpress.com data centres.  These numbers are available in the "July wrap-up" post on the Wordpress.com web log.

Apparently, across the approximately 710 servers, 10 000 web requests and 10 000 databases requests are handled per second (I wasn't intelligent to write down whether this was the average).  110 requests per second are done to Amazon's S3 storage service, while 3TB of media is cached on their own media caches.  They output 1.5TB/s (I wrote TB, so it probably is TB and not Tb.  I'm guessing this is peak). They experience approximately 5 server failures a week.

How is it put together?  They use Round Robin DNS which determines the data centre (from testing, it seems there round robin six IPs - two IPs for each of three data centres).  There it hits a load balancer using some combination of nginx, wackamole, and spread.  They use Varnish for serving at least media, and currently use Litespeed web servers.  They also use MySQL and memcached.

They use (and developed) the batcache Wordpress plugin to serve content from memcached - according to the documentation, batcache only potentially servers stale content to first-time visitors - visitors who have interacted with the web log receive up to date content.

When new media is uploaded, its existence and initial location is stored in a table.  As necessary, the other data centres will create their own local copies of that media, and update that table.  The backup media stores in the data centres are write-only - apparently nothing is ever deleted from them.

That's about all I wrote down, but there's quite a bit of information about how Wordpress.com is set up and the sort of load/traffic it has on the Wordpress.com blog and on the blogs of various employees (such as this post on nginx replacing Pound, this one on Pound, and another on varnish) giving some useful information which will probably inform some technology choices we might make at SynthaSite.

Syndicated 2008-08-24 17:13:34 from Neil Blakey-Milner

Subversion (SVN) shortcuts to revert previous commits

Good version control system usage prevents many disasters, but that doesn't necessarily mean you won't make your own mistakes.  Today, I mistakenly included a file in a commit that I didn't want to commit yet.  I learned two new tricks while spending a few minutes puzzling the best way to get back to where I was before with that file.

First, make a mistake:

$ svn commit -m "..."
Sending dev.cfg
Sending gibe/plugin.py
Transmitting file data ..
Committed revision 114.

svn merge is the tool to use for this:

merge: Apply the differences between two sources to a working copy path.
usage: 1. merge sourceURL1[@N] sourceURL2[@M] [WCPATH]
2. merge sourceWCPATH1@N sourceWCPATH2@M [WCPATH]
3. merge [-c M | -r N:M] SOURCE[@REV] [WCPATH]

Trick #1: use svn merge's 3rd usage pattern with the -c option with the negative of the revision you've committed, and (here comes the trick) use . (the current directory) as the source of the merge:

$ svn merge -c -114 .
U gibe/plugin.py
U dev.cfg

With that your working copy is now where the repository was before your commit.  Commit that to the repository, and the repository is back where it was before your commit.

Now your working copy is where it was before you made any changes - but you probably want those changes back.  Easy enough:

$ svn merge -c 114 .
U gibe/plugin.py
U dev.cfg

Now your working copy is back where it was before you did the mistaken commit.

Trick #2: Of course, if your mistake is like mine and you only messed up one file and everything else is as it should be, you can just do this on one file, by using svn merge's 2nd usage pattern:

$ svn merge dev.cfg@114 dev.cfg@113
U dev.cfg

Commit that, and your repository is back to normal.  Then run:

$ svn merge dev.cfg@113 dev.cfg@114
U dev.cfg

Now the file is back where it was before your botch.

Syndicated 2008-08-22 15:01:03 from Neil Blakey-Milner

117 older entries...

 

nbm certified others as follows:

  • nbm certified grog as Master
  • nbm certified eivind as Master
  • nbm certified dcs as Journeyer
  • nbm certified nik as Master
  • nbm certified billf as Journeyer
  • nbm certified phk as Master
  • nbm certified green as Journeyer
  • nbm certified jedgar as Journeyer
  • nbm certified msmith as Master
  • nbm certified kkenn as Journeyer
  • nbm certified gsutter as Journeyer
  • nbm certified quiet1 as Apprentice
  • nbm certified dwhite as Journeyer
  • nbm certified peter as Master
  • nbm certified bp as Journeyer
  • nbm certified mjt as Apprentice
  • nbm certified ljb as Apprentice
  • nbm certified jkh as Master
  • nbm certified washort as Journeyer
  • nbm certified itamar as Journeyer
  • nbm certified glyph as Master

Others have certified nbm as follows:

  • cmc certified nbm as Journeyer
  • anders certified nbm as Journeyer
  • benno certified nbm as Journeyer
  • gsutter certified nbm as Journeyer
  • eivind certified nbm as Journeyer
  • asmodai certified nbm as Journeyer
  • jedgar certified nbm as Journeyer
  • winter certified nbm as Journeyer
  • phk certified nbm as Journeyer
  • rwatson certified nbm as Journeyer
  • will certified nbm as Journeyer
  • ljb certified nbm as Journeyer
  • peter certified nbm as Journeyer
  • jhb certified nbm as Journeyer
  • mjt certified nbm as Journeyer
  • billf certified nbm as Journeyer
  • bp certified nbm as Journeyer
  • green certified nbm as Journeyer
  • locust certified nbm as Journeyer
  • dcs certified nbm as Journeyer
  • bmilekic certified nbm as Journeyer
  • doxxx certified nbm as Journeyer
  • ian certified nbm as Journeyer
  • voltron certified nbm as Journeyer
  • AilleCat certified nbm as Journeyer
  • kappa certified nbm as Journeyer
  • Denny certified nbm as Journeyer
  • mwest certified nbm as Journeyer

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page