26 May 2008 rcaden   » (Journeyer)

Following Web Page Redirects with Java

CNET moved a bunch of its blogs to a different domain this weekend, including Beyond Binary, Coop's Corner, Geek Gestalt, One More Thing, Outside the Lines and The Social. I mention this because the change hosed Meme13, which treated all six as if they were newly discovered sites.

One of my ground rules for developing Meme13 is that I won't hand-edit the site to make it smarter. I need the application to recognize when existing sites in its database have moved.

Meme13 monitors sites using a Java application I wrote that downloads web pages with the Apache HTTPClient 3.0 class library. Web servers indicate that a page has moved by sending an HTTP redirect response of either "301 Moved Permanently," which indicates a permanent move, or "302 Found," which is intended for temporary changes. I wrote a Java method that can find the current location of a web page, even if it has been redirected one or more times:

public String checkFeedUrl(String feedUrl) {
    String response = feedUrl;
    HttpClient client = new HttpClient();
    HttpMethod method = new HeadMethod(feedUrl);
    method.setFollowRedirects(false);
    try {
        // request feed
        int statusCode = client.executeMethod(method);
        if ((statusCode == 301) | (statusCode == 302)) {
            // feed has moved
            Header location = method.getResponseHeader("Location");
            if (!location.getValue().equals("")) {
                // recursively check URL until it's not redirected any more
                response = checkFeedUrl(location.getValue());
            }
        } else {
            response = feedUrl;
        }
    } catch (IOException ioe) {
        response = feedUrl;
    }
    return response;
}

The HeadMethod class requests a web page's headers instead of requesting the entire page, consuming far less bandwidth as it checks for redirects. My Java method looks for both kinds of redirects, because web publishers have a bad habit of using "302 Found" when they've moved a page permanently.

Syndicated 2008-05-26 03:46:06 from Workbench

Latest blog entries     Older blog entries

New Advogato Features

FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!