Older blog entries for robogato (starting at number 33)

2 Jun 2011 (updated 3 Jun 2011 at 19:32 UTC) »

Robogato Returns

We had a bad hardware crash recently and, as I was restoring Advogato to new hardware, I realized that it's been too long since I've devoted any significant time to improving the code around here. I took advantage of the downtime caused by the crash to make some final tweaks to the long-awaited libxml2 based HTML parser and made it live. It fixes a lot of the rendering problems already and will fix more once I make a few more tweaks.

I'm also working on improving security in general and making account creation by spammers harder in particular. I had a nice email exchange with dkg about the subject awhile back. He took a look at the code and provided a laundry list of things that needed fixing or improving. I'm working on those now. The first change just went live this week - mod_virgule now requires the POST method for submitted forms. This minor change already stopped a couple of our automated account spammers who were creating accounts with GETs. Only the dumbest spammers were doing that I'd think. Using POST isn't much harder. More changes to come.

If you're wondering what caused the increase in spam accounts we've been seeing for the last year, here's a possible contributor: Incansoft, apparently a purveyor of web-based spam tools, added an Advogato attack to a spamming tool they sell called Web20Bot (sorry, not going to link to it but you can google it). Web20Bot will create phony account profiles containing your backlink spam on 20 websites including Advogato.org, squidoo.com, wordpress.com, blogger.com, tumblr.com, and livejournal.com. They claim Web20Bot handles email verification and captchas, so working out a defense may be interesting. I doubt any of their spam lasts more than 48 hours around here anyway but it would be nice to make life harder for them. (incidentally, if someone were to come up with a copy of this thing so we could analyze it, that might be cool - maybe we could help other sites being attacked by it too).

Update: Thanks for pointing out those issues, Redi. I've fixed the diary edit problem, it should not have been checking for a POST. The <person>, <project>, and <wiki> tags were special cases in the old HTML handler. If one is broken, all three probably are. I'll get on that now. It will take me a little while to track down the problem. <proj> was deprecated in favor of <project> way back in the Raph days but the code checking for <proj> wasn't dropped until this most recent update. I didn't realize anyone still used it. I can add it back in.

Update 2: Ok, found the problem. The old tag handlers output directly to the apache buffer while the new handlers modify the XML tree, which is rendered to the buffer later. I need to modify or replace the handlers for those three tags. I'll try to get to it today if time allows.

Update 3: I think the special tag issue is fixed now, let's try this code for a day or so and see if any problems show up.

<person> test: redi

<proj> test: mod_virgule

<project> test: mod_virgule

<wiki> test: WikiPedia:Advogato.org

Watch for Spammers

If you're wondering about the source of the recent increase in phony users signing up for Advogato accounts, I think I've found it. A number of Russian SEO/spammer blogs are discussing a list of websites that seem to be highly trusted by Google based on the ratio of pages in the main Google index to the supplemental Google index. Advogato is #16 on the list. (I'd provide some links but giving them links from Advogato is the last thing we should do. If you're curious you should be able to find them using a site like Technorati to find blogs that have linked to Advogato in the last few weeks.)

A side effect has been a big bandwidth hit. I thought at first we'd been slashdotted. But the main result is a rash of SEO spammers signing up for Advogato accounts and trying to find some way to get backlinks to their link farms and spam sites. Average survival time for their profiles has been less than 48 hours so probably nothing to worry about but everyone should take a look at the "recent people joining" list and flag anyone who looks like spam. Hopefully it will die down in a week or two.

24 Feb 2008 (updated 21 Jan 2009 at 19:12 UTC) »

Test post for the libxml2 HTML parser

In theory, the libxml2 HTML parser should make best guesses on how to fix screwed up, illegal HTML and all tags should get closed at the end of this diary entry, preventing problems in diary entries that follow or elsewhere on the page.

bold tag with no close

italics tag with no close

strike tag with no close

Update Jan 2009: after a long downtime, I'm finally working on the HTML parser again. Should have it live this month!

Advogato Status Report

My New Year's resolution is to start doing monthly status reports again! Here's the first one.

Even though I haven't posted a status update in a while, minor code updates have continued. To find out what's changed in the live mod_virgule code running Advogato, see the changelog. It's always there and nearly always up to date.

The biggest change has been in the XML file store locking code. The previous system relied on a site-wide read/write lock that locked out access to the entire database when writes were happening. This was getting to be a problem because of trust recalculations and diary syndication that happens at the top of the hour. Write locks were often clogging things up for 10 to 15 minutes per hour.

But it's all good now. All the locking code has been totally ripped out and replaced with file-level locking. There should almost never be any detectable site delays caused by locking now. Besides fixing the hourly slowdowns, this also gives us a little more breathing room to continue growing.

Another recent change is a patch from fzort that improves the HTML parsing code to eliminate undesirable tag attributes. The long-term the plan is still switching to libxml2's HTML parser and junking the one in mod_virgule but, until then, this should make things a little more secure.

A few other fixes and improvements:

The GUID of syndicated blog posts is now preserved when they go out on the Advogato diary RSS feed.

Mod_virgule now has built in support for Google Analytics. Drop your GA ID code into the config.xml and the appropriate GA markup appears on every page throughout the site.

Joe Presbrey of MIT contributed a patch for an external FOAF URI on the user profile. This allows you to link your Advogato FOAF to any other existing FOAF profile you may have, helping to consolidate your online identify.

The computed trust level for each user is now exported via FOAF, referencing a local RDF schema that describes the trust levels. This mechanism was suggested by Sean B. Palmer and Dan Connolly on the W3C #swig IRC channel.

31 Aug 2007 (updated 31 Aug 2007 at 23:33 UTC) »

Advogato Status Report

A new rev of mod_virgule code is live on Advogato. See the changelog for the details. Here are a few highlights.

A discussion between ncm, raph, and chrisd speculated on why there seemed to be a decline in Google rankings for individual blog content on Advogato lately. It was suggested that a change in the Google ranking algorithm may be placing less value on pages with dynamic URLs like http://www.advogato.org/person/ncm/diary.html?start=191. Advogato has long had static URLs for individual articles, so I've added similar support for each individual blog post. If you click the permalink marker beside one of your blog posts, you'll see it now goes to a static URL with just that one post on the page instead of to a dynamic URL that includes a range of posts. For example: http://www.advogato.org/person/ncm/diary/190.html. The old, dynamic system is still in place so search engines and existing links will get to the right place, of course. There's another advantage to having the static URLs to individual blog entries. These will be used for comment pages eventually. Yes, blog comments are really coming. I promise. Some day.

There's also a fix to minor foaf:mbox_sha1sum bug that was noticed by Andreas Harth.

You may have noticed that our Italian cittaditorino spammers were back with a vengence the last couple of weeks. The community spam flagging system seems to be controlling them. Most of the bogus accounts are being deleted within a few days of creation. At ncm's suggestion, I've added rel="nofollows" attributes to all links to untrusted users in the recentlog, recent people joining list, and Advogato People index. There were already nofollows on all links created by untrusted users but this new addition should prevent search engines from even indexing their profile and blog pages. With all these spam control measures in place, keep in mind it's a little harder than it used to be for real users to create an Advogato account and get certified. Well-known users aren't having much trouble and the new trust injected by adding mako as a seed has helped tremendously. But there are users here and there who haven't collected enough certs to become trusted, like pabs3.

That's all the news for now but more new features are on the way.

The URL rendering bug that redi spotted has been fixed, I think. Looks like it was an artifact of the Apache APR 1.3 to 2.0 upgrade that had gone unnoticed for a quite a while. If anyone spots any other URL issues in the project section, let me know.

Advogato Status Report

A new rev of mod_virgule code is live on Advogato. See the changelog for the details.

Aside from the usual minor bugfixes and tweaks, there are two new features you may have noticed already.

New certification indicators: A visual indication is now added to trust certifications that are less than 30 days old. This should make it easier to spot new certs on the user profiles. You can check this out on your own user profile if you've certified anyone, or been certified by anyone, in the last 30 days.

Article lists: Ever wonder how many Advogato articles you've posted? Or wanted to read other articles by a particular poster? Each user profile now includes a reverse chronological list of the 10 most recent articles posted by that user. For users who are more prolific, there is a link to a separate page that includes a complete listing of all articles posted by that user.

In addition to providing a new way to explore Advogato's articles, this should provide another direct route for search engine robots to find the static links to the articles.

11 Jul 2007 (updated 11 Jul 2007 at 20:40 UTC) »
Advogato Status Report

New mod_virgule code is live on Advogato. See the changelog for the details.

More minor bugs fixes. The aggregator should do a better job now of rejecting dupes from feeds that retroactively alter the post date on blog entries. The no_cache and no_local_copy flags in the Apache request records are now set for logouts to prevent browsers from caching old logout results and to prevent the server from sending a 304. This was preventing some Galeon users (and possibly other browsers) from logging out.

I replaced the social bookmarking test links on the article pages with a fully functional social bookmarking tool, linked from the standard "share this" icon. The share link is now available on project and profile pages as well as on articles. If someone has a favorite social bookmarking service that's not listed yet, let me know and I'll add it.

Time has been a scare resource for me lately, so progress through the ToDo list has been slower. More updates as time allows and, as always, patches are welcome.

Social Networking

Google sponsered a CMU project last year to study and reinvent online social networking. The result was Socialstream, a design concept based on the idea of a Unified Social Network (USN). A lot of what they came up sounds similar to what the semantic web folks are working on with OpenID and formats such as FOAF and DOAP. Basically, they're suggesting that social network sites standardize on a data sharing format that would allow them easily interact with each other and become part of a larger network of sites.

The project also did some interesting research, ranging from social networking theory and taxonomy to identifying common complaints about social networking sites and desirable features. They also researched who uses social networks and broke down the results into archetypical user types. The researchers also created a video demo of the Socialstream concept site. Some of the ideas they mention are already in Advogato or are on the ToDo list. I think there are plenty of other ideas here we can incorporate into Advogato as well.

Trust/Authority Metrics

Someone pointed out a link to an article by Michael Jensen in the Chronical Review: The New Metrics of Scholarly Authority. It talks a lot about Web 2.0 authority models. It mentions the Google PageRank system but, oddly, leaves out any mention of the mod_virgule trust metrics implemented on Advogato. Still, it's an interesting read.

Advogato Status Report

A new rev of mod_virgule code is live on Advogato. See the changelog for the details.

Mostly minor stuff. Setting a project staff relation to none now consistently removes the relation from your user profile. Thanks to Gary Benson for noticing the bug. I upgraded the server from CentOS 4.4 to 4.5. This was just a maintenance update and shouldn't cause any changes. We're having another wave of account spam lately but the new flagging system has largely controlled it. One of the spammers discovered a way of circumventing the code which strips anchor tags posted in the notes field of untrusted accounts. I've fixed the bug that allowed this.

GPL v3 Release Party in Dallas?

The GPLv3 is supposed to be released on 29 June. I saw joolean mention a GPLv3 release part in Brooklyn and figured, why not here in Dallas too? If there are any other Advogatoans in the DFW area who'd like to get together to celebrate the release of the new and improved GPL, let me know.

Trust Metric Growing Pains

The good news is that Advogato is growing again. The bad news is that this is bringing to light some issues with the trust metrics. First, there are a growing number of new users who have multiple certs but are still rated as observer. Second, there was the related incident with user OpenSpecies. Many people thought his blog posts looked spammy and flagged him as spam. Other users trusted him at Apprentice or Journeyer level but even with six or seven certs he never acquired enough gato-juice to reach Apprentice level. Because he stayed at Observer level, his account was always at risk of being classified as spam. This happened once, resulting in the decision to increase the spam score required to delete an account. I reinstated his account from a backup. A few months later it had been flagged as spam enough times to get deleted again. I restored it, however, OpenSpecies opted to move elsewhere and requested the account be permanently deleted.

The lack of gato-juice available for certifying people can be traced back to an issue with the trust metric seed users. Of the four original seed users, only raph is actively visiting Advogato and certifying users. Federico has visited in the last year but no longer certifies any users. Miguel hasn't visited in many years and only certified a handful of users. Alan has certified many users but no longer seems to be an active user himself (hopefully I'm wrong about that). This means there are really only two seeds and almost all the trust flowing to new users through certification is at best several generations removed from them.

To improve the situation, I'm going to add a few new seed users. This will need to be done gradually so that we can make sure it fixes the problem without resulting in cert inflation. My criteria for selecting new seed users will be: 1) Must be currently rated as a master by at least one of the original seed users 2) Must be rated as master by other non-seed users 3) Must be an active Advogato user who visits the site regularly and has posted at least one article 4) Must be reasonably well known within the community and have occasion to meet and interact with many other Free Software developers in person.

I talked with Raph about possible ways of handling this. Elections, nominations, automated selection by the trust metric itself, or just picking someone. Eventually, I think it would be interesting to have the trust metric select new seeds automatically as needed but that will take more time for testing and experimenting than I've got right now. So, initially I've opted for picking someone who meets the qualifications to save time. Our first new seed is: mako. By a handy coincidence, he's traveling to several European conferences over the next few weeks, giving him a chance to meet more people who may need certifying.

This is one of several things that I think should start pumping some new life into the trust metrics. Another issue I'm looking at is what to do with inactive users who have become stagnant sources in the trust metric network flow. These include users who will not return for one reason or another such as ettore, sisob or lilo. Trust passing through these nodes is essentially unchangeable, which is a problem because trust in the real world is dynamic. Sometimes we trust a person today that we didn't yesterday. Sometimes we no longer trust someone that we trusted in the past. If enough certs become stagnant and cannot be removed, this tends to make the trust metrics innaccurate. One way of dealing with this is to identify users who are inactive and expire their outbound certs automatically after enough time has elapsed. The tricky part is deciding how long a user has to go without visiting the site before being considered inactive. DV, for example, is an active user yet has gone for as much as a year between logins. Federico, one of our seed users, hasn't logged in for seven months. Right now, I'm thinking that exceeding one year without a login is a pretty good indication of inactivity.

Advogato buzz

Advogato showed up on a list of social network site statistics at the X2iN blog: Social Network Marketing, the Sky is the Limit.

Advogato's founder Raph Levien will be giving a talk titled Advogato: Lessons Learned at 6:30 PM on Monday, June 25 as part of Google's Open Source Developers @ Google series. The talk will be at Google's Mountain View campus. Guest are welcome and should sign in at Building 43.

Advogato Status Report

A new rev of mod_virgule code is live today on Advogato. See the changelog for the details.

The mod_virgule config.xml file now supports having a list of a authorized "editors". Article posting priviledges can be limited to these editors. Don't worry, this feature isn't intended for Advogato, where all certified members will continue to be able to post articles. It will be used on robots.net. In the past robots.net was configured such that only the users who were trust metric seeds could post stories. As robots.net has grown, the need has arisen to make a clear distinction between the list of trust metric seed users and the article editors. I think this feature will be useful on other sites that use mod_virgule as well.

I've tweaked the HTML layout of the diary entries, replacing the older style markup with divs. At the request of trs80, the div wrappers on each diary entry now include the username as a second class. While not needed for CSS, this additional class designator can be used by screen-scrapers to easily identify the author of each entry in the recentlog. Screen-scraping aggregators can use this as part of a dupe-control mechanism. This same username as class convention is used on many Planet sites, so it should make Advogato's recentlog more easily parsable by existing Planet scrapers. The fun part was the slight difference between legal mod_virgule usernames and legal CSS1/2 class names. This prompted the creation of a new utility function, virgule_force_legal_css_name(). Supplied with an arbitrary string of text, this function will return a properly escaped CSS1 class name.

More good Advogato buzz

Andrey Golub of Milan IN recently discovered Advogato and gave a nice mention in his blog. He also added Advogato to Milan-IN's listing of Online Social Networking Platforms. Perhaps this will bring a few other new Advogato members from the Italian free software community our way.

Dan York also gave us a great mention in his blog. He's an Advogatoan from way back who left Advogato for LiveJournal during the extended Advogato server outage back in 2004. He was writing to commemerate his 7th year of blogging and rediscovered Advogato in the process. His entry summarizes the recent changes on Advogato and suggests dyork may be making an appearance in our recentlog again soon.

During a recent discussion on the Extreme Programming mailing list about the possibility of a certification mechanism for XP programmers, Martijn Meijering suggested that a community trust metric system similar to Advogato's might be a desirable alternative to certification based on traditional knowledge-based testing.

24 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!