State of the Gato Address for 2007

Posted 1 Nov 2007 at 05:39 UTC (updated 1 Nov 2007 at 05:42 UTC) by robogato Share This

One year ago, in October of 2006, Advogato was transferred into new hands for its care and maintenance. As we come to the end of October 2007, I thought it would be appropriate to take a look at where we are one year later. Many bugs have been fixed. A few new ones have cropped up. Many of the requested features have been added but the ToDo list is still dauntingly long. Account creation by spammers is down. Account creation by real users is up. Overall, I think we've made a good start at making Advogato relevant again but there's still much to do. I'll try to lay out a general roadmap of the work to be done. And, of course, this is an ideal time to chime in with more bug reports, feature requests, and general comments you may have about what sucks and what rocks on Advogato.

What Got Done This Year

We successfully brought the blog and profile spamming under control with the addition of the trust-based spam flagging system. Literally thousands of fake accounts were eliminated during the initial weeks after the change over. Spammers still set up occasional accounts but they don't last very long. In October of 2006, Advogato had a total of 13,868 accounts but over two thousand were spammers. As of today we have 13,310 accounts, almost all of them real accounts. We briefly dropped to 11,000 users and have had steady growth since then.

Advogato was moved to the same codebase as robots.net, so the two largest mod_virgule sites are now running on the same code. This helped tremendously in removing hardcoded HTML and references to Advogato, making many features configurable, and moving mod_virgule closer to being easy to use for new sites. A lot of minor bugs were fixed making mod_virgule much more stable.

FOAF support was added, making Advogato a part of the semantic web. This feature allows Advogato to export both the social network graph and the user trust ratings in a standard form that can be used easily by other websites and web applications.

A general purpose blog aggregator was added to bring back the blogs of Advogato users who have moved their primary blog to another site. The aggregator supports most variants of RSS, ATOM, and RDF Site Summary formats. This has largely been successful and the recentlog now includes syndicated blogs from 140 Advogato users who had left the site.

Advogato's own RSS feeds have been updated from RSS 0.91 to RSS 2.0, removing any reliance on the now defunct Netscape RSS 0.91 DTD document.

Internationalization support is a little better now. UTF-8 support is much more robust, allowing posting in nearly any language (English is still the preferred language of most of our readers but it's nice to be able post in others as needed). The timestamps used through the site now reflect UTC instead of US Pacific time.

Mod_virgule now has a trust metric repair procedure that was able to recover close to one thousand certifications that were lost when user profiles were damaged in a couple of ancient disk crashes. We also replaced one of Advogato's original trust metric seeds who had become inactive, federico, with mako, a Master-certified user who is also on the Free Software Foundation's board of directors. These changes have generated some minor improvements in the trust rankings.

Advogato now has a FAQ with a growing number of questions and answers, the site also got a minor facelift to bring the HTML up to date. There have also been lots of minor bugs fixes and feature additions such as the new social bookmarking widget.

What Didn't Get Done This Year

Despite all the improvements there are a lot features that didn't get implemented yet. Don't worry, these are still on the ToDo list and they are getting closer. In most cases, the delayed features depend on some major code refactoring that's not completed yet. Patches are always welcome of course!

Probably the most desired feature that we don't have yet is blog commenting. OpenID support is also frequently requested. Several users including TimBL have requested DOAP support for the projects. How can I say no to a feature request from Tim? :) Recentlog improvements including the ability to page back through previous dates and get an RSS or ATOM feed of the recentlog are two more frequent requests. Advogato still badly needs some form of control over what gets posted in the main story queue on our front page. And there are plenty of less common features on ToDo list.

Some of the new features that were implemented have resulted in other work that needs to be done. For example, the blog aggregator has stretched mod_virgule's HTML sanitizing code to the limits. The range of nasty, invalid, and not well-formed markup being fed into Advogato is truly amazing. Since use of the aggregator is dependent on being a trusted user, at least we can feel somewhat comfortable knowing that it's most likely not malicious markup, but it still needs to be fixed ASAP.

Another problem we're beginning to see is a slowdown at the top of the hour when the trust metrics and other site maintenance procedures run. To an extent it's the old "victim of our own success" problem. We have 2,000 more users to calculate trust metrics for, 140 new blogs being handled in the recentlog and those numbers are continuing to grow. The load is pushing the limits of the current locking system on the file-based XML datastore.

Where We're Going

A lot of what I did last year was not really planned out very far in advance. It was largely a result of trying to respond to the biggest problems first. Now that things are beginning to feel a little more stable, I think I can afford the luxury of laying out some sort of roadmap. Because I'm doing most of this work myself in my spare time, I don't think I can reliably attach dates to the roadmap. But at least you'll have an idea of where things are headed and what I'll be working on next.

The first item on my list is to scrap mod_virgule's internal HTML parser/sanitizer and use the more robust one included in libxml2. This will take care of many of the HTML issues uncovered by blog aggregation including well-formedness, purging potential javascript injection and XSS exploits, stripping syndicated advertising sneaking in on with blog posts, and other issues.

The biggest roadblock that's holding up a lot of the features on the ToDo list is the hardcoded database schema, if schema is the right word for something that just exists as data structures and hardcoded HTML forms spread throughout a dozen different files. Trying to do something as simple as adding or a deleting a field in the user profile can be a fairly time consuming and painful process. Solve that problem and it becomes much easier to add comments to blog entries, or add GPG fields and geolocation fields to the user profiles.

I'm refactoring a lot of code to strip out all the hardcoded schemas and replace them with a single piece of code that loads XML schema descriptions for profiles, projects, stories, and whatever else we need. This should greatly reduce the size and complexity of the mod_virgule code while making Advogato much easier to customize. I expect this to be live before the end of 2007. The first use of the new and improved system will be a complete rebuild of the Advogato project database that will include full DOAP support. DOAP does for projects what FOAF does for user profiles, allowing us to expose the information to RDF and XML aware web apps in a standard way.

Next up, I plan to overhaul the user profiles, moving them from the current hardcoded schema to the new XML schema module. This will allow us to more easily add new fields for GPG keys, GPG signatures, alternate RDF IDs, Geolocation info, IM IDs, Flickr photo streams, and all sorts of other cool stuff.

After that, I'm looking at adding OpenID support. For new users signing up, I'd like them to be able to give their OpenID and a FOAF URI, allowing us to import all the rest of their account info without the need to manually enter it. We'll also allow OpenID logins, though an Advogato account will still be needed to be considered at higher than Observer level. Depending on how much work it is and how badly we want it, I may also try to set up Advogato as an OpenID provider.

Finally, I'd like do something about the continued sad state of stories on the front page. We're getting some good stories posted but we continue to see substandard stories too and even stories posted by accident. My biggest concern here has been in how to fix the problem without taking away the freedom that every Advogato user now has to post a story. My current theory about how to do this is that we need a story preview queue that is visible only to certified users. Any certified user can post a story and it will go into the preview queue. All other certified users can see it and give it a simple +/- rating weighted proportionally to their certification level. Any story that exceeds a configurable threshold will be deemed suitable for publication and promoted to the front page, where it will be visible to the world and syndicated via the RSS feed. Stories that don't achieve the threshold rating within some configurable amount of time will expire and be deleted.

To summarize, here's what's in store for 2008

  • Replace internal HTML parser/santizer with libxml2 HTML parser
  • Replace hardcoded schemas with general XML DB schema module
  • Rewrite project module (including DOAP support)
  • Overhaul user profiles (GEO, GPG, RDF IDs, etc)
  • OpenID Support (client and server)
  • Overhaul front page news module (add a story approval queue)

There will likely be other minor features and bug fixes along the way, including some locking optimizations on the XML datastore and a few more tweaks to the certification system. But the above list represents my main goals for the next year. Working on my own in my spare time, this may be an optimistic roadmap. On the other hand, if we collect a few other C programmers along the way and build up a little momentum, maybe we'll get it done sooner and move on to the other items on the ToDo list.


Recentlog RSS feed?, posted 1 Nov 2007 at 13:11 UTC by wlach » (Master)

Thanks so much for your work on Advogato. The site's been getting more and more useful to me over the past year: the diversity of people posting on the recentlog is now truly impressive; I learn something new of interest just about every day.

This proposed improvements all sounds great, I'm especially looking forward to OpenID support (client-side) and a story approval queue. One thing you didn't mention is an RSS feed for the recentlog. Any plans for this? It would be very convenient for those of us that read blogs through an aggregator.

Also, just out of curiousity, why not switch Advogato to using a real database for storing its information? This would seem more natural to me than a file-based XML datastore.

Many thanks, posted 1 Nov 2007 at 19:35 UTC by jemarch » (Master)

and keep up the hard work! :)

Thanks, posted 1 Nov 2007 at 19:52 UTC by ncm » (Master)

Thanks, Steven. I really like what has been done with a.o so far. I don't know if it will ever snag that elusive "relevance", but it's useful and improving, so who cares?

I agree that the articles represent the greatest current embarrassment, and your plan to fix them seems sound. (Probably deleting Mario's accidental article would be doing him a favor.)

Support for replies to blog postings seems much less important, and I'd be just as happy not to see it at all. What might be more useful is a formal way to reference other blog entries in our own, such that a link to the later blog entry shows up in the earlier one. I envision just a sequence of user-names, at the end, that link to the later postings.

Spammers, posted 1 Nov 2007 at 20:02 UTC by ncm » (Master)

Oh, and about spammers... they seem to be surviving the gauntlet, at a rate of about two per day, still, e.g. "shadi" at the moment. It appears a threshold count of 10 would cut that way down. I don't know what the spammers have in mind; maybe they hope to lie low until they run off the "new members" list, and then fill their profiles with links.

On front page article filtering, posted 1 Nov 2007 at 20:33 UTC by cdfrey » (Journeyer)

I like your idea of having a queue for articles that are only visible to certified users, and then having a voting system to determine which get posted to the front page.

I don't like the idea of deleting articles that don't make the cut.

Pondering this some more, I think the real solution is to remove article posting entirely, and do everything through the diary queue. Every post becomes a diary entry, and any diary entry that has aspirations of being an article, the author can flag as such. This would add the post to the article queue, and the voting process continues.

This has the advantage that no user is ever silenced, and that everyone can read everything that is ever posted. It also has the advantage that diary entries which the author didn't think were good enough for an article, but in fact are, could be voted onto the main page anyway. This would be a good way to promote our peers who put a lot of effort into their diary entries.

On blog comments, posted 1 Nov 2007 at 20:36 UTC by cdfrey » (Journeyer)

Continuing with the theme that "everything is a diary entry", I think there is value in the idea that every comment to a diary entry could be a diary entry itself.

The system should store back links to original posts that are being replied to, as well as provide conventional blog views that have all the comments on one page. These blog views could start anywhere on the conversation chain, with the option of jumping to the top.

Agree, posted 1 Nov 2007 at 22:26 UTC by ncm » (Master)

I agree with Chris Frey's suggestions.

Re: various stuff, posted 2 Nov 2007 at 00:30 UTC by robogato » (Master)

wlach said:
RSS feed for the recentlog. Any plans for this?
Yes, this is definitely on the ToDo list.

wlach also said:
why not switch Advogato to using a real database for storing its information?
Using a real database is the best long term solution and I think that's exactly what we'll do. I'm putting it off for now because I'd like to wait until the Apache APR DBD layer is included in production Linux distros. Using the Apache DBD stuff should be an nicer alternative to either writing the DB interface code myself or requiring yet another library at build time. Meanwhile, I think with a few locking optimizations, the current XML file system should hold up under our current rate of growth for at least another year, so there's no hurry.

ncm said:
What might be more useful is a formal way to reference other blog entries in our own, such that a link to the later blog entry shows up in the earlier one.

cdfrey said:
I think there is value in the idea that every comment to a diary entry could be a diary entry itself.

It's an interesting possibility. One concern I have is making sure that whatever we do easily interoperates with other blogs and blogging sites. For example, if I read your blog post and click a "comment" button, whatever happens should work in a useful way whether your blog is a local Advogato blog or a syndicated blog from somewhere else. The idea that comments are just regular blog postings referenced in a special way might be just the way go there. Further thought and maybe some experiments may be in order.

ncm said:
Oh, and about spammers... they seem to be surviving the gauntlet, at a rate of about two per day

Granted, those are probably potential spammers lying low. Perhaps it would be more accurate if I said no spammers that actively spam their profile or blog last very long. My main concern is avoiding the spam itself. I'm usually hesitant to flag an account as spam unless I see something that's unquestionably spam like links to viagra or seo sites.

cdfrey said: I think the real solution is to remove article posting entirely, and do everything through the diary queue. Every post becomes a diary entry, and any diary entry that has aspirations of being an article, the author can flag as such...

I'm less thrilled with this idea because I tend to think of blog posts and articles as fundamentally different from each other. But without a strong supply of articles to back me up, I suppose I won't argue to hard either way for now. :-)

Articles, posted 2 Nov 2007 at 05:54 UTC by ncm » (Master)

One reason I like Chris's suggestion about articles is that some diary postings really are articles, and I'd like to be able to promote them.

Another is that the ability to vote on promoting entries to articles will engage people. People will write hoping to be voted to article status, and therefore write better. People reading will have something to do besides scroll down. Those writings have a chance to achieve a little more permanence than a mere blog posting.

About Apache DBD... my experiences with Apache projects have been very bad. Unless you know a lot about this one, I would recommend staying far away. Just code directly to PostgreSQL; it's fast, works well, and will be around forever.

Instead of going directly to a database, though, you might consider connecting to a distributed source-control system, and let it manage files. Then, you can let anybody edit any entry (of their own) anywhere, but also let anybody see the entry's entire history, if they care. You might let people pull their whole change history into a local repository. Oh, and a pony.

Articles and diary entries, posted 2 Nov 2007 at 09:01 UTC by ingvar » (Master)

ncm, would it make sense to have both "flag diary entry as article" and "submit article", both ending up on a 'prospective articles' queue, where they get to live for N days, for voting. Obviously, a diary entry that is removed from the prospective-article queue is still in the author's diary, but something submitted straight off would then be a candidate for garbage collection.

flags, posted 3 Nov 2007 at 00:18 UTC by ncm » (Master)

If people could flag their own entries, that would be the same as "submit".

trust metric locking, posted 3 Nov 2007 at 20:25 UTC by lkcl » (Master)

steven,

one of the first things that i did with virgule was to remove the "global" lock on trust metric and i think what i did was have a per-directory or perhaps even a per-file lock. it wasn't difficult.

the simple alternative is to have one process do the write, and then only at the end perform a lock, rename, unlock.

the issue is that aaany access is locked out by a global lock. quite simple to fix, really.

articles, posted 3 Nov 2007 at 20:47 UTC by lkcl » (Master)

as the author of nearly 5% of advogato's content and about 2% of advogato's _useful_ content, i can see benefits for a "store and solicit input" option but i do not believe in fact very strongly disagree with the concept of "democracy".

people with ideas are often considered to be stupid for coming up with the ideas. mostly because the people doing the criticising cannot conceive ever of how the idea can be brought to fruition.

a lovely quote springs to mind:

"the reasonable man adapts himself to the world. the unreasonable man adapts the world to himself. therefore, all progress depends on the unreasonable man."

additionally, article publication at short notice would be out the window. forcing authors to seek out other advogato users and solicit - perhaps even bribe - their "votes" seems counterproductive.

it's a risk to take.

_plus_, what is to stop an author from creating a chain of valid users, Certified by them, and logging in as those users in order to add "votes"? even if you change Certification such that it requires 3 other "Masters" for example to Certify you before you yourself can receive a "Master" Certification, that means instead that you are encouraging users to set up "cartels" to perform "voting".

no - i really don't like the idea of "+/-" voting at all. Free Software users should be sensible enough to consider whether their content, which is not particularly high traffic (slashdot: one article every 20 mins? advogato: one article every... 3 days? 5 days? i think i saw 3 weeks go by, once, without anyone posting)

look at it - there's been only 953 articles in 6 years, guys - that's an average of 1 every 3 days.

it's _really_ not worth the effort and it really smacks even more of "elitism" than when advogato was itself first established.

if you don't like people's content, well.... tough! so what! that's your problem. go read slashdot, theregister, kuro5hin or userfriendly instead, enjoy your life, and don't waste your breath or your time telling people how you didn't like what they had to say just because they said it - it's not helpful.

be constructive in your criticism (god knows i don't do that enough myself, i know).

so there are, i believe, more constructive ways to "filter" content, for example with the wonderful use of tagging. a neat way to combine in trust metrics would be for example to use the diary-rating system on each per-article "tag" or even each article itself.

i dunno - i'm just throwing out ideas here.

and - to reiterate - yes i'm fully aware that ... i forget how many it is, last time i ran that python program which counts the number of articles per author - i think it was well over 60 articles i'd written, and mostly forgotten about.

enough to make a veeery boring book :) ha ha

"in reply to", posted 3 Nov 2007 at 20:53 UTC by lkcl » (Master)

yes! a "post a reply on your own diary" entry would be very helpful. it's an incredibly simple idea - would take probably about... 150 lines of c code, i imagine, to add that.

then, on each diary entry, show at the top "this entry was in reply to 'x'" and at the bottom "these people responded with 'y'" or... whatever.

it just needs an extra xml tag in the diary entries, just like the certifications:

in joe's 200th entry:

<diaryreply user="fred" entry="152" />

in fred's 152nd entry: <diaryresponse user="joe" entry="200" />

simple, really.

"in reply to", posted 4 Nov 2007 at 06:16 UTC by ncm » (Master)

... and there could be an icon next to the diary entry whose link text was precisely the tag needed to reference it.

On article democracy, posted 4 Nov 2007 at 18:41 UTC by cdfrey » (Journeyer)

I confess that it was only after I'd posted about article filtering that I started pondering the ways to abuse it. And it does get rather messy.

I think Luke is right, that the headache of filtering is worse than our current headache of a few misdirected articles on the front page.

The solution to poor speech is more speech, not censorship.

I still like the idea of promoting diary entries to the front page. Perhaps with a link to the original diary entry. It would help to have titles for diary entries too.

Re: On article democracy, posted 5 Nov 2007 at 03:53 UTC by bi » (Master)

Dang, lkcl, I thought this "democracy" thing had something to do with elections... :| 

If we can put faith in the sensibility of individual free software users and writers, there's simply no reason why we can't put faith in the sensibility of free software users and writers as a whole. And I must say, any mentions of the lone "unreasonable man" who is responsible for "all progress" smacks far more of elitism than any proposal involving popular votes... this whole appeal to the idea of Übermenschen is based on the extremely undemocratic notion that only one single man (mentifex, or lkcl, or John Galt) is privy to profound truths, and the rest are unwashed masses who should just shut up and listen.

But I digress... and I second (third?) cdfrey's proposal.

Re: artciles, posted 5 Nov 2007 at 18:57 UTC by zanee » (Journeyer)

I tend to agree with what was stated by lkcl. The idea of having a queue where people vote tends to create cliques of user who will surely shoot down an idea or topic they do not agree with. It's also prone to hijacking by one group which will eventually leave the front page in the hands of a few. Sadly, the current behavior is prone to error and it may not always make me or others the most happy but it ensures that those who are certed have the option of posting an article. Regardless of how completely insane and/or inane it may be. However, this doesn't mean that egregious error should not be removed from the front page.

Hijacking by a group vs. hijacking by a person, posted 5 Nov 2007 at 19:34 UTC by bi » (Master)

Before going on yet another generic rants of the virtues of free speech and the evils of elitism (except when it comes to ideas originating from Me™ the lone "unreasonable man", in which case I™ Know All and All The Rest Of You Are Just Ignorant Sheep, and no that's totally not elitist), just consider that cdfrey's idea involves promoting diary entries to articles. Even if an article nomination is "shot down", it'll still be available as a diary entry.

Let each stand on its own reason, posted 5 Nov 2007 at 20:22 UTC by nymia » (Master)

I'll cast my vote and choose a lenient system, anyone can stand in the center and be (mis)understood by anyone.

It may look like an open mess, but then it is better to have a site managed by individuals than a sub-culture of allies.

Trackbacks?, posted 6 Nov 2007 at 01:26 UTC by ncm » (Master)

I don't know how syndication works. Given a "diaryreply" tag, might it be possible to have the replying diary entry linked as a response (a "trackback", perhaps?) on the blog entry from which the tag's referent was syndicated?

thanks, posted 7 Nov 2007 at 03:19 UTC by brondsem » (Journeyer)

Thanks for all your work on Advogato, and your roadmap looks great. Glad to see it moving forward :)

Re: more various stuff, posted 8 Nov 2007 at 00:44 UTC by robogato » (Master)

Lots of good ideas here. I'm begining to like the story submission from diary post idea the more I think about it.

I also like the idea of a diary comment being someone else's diary post but I'm not sure if there's an existing mechanism for doing something like that. It needs to work not just with local diaries but syndicated diaries as well. I may run it by the SWIG folks and see if they know of any fancy XML/RDF linkages that do that.

lkcl: yep, you're exactly right on the locking issues. It's not a hard problem to fix, just time consuming like all the other ToDos.

Baby steps, posted 8 Nov 2007 at 04:13 UTC by ncm » (Master)

I don't think you need to implement replies going to syndicated diaries immediately.

(BTW, can't we just call these things "logs", instead of "blogs"? We are all aware that they're on the web, and don't need to be reminded... do we?)

Re: Baby steps, posted 8 Nov 2007 at 04:49 UTC by bi » (Master)

"Logs" without qualification just makes me think of server diagnostic messages though.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page