State of the Gato Address for 2007
Posted 1 Nov 2007 at 05:39 UTC (updated 1 Nov 2007 at 05:42 UTC) by robogato
One year ago, in October of 2006, Advogato was transferred into new
hands for its care and maintenance. As we come to the end of October
2007, I thought it would be appropriate to take a look at where we are
one year later. Many bugs have been fixed. A few new ones have cropped
up. Many of the requested features have been added but the ToDo
list is still dauntingly long. Account creation by spammers is down.
Account creation by real users is up. Overall, I think we've made a good
start at making Advogato relevant again but there's still much to do.
I'll try to lay out a general roadmap of the work to be done. And,
of course, this is an ideal time to chime in with more bug reports,
feature requests, and general comments you may have about what sucks and
what rocks on
Advogato.
What Got Done This Year
We successfully brought the blog and profile spamming under
control
with the addition of the trust-based spam flagging system. Literally
thousands of fake accounts were eliminated during the initial weeks
after the change over. Spammers still set up occasional accounts but
they don't last very long. In October of 2006, Advogato had a total of
13,868 accounts but over two thousand were spammers. As of today we have
13,310 accounts, almost all of them real accounts. We briefly dropped to
11,000 users and have had steady growth since then.
Advogato was moved to the same codebase as robots.net, so the two largest mod_virgule sites
are now running on the same code. This helped tremendously in removing
hardcoded HTML and references to Advogato, making many features
configurable, and moving mod_virgule closer to being easy to use for new
sites. A lot of minor bugs were fixed making mod_virgule much more
stable.
FOAF support was added, making Advogato a part of the
semantic web.
This feature allows Advogato to export both the social network graph and
the user trust ratings in a standard form that can be used easily by
other websites and web applications.
A general purpose blog aggregator was added to bring back the
blogs
of Advogato users who have moved their primary blog to another site. The
aggregator supports most variants of RSS, ATOM, and RDF Site Summary
formats.
This has largely been successful and the recentlog now includes
syndicated blogs from 140 Advogato users who had left the site.
Advogato's own RSS feeds have been updated from RSS 0.91 to
RSS 2.0,
removing any reliance on the now defunct Netscape RSS 0.91 DTD document.
Internationalization support is a little better now. UTF-8
support is
much more robust, allowing posting in nearly any language (English is
still the preferred language of most of our readers but it's nice to be
able post in others as needed). The timestamps used through the site now
reflect UTC instead of US Pacific time.
Mod_virgule now has a trust metric repair procedure that was
able to
recover close to one thousand certifications that were lost when user
profiles were damaged in a couple of ancient disk crashes. We also
replaced one of Advogato's original trust metric seeds who had become
inactive, federico, with mako, a Master-certified
user who is also on
the Free Software Foundation's board of directors. These changes have
generated some minor improvements in the trust rankings.
Advogato now has a FAQ
with a growing number of questions and answers, the site also got a
minor facelift to bring the HTML up to date. There have also been lots
of minor bugs fixes and feature additions such as the new social
bookmarking widget.
What Didn't Get Done This Year
Despite all the improvements there are a lot features that
didn't get
implemented yet. Don't worry, these are still on the ToDo list and they
are getting closer. In most cases, the delayed features depend on some
major code refactoring that's not completed yet. Patches are always
welcome of course!
Probably the most desired feature that we don't have yet is blog
commenting. OpenID support is also frequently requested. Several users
including TimBL have
requested DOAP support for the projects. How can I say no to a feature
request from Tim? :) Recentlog improvements including the ability to
page back through previous dates and get an RSS or ATOM feed of the
recentlog are two more frequent requests. Advogato still badly needs
some form of control over what gets posted in the main story queue on
our front page. And there are plenty of less common features on ToDo
list.
Some of the new features that were implemented have resulted
in other
work that needs to be done. For example, the blog aggregator has
stretched mod_virgule's HTML sanitizing code to the limits. The range of
nasty, invalid, and not well-formed markup being fed into Advogato is
truly amazing. Since use of the aggregator is dependent on being a
trusted user, at least we can feel somewhat comfortable knowing that
it's most likely not malicious markup, but it still needs to be fixed
ASAP.
Another problem we're beginning to see is a slowdown at the
top of
the hour when the trust metrics and other site maintenance procedures
run. To an extent it's the old "victim of our own success" problem. We
have 2,000 more users to calculate trust metrics for, 140 new blogs
being handled in the recentlog and those numbers are continuing to grow.
The load is pushing the limits of the current locking system on the
file-based XML datastore.
Where We're Going
A lot of what I did last year was not really planned out very
far in
advance. It was largely a result of trying to respond to the biggest
problems first. Now that things are beginning to feel a little more
stable, I think I can afford the luxury of laying out some sort of
roadmap. Because I'm doing most of this work myself in my spare time, I
don't think I can reliably attach dates to the roadmap. But at least
you'll have an idea of where things are headed and what I'll be working
on next.
The first item on my list is to scrap mod_virgule's internal HTML
parser/sanitizer and use the more robust one included in libxml2. This
will take care of many of the HTML issues uncovered by blog aggregation
including well-formedness, purging potential javascript injection and
XSS exploits, stripping syndicated advertising sneaking in on with blog
posts, and other issues.
The biggest roadblock that's holding up a lot of the features
on the
ToDo list is the hardcoded database schema, if schema is the right word
for something that just exists as data structures and hardcoded HTML
forms spread throughout a dozen different files. Trying to do something
as simple as adding or a deleting a field in the user profile can be a
fairly time consuming and painful process. Solve that problem and it
becomes much easier to add comments to blog entries, or add GPG fields
and geolocation fields to the user profiles.
I'm refactoring a lot of code to strip out all the hardcoded
schemas
and replace them with a single piece of code that loads XML schema
descriptions for profiles, projects, stories, and whatever else we need.
This should greatly reduce the size and complexity of the mod_virgule
code while making Advogato much easier to customize. I expect this to be
live before the end of 2007. The first use of the new and improved
system will be a complete rebuild of the Advogato project database that
will include full DOAP support. DOAP does for projects what FOAF does
for user profiles, allowing us to expose the information to RDF and XML
aware web apps in a standard way.
Next up, I plan to overhaul the user profiles, moving them
from the
current hardcoded schema to the new XML schema module. This will allow
us to more easily add new fields for GPG keys, GPG signatures, alternate
RDF IDs, Geolocation info, IM IDs, Flickr photo streams, and all sorts
of other cool stuff.
After that, I'm looking at adding OpenID support. For new users
signing up, I'd
like them to be able to give their OpenID and a FOAF URI, allowing us to
import all the rest of their account info without the need to manually
enter it. We'll also allow OpenID logins, though an Advogato account
will still be needed to be considered at higher than Observer level.
Depending on how much work it is and how badly we want it, I may also
try to set up Advogato as an OpenID provider.
Finally, I'd like do something about the continued sad state of
stories on the front page. We're getting some good stories posted but we
continue to see substandard stories too and even stories posted by
accident. My biggest concern here has been in how to fix the
problem without taking away the freedom that every Advogato user now has
to post a story. My current theory about how to do this is that we need
a story preview queue that is visible only to certified users. Any
certified user can post a story and it will go into the preview queue.
All other certified users can see it and give it a simple +/- rating
weighted proportionally to their certification level. Any
story that exceeds a configurable threshold will be deemed suitable for
publication and promoted to the front page, where it will be visible to
the world and syndicated via the RSS feed. Stories that don't achieve
the threshold rating within some configurable amount of time will expire
and be deleted.
To summarize, here's what's in store for 2008
- Replace internal HTML parser/santizer with libxml2 HTML parser
- Replace hardcoded schemas with general XML DB schema module
- Rewrite project module (including DOAP support)
- Overhaul user profiles (GEO, GPG, RDF IDs, etc)
- OpenID Support (client and server)
- Overhaul front page news module (add a story approval queue)
There will likely be other minor features and bug fixes along the
way, including some locking optimizations on the XML datastore and a few
more tweaks to the certification system. But the above list represents
my main goals for the next year. Working on my own in my spare time,
this may be an optimistic roadmap. On the other hand, if we collect
a few other C programmers along the way and build up a little momentum,
maybe we'll get it done sooner and move on to the other items on the
ToDo list.
Thanks so much for your work on Advogato. The site's been getting more and more useful to me over the past year: the diversity of people posting on the recentlog is now truly impressive; I learn something new of interest just about every day.
This proposed improvements all sounds great, I'm especially looking forward to OpenID support (client-side) and a story approval queue. One thing you didn't mention is an RSS feed for the recentlog. Any plans for this? It would be very convenient for those of us that read blogs through an aggregator.
Also, just out of curiousity, why not switch Advogato to using a real database for storing its information? This would seem more natural to me than a file-based XML datastore.
Many thanks, posted 1 Nov 2007 at 19:35 UTC by jemarch »
(Master)
and keep up the hard work! :)
Thanks, posted 1 Nov 2007 at 19:52 UTC by ncm »
(Master)
Thanks, Steven. I really like what has been done with a.o so far. I
don't know if it will ever snag that elusive "relevance", but it's
useful and improving, so who cares?
I agree that the articles represent the greatest current embarrassment,
and your plan to fix them seems sound. (Probably deleting Mario's
accidental article would be doing him a favor.)
Support for replies to blog postings seems much less important, and I'd
be just as happy not to see it at all. What might be more useful is a
formal way to reference other blog entries in our own, such that a link
to the later blog entry shows up in the earlier one. I envision just a
sequence of user-names, at the end, that link to the later postings.
Spammers, posted 1 Nov 2007 at 20:02 UTC by ncm »
(Master)
Oh, and about spammers... they seem to be surviving the gauntlet, at a
rate of about two per day, still, e.g. "shadi" at the moment. It
appears a threshold count of 10 would cut that way down. I don't know
what the spammers have in mind; maybe they hope to lie low until they
run off the
"new members" list, and then fill their profiles with links.
I like your idea of having a queue for articles that are only visible to
certified users, and then having a voting system to determine which get
posted to the front page.
I don't like the idea of deleting articles that don't make the cut.
Pondering this some more, I think the real solution is to remove article
posting entirely, and do everything through the diary queue. Every post
becomes a diary entry, and any diary entry that has aspirations of being
an article, the author can flag as such. This would add the post to the
article queue, and the voting process continues.
This has the advantage that no user is ever silenced, and that everyone
can read everything that is ever posted. It also has the advantage that
diary entries which the author didn't think were good enough for an
article, but in fact are, could be voted onto the main page anyway.
This would be a good way to promote our peers who put a lot of effort
into their diary entries.
On blog comments, posted 1 Nov 2007 at 20:36 UTC by cdfrey »
(Journeyer)
Continuing with the theme that "everything is a diary entry", I think
there is value in the idea that every comment to a diary entry could be
a diary entry itself.
The system should store back links to original posts that are being
replied to, as well as provide conventional blog views that have all the
comments on one page. These blog views could start anywhere on the
conversation chain, with the option of jumping to the top.
Agree, posted 1 Nov 2007 at 22:26 UTC by ncm »
(Master)
I agree with Chris Frey's suggestions.
wlach said:
RSS feed for the recentlog. Any plans for this?
Yes, this is definitely on the ToDo list.
wlach also said:
why not switch Advogato to using a real database for storing its
information?
Using a real database is the best long term solution and I think that's
exactly what we'll do. I'm putting it off for now because I'd like to
wait until the Apache APR DBD layer is included in production Linux
distros. Using the Apache DBD stuff should be an nicer alternative to
either writing the DB interface code myself or requiring yet another
library at build
time. Meanwhile, I think with a few locking optimizations, the current
XML file system should hold up under our current rate of growth for at
least another year, so there's no hurry.
ncm said:
What might be more useful is a formal way to reference other blog
entries in our own, such that a link to the later blog entry shows up in
the earlier one.
cdfrey said:
I think there is value in the idea that every comment to a diary entry
could be a diary entry itself.
It's an interesting possibility. One concern I have is making sure that
whatever we do easily interoperates with other blogs and blogging sites.
For example, if I read your blog post and click a "comment" button,
whatever happens should work in a useful way whether your blog is a
local Advogato blog or a syndicated blog from somewhere else. The idea
that comments are just regular blog postings referenced in a special way
might be just the way go there. Further thought and maybe some
experiments may be in order.
ncm said:
Oh, and about spammers... they seem to be surviving the gauntlet, at a
rate of about two per day
Granted, those are probably potential spammers lying low. Perhaps it
would be more accurate if I said no spammers that actively spam their
profile or blog last very long. My main concern is avoiding the spam
itself. I'm usually hesitant to flag an account as spam unless I see
something that's unquestionably spam like links to viagra or seo sites.
cdfrey said:
I think the real solution is to remove article posting entirely, and do
everything through the diary queue. Every post becomes a diary entry,
and any diary entry that has aspirations of being an article, the author
can flag as such...
I'm less thrilled with this idea because I tend to think of blog posts
and articles as fundamentally different from each other. But without a
strong supply of articles to back me up, I suppose I won't argue to hard
either way for now. :-)
Articles, posted 2 Nov 2007 at 05:54 UTC by ncm »
(Master)
One reason I like Chris's suggestion about articles is that some diary
postings really are articles, and I'd like to be able to promote them.
Another is that the ability to vote on promoting entries to articles
will engage people. People will write hoping to be voted to article
status, and therefore write better. People reading will have something
to do besides scroll down. Those writings have a chance to achieve a
little more permanence than a mere blog posting.
About Apache DBD... my experiences with Apache projects have been very
bad. Unless you know a lot about this one, I would recommend staying
far away. Just code directly to PostgreSQL; it's fast, works well, and
will be around forever.
Instead of going directly to a database, though, you might consider
connecting to a distributed source-control system, and let it manage
files. Then, you can let anybody edit any entry (of their own)
anywhere, but also let anybody see the entry's entire history, if they
care. You might let people pull their whole change history into a local
repository. Oh, and a pony.
ncm, would it make sense to have both "flag diary entry as article" and "submit article", both ending up on a 'prospective articles' queue, where they get to live for N days, for voting. Obviously, a diary entry that is removed from the prospective-article queue is still in the author's diary, but something submitted straight off would then be a candidate for garbage collection.
flags, posted 3 Nov 2007 at 00:18 UTC by ncm »
(Master)
If people could flag their own entries, that would be the same as "submit".
steven,
one of the first things that i did with virgule was to remove the "global" lock on trust metric and i think what i did was have a per-directory or perhaps even a per-file lock. it wasn't difficult.
the simple alternative is to have one process do the write, and then only at the end perform a lock, rename, unlock.
the issue is that aaany access is locked out by a global lock. quite simple to fix, really.
articles, posted 3 Nov 2007 at 20:47 UTC by lkcl »
(Master)
as the author of nearly 5% of advogato's content and about 2% of advogato's _useful_ content, i can see benefits for a "store and solicit input" option but i do not believe in fact very strongly disagree with the concept of "democracy".
people with ideas are often considered to be stupid for coming up with the ideas. mostly because the people doing the criticising cannot conceive ever of how the idea can be brought to fruition.
a lovely quote springs to mind:
"the reasonable man adapts himself to the world. the unreasonable man adapts the world to himself. therefore, all progress depends on the unreasonable man."
additionally, article publication at short notice would be out the window. forcing authors to seek out other advogato users and solicit - perhaps even bribe - their "votes" seems counterproductive.
it's a risk to take.
_plus_, what is to stop an author from creating a chain of valid users, Certified by them, and logging in as those users in order to add "votes"? even if you change Certification such that it requires 3 other "Masters" for example to Certify you before you yourself can receive a "Master" Certification, that means instead that you are encouraging users to set up "cartels" to perform "voting".
no - i really don't like the idea of "+/-" voting at all. Free Software users should be sensible enough to consider whether their content, which is not particularly high traffic (slashdot: one article every 20 mins? advogato: one article every... 3 days? 5 days? i think i saw 3 weeks go by, once, without anyone posting)
look at it - there's been only 953 articles in 6 years, guys - that's
an average of 1 every 3 days.
it's _really_ not worth the effort and it really smacks even more of "elitism" than when advogato was itself first established.
if you don't like people's content, well.... tough! so what! that's your problem. go read slashdot, theregister, kuro5hin or userfriendly instead, enjoy your life, and don't waste your breath or your time telling people how you didn't like what they had to say just because they said it - it's not helpful.
be constructive in your criticism (god knows i don't do that enough myself, i know).
so there are, i believe, more constructive ways to "filter" content, for example with the wonderful use of tagging. a neat way to combine in trust metrics would be for example to use the diary-rating system on each per-article "tag" or even each article itself.
i dunno - i'm just throwing out ideas here.
and - to reiterate - yes i'm fully aware that ... i forget how many it is, last time i ran that python program which counts the number of articles per author - i think it was well over 60 articles i'd written, and mostly forgotten about.
enough to make a veeery boring book :) ha ha
"in reply to", posted 3 Nov 2007 at 20:53 UTC by lkcl »
(Master)
yes! a "post a reply on your own diary" entry would be very helpful.
it's an incredibly simple idea - would take probably about... 150 lines of c code, i imagine, to add that.
then, on each diary entry, show at the top "this entry was in reply to 'x'" and at the bottom "these people responded with 'y'" or... whatever.
it just needs an extra xml tag in the diary entries, just like the certifications:
in joe's 200th entry:
<diaryreply user="fred" entry="152" />
in fred's 152nd entry:
<diaryresponse user="joe" entry="200" />
simple, really.
"in reply to", posted 4 Nov 2007 at 06:16 UTC by ncm »
(Master)
... and there could be an icon next to the diary entry whose link text was precisely the tag needed to reference it.
I confess that it was only after I'd posted about article filtering that
I started pondering the ways to abuse it. And it does get rather messy.
I think Luke is right, that the headache of filtering is worse than our
current headache of a few misdirected articles on the front page.
The solution to poor speech is more speech, not censorship.
I still like the idea of promoting diary entries to the front page.
Perhaps with a link to the original diary entry. It would help to have
titles for diary entries too.
Dang, lkcl, I thought this "democracy" thing had something to do with
elections... :|
If we can
put faith in the sensibility of individual free software users and writers,
there's simply no reason why we can't put faith in the sensibility of free
software users and writers as a whole.
And I must say, any mentions of the lone "unreasonable
man" who is responsible for "all progress" smacks far more of elitism than
any proposal involving popular votes... this whole appeal to the idea of
Übermenschen is based on the extremely undemocratic notion
that only one single man (mentifex, or lkcl, or John Galt) is privy to
profound truths, and the rest are unwashed masses who should just shut up
and listen.
But I digress... and I second (third?) cdfrey's proposal.
Re: artciles, posted 5 Nov 2007 at 18:57 UTC by zanee »
(Journeyer)
I tend to agree with what was stated by lkcl. The idea of having a queue
where people vote tends to create cliques of user who will surely shoot
down an idea or topic they do not agree with. It's also prone to
hijacking by one group which will eventually leave the front page in the
hands of a few. Sadly, the current behavior is prone to error and it may
not always make me or others the most happy but it ensures that those
who are certed have the option of posting an article. Regardless of how
completely insane and/or inane it may be. However, this doesn't mean
that egregious error should not be removed from the front page.
Before going on yet another generic rants of the virtues of free speech and
the evils of elitism (except when it comes to ideas originating from
Me™ the lone "unreasonable man", in which case I™ Know All and
All The Rest Of You Are Just
Ignorant Sheep, and no that's totally not elitist), just consider that
cdfrey's idea involves promoting diary entries to articles. Even if
an article nomination is "shot down", it'll still be available as a diary
entry.
I'll cast my vote and choose a lenient system, anyone can stand in the center and be (mis)understood by anyone.
It may look like an open mess, but then it is better to have a site managed by individuals than a sub-culture of allies.
Trackbacks?, posted 6 Nov 2007 at 01:26 UTC by ncm »
(Master)
I don't know how syndication works. Given a "diaryreply" tag, might it
be possible to have the replying diary entry linked as a response (a
"trackback", perhaps?) on the blog entry from which the tag's referent
was syndicated?
thanks, posted 7 Nov 2007 at 03:19 UTC by brondsem »
(Journeyer)
Thanks for all your work on Advogato, and your roadmap looks great. Glad to see it moving forward :)
Lots of good ideas here. I'm begining to like the story submission from
diary post idea the more I think about it.
I also like the idea of a diary comment being someone else's diary post
but I'm not sure if there's an existing mechanism for doing something
like that. It needs to work not just with local diaries but syndicated
diaries as well. I may run it by the SWIG folks and see if they know of
any fancy XML/RDF linkages that do that.
lkcl: yep, you're exactly right on the locking issues. It's not a hard
problem to fix, just time consuming like all the other ToDos.
Baby steps, posted 8 Nov 2007 at 04:13 UTC by ncm »
(Master)
I don't think you need to implement replies going to syndicated diaries
immediately.
(BTW, can't we just call these things "logs", instead of "blogs"? We
are all aware that they're on the web, and don't need to be reminded...
do we?)
Re: Baby steps, posted 8 Nov 2007 at 04:49 UTC by bi »
(Master)
"Logs" without qualification just makes me think of server diagnostic messages though.