Older blog entries for robogato (starting at number 6)

17 Nov 2006 (updated 17 Nov 2006 at 06:44 UTC) »
Advogato Status Report

Okay, I think we have a fix for badvogato's Chinese character problem. I've posted four test cases below. Remember that even with mod_virgule working 100%, some browsers may not have a UTF-8 font that will render every possible character correctly. If your UTF-8 font is missing a character it will normally display a little box with the character code in it.

This one was a brain teaser. Turns out the problem has been there (in my codebase) for well over a year and was never noticed because most bloggers at robots.net post in English. I added the accept-charset="UTF-8" to all the forms generated by mod_virgule sometime back as part of an attempt to make it more UTF-8 friendly. As it turns out, one of the older mod_virgule functions, virgule_nice_htext(), is not UTF-8 safe. It assumes the input is ASCII or, at least, something where one byte = one character. UTF-8 characters that were multiple bytes were getting mangled, leading to undesirable results.

Initially I thought a fix would be as simple as passing the form data through the libxml2 function UTF8ToHtml() which should convert UTF-8 to ASCII + encoded entities. Many hours later, I figured out this just doesn't work. Due to what I believe is a bug in UTF8ToHtml(), it fails on valid UTF-8 strings that contain characters for which there is not a named HTML entity value. That means it fails on almost all UTF-8 strings that contain anything other than common European variants of ASCII characters. A Latin character with an acute or a circumflex is converted correctly but, for example, a Chinese ideograph would cause the conversion process to terminate with an error.

In the end, I patched UTF8ToHtml() to use numerical entities in this case and now all seems to be well. I'll run this by DV and see if incorporating the patch upstream is warranted.

UTF-8 Tests

1. Problematic Han ideographs as mentioned in the Chinese XML FAQ:

兡也包因沘氓侷柵苗孫孫財 崧淫設弼琶跑愍窟榜蒸奭稽 霄瓢館縲擻鼕孃魔釁佉沎岠 狋垚柛胅娭涘罞偟惈牻荺傒 焱菏酡廅滘絺赩塴榗箂踃嬁 澕蓴醊獧螗餟燱螬駸礑鎞瀧 鄿瀯騬醹躕鱕

2. Cut-and-paste sample from hjclub.com website:

今天在海归网上浏览,发现一个贴子:《[保陈良宇的出笼新解释]胡 锦涛被套牢 陈良宇是赢家不是输家?》 (海纳百川 www.hjclub.com)

粗读了一下,觉得这篇文章大有深意,跟党中央不太一致是肯定的。 我看了一下别的网站,文学城、万维都登了。但海归网是商业网站, 不能成为政治斗争的牺牲品。海归网的版主因为国庆长假,未必会上 网看着。所以我就顺手删去了这个贴子。我删贴其实没有什么用处, 因为这个贴子在海外已经广泛流传。 (海纳百川 www.hjclub.com)

3. Sample from badvogato's blog

情不知所起,一往而深.

生者可以死,死可以生,

生而不可与死,死而不可复生者,

皆非情之至也.

梦中之情,何必非真,天下岂少梦中人耶?

4. Cut-and-paste from Wikipedia language menu:

# العربية # Bahasa Indonesia # Български # Català # Česky # Dansk # Deutsch # Eesti # Español # Esperanto # Français # עברית # Hrvatski # Italiano # Nederlands # 日本語 # 한국어 # Lietuvių # Magyar # Norsk (bokmål) # Polski # Português # Română # Русский # Slovenščina # Slovenčina # Српски / Srpski # Suomi # Svenska # తెలుగు # Türkçe # Українська # 中文

Advogato Status Report

A new rev of mod_virgule code went live today. See the changelog for the details. Other than a few more bug fixes, the big change is the addition of a blog aggregator. This will allow Advogato users who keep their blog somewhere else to syndicate it here so it shows up in the recentlog. There are already seven users whose posts have returned to the recentlog. Hopefully more past Advogato users will follow.

Initially the aggregator supports Atom v1.0, RSS v0.91, v0.92, v2.0, and RDF Site Summary (sometimes known as RSS v1.0, a fork of "real" RSS). My recommendation is to use Atom v1.0 if you've got, with RSS v2.0 as a safe alternative. I expect there are still some bugs to work out, so bear with me for a week or so as we sort things out. There are a few known caveats:

  • Due to limitations in the existing recentlog code, bursts of multiple syndicated entries from the same user that arrive within a narrow time window will only result in one recentlog entry. This is only likely to be noticed the first time the feed is grabbed when maybe 5 or 10 entries get sucked in at once.
  • The blog post title, link, and original posting date are stored locally but the current diary code doesn't display them yet. The additional info should start showing up after the next code release. Soon...
  • Some variants of older RSS (v0.xx) feeds may produce unexpected results. There seem to be an endless number of variations of the RSS formats and I may not have accounted for them all yet.
  • RDF Site Summary format is more complex than Atom or RSS. It's a "modular standard" with dozens of different modules. Trying to parse the output of every conceivable combination of modules is non-trivial. Fortunately, this format isn't very common. Right now, I'm parsing a couple of combinations that use RDF Site Summary v1.0, plus the date tag from the Dublin Core module and the content encoding fields of the most recent draft version of the Content module. That's working for the one RDF Site Summary feed I know of on Advogato. If you can't use Atom or RSS and your RDF Site Summary feed doesn't work, send me a link and I'll try to support it.
  • It will be safer to either to use Advogato for blogging or syndicate your blog here from another site. Mixing the two options, while possible, may produce unexpected results with regard to the ordering of the posts if you post multiple times per day.
9 Nov 2006 (updated 9 Nov 2006 at 16:26 UTC) »
Advogato Status Report

A new rev of mod_virgule code went live this morning. This is an bugfix only release to correct a couple of bugs introduced in the last version.

The missing projects are now visible again thanks to the addition of a missing pair of brackets around an if statement.

A bug in the account deletion code was causing only the first reference to a user to be deleted from the recentlog. The switch to multiple recentlog posts revealed the problem. This is now fixed also.

There was some doubt expressed about whether a recently deleted account was actually spam. I've restored two accounts deleted last night, phpgurru and xerox (the most recent blog post of each account was lost as both were deleted after the Wednesday backup and just prior to the Thursday backup). I'm not sure what these accounts are. Maybe just non-native English speakers? Xerox appears to have been a member since 2002 but most of Xerox's blog posts are either in Chinese or some sort of autogenerated content. Maybe another Chinese-speaking Advogato user could check out Xerox's blog and give us a clue as what it's all about?

I've increased the spam score needed for account deletion from 10 to 15. Now that most of the easy to ID spammers are gone, it probably makes sense to require a larger concensus of users before doing something as drastic as deleting an account.

6 Nov 2006 (updated 7 Nov 2006 at 06:15 UTC) »

Advogato Status Report

A new rev of mod_virgule code went live on Advogato today. We have a couple of new features in addition to the usual minor changes and bugfixes. I'll summarize them but see the changelog for details.

The trust metric cache is now loaded into an Apache thread private memory pool so it can persist across hits. It no longer has to be loaded and parsed on every hit. Instead it's loaded only when the cache is updated (usually once per hour). I'm not sure if this will produce any noticable performance increase but mod_virgule doesn't seem to thrash the hard disk quite so much now.

Most of the major issues with gcc 4.x seem to be fixed. There are still loads of warnings due to char vs xmlChar mismatches. Explicit casts are needed to fix these and I'll continue adding them as I get time.

Hits on non-existent projects now return a 404 instead of a 200 so that search engines will stop pounding all sorts of bogus project URLs that have accumulated over the years. We get over 25k hits per month on one non-existent project, so this is eating a small but measurable amount of our bandwidth.

I've got about 75% of the coding done for blog syndication. It's now possible to specify a feed URL in a user profile. The feeds are aggregated hourly but are not appended to the Advogato blogs yet. I got bogged down (blogged down?) in the minor differences between RSS, Atom, and mod_virgule diary formats. They all use different formats for dates, of course. I'm hoping to have a few more hours available this week to finish it up. I'll probably test it for a few days on robots.net and then take it live here once I'm fairly sure it's stable.

Advogato.org is now set up over at Technorati. Well, maybe. It seems to take several attempts to get anything set up at Technorati. Their view of our RSS feed is still hosed but it should clear up after the next couple of articles are posted. Technorati syndication might generate a little more human traffic to our stories here. Now we just need to work on posting some interesting, original content like in the good ol' days. Maybe advogato the cat could be induced to bring back the Advogato's Number editorials (hint hint).

rillian asked about switching the recentlog to "as posted" mode from the current "unique" mode where only one post per user is allowed. He made the point that the Planet aggregators do this and it seems to be the preferred method by most readers. With the spam problem under control, I didn't see any reason not to make the switch. So, as of 3 November, any blog posts made should show up in recentlog until they scroll off naturally according to the date.

26 Oct 2006 (updated 26 Oct 2006 at 01:09 UTC) »

Advogato Status Report

A new rev of mod_virgule code went live on Advogato today. There were only minor changes and bugfixes. See the changelog for details.

Most of the code changes this week were aimed at getting a clean compile on gcc 4, Apache 2.2, and the newest version of the Apache APR libs. Here and there mod_virgule still relies on some of the deprecated Apache 1.3 compatibility code that's being dropped from the newest Apache libs. There's still more work to do here but it's getting close.

The rate of Advogato spammer account deletion has slowed to a trickle. Most of the easy to spot spammers are gone. I did run across a few today, however: Pramod, nulledphpscritps, Phat, JohnH, bekka, Zorro and a few more if you follow the inbound certs on those.

The removal of all those accounts has had two side effects. One is that thousands of certifications issued by the spammer acounts have also been removed. That seems to have contributed to shorter run times for trust metric computations. The other side effect is that search engine robots are hitting a lot of account pages that no longer exist. Mod_virgule didn't handle this quite the way it should. It displayed an error page saying the account wasn't there but it returned a result code of 200. So the search engines continue indexing the bad URL and continue hitting it every couple of days looking for updates. I've tweaked this so mod_virgule now returns a 404 when displaying the person not found error page. This should cause the search engine robots to eventually stop trying to hit all those dead accounts.

18 Oct 2006 (updated 4 Aug 2009 at 04:24 UTC) »

View This Profile Here

10 Oct 2006 (updated 10 Oct 2006 at 01:28 UTC) »

Advogato Status Report

It's been a little over a week since Advogato transitioned to new hardware and a new codebase. Overall the transition went pretty well considering the differences in the code. Minor bugfixes are ongoing. Please excuse the mess during the transition period!

In case you haven't noticed, the much requested password reminder feature is working and has already been used several times. Maybe that means we'll be seeing some long lost friends posting again?

Bandwidth and DoS issues seem to be coming under control. One strange problem now solved was a Microsoft proxy server that was hitting a single Advogato diary 5 times per second. Most likely just a misconfigured RSS program of some kind. Once the source of the problem was identified, an email to their NOC took care of it. The new, paged person index also seems to have reduced bandwidth usage somewhat.

Spam, spam, spam. It's still with us but there's light at the end of the tunnel.

To reduce the attraction of Advogato to spammers, blog entries posted by untrusted users now have nofollow attributes included in all links. Further, links posted by untrusted users in their account profile pages are supressed altogether. A note has been added to the new accounts page so potential spammers know their links are going to be worthless in search engines.

Two active groups of spammers have been identified so far. One was an SEO firm in New Delhi, India. They were using their own IP addresses to connect, so adding a few addresses to iptables has taken care of them (at least for the moment). The second group connects from random IPs in China and Korea. It will take a group effort to discourage them - how, you may ask?

By using Advogato's new spam rating system. It's based on a suggestion by lkcl and also on the system used over at craigslist. If you see a post in the recent log that looks like spam, click to that user's profile page. If you're certified as apprentice or higher, you'll see two new things at the top of the page: a spam score and a "flag this account as spam" link. Click the link and you'll add to the account's spam score. If you're an apprentice, you'll add one point to their spam score. A Journeyer adds two points. A master adds three points. When an account's spam score reaches a preset threshold (currently set at 10 points), the account is automatically deleted. This system only applies to untrusted, observer accounts, of course. If a user is certified by a trusted user, it's assumed they aren't spammers. There are several spammers currently listed in the recentlog, so let's give it a try.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!