Older blog entries for Bram (starting at number 72)

Election Methods

apenwarr: Getting better election methods actually adopted is indeed the problem. The United States has had unranked ballots for so long that Duverger's Law has kicked in and we now have two completely entrenched political parties which have a vested interest in keeping things as they are.

Perhaps a clue can be gleaned from the defeat of proportional representation. Proportional representation as a method of electing congresspeople has been banned, which seems rather extreme given the usual bent towards states's rights. The reason this happened was actually a good one. Elected officials like having a considerable amount of personal power, and simple proportional representation would lead to them being completely subservient to the party they represent, easily replaced by party beurocrats.

One of the nice things about the way elections work in the US is that ballots always list individuals, not parties. In the above case, the interests of those individuals helped improve election methods used (or at least, keep it from getting worse). Perhaps incuments could be convinced to back ranked ballots.

Do ranked ballots favor incumbents? I'm honestly not sure. There are two forces at work; One is that ranked ballots make political parties have greatly reduced influence, so an incumbant doesn't need to worry so much about party backing going away. The other is that the will of the voters is more accurately reflected, possibly removing incumbents whose support relied on electoral distortions.

I suspect that incumbents could simply shift their positions to reflect the opinions of voters more than political parties, and wind up in a better position than they were before. However, real studies of the likelihood of and incumbent getting reelected when the voting system becomes ranked versus when it remains unranked are required.

Election Methods

robla links to debian election analysis. The full results show that condorcet yields a well-defined full ordering of all the candidates, making what should happen in this election completely non-controversial.

The fundamental brokenness of unranked ballots gets astoundingly little media attention. Even after it caused an apparently artifactual result in France's presidential election, which was later shown to be clearly artifactual by the runoff results, most media coverage portrayed it as a political struggle.

Schedule

I'm going to be at the emerging technologies conference on the 24th and 25th, and at the emerging man party on the night of the 24th. After the 26th I'm going to be in Amsterdam for a while, then New York City starting on the 15th of May. Anyone who would like to meet up with me, mail me.

Certs against Spam

I previously suggested an anti-censorship multiplier of around 1/5. This is way too high - a multiplier of 1/100 or even 1/1000 would be able to make spam not worth doing, since spammers rely on being able to reach millions of users. Not eliminating peers which were eliminated by less than 100 or 1000 anti-certs could also help reduce false positives.

The extreme looseness of these values indicates that stopping spam with certs, once the necessary infrastructure is in place and the right approach is taken, is actually very easy.

Tickets against Spam

Penny Black has an ticketing system. There are a few obvious improvements to this sytem though, which I'll explain now.

Rather than sending back simple tickets, the server should send back pairs of one time passwords and their identifiers. When sending mail, rather than including the one time password, the mailer includes the identifier and a MAC of the hash of the message based on the one time password. When the recipient gets the mail, they query the token server with the id, hash, and MAC, and the token server responds with whether it's valid or not.

With this approach, there is no need to worry about a spammer intercepting a whole bunch of other peoples's mail and using their tickets. There's also no pop-style problems with dropped connections losing mail, since the ticket server can remember which hash was used with which ticket and respond idempotently that it's still valid later on.

The proper encoding of tickets is in a header line, and the hash should be of the Subject, From, and To lines, and the message body. This way it's possible for more than one kind of ticket to be put in the headers, if more become widespread in the future.

I think this technique could go a long way to stopping spam, and make Microsoft quite a bit of money in the process. They're one of the few companies positioned to be able to make such a system ubiquitous. I think $100 for 10,000 tickets is the wrong price point though. I'd be willing to pay $10 for 1000 tickets, if that really made sure my mail would get through any spam filter.

RSS bandwidth saving

garym asks about saving bandwidth used by RSS. This can be done much more simply by adding an optional GET parameter for RSS feeds which specifies what the latest entry the downloader already has is. This parameter is strictly advisory, but could save a lot of bandwidth.

Another useful parameter would be an advisory first entry number to return. That would make RSS feeds useful as archives.

I'm not going to suggest names for these parameters for fear of adding one more to a multiplicity of available standard names for them which I don't know yet.

The one subtlety has to do with when an existing entry is changed. To make those get overwritten, the client should send an if-modified-since like it does now, and the server sends older entries past the most recent one the client has if any of them have been modified.

Trust Metrics

A trust metric evaluation function I gave before has the problem that that it's very hard to compute. It can be approximated well by setting the weight of each node to the probability that there's a path from that node to an attester, then do simple random walk.

It turns out the latter problem is essentially the same as what's called all-terminal network reliability problem, which there's an interesting neural network approach to. I haven't read over this paper carefully enough to say if the technique is practical for a site the size of advogato, but from a quick skimming it appears so.

Neural networks also add buzzword compliance.

Bram's Law

I realized this the other day, and dub it Bram's Law

The easier a piece of software is to write, the worse it's implemented in practice.

Why? Easy software projects can be done by almost any random person, so they are. It's possible to try to nudge your way into being the standard for an easy thing based on technical merit, but that's rather like trying to become a hollywood star based on talent and hard work. You're much better off trading it all in for a good dose of luck.

This is why HTTP is a mess while transaction engines are rock solid. Almost any programmer can do a mediocre but workable job of extending HTTP, (and boy, have they,) but most people can't write a transaction engine which even functions. The result is that very few transaction engines are written, almost all of them by very good programmers, and the few which aren't up to par tend to be really bad and hardly get used. HTTP, on the other hand, has all kinds of random people hacking on it, as a result of which Python has a 'fully http 1.1 compliant' http library which raises assertion failures during normal operation.

Remember this next time you're cursing some ubiquitous but awful third party library and thinking of writing a replacement. With enough coal, even a large diamond is unlikely to be the first thing picked up. Save your efforts for more difficult problems where you can make a difference. The simple problems will continue to be dealt with incompetently. It sucks, but we'll waste a lot less time if we learn to accept this fact.

Certs against Spam

The last technique I gave for back propogation has a serious flaw, but I think the criteria and general approach are good. Here's a much better algorithm:

  1. Compute a weight for each node. This is done by answering the question 'If I walk randomly along certifications, with probability x of stopping at each hop, which is the probability that I will stop at this node?' The stopping probability is parameterizable.

  2. Compute a weight of of spammerness of each node. To do this, for each node subdivide its weight among everyone it claims sent spam, and add up the claims of spammerness for each node. Then multiply by a fraction, say around 1/5. This fraction is parameterizable, and reflects a tradeoff between judiciousness of stopping spam versus difficulty of fraudulent censorship.

  3. Use spam values to neutralize weights. If a node has weight x and spam value x + y, then the node's weight becomes 0 and the remaining value y is used to neutralize the nodes which certified that one, possibly completely neutralizing them and requiring furthur back-propogation.

Mail is accepted from any node which there's a path to along certification edges over nodes which have weight greater than zero.

I've glossed over some details, but that covers all the important points. I'm very enamored of this approach. It's extremely untuitively compelling, and all the steps in it can be done or at least very well approximated in essentially linear time. It looks like we have an algorithm which really can expand to a global scale now.

Twisty Puzzles

Some posts by me to the twisty forum.

HTML

Web pages which make their hyperlinks non-underlined drive me absolutely batshit. I wish there was a browser option to always underline even when the page says not to.

Today, the W3C announced a new initiative to emphasize ease of implementation in specification design. 'Encouraging adoption is the most important part of making protocols useful for the future internet' said Tim Berners-Lee.

In other good news, the IETF announced a new policy in RFCs against using MUST to describe behavior which is widely violated in practice, especially when that violation won't change for the forseeable future.

29 Mar 2003 (updated 29 Mar 2003 at 05:51 UTC) »
CodeCon

The CodeCon 2003 archives are now on codecon.org. We had lots of great presentations this year, those of you who missed it can now hear them.

BitTorrent

Pushed out a new release yesterday, and a bugfix release today. Get it on the download page.

tjansen: requiring hosters of big content to purchase a big net connection and then recoup the costs later would force them to take on a huge risk, and slow down new deployments immensely. Besides, we don't have micropayments and won't for the forseeable future, so the point is moot.

Bayesian Filtering

brondsem: Counting a repeated instance multiple times causes long messages to be weighed disproportionately. Each message either is or is not a spam, a long spam message does not count as two pieces of spam.

Also, if a non-spam message is about mortgages it's likely to contain the word 'mortgage' many times. Counting that commonly used spam word against it repeatedly is much more likely to tag it as spam, when in fact the repeated instances are just as likely to occur in non-spam as spam.

The idea behind only counting a few words is to not make all messages look the same. If you count all the words in all messages then most messages wind up with pretty much the same average value.

You are absolutely right that real backtracing is needed to know for sure though. Anyone who's setting up a spam filtering system please save all your spams and set it up so you can easily do backtracing and find out the number of false positives and false negatives different techniques would have hit.

How to be a Pundit

A pundit, of course, is someone who others look to when they want opinions. Here's how to become one in four easy steps.

  1. Post constantly. Preferably daily. Preferably put every inane little thought to come out of your head on a weblog, so the whole world can see you post constantly. The first step to having a following is to reliably offer an opinion.

  2. Always say the same things. It doesn't particularly matter if the things you say are right, or even if they make sense. What's important is that they sound good, and that they're always the same. Followers don't want leaders saying confusing or ambiguous things.

  3. Include what a great guru/authority/visionary/leader you are as part of your message. Don't worry if you aren't actually one, the important thing is that people think you are.

  4. Arrange pre-packaged stories for journalists to print. Journalists rarely have the resources to research a story properly, so if you give them a story they can just reword a bit and print, they'll do that. Don't forget to include how great you are as an important point in the story.

RedHat 9

Since the free download of RedHat 9 won't be available for a week after it becomes available to subscribers, this is a great opportunity to demonstrate BitTorrent. Unfortunately, I'm not a subscriber. Anyone who is a subscriber and has a decently fast pipe and would like to help download the ISOs and make them available via BitTorrent within a few hours of their release please mail me.

Travelling

I'm going to be in Amsterdam from April 27th throug May 3rd, and New York City from May 18th till about June 1st. I'd like to meet some of the local open source hackers while I'm there. Anybody who would like to meet up mail me.

Bayesian filtering

I'd like to remind everyone that I already posted about Bayesian filtering, and gave some ways of fixing some of Graham's code's deficiencies. Highlight include

  • Whether a word occurs in a message should be yes or no, multiple occurences should be ignored. Graham's code does something very funky with multiple occurences which can cause lots of problems.

  • Everything should be changed to log scale so scores are simply added together. This makes everything much clearer. In particular, it makes it obvious that you might as well assume a default probability of spam of 1/2, since it just trades off with the cutoff threshold. It shows how thoroughly keeping things as probabilities obfuscates the process that many people, including Graham, didn't realize this.

  • Adding up the 16 most extreme scores in either the positive or negative direction is very crude. For long messages it winds up throwing out all subtle details of exact values and just doing a vote count of the top 16 as to whether they're positive or negative. A much better approach is to add up the top 8 and bottom 8. Bruce Schneier and Joel Spolsky have recently complained about very long, easy to flag as non-spam messages of theirs getting rejected, and this is probably the cause and simplest fix.

    Graham didn't specify what the behavior should be when you have 16 words which say 100% spam and 16 words which say 100% not spam, although it doesn't really matter, because the top and bottom approach is clearly better.

I hope I don't sound overly critical of Graham here. I think his approach is a great one. But it is a first version, and can be dramatically improved.

I wish someone had a backtracing setup for spam filtering algorithms, so we could get some real data about false positive and false negative values, instead of just having to guess.

Partial Busy-work Functions

ciphergoth: There are a few other criteria:

  • The whole puzzle should require 2k time to solve

  • If you have two nodes which aren't at opposite ends of the arc, then you still have to spend k time to solve the whole puzzle

Quick verification isn't a criterion I'm looking for, although that would be an interesting add-on. I'm thinking about this because it's related to circuit complexity, not because I'm looking for an especially practical crypto primitive.

The two cases with two nodes are easy. If the nodes are connected by an arc, then the puzzle is to compute the root of a binary hash tree, and each of the partial work functions is one of the two things pointed by the root. For the disconnected case, the puzzle is to compute the output of a hash chain, and the two values are both the value of the halfway point of the chain. For three nodes, I haven't figured out a solution to the case where all three nodes are connected and the case where only a single pair of nodes is connected.

Twisty Puzzles

I figured out some big improvements to my pentultimate design. Possibly even more interesting than having a working pentultimate is the internal pieces which move at half the rate of the external pieces. My basic ideas can be used in a much more robust design for an extension of the 2x2x2 rubik's cube in which the outside is transparent and internally there are a bunch of extra pieces which move at half the rate of the outer ones. This would be a very interesting and difficult new puzzle, and my design for it is extremely robust.

Now I just have to learn how to make precision plastic parts. Good thing I don't have lots of other projects to work on.

The Twisty Forum is great.

Certs against Spam

I've given some more thought to stopping spam, and have come up with the following approach, which is still a bit sketchy but seems to be on the right track.

When receiving a piece of mail, determine if you're willing to accept mail from the sender by the following process:

  1. Determine each peer's degree of spammerness. This can be done using a variant of the random walk method, which is used to calculate advogato's diary ratings.

  2. Back propogate spammerness. Oh, you wanted to know how to do that? For each spammer, calculate max flow from you to that spammer. In the flow diagram, try to even out the flow as much as possible, so if there's a possibility of offloading flow through one peer to other peers each of which are currently taking on less flow, do so. Calculate this for every spammer, then calculate each peer's degree of spammerness by adding up the flow through them for each spammer.

  3. Throw out all peers whose degrees of spammerness is above some threshold. If there's a path from you along certification edges over remaining peers to the sender of the mail being evaluated, accept it. Otherwise reject it.

The real innovation here is the second step, which does back propogation. It ensures that sending spam from fake certified identities is just as damaging as sending spam from the directly compromised identities.

The database for this could be easily populated using a Friendster-style interface. Unfortunately, to be maximally useful this database has to be global, meaning it's huge, and the steps involved in evaluating it look to be at least quadratic. Thankfully moore's law will make this much more manageable in due time, although the database will still have to be widely replicated in order to distribute computational load.

In case you wondered what one could ever do with vastly greater than videophone bandwidth, it turns out you can utilize it nicely propogating global databases.

An interesting graph problem

Given a graph, and a specification of a 'red' node and a 'blue' node, which wish to know if it's possible to remove k nodes, not including blue and red, so there's no longer a path from blue to red along node edges over remaining nodes. Is this problem NP-complete?

Architecture

An architecture is a set of libraries which are so poorly encapsulated that if you want to use one of them you have to use all of them.

Building a Pentultimate

I have long been fascinated by the pentultimate. Could one actually be built? I have a general design which I'm fairly confident is workable.

Imagine two blocks which you wish to slide along each other in a straight line. If you cut a slot along their boundary, you can put a disc in the middle which will rotate as they move. You can add teeth to the disc and notches to the edges of the slot, thus forcing the gear to rotate and move exactly half as much as each block does relative to the other one. A more esoteric trick is to add circular notches to the faces of the disc and grooves to the sides of the slot for those notches to move in. This will have the same effect in terms of how the disc moves, but has the additional property that the two blocks can't be pulled apart while the disc is inside.

If we bend things a bit we can make the slot be circular, and the blocks then rotate. My trick for making a pentultimate is to cut grooves of this form along the hidden sides of each visible piece and have a disc sit under each boundary between a triangular and pentagonal piece. The discs keep the puzzle from being taken apart, but allow rotation along the cuts. Every time a rotation is done around a cut, it moves a fifth of a complete rotation, making each of the ten discs beneath move half that distance, resulting in each of them moving to the next position along and the puzzle once again being in a rotatable position.

The details of course must be hashed out, but I'm confident a real physical pentultimate which is robust enough to play with could be made using this technique.

An interesting possibility this internal mechanism raises is that the outside of the the puzzle could be made transparent, and the puzzle would then be to position and orient the internal discs.

There are many other fascinating Rubik-type puzzles, but I think the pentultimate would be by far the most compelling toy.

63 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!