Older blog entries for raph (starting at number 263)

Metamath

I've been chewing over Metamath in my head some more. It's very appealing in lots of ways, especially its simplicity, but a few things need to be done differently, I think, to make it work as a distributed, web-based proof repository.

The biggest problem is namespaces. On the web, everybody has to be able to define their own local namespace, while also being able to refer to other things in the world. Metamath, by contrast, uses short, cryptic, and impermanent names for its theorems. On that last point, set.mm includes this note: "While this file is complete and correct, it may undergo revisions from time to time (including theorem name changes, which means any new theorems you add may not always remain compatible)." That won't do.

Actually, I'm pretty convinced by now that the best name for a theorem is the theorem itself. If the size of the name is a concern, a hash can be substituted. However, an advantage of using the whole theorem is that, if a correct proof cannot be found on the web, someone can probably prove it by hand.

The namespace of newly defined keywords is also problematic. In fact, Metamath does not automatically check whether definitions are ambiguous. Having a definition collision could lead to contradiction, which would be bad. For definitions that simply expand as macros, I suppose you could use the expansion (or a hash thereof) as the permanent name. However, not all definitions in Metamath have this form.

The second big problem, as I've mentioned earlier, is whether to keep theorems abstract, or pin them down to a single model. Metamath does the latter, which helps keeps theorems simple, but it also makes them nonportable to other models. I think it's possible to do abstract theorems in Metamath (you quantify over all possible models that satisfy the axioms), but probably notationally very cumbersome. Even so, it's similar to the way a lot of mathematics is done: you see a lot of theorems over, say, fields, without them pinning down which field is meant.

I'm still not sure whether formulae should be represented as strings of symbols (subject to implicit parsing through wff rules), or as trees. My gut feeling is the latter, because it sidesteps the issue of whether the parsing grammar is ambiguous. The former is a slightly simpler representation, though, and also closer to human-readable mathematics.

There are a few stylistic choices which are unusual, but probably reasonable. For one, the only primitive inference rules are the Modus Ponens and generalization. Equality and substitution are handled through theorems just like everything else. For another, instead of classifying variables as "free" or "bound", Metamath just places distinctness constraints on variables. A lot of the primitive operations such as substitution are cleverly defined to place no more distinctness constraints than necessary.

By the way, does anyone have a formal spec for lambda? Here's what I have, but I don't know if it's right:

( lambda x A ) = { y | y = E. z <. z , [ x / z ] A >. }
y,z,A distinct

Project death

AaronSw notes the demise of Pepper, a shareware text editor. Apparently, the author is tired of it.

This highlights one of the major differences between proprietary and free software. People lose the motivation to work on projects all the time. But free software projects are like extremely hardy seeds; they can dry up and last a long time, and when a fertile environment comes along, sprout and flourish. When the author has particular rights over the code, a big risk is that it can really die.

The history of ncurses might be illuminating on this subject. Throughout its early history, it's license wasn't quite free, and in fact the original author basically had veto rights. It would be really interesting for someone to do a critical history of this project. There was a lot of strife surrounding it. If you read "Homesteading the Noosphere," it might help to know that ncurses provided much of the background for the discussion.

Bytecode interpreters

One of the interesting threads (pphaneuf McCusker JDarcy) that's been bouncing around lately is the design of bytecodes for interpretation. It's a different problem than instruction set design for CPU's. There, RISC is a win because a stripped down instruction set can support higher clock speeds. But for a bytecode interpreter, you're going to have a certain amount of overhead to go through an iteration of the interpreter main loop, so you might as well make it do more.

But a lot of this argumentation leaves me unsatisfied. If you've got a bytecode interpreter in a system, it's a dead giveaway that you're optimizing for something other than raw performance, for example overall system complexity or portability. Both are worth optimizing for, of course.

The stack machine vs. memory transfer machine argument is interesting. I think it comes down to where you're willing to put your complexity. If the compiler from source language to bytecode must be simple, then stack machines are a win, in large part because they support arbitrary depths of expression nesting without any special case code. With memory transfer, you either have to deal with variable sized fields for representing the memory cell (and thus variable sized instructions), or you have to implement spilling when depth goes above (say) 256.

On the other hand, if you're trying to ship virtual machines in tiny embedded devices, then it makes a lot of sense to keep the VM as simple as possible, even at a cost of more sophistication in the compiler.

I note a number of interesting memory-transfer based virtual machines, including Parrot and MMIX. Virtual machine design is a delicate art. I'm happy to see it undergoing something of a renaissance.

Alan

School starts on Monday, and we're starting to gear up. For those of you joining late, we have a hybrid plan. He'll do half a day in public school (1st grade), and half a day at home. For the math-and-science part of the curriculum, one of the things I plan is Mindstorms. We ordered the starter set today. I'm pretty excited; it looks cool.

Ostensibly, Heather will cover language arts and I'll do the Spock stuff, but ironically he started on an illustrated story this evening with me. It's called "Cretaceous Park 1". Ok, so it's a bit derivative, but we're having fun. The words flow easily, and then he puts a great deal of effort and concentration into drawing and coloring the accompanying pictures. I'll probably scan it and put it on the net when he's done.

Max

Max's language development continues to be amazing to behold. He's able to communicate very clearly now. When Alan had a tantrum a few days ago, Max picked up the Nerf rocket launcher and said to us, "I'm going to shoot him." We burst out laughing, of course. Other samples include "I want to press the button on Alan's toy" and "I like your shirt. It has fire on it."

I've been using a lot of the time that's been set aside for writing this diary for other things. For example, I just spent an hour or so talking on the phone with David McCusker. Also, over the last two evenings I read a great book. It's all good, but as usual a challenge to keep everything in good balance. I'm not sure whether writing a diary every day is the right frequency or not.

Raising Blaze

"Raising Blaze", by Debra Ginsberg, is a compelling tale of a child with a high impedance mismatch with the (public) school system, and how his family copes with it. If you have a child that doesn't fit in to any of the neatly defined categories that schools are designed to handle, you should read this book.

Our Alan is a challenging kid, for sure. We've been incredibly fortunate to have good teachers, guidance, and support, and his academic performance has been excellent (by contrast, the Blaze of the book, is labelled a "problem" very early on and struggles with academics).

The book is beautifully written, with gripping narrative at times, and fully human, real characters. Heather is reading it now, and I am sure we will discuss it deeply.

Gnome-Meeting

A couple of people recommended gnome-meeting. I did try it before, but with no success. It was probably a low-level audio driver issue. I'm particularly prone to these kinds of problems because I compile my own kernels. I'll give it another shot.

I also bought a reasonably priced Plantronics headset with microphone, instead of the clunky old Apple microphone I had been using. The ergonomics should be a lot better, anyway.

Video games

Both Alan and Max are getting into computer games on the iMac in a pretty big way. Everybody (myself included) likes Pop Pop, a new game from Ambrosia in the Breakout family. Alan is able to get through the demo levels by himself, so I've bought him a registration key. Alan is also really enjoying Snood. He's just recently been able to win games at the first difficulty level.

Build tools

I've actually been hacking a bit on my prototype for a build system, but if tromey's is a good design, I'd rather help on that. Perhaps a good way to do that, though, is to make the Rebar prototype good enough to build interesting projects, and offer it as grist for the mill.

Tromey writes about an important question in build tool design: do you make it depend on an existing, but not yet ubiquitous tool or language (such as Python), or do you try to make it bootstrap from more primitive parts? Auto* does an impressive job working on older Unix systems, but is deeply dependent on the Unix shell. I really need a build system that works on other platforms too.

A dependence on Python is pretty reasonable and is one way to solve this, but I have a different idea for Rebar. This question also brings up another: what language underlies the build scripts? If you know that your tool will have a (say) Python dependency, then it's very tempting to just export the languages "eval" capability. In fact, all build tools have some underlying language.

In the case of Rebar, my choice is to roll a new language, tiny and simple of course. What makes it interesting is that it's purely functional, so it can be evaluated in parallel (make -jn), and so that results from one run can be cached for others (like ccache and friends). However, in all cases, the tool must give the same results as if executing the script serially from beginning to end. I feel that this design is potentially a lot more robust than those in which the parallelism and caching is specified explicitly, and that build scripts can be simpler.

As for bootstrapping, I will expect most developers to have the full tool. However, the plan is to have a "light" version written in portable C and released under a very liberal free license (MIT), so that it can be included in source distributions. This version can be much simpler in terms of caching and dependency analysis, so I think it's possible to do in a small number of lines of code. It will also come with scripts to automatically build it on a number of popular systems. If these scripts fail for any reason, then getting the full tool is always an option.

I like this idea, and feel it would be quite a bit harder to pull off if a large, rich language such as Python were embedded. Even so, I'm doing the prototype in Python and am happy with the choice.

More need for trust

Another article on the posting of false metadata to p2p networks in an attempt to poison them, presumably with the intent of selling more CD's and DVD's. Interestingly, at the end it suggests that an impending release of Morpheus will have some kind of authentication built in. That will be interesting.

VoIP

I spend quite a bit of time with other Ghostscript people on the phone. It's a good way to keep in contact, but I'm not happy with shoveling lots of money at the phone company. Calls to Russia, in particular, were radically overcharged (basic WALD is around a dollar a minute). I've tried phone cards, but they're not great either. Call setup takes a long time and isn't 100% reliable. Voice quality isn't all that great either. Finally, even though they're pretty cheap (about 4 cents a minute to Russia), the billing seems to be unreliable. In fact, I now strongly suspect that the card (Original Gold in my case) is just lying about how much time is available.

Of course, phone cards just route voice over IP. So why am I routing voice through a patched up network of analog hardware and IP gateways, when I have high quality audio equipment and good bandwidth here? Sadly, the answer is convenience. So far, my experiments with VoIP have been a pain in the royal ass. Today, I played a bit with SpeakFreely.

Even so, I did succeed in setting up a conversation with Ray Johnston. Voice quality varied, and I had to use two separate computers to get full duplex (anyone know how to sample a microphone input on an Audigy with alsa?). The SpeakFreely client is crude; it doesn't even seem to report packet loss. It also seems to be hardwired for 8kHz sample rates. Since I've got plenty of bandwidth and CPU, what I really want is 22.05 or 44.1, and Vorbis compression to 32kbps or so.

We spend enough time on the phone, though, that it's probably worth perservering. If we come up with a magic recipe, I'll certainly post it.

Paul Graham and Spam

xach figured it out: the person analyzing spam is, indeed, Paul Graham. In fact, he has a new piece up.

I'm writing something more detailed about trust now, filling out some email conversation that we've started. However, I'm struck by one feature of his new scheme, so I'll just blog my response here:

Very good. I do believe you're on to something here. The idea of everybody building their own corpus _is_ powerful. Your analysis makes sense to me, and I do believe your tool will do better than most.

I take issue with one thing, though: your assertion that probabilities are superior to scores. Most people not trained in statistics will find adding up of scores more transparent than computing Bayesian probabilities. Further, you've got an awful lot of voodoo constants in there: 2x, 0.01, 0.99, 0.20, 0.9. Lastly, the Bayesian probability computation assumes all the probabilities are independent (if I recall my stats correctly), which is definitely not valid in this application.

In fact, I do believe that your probabilities and SpamAssassin-like scores are equivalent. Use the transform: score = log p - log (1-p), or p = exp(score) / (1 + exp(score)). Take the 15 scores with greatest absolute values (see, already a more intuitive formulation), and simply add them. Your voodoo scores are now -4.6, 4.6, -1.4, and 2.2, respectively.

Perhaps the best way to look at this is that Bayesian probability can give theoretical justification to a "score" system. That might be interesting to some.

Font problems

It's hard to find good, free TrueType fonts. See this note on debian-devel for more evidence.

Even the nonfree Microsoft fonts that everybody uses aren't available for free download any more (thanks to Peter Novodnorsky for bringing this to my attention).

Now's my turn to say, "nyah nyah, told you so." I've long argued for Type1 and against TrueType fonts, for political, technical, and aesthetic reasons. Now that we're moving to high-resolution printing and unhinted antialiasing, the hinting embedded in the font is becoming increasingly irrelevant. Even if you are going to do hinting, a good renderer is capable of some very nice quality with Type1's. Formerly, all the free renderers sucked, but FreeType's is now showing quite a bit of promise.

The message linked above doesn't mention it, but there's a good set of Type1 fonts with a GPL license. Recently, Valek Filippov has extended these fonts to include Cyrillic glyphs, thereby fulfilling the spirit of the GPL. We'll include these in the full Ghostscript release, by the way, as soon as we test them to make sure they don't cause regressions in Latin documents.

By the way, URW did not donate these fonts under the GPL out of their own hearts. Artifex paid good money for them, and donated them out of a mix of self-interest and altruism.

Bug bounty

We've set up a bug bounty program for Ghostscript. It'll be very interesting to see what kind of results we get. If positive, we might do something like this regularly.

GoBe

gobeProductive will be released under the GPL soon. It's very cool software. Why adopt in when we already have KWord, AbiWord, and OpenOffice? For one, it has the experience of the GoBe team behind it. These are the same guys who did ClarisWorks. I haven't seen the code, but it's likely to be a lot more elegant than its competitors.

I know the GoBe folks, especially Bruce Q. Hammond, because I did some Libart work for them. The graphics tool alone will be a significant contribution to free software, especially for less technical users.

The software is based on a "parts" architecture. This should make it relatively easy to get started. A good project would be to do Python bindings for the parts API, so people can do simple parts in Python.

The Gobe people have done some really great work. It's a damn shame the business side didn't work out. Instead of trying to hoard intellectual property into uselessness, they're giving their source to the public. The free software community should adopt them warmly.

jfleck: I agree with what you said, which is of course why I used the word "seemingly". I also don't know if Hatfill is the man or not, but I found the patterns of press coverage very odd.

Your comments on anonymous sources are worth noting too. There's almost no benefit to going on the record; you'll still get quoted, and you avoid being held responsible for your words. I'm a big fan of anonymity in some cases, but in others it's simply ridiculous. I was struck, for example, by this passage in the AP wire story on the Chinese Harry Potter sequel:

Rowling's agent, the Christopher Little Literary Agency in London, said it was aware of the fake Chinese Harry. A spokeswoman who asked not to be identified refused to comment by telephone, but sent The Associated Press an e-mail saying, "We are taking this issue extremely seriously."

Uh, sure.

Some followup

graydon wrote me a nice email on my proof thread, pointing out his diary entry, which a lot of people probably missed because the next one was posted so soon after.

In it, he points out the logic advocated by Eric Hehner. In this logic, programs themselves are predicate expressions. Graydon claims that this results in some simplification, presumably because you don't have to separate two different kinds of expressions: programs and predicates about them. Hopefully, when Graydon and I meet, he can explain this to me. I suppose it's possible, but to me these do seem like the kinds of things that should be separated.

Aaron Swartz linked my memorial to Dijkstra. It really made my heart glow when he called it "beautiful". Thanks, Aaron.

I got an exciting email a couple of days ago. Somebody smart (you probably know the name) is working on the spam problem, and asked me about trust metrics. I'll post more as soon as it's clear the person wants this info public.

I wrote recently about the alarmingly weak US media coverage of the Anthrax case. I'm relieved to see a seemingly fair, accurate article in Newsweek.

254 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!