Older blog entries for raph (starting at number 240)

Math

My regular readers are no doubt wondering what it was that Bram and I talked about yesterday evening that was so much fun it left my diary so inarticulate. I'll try to summarize.

Rigorously formal mathematics is interesting, but has a well-earned reputation for being too painful and tedious to be really practical for anything. So far, it's been much more the domain of philosophers than engineering.

Bram and I feel that the Web has the power to change that. For the most part, tools for mechanically checking proofs have been basically been solo tools. For one person to get through a significant slice of mathematics is an ordeal. But if the work is split up among many, it's a lot more reasonable.

Collaboration requires standard file formats. Bram and I are thinking along the lines of a low-level, simple formal proof format. The main thing is to apply inference rules. Then, there's a higher level metadata format for saying things like "this is a proof of Fermat's Little Theorem using the Peano axioms for natural numbers".

The design of such languages parallels the design of computer languages. Names and namespaces are particularly important. Proofs should be written as stand-alone modules, but also so that you can build up a structure by reusing the results in other proofs. Since these file formats are intended to live on the Web, links are critically important too. In particular, you want to be able to import other namespaces. Thus, in the example above, "the Peano axioms for natural numbers" would be a link.

A link can be a URL, but it can also be an immutable hash of the content. That way, if you have the file locally, you don't need to go to the net to resolve it.

Typically, these formal systems have set theory as a primitive (Metamath uses ZFC). Other objects are encoded in the primitives. For example, the usual encoding of natural numbers is: 0 = {}, i + 1 = i \union {i}. But there are other ways of encoding numbers, some of which might make more sense. For example, if you're doing calculations, a sequence of bits might be better. Of course, all the encodings of integers are equivalent, but if two proofs use two different encodings, it might be hard to knit them together.

So I think it's important to make it easy to convert between these different conventions. There are a couple of techniques I can think of. First, you might be able to make inferences based on models: "if P is a proof of theorem T using encoding E0, and E1 is a model of E0, then (some simple transformation of T) is also true". Another way would be to parameterize proofs, so they take, say, the integers as arguments. Any encoding of the integers which satisfied the Peano axioms would do; to use a proof with a different encoding, you just prove the Peano axioms in your new encoding.

The higher-level format is useful for humans wishing to browse the space of proofs, and also for clients to automatically verify proofs. At the very least, you want to let them crawl enough of the proof web that the expansion of the proof is complete. I'd also expect the usual Web patterns to show up: search engines, directories, pages full of annotated links and commentary.

The exact choice of lower-level language is probably not that important. You can always use tools to translate more convenient proof formats into the lower-level one, or even to automate part of the process. By using a common lower-level format, people using different tools can still collaborate. One important test of this principle is whether the Metamath proofs can be so translated.

Formal logic for proofs is not new, of course. The idea of a proof repository isn't new, either. But I am unaware of anyone who has designed languages for formal proofs with the explicit goal of fostering collaboration through the Web.

This Introduction to Mathematical Logic may be rewarding to those intrigued by these ideas.

14 Jul 2002 (updated 14 Jul 2002 at 07:53 UTC) »

I've been having a lot of fun discussing with Bram the ideas mentioned yesterday. Thinking about formal math in the context of the Web is a lot of fun.

Fitz Wiki

The Fitz Wiki actually has some interesting content now, although of course it's still very new.

Metamath

Bram brought up an intriguing idea: an online repository for theorems and proofs, in which the proofs are in a formal language and can be checked automatically.

This is in many ways an easier problem than automated theorem proving, an area that's soaked up a lot of research attention without all that much to show for it. Here, we optimize for proofs written by people.

Rigorously formal proofs have a reputation for being tedious and verbose. I'm not sure they have to be, but even so I think it makes sense to store a low-level proof format in which all steps are spelled out. People developing proofs might work in a higher-level language, and have it compiled down. An appealing consequence is that people can experiment with higher-level proof systems.

There are still a lot of challenges, though. For one, mathematicians are not able to agree on a set of fundamental axioms that are "right". In fact, the debates over various axioms runs almost as hot as Perl vs Python.

For two, there are many different ways of encoding the same idea into mathematical logic. Plain-old unsigned integers, for example, readily admit Peano encodings, Church encodings, and so on. None of these logic-based are at all efficient if you're dealing with big numbers, so you probably want another encoding as a sequence of binary digits. But if you prove "2+2=4" in one encoding, how can you use the result in another?

I think this question is very analogous to runtime incompatibilities between various programming languages, or even within a single programming language. In C, you can have many different implementations of a simple list.

One of the best-developed formal proof repositories is Metamath. It deals with the runtime incompatibility problems by just choosing one. Overall, proofs don't look that scary, but I still think there are language design issues that get in the way of larger-scale use. The syntax is twisted a bit to look like math. For example, '+' looks like an infix operator. To me, it makes much more sense to write +(x, y) in the low level language, and possibly compile infix notation down to the lower level.

It's pretty easy to prototype something like this. You want search and directory capabilities to get a handle on the potentially huge number of theorems, but that's all pretty well understood in the Web world. My guess is that we'll start to see more experimentation in this area, and possibly a useful proof repository within a few years.

Fitz

The mailing list and Wiki for Fitz are now instantiated, thanks to rillian. If you're interested, you might want to check the archives; they contain some relevant discussion.

The next step is to populate the Wiki. I'll try to write a little something every day, just like I do here.

Anthrax

There's something fishy going on with the Anthrax investigation. All the evidence points to the culprit being someone hooked into the US defense establishment. At the same time, the FBI's investigation has been laughably incompetent.

Then, most remarkably, the New York Times published an editorial (reg required) talking about an individual who many believe is the culprit, but who has barely been investigated. The editorial itself doesn't name the name, but a few minutes searching on Google reveals it to be amost certainly one Steven J. Hatfill.

If this is true, it's a pretty big story. For one, it drives home the need for transparency and accountability in our government. The defense establishement has been largely exempt, because we have generally trusted that they need their secrecy to protect us from the bad guys. But if they are the bad guys, everything changes.

This story has been extremely differentially reported in the US and overseas media. Aside from this one editorial in the NYTimes, I've been able to find almost no coverage in the US mainstream press. If this is "objective journalism", fuck it, I'd much rather have Pravda. Fortunately, we have a vital independent press on the Web. I've been tracking this story through rotten.com.

I don't know that Hatfill did it, but it seems like by far the best lead the FBI has. For them not to be pursuing him vigorously, and for the US mainstream media not to be reporting the story, verges on the criminal.

Well Tempered Screening

I got my first color output today. Wanna see a sample? (note: 1.4M)

Also see these fun bug images: 1 2

Fitz

Between Antony and tor, the discussion is really heating up. I should set up a real mailing list, so it gets properly archived and searchable and so on.

Google

AaronSw pointed out that searching on "advogato" in google yields a "Google is hiring" adword. Considering the otherwise absolute radio silence from them, it's a very cool bit of feedback.

Ghostscript

Woohoo! I got my first Well Tempered Screening output this morning. If you like, you can take a look at before and after. If you want to print them, try: pngtops golfer.png | pnmtops -equalpixel -dpi 300 | lpr

WTS solves three problems. It replaces the (possibly objectionable) patterning artifacts with a finely dispersed noise. It renders angles extremely precisely, where Ghostscript's existing halftones, even in "AccurateScreens" mode, are much less accurate than Adobe's. Finally, Ghostscript uses a "tile cache". As the cell grows in size, it can overflow the tile cache. In that case, performance can degrade horribly. By contrast, WTS thresholds on the fly.

There's more work to do, but I'm quite pleased.

Trustworthy names

Name services are an important battleground for trust. The question, "who deserves the right to name X?" is a deep one, and the answer is far more social than technical.

The way DNS is being run now sucks sucks sucks. One symptom is that there are no less than four services offering non-global DNS root service.

Much of the original motivation for my thesis work was a desire to build a better name system. It's a hard problem, though. More research is needed, or so I'm fond of saying.

What if you used the eigenvector metadata engine? Assertions would be of the form "name X belongs to public key 0x12345". When more than one such assertion exists for any given name, the one with the higher confidence value wins.

Obviously, this has the risk of "balkanizing" the namespace. Certain "controversial" names would resolve differently from community to community. While this isn't a serious problem for RealNames-like keyword searching, it is a problem if you're trying to publish stable URLs.

But here's the big question: even though the system is capable of balkanization, would it be a problem in reality? It's not clear to me. People can make local decisions to reduce the ambiguity in resolving names. For one, "registrants" can choose names less likely to be challenged. For two, when choosing the trust links for your community, you can emphasize people who take the game seriously and try to bind names to the rightful owner.

If the controversy can't be resolved, then the name is simply not useful as a global resource. Is this so bad? There are certainly a ton of names owned by squatters now, and can't be used, so it's not sure that the proposal would be any worse.

I'm not convinced that this is the best naming scheme. I personally favor first-come, first-served naming policies, and my PhD research goes into some detail about how to implement those. But I do think it's worth thinking about alternatives. What we have now isn't very good, and it isn't going to get any better by itself.

MuPdf and Fitz

I took a look at tor's MuPdf. It's very unfinished, but it's still very impressive. The unhinted, antialiased text is very pretty, and performance seems to be excellent. In many ways, it's a model for what I want Fitz to be. In many other ways, Ghostscript is the model, because it's already solved the problem of rendering the full PDF imaging model.

Antony Courtney has posted an architectural goals document for Fitz, with edits from me. There's a lot more I want to write, but I don't have big gobs of time right now. Also, it's obviously getting near time to set up the public infrastructure for the project: CVS and mailing lists as bare minimum, but I also want a Wiki.

Two degrees of separation

Gene Kan died last week, probably a suicide. Sad.

Ghostscript

Developer release 7.21 is out. We're going to merge in the DeviceN code next, so this is probably an "island of stability".

Deep thinking vs not

Bram has been thinking deeply about how to efficiently move around large quantities of data in a robust p2p way. The result is BitTorrent, which is now in a good beta release.

By contrast, here we see "oopster", which is a specialization of Jxta to move around the OpenOffice binaries (which needs this kind of thing because of bloat, but that's another story). They're just beginning to think about the problems that Bram has already solved.

In the thinking world, one of the great applications of BitTorrent will probably be ISO's of free software distributions.

tromey: Cool.

sej: What you are proposing (adding transparency to PostScript) is adding nonstandard extensions to PostScript. It's possible, but I'm not sure what the real advantage is over PDF.

Jeff Darcy has written an interesting piece on protocols that contain reimplementations of TCP. Looking through the archives, Jeff has a lot of good stuff on protocols. Designing good protocols is an esoteric art, and should probably get more attention. It's nice to see writing in the open, even though Jeff works for a proprietary company.

Roughly, Jeff is to protocols as David McCusker is to trees. I think we should do a Vulcan mind-meld of the two, because I think an efficient protocol for remote access to trees is interesting. It's a central component of the Athshe vaporware project, but I freely admit that there many aspects I don't how to do right. Maybe in the absence of mind-meld technology, we can just bounce ideas off each other.

Dirt and noise followup

I wrote a few days ago about "dirt and noise". One of the things I was thinking about is the demonstration in Wolfram's book that the simple "rule 110" automaton can run a universal Turing machine. The straightforward implication is that, even for such ridiculously simple automata, if you know the initial state, you cannot predict whether a certain pattern (corresponding to the halting of the Turing machine) will ever occur. Simple universal machines are not new, but Matthew Cook's proof that rule 110 qualifies certainly pushes the simplicity envelope.

But here's a question. What if your cellular automaton didn't proceed purely mathematically, but had occasional bit errors? Does this make the halting problem go away? If so, does predicting the presence or absence of patterns become tractable?

Incidentally, the best review I've seen so far is this one, posted on the MAA website.

The book may be more interesting as art or literature rather than science. Certainly, the argument for deep equivalence in all fields of human intellectual endeavor is at the heart of Magister Ludi (Das Glasperlenspiele), by Herman Hesse.

Outrageous rudeness

tromey wrote more about his feelings upon receiving emails saying "automake sucks".

Think about the analogue in the real world. Let's say you spent years refining an artwork, and many copies were distributed. Then somebody comes up to you with a copy in hand, says, "you know this artwork? it sucks," and rips it in half. I am a peaceful person, but I would be quite understanding if you just hit him.

The principle of it is the same on the net, but of course there are a lot more assholes within easy reach of you here. Scale matters. In a smaller community, it would be a lot harder to get away with behavior like that, at least without being called on it.

I get some "thanks" emails, a number of impolite emails asking for favors, and a sprinkling of hostile ones. I really appreciate the former, and know I should send out more myself. The hostile ones I happily ignore. As for the ones in the middle, I somehow never seem to have time to respond. Ah well.

Last time I got a "you suck" message (it was here, three weeks ago), it was very easy to ignore, because the person who wrote it had already firmly established his credentials as someone whose opinion I cared nought about.

Thanks, Tom, for your work on automake. I use it, and it really does make life easier. The fact that it's actively maintained is quite a luxury. You're one of a handful of people actually doing a usable automated build tool, and I appreciate that.

Build tools

That said, now I'll talk about how automake sucks :) No, actually I'll respond to tromey's thoughtful comments on what a systematically designed build tool should look like.

I agree with the four bulleted points (integrate configuration with build, single invocation, tools as objects, caching), although I maybe I don't quite understand the implications of treating tools as objects. Does this simply mean that when you upgrade the C compiler, you rebuild everything?

Derived object caching is particularly interesting, not just because it makes builds faster, but because (if done right) it should make switching between variant builds much more pleasant. I want to be able to build with a -pg option and then have a binary I can profile. Even better, --with-checker (and other tools requiring compile-time support).

I've also been refining my thoughts since my rebar entry to the Software Carpentry competition. At the time, I proposed an XML syntax, but don't believe in that any more. I see makefiles as an integral part of the program. They are just as worthy of human readibility as the program code. But that's just a syntactic difference.

More fundamentally, I believe that the build tool should be a library combined with a simple lazy functional programming language. This is a somewhat quirky choice for a programming language, but I think well justified for this application. Laziness lets the implementation find and exploit parallelism without explicit direction from the programmer.

Other approaches, such as cons, provide a library over some available scripting language. That's also a reasonable approach, and keeps you from having to invent a new language. But I still favor the new language. For one, scripting languages tend to be pretty big, because they're trying to support general purpose computing. In this context, the language could be tiny. Second, I think the lazy approach has distinct technical advantages.

One such I'll try to illustrate here. A build script in the current rebar prototype looks something like this:

hello_o = compile/C ('hello.c');
foo_o = compile/C ('foo.c');
hello = link/C ([hello_o, foo_o]);

You could implement this sequentially, but because compile is functional (side effects are not permitted), the implementation is free to run the first two lines in parallel. In fact, the current prototype does so. But of course all serious build systems have some way to -j2.

More exciting, if an error happens (let's say foo.c doesn't exist), you can easily trace that to the exact line in the build file. I consider providing very high quality error messages to be vitally important. This is certainly an area where auto* falls down. With auto*, by the time you trip over the error, the build script has been macro-substituted and otherwise mangled beyond recognition.

In fact, I'd go so far as to say that a build system should be a tool for diagnosing build problems, with the happy side effect of actually building the target if there aren't any.

On deployment

A fundamental question: should building a project require a separately downloaded build tool, or should a distribution ship with enough of the build tool included to successfully run on target systems?

I lean toward the former. You can't ship a distribution without making some assumptions about available tools. autoconf assumes working sh and make. This is a reasonable assumption on Unix systems, but not on others.

Even so, I think it makes sense for code distributions to ship with a copy of the build tool (a good reason to keep it small and simple!) so that it will run on some subset of legacy systems. One of the advantages of doing it as a functional language is that you can have an implementation optimized for size and portability, rather than performance and features. In particular, the "tiny" flavor might eschew sophisticated dependency analysis.

Drawing the lines

I think it's very difficult to figure out where to draw the lines between a build tool and the rest of the system. In particular, the overlap between the build tool and the package system is considerable.

You can think of /usr/local/ as a very ad-hoc package database. The "make install" and "make uninstall" targets of makefiles do the basic package install and uninstall. /usr/local/lib/pkgconfig/ is one effort to collect metadata about installed packages. Distributions aren't very good at the basic function of finding libraries, and they also aren't very good at maintaining a separate development sandbox, especially one that can be manipulated without root privs. That's a good reason for /usr/local to continue.

The other way works too. Your build tool could build a .deb or .rpm, and you could rely on the package system to do it right. In fact, this approach is already common on Debian for kernel installs (make-kpkg). For one, it makes hardly any sense to install a kernel as non-root.

In the long term, to do things really well will probably require parallel evolution of the package system and the build tools. That's a bit daunting; it's likely to take a lot of time.

I care deeply enough about a good build system that I've put a lot of thought into it and have built a prototype. Tom, if I get the sense that your new approach is good, I will help you with it. I'll start just by dumping some of my thoughts (this can happen in my diary).

By the way, building Ghostscript is the acid test of a build tool. If a build tool can handle Ghostscript, it can handle anything.

mod_virgule

gary just merged the various mod_virgule trees, and the site is now running the new code. One visible consequence is the (n new) notation for article comments. Less visibly, we can now proceed with courage, knowing that adding new features isn't going to cause pain for a future merge.

There are some performance issues that crept back in. Please be patient.

jbig2dec

rillian has been hacking jbig2dec with help from me, and the progress is encouraging. It now decodes 042_10.jb2, with symbol dictionary and text region. My guess is that it will be decoding CVision streams very soon, as that's basically all the functionality they use. Before too long, we'll be inviting people to bring test images to throw at the code.

tromey: I'm sure automake gets more "sucks" emails than a lot of other packages. While I consider myself quite critical of the auto* tool hierarchy, I also very much appreciate the work you've put into automake, and believe that it's very high quality given the constraints it operates in.

The main challenge it faces, I think, is that there isn't a single well-designed tool, but a hodgepodge of different tools with overlapping functionality. Further, the boundaries between these tools are more or less set in stone for historical reasons.

Thus, I feel that it's possible to do a lot better than auto*. On the other hand, people aren't doing it, so from a practical point of view the question is moot.

I'm also very glad that you're thinking about how to move forward. Obviously, you have lots of practical experience that the "redo it from scratch" crowd lacks.

Another factor in your email volume is that automake has a lot of users. This speaks well to it being a useful tool, I think.

Bottom line: keep up the good work, and ignore the whiners.

231 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!