27 Jun 2009 Ankh   » (Master)

stan, Python already has regular expression support... if you want only ^.*$ then the simplest and most efficient way might be to prefix all others with \ and use the existing regexp support. Most implementations of Perl-style regular expression matching these days can use Boyer-Moore-style delta tables to go massively faster in many common cases. If the code was for your own understading, though, that's fine, and in any case Rob Pike rocks :-)

I spent some time with Marc Lehmann's String::Similarity module, which seems to do reasonably well on finding similar strings that were OCR'd independently. I wish Google would get a clue and make higher resolution scans: the OCR error rate would drop hugely, they'd get more of the punctuation and footnotes, and they might eve nstart capturing some of the diagrams! The problem is that it's more lucrative to have millions of badly scanned crap than to have hundreds of thousands of well-scanned books, it seems.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!