Advogato: Blog for Ankh

stan, Python already has regular expression support... if you want only ^.*$ then the simplest and most efficient way might be to prefix all others with \ and use the existing regexp support. Most implementations of Perl-style regular expression matching these days can use Boyer-Moore-style delta tables to go massively faster in many common cases. If the code was for your own understading, though, that's fine, and in any case Rob Pike rocks :-)

I spent some time with Marc Lehmann's String::Similarity module, which seems to do reasonably well on finding similar strings that were OCR'd independently. I wish Google would get a clue and make higher resolution scans: the OCR error rate would drop hugely, they'd get more of the punctuation and footnotes, and they might eve nstart capturing some of the diagrams! The problem is that it's more lucrative to have millions of badly scanned crap than to have hundreds of thousands of well-scanned books, it seems.

27 Jun 2009 Ankh » (Master)