I spent some time with Marc Lehmann's String::Similarity module, which seems to do reasonably well on finding similar strings that were OCR'd independently. I wish Google would get a clue and make higher resolution scans: the OCR error rate would drop hugely, they'd get more of the punctuation and footnotes, and they might eve nstart capturing some of the diagrams! The problem is that it's more lucrative to have millions of badly scanned crap than to have hundreds of thousands of well-scanned books, it seems.