16 Oct 2003 chaoticset   » (Apprentice)

Still working on the palindrome problem. Still no solution in sight.

Right now, I've got a system where I'm producing data from word interactions but the process for it is too slow when applied to the dataset (1.7 megs) that it needs to be applied to. Right now it's too slow to handle a tenth of that within four hours.

I'm working on ways to speed it up, but I have a feeling they're few and far between. My first try is a binary encoding of whether or not a word contains various letters of the alphabet (because then I can apply an operation to both and find out whether the words contain any common letters. If two words don't contain common letters, they sure as hell don't need lengthy regex comparison.)

Could be that I'm duplicating effort here, that the regex engine does this already, etc., but I don't think so, and even if it does, it's doing it 6 times or so, and I'm just trying to do it once.

The other side of this problem is that the hash insertion is taking longer and longer for each successive word. With a hundred, it's a non-issue. With over a billion...I'm thinking I'll need a database. I've got MySQL on this machine, and if I really had to I could install something else, so I'll noodle that around, implement it, run it on my test data, and see what happens. If it looks appreciably faster I'll throw it at 1/10 the amount of the full dataset, see what that looks like.

And, if it blazes through that in an hour or less, then we'll look at running the whole dataset through it over the weekend or something.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!