11 Nov 2008 rufius   » (Journeyer)

Things I Learned Today…

I’ve been writing a bioinformatics program to test some Bayesian naive classification of K-mer/oligonucleotides. I started with some code I was given that was in Perl, wrote some in Python and then moved Java. In that time I learned a lot about optimizing Python and Java with respect to string manipulations. 

Today I was working with a program to build k-mer distributions in a format that a SVM (Support Vector Machine) can read and process. This requires building huge strings and putting them all in a file line by line. The files are usually in the area of > 50 MB so they’re fairly sizable.

Doing this process was fine as long as I was using k-mers less than 6 (4^6 = 4096), so lines that are no longer than 4096 entries. I noticed a fair slow down when I built a data set with 7-mers but didn’t think much of it. When I tried with 8-mers a little while ago, it was painfully slow. Turns out doing the following with really big strings is bad joojoo:

String my_line = "";
for (int i = 0; i < 20000000; i++) 

   for (int k = 0; k < i; k++) 
       line = line + i;
   line = line + ” | ”;

Obviously I’m not doing exactly that but you get the idea. Basically your string concatenation starts of really fast but as the string gets bigger and bigger, it will get slower and slower. Though I don’t claim to know the inner workings of the String class, my best guess is that every time you concat a string to the end of another string, the JVM realloc’s (as in the C version) the memory to make room for the added information. I may not be right, but from just thinking about it halfway, thats the best I’ve got. 

To alleviate these situations, this is my solution:
StringBuilder str_bldr = new StringBuilder();
for (int i = 0; i < 20000000; i++)
    for (int k = 0; k < i; k++)
    str_bldr.append(” | ”);
String line = str_bldr.toString();

As you can see above, I’m using a class called StringBuilder. Again, no claim of knowledge, but it probably just acts as a Vector/ArrayList (not sure if its synchronized) and you just append items and the toString just iterates the array and returns a big string.

To most this is probably amateur business but I figure its useful for others to know in case they ever wondered. Even if I am a fairly seasoned programmer, I’ve got new things to learn and so does everyone else.

Syndicated 2008-11-11 21:39:22 from blog.zacbrown.org - just run away, now.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!