5 Dec 2008 rufius   » (Journeyer)

Archaea Classification Continued

After having thoroughly examined the code for a couple days and tried the code with replacement of fragments, I’ve convinced myself that the code is correct. After thinking about it, it occured to me that the relative k-mer distribution profiles for larger k-mers (7,8,9) might be skewed by even very small sampling without replacement.

I went ahead and took the difference between the relative distributions for Pyrobaculum calidifontis for 4 different cases:

  • 8-mers - 100% genome vs 99.5% genome
  • 8-mers - 100% genome vs 67% genome
  • 4-mers - 100% genome vs 99.5% genome
  • 4-mers - 100% genome vs 67% genome. 
Since 4-mers showed little variation between training and full genomes, I felt that was a good base for “lack of difference” in the distributions. Here’s the data:

As can be seen, the variation in relative distributions for the 4-mers is very small, generally no larger than +/- 0.002  and thats with training 67% of the genome. Meanwhile, the 8-mers show significant variation with training 67% of the genome there is a variation of up to nearly +/- 0.2 which entirely changes a profile. Even with 99.5% training, it shows variation in the hundreths place which is enough to skew the profile. This was tested on several organisms, but Pyrobaculum calidifontis just happens to be my pick.
That to me, explains why this technique might not be applicable the way its currently designed as the profiles for the organisms don’t match as well. Of course the other side of this is since every one of the genomes’ profiles would be skewed, wouldn’t that even it out. Without some serious statistical analysis (and time), I can’t say for sure.
Here also is a comparison of distributions:
From this, it can be seen that sampling with replacement (100 pieces) is pretty close to sampling 95% of the genome with replacement. Those are two separate pieces of software which is what leads me to believe the software is written correctly.

Syndicated 2008-12-05 15:21:55 from Zac Brown

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!