Advogato: Blog for baruch

Hacked on cvsps to reduce memory usage, on a large 400MB CVS repository it needed over 500MB of memory to keep the details.

At first I've started to convert it from memory data structures to gdbm, but it got too tedious after a while.

I then found that the cached data on disk, which is a mirror of the memory data, is only 30MB. So I started to look around to find the culprit.

Apparently, there were some huge overallocations, where a log message has a max of 1K in that repository, 8K would be allocated, there were over 15K log messages. For each filename 4K were allocated, a max length for filename was 200 bytes. Revisions and branch information were kept in too large hashes where a linked list would do well. And a few other minor optimizations were needed.

All in all, memory requirement dropped from 500MB to less than 60MB, which is still a lot but liveable. Until such time that the repository grow too much.

I added a small statistics collector/reporter to the code to help guide my way and used the large repository as well as the gaim repository as a base for my decisions, it was fun.

I did notice a need for a statistics collector library for such a thing, it should report max, average, median and such data, I didn't do median because I was lazy. But between the max and average there is such a large difference that a median would help here. Dumping the data and showing histograms would be great for such a task.

Now I need to clear it up at work and submit the patches to the author. I've got one of those all-your-code-are-belong-to-us type of contracts but with a special clause for OpenSource projects, I still need to get permission for each new project to ensure it doesn't clashes with my work relared tasks.

bytesplit: Considering your attack on OpenSource that it has too many Editors, how about joining one of the PHP image catalog projects and help there instead of starting from scratch?

24 Jun 2002 baruch » (Journeyer)