12 Sep 2004 salmoni   » (Master)

Latent Semantic Analysis

I've decided to submit a project for hosting somewhere. It will be a latent semantic analysis engine. I understand that there is PyLSI which uses Python, but maybe I'm being stupid or missing the obvious, but I can't see much going on with it.

Anyway, the code will be written in Python - well, the prototype code will to begin with, and maybe I could see if folks are interested in a C version. After all, LSA is an extremely intensive thing (try doing singular value decomposition on a matrix of integers sized 10K by 10K, and then reconstructing it!). It can take hours, but the Python version will work nicely for smaller corpuses and testing.

I need to find somewhere with good facilities to do it.

And applications? Well, it can be useful for marking essays, comparing essays to test for plagarism, maybe provide insight into problem solving, and of course it (apparently) works wonderfully filtering out spam.

The engine will have X component parts:

  1. A text parsing engine that reads in documents and orders them into a large term by document matrix;
  2. An engine that takes this matrix, decomposes it with SVD, reduces the coefficients on the diagonal matrix by a user-specified amount (has to be user specified as it is hit-or-miss to get the correct number) and then reconstruct the matrix;
  3. A business engine that takes the new matrix, takes the required input, compares them, and returns an answer.

I've done the LSA engine (# 2). It is *so* complex, maybe 10 lines of code, including imports. The terseness of Python continues to amaze me.

What shall I call it? PyLSA is too close to PyLSI and besides it implies that it is written in Python (which it will be but only for testing / basic use), so maybe Open LSA, or rather OLSA. Is that too close to ALSA?

And what document corpus(es) to use? I was thinking of using wikipedia. I'll have to order a CD-ROM as it doesn't seem fair to use all their bandwidth on a pet project of mine. Other alternatives would be Project Guttenburg, news / newspaper web-sites, and maybe Advogato. Heh, perhaps I could arrange a corpus based on Slashdot comments. I wonder how that would fair... ;^)

All else

I'm having these painful stomach cramps - they woke me up early this morning (and on a day off too! Grr!), but eased off in the afternoon. Since then, however, they've come back with a vengeance, so I've hidden myself in my room doing this LSA stuff. Just hope I can get some sleep tonight as I am shattered after a long week at work.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!