Older blog entries for salmoni (starting at number 409)

nutella - thanks!

Phew, I slept well on Wednesday, or at least until my neighbours alarm clock decided to repeat its little "jaunty" tune repeatedly at 6.30 the next a.m. Loverly...

Had some interesting correspondence about LSA which I am reading about now.

Currently reading "Survival in the Killing Fields" by Dr. Haing Ngor. Quite a gut wrenching story, and not for the weak of stomach.

I've heard that I may be flying to New Zealand in the near future to visit family - I honestly cannot wait to go over there. I wonder if they have any research positions for HCI available there?

28 Sep 2004 (updated 29 Sep 2004 at 06:09 UTC) »

Got it. Minor corrections. I guess I'm Dr. Salmoni. F***.

btw - that's the first time I wrote my name with a Dr in front.


It begins in just under 5 hours. Curiously, I've actually been nervous all weekend about it (and still am), but last night, I felt more like a child on Christmas Eve than I have done for many years - excited.

Having said that, I seriously think that I will need to do some more work for it. It needs (I believe) another experiment or two to be complete, but opinions as to what theses are vary wildly.

From the example of my friends, I have seen some rejected that contained plenty of work with the justification that a thesis should be a wonderful and complete self-contained document. Other friends have been accepted and they said that a thesis should be more of a snapshot of the candidates current work. It all depends upon the examiners really, and it's impossible to call it before hand.

Wish me luck though, as it would be nice to get it. I have worked hard on the thing, and surprisingly, I have more enthusiasm for it now than I did at any time when testing. I must be made for research...


I promise that I will do some OS coding this week on the LSA engine. I PROMISE!!!

Publishing FOSS manuals

I'm saying that because I have done so little FOSS work lately that I'm losing touch, and that's not good. I've put the SalStat manual up for publishing. Actually it has been published (self-publishing) and I would be surprised if anyone ever buys it, but it's there if anyone needs it. I will also have to change the website to notify people of its availability.

Of course, the original pdf is there for zero cost if anyone wants to print it out themselves, but I thought it would be handy to make it available for those who don't want to print it out (or can't). I'm surprised that more FOSS projects haven't done this already (I can see myself buying the R documentation), as it might generate a nice way to make some money for some projects without much extra work.

I did the SalStat manual through Lulu (www.lulu.com) which is a naff name (except for the wonderful Scottish singer of course!), and was created by Bob Young (he of Red Hat fame). All you do it upload a pdf file (with fonts embedded), select a cover, and choose a price. There, it's published. I've sold one copy already - and that's winging its way to me as I speak ;^). That's more for checking than anything else, but I am curious as to what it will look like. I'm so vain...

24 Sep 2004 (updated 24 Sep 2004 at 12:02 UTC) »

The other night as I was setting up a website to hold all my academic reviews and articles, I just had a thought - wouldn't it be good to have an online system dedicated to academic achievement - a place where a user could edit essays and articles, hold their reviews and the suchlike.

The article editing function would have modules for different styles (eg, APA), and while the text would initially be in html, there would be output options for pdf, latex, ps, dvi, and rtf. The beauty of it being online is that people could edit documents pretty much anywhere they had access to a web browser. Holding article reviews would also be handy.

Other functions: calendar (meetings, conferences), contacts, and a chat facility for collaborative work.

But I am sure that somebody in the F/OSS community has had this idea already and implemented it (or at least tried).

I guess it could be built on top of existing CMS's like PostNuke, Mambo and the like. The whole thing would be like WordPress (with styles and *wicked* export features) and essential groupware.

If anyone has any ideas, let me know. From the docs, I gather that WordPress can export to pdf, but I cannot see if a range of other exports would be possible.

Busy reading, shouldn't be posting to my journal...

Seven days to go. I had a nice email off a friend today who told me that a viva is more about two people just chilling out and chatting about my viva which made me feel a _lot_ better, but then maybe her work was better than mine! In some ways, I cannot wait to get in there and get it over with: The excitement of getting the doctorate would be immense, but a relative (or especially complete) failure would be demoralising. Still, I'm resilient if nothing else.

I chatted a while back about starting a usability site but somebody beat me to it! That's good because I would rather someone else do the work for me. However, I wanted a place where I could put my articles where nobody would care, so I've started a new site called Arafaelion. The name is Sindarin for Alan (or at least my interpretation!), but I plan on putting some articles and article reviews there from time to time. It's a bit of a vanity project as I am the only member but I am *not* expecting anyone else to be there at all. In fact, I'm not sure that I want anyone else there anyway unless I get a good academic contribution. I have this image of the site as the web equivalent of the dusty academics office, though an academic who is completely ignored by the rest of the world, just content to wallow in his self-indulgent research.<grin> I can't get that kind of stuff in real life, so I'll have a virtual wastage...

tk - a review of the Powers et al study is here. I plan on doing more article reviews than anything else. The design is atrocious but I'm limited right now with the time I have available and frankly I'd rather write reviews than work on usability testing.

The open source LSA stuff is coming along, though I need to sort out how I am going to deal with plurals. I could use somebody elses solution, but I would rather tackle this myself just for the learning. It seems like an easy problem, but doubtless it will get complicated in no time (the archetypal 'Salmoni' solution). Despite the Powers et al study mentioned above, I still quite like the idea of using LSA for some purposes (which I cannot discuss here just yet).

tk - thanks for the articles! I'll get back to you - I've read a few articles about LSA's success in essay scoring and could do with a decent rebuttal.

tk - LSA != NLP; but it appears to simulate it well enough.

I had a thought: just imagine if every proposition in an essay was reversed (by adding or removing the word 'not'). The meaning would be totally different in most cases, but LSA would still regard it in exactly the same way.

How to get around this? Prepending 'not' to the word would differentiate it, but then I can see some problems arising out of the use of 2 different words for the same concept (though proponents would say that LSA should stand well against this).

Anyway, the LSA engine is coming along. I spent some of Sunday writing the document parser (remove punctuation, make everything lower case, remove non-functional words etc). The only bit that takes more than 5 lines of code is the list of words to exclude, though I haven't dealt with plurals yet. Can be tricky in English...

Go to ChavScum.co.uk for a detailed examination of UK culture. Very funny.

My stomach pains continue, though I've got an appointment with my doctor but it's not until next week. Curiously, I feel quite physically weak as well. The pains feel like someone has thumped me hard in the solar plexus. I think I'll chat to one of the doctors at work instead.

2 weeks (and one day) to the viva and counting.

Still thinking of a name for the LSA engine. Suggestions welcome.

Latent Semantic Analysis

I've decided to submit a project for hosting somewhere. It will be a latent semantic analysis engine. I understand that there is PyLSI which uses Python, but maybe I'm being stupid or missing the obvious, but I can't see much going on with it.

Anyway, the code will be written in Python - well, the prototype code will to begin with, and maybe I could see if folks are interested in a C version. After all, LSA is an extremely intensive thing (try doing singular value decomposition on a matrix of integers sized 10K by 10K, and then reconstructing it!). It can take hours, but the Python version will work nicely for smaller corpuses and testing.

I need to find somewhere with good facilities to do it.

And applications? Well, it can be useful for marking essays, comparing essays to test for plagarism, maybe provide insight into problem solving, and of course it (apparently) works wonderfully filtering out spam.

The engine will have X component parts:

  1. A text parsing engine that reads in documents and orders them into a large term by document matrix;
  2. An engine that takes this matrix, decomposes it with SVD, reduces the coefficients on the diagonal matrix by a user-specified amount (has to be user specified as it is hit-or-miss to get the correct number) and then reconstruct the matrix;
  3. A business engine that takes the new matrix, takes the required input, compares them, and returns an answer.

I've done the LSA engine (# 2). It is *so* complex, maybe 10 lines of code, including imports. The terseness of Python continues to amaze me.

What shall I call it? PyLSA is too close to PyLSI and besides it implies that it is written in Python (which it will be but only for testing / basic use), so maybe Open LSA, or rather OLSA. Is that too close to ALSA?

And what document corpus(es) to use? I was thinking of using wikipedia. I'll have to order a CD-ROM as it doesn't seem fair to use all their bandwidth on a pet project of mine. Other alternatives would be Project Guttenburg, news / newspaper web-sites, and maybe Advogato. Heh, perhaps I could arrange a corpus based on Slashdot comments. I wonder how that would fair... ;^)

All else

I'm having these painful stomach cramps - they woke me up early this morning (and on a day off too! Grr!), but eased off in the afternoon. Since then, however, they've come back with a vengeance, so I've hidden myself in my room doing this LSA stuff. Just hope I can get some sleep tonight as I am shattered after a long week at work.

400 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!