Log Entropy models
I had problems when I last upgraded to 0.7.8 of Gensim. The main issue was
that the package I imported wasn't necessarily the one used: quite often, it
seemed as though the top level would be from one install whereas another
import would be from somewhere else. The net result was that parts of my
software were looking for an id2word method in a dictionary where there
were none before.
However, I still want to try 0.7.8 if I can and I found a way. I downloaded and
untarred it, and renamed it 'gensim078'. Then, I went and changed each 'from
gensim import *' statement to 'from gensim078 import *' which seems to be
doing the trick. I'm sure there are better ways to do it but this is working for
me so I'm happy.
The advantages are that a) it's faster particularly for similarity calculations,
and b) I now have access to the Log Entropy model which I'm building for
G1750.
Later tonight, I'll adjust the dictionary and begin pruning words that appear
across lots of documents to see if that improves the focus. The program does
seem a little 'fuzzy' as it is but that is quite a human characteristic so I'm not
too worried. However, it will help me explore vector models and understand
them better myself.
Although the results of the word-pair semantic association task were poor,
I'm not dismayed (too much!) because my whole construction is not perfect
and there is lots of room for improvement. The task is also useful as it gives
me an indication of accuracy by another means to the 20NG categorisation
task. When I create a new corpus, I should ideally subject it to a battery of
tests designed to test different things. With the results of these, I can work
out whether the corpus is heading in the right direction or not. It's all good to
have these tools even if (initially) not going how I wanted them to.
I'm turning into a perfectionist. I really need to release something useful
before I refine... Release early, release often...