Older blog entries for salmoni (starting at number 580)

Busy, busy, busy.

But not much to show for it. My wife and I are expecting our first child next Friday. This will be a nervous week indeed.

For my statistics project, I wrote a Python module to import SPSS files and was wondering whether anyone would be interested in it if I released it as open source. It's one piece of code that would greatly benefit from community testing. So far, it seems to work on the SPSS files I have without problem but SPSS have added extra things to the format. Cleverly (or rather obviously, but nice to know that they've done it), older versions of the software can still read the new formats, but they just ignore the extra bits that the new formats have. My software does just that: it ignores all the extra bits, though I suspect that there may be some cases where my software misses completely. For example, the architecture: I believe mine only reads one endian.

But it could be useful for some people. There are already FOSS versions in R and PSPP (I think the R version came from the PSPP code) but they are in C and a Python version might be useful. I wonder if SciPy has it? Currently, it can import via COM, but that is Windows only so of limited application. A pure Python module would have no such restrictions. Scientists using Python might appreciate being able to cut one more string to SPSS so I think I will release it. If anyone here is interested, let me know and it could be the spur that motivates me to release it!

I've started a usability consultancy officially instead of doing work ad hoc. This should be fun as I have to learn marketing very quickly indeed. In the Philippines, I think there is one independent consultant who is serious about the work (ie, has advertising) and a few others who seem to do it as a sideline. However, in nearby Hong Kong, there are two that I can find: Apogee and Customer Input. Looking through Google adwords shows that there is only a small market in HK compared to say the UK. However, we will be operating internationally so the location is of less importance. It does create some difficulties in terms of meeting the clients, but for general applications, I can easily get a good sample of users of varying abilities. It's also about time to put my remote testing experience to, erm, the test.

I'm also toying with the idea of joining the UPA whom I gave a talk for some while ago, but I need to check whether I will get my money's worth. It could be good for being noticed by potential customers.

5 Jun 2008 (updated 5 Jun 2008 at 12:17 UTC) »

I have finished a new Python module which is designed to import SPSS data files as a Python object. It seems to work quite well with the data sets I have. Not all functions are enabled yet: some of the type 7 records are not working yet, but for some I have to reverse engineer the solution and for others I need new data sets that use the subtype records.

When it's a bit more solid, I will probably release it under the AGPL.

llkc - I guess I didn't explain well. My idea isn't an interactive debugger, though there are elements of that in it. The best thing is to produce the code.

In other news, I should (hopefully, all being well!) be doing some consultancy soon. I'm not sure of the size of the job, but it sounds like a good one (ie, interesting and a challenge).

I noticed that Google have opened their appengine up completely now. Signups are ongoing here and it doesn't seem as though there is any limit to the number of users this time. Prices are also available here.

Perhaps this has been done, but I have been thinking about my ideal IDE for Python.

I like editors and have tried many. I also like interactive interpreters and have tried many. But my issue is that I often have to have both running at the same time (and yes, I know there are editors with interpreters running in them at the same time, but that's not what I'm thinking of).

But what about a dynamic editor/interpreter?

It sounds fanciful and I'm just beginning to think of the architecture but here is how it would work with Python.

You type some code in. It works interactively, so only executes when a block is entered. Or it may not. Each code block has a flag next to it that when activated causes the code to be marked as executable. When executed, only that code is run.

Ok, still fairly basic.

But what about if the user could also interactively run code separately from the stored blocks. So if I type in a large program, I can still type 'print "hi"' in the middle, click it, and it and only it will run.

But even better: what about if I can execute the code block by block or even line by line?

Again, this is not totally revolutionary. But what if you could change existing code and cause the program (assuming that it's still running and waiting for the user to enter the next code) to step back to a previous state? And then run up to the end with the new code?

And then when saving, the user can save a working version (with the interactive bits in place) and a "parade" version with all the interactive bits taken out.

I'm not sure this has been done (though if anything can, it's probably Eclipse or Emacs). I have probably described this idea poorly, but I think it could be a good thing that unifies the best of editor/IDE operation with the best of interpreters operation.

I'll have to work on a prototype and test it myself to see if it works.

29 May 2008 (updated 29 May 2008 at 09:13 UTC) »
Python Consulting

This is an announcement that I will be doing Python consulting from now. My expertise covers Python, wxPython, NumPy and SQLAlchemy; and the primary area of my work is on numeric analysis / statistics, though of course you get a PhD in human-computer interaction thrown in if you want interfaces made.

If anyone has any Python work they would like help with, I can offer a discount on open source code. I can work internationally as long as requirements can be sent electronically. The best way to contact me as salmoni - at - gmail.com

Apart from that, all is well here in the Philippines! The coding on the new project is going well and I'm considering farming off the database viewer/importer tool as a separate component for database management. I'm not exactly sure what functionality would be necessary for this, but suffice to say that the basics should be easy to implement (and the middling / advanced stuff a nightmare!).

Factorial ANOVA of large sets

I've also solved all the problems concerning factorial analysis of variance for extremely large datasets (ie, those too large to fit into memory). I will crack on with this code now to get it done and to make an industrial quality heavy-weight data analysis tool. This will be open sourced in time, after testing anyway. The real problems that I have are a) getting hold of an environment (ie, a machine with a massive database on it), and b) getting comparison results, though SAS should be able to deliver on this. I understand that SPSS will face problems if the data are too big for memory; but SAS can work around this just like my code can.

Moore's Law makes this of decreasingly utility; but it's nice to have software that you know can handle any task.


I've also enquired about submitting an article to a Python journal about how to use the code module to implement an interactive interpreter and embed it within a Python program. This comes from work on the statistics program where I wrote one for quick debugging and found it so good that I extended it a little to be used as a permanent tool.

One problem we found is that when declaring and using a variable, a user would have to write:

x = newvar()



It would make more sense (to novices) to write


It does this now. What I did was override the code.InteractiveInterpreter.showtraceback method to catch NameErrors (which are risen when x is sent to newvar because x doesn't exist). Then the code works out the command and sends it again to the newvar method but with the x in quotes. It's minor stuff but less annoying to users.

And if a database has awkward variable names that are not valid variable names in Python, they cannot be used: so I added a catcher to showtraceback that catches AttributeErrors and tests to see if a string has been issued with a program method:

"Variable 1 (2000)".variance()

This would never work normally within Python without overriding the string class (which is another possibility). However, the catcher above can catch this attribute error and redirect the 'variance()' bit to the proper variable definition.

All this just means that the application is beginning to work around its users instead of demanding that they work around it.

I also added lots of alternative names for descriptive tests so:


all call the same function. This helps because when I've used a new statistics program, I have to find out the exact name for the functions. This way, I don't have to remember which one: I just pick a common one, and away I go! :-)

I spent the weekend wrestling with factorial ANOVA code which was nice and fun. All seems to work alright but there is still some finishing and of course testing to do before it's anything like releasable. Plus I need to work on how to work things like post-hoc tests and simple effects for when a significant effect is observed. Lots of fun!

I've been having lots of fun working through factorial statistics code. Actually, I'm not being sarcastic because I've spent so much time preparing the data ready for analysis (that's the part that takes the most work), that the statistics code itself is a nice easily stroll. And curiously, it's fun. The preparation stage doesn't provide so much in the way of motivation because it doesn't really do anything from an end-users perspective. But the stats code can analyse factorial analysis of variance of arbitrary factors and that is a rather nice thing indeed. It actually does something!

In other news, the naming of the business (branding etc) is coming to a head and hopefully we should have formed our company soon and bought all the URLs etc. We had a blitz last weekend and managed to get some ideas that I thought were rather good. I won't mention them here because of squatters, but when we're ready, I will be able to announce them.

And once I have announced them, I can make a public release of the software! Yay!

The above factorial code won't be in it though as it's not anywhere near tested (though I should just add it anyway for users to look at and shake their heads at). The problem is that I like to release things that actually work properly. That goes against the principles of "release early, release often" mantra so I should learn to lose control and just get code out there.

Thanks to everyone from here who completed the questionnaire I linked to in my last diary entry. The information has been tremendously useful! And as I promised, the code will be open source code, probably under the AGPL (which ever one we choose - apparently there are two, both of which are very similar).

Statistics software questionnaire

If anyone uses statistics software of any sort (whether Excel, SPSS, R, SAS or anything), I would be grateful if you could help by completing a survey we have put up at SurveyMonkey. It shouldn't take longer than a few minutes to complete and there are only ten questions. Feel free to expand upon your answers if possible.

Thank you very much in advance to those who complete it.

btw, it's all for the open source software that we're producing. We're stuck for a name now.

The market research has been going well and in our favour. We used a survey and interviews (blind for the first half to get opinions about the field and open the second half to get opinions about our product). We certainly have a strong market here.

And the development is going well though I have been stuck a lot on importing data. However, the tool is extremely flexible and useful - and it's great for merging data from different sources into one unified dataset which is something I think advanced users will appreciate.

I have also been trying to work on the interactive results without too much luck and have instead asked the opinions of the very knowledgeable people on the wxPython mailing list. They seem to come up with extremely helpful answers, but why not ask here?

My situation is this: I have a wxHTML frame displaying HTML results. These need to be dynamic - users will be able to select options that will mean the HTML needs to be changed and then redisplayed. The best way I can think of dealing with this is just to get the HTML (stored in a temporary memory file system) and remove the old code and insert the new code in its place and then re-display it. Does this seem like too much of a bad hack?

wxPython Sizers

I just wasted most of a day trying to sort out the data import GUI and problems with sizers. It was quite frustrating, but I managed to get most of the problems sorted out finally. It is now connecting to various databases and showing a sample of data which users can browse and select what they want to import from.

Oh, and it imports the variables too which is good. It is so nice when problems eventually finish. I have lots more work to do tomorrow (csv importing - I wrote my own csv module to deal with little problems like missing data in the middle of a row) but I am also going to my wife's family's village for a fiesta. It's been raining all day, so here's hoping the weather improves. Here's a picture of the village in sunnier times.

Either way, the work is coming along really nicely now. The project is not yet 50% finished (my estimate), but it already imports data from databases, allows a range of operations on them, and can produce even complex descriptive analysis. It's looking good so far.

571 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!