Recent blog entries for salmoni

I just released a Python-based interpreter for the CESIL language.

https://github.com/salmoni/CESILPy

When I was first being taught computers (1983-1985), this was our first language. It had a massive total of 14 instructions and initially ran as a batch job. In my school, we had a Research Machines thing that ran the code but getting access was hard.

It was a good experience though. I was taught to plan properly: Write the expected output for a set of inputs (unit testing on paper), draw a flowchart of the flow, and then write the actual code on paper before even sitting down at a computer.

More information on CESIL.

BTW, I got Salstat running on FreeBSD as well as Linux, OSX and Windows.

20 Jun 2014 (updated 20 Jun 2014 at 21:08 UTC) »

Salstat work is going reasonably well. With my current day job, I have a long commute with about 2 hours 40 minutes on a train. I use this time to work on the various aspects of Salstat.

The latest work was getting it running on Linux. I originally developed the Python version on a Linux machine, way back in the early 2000s. The versions I've released since 2013, however, required wxPython 2.9+ which isn't in any of the Ubuntu or Debian repos – 2.8 is the absolute latest but this doesn't feature the html2 component. This (depending upon platform) embeds WebKit (OSX and Linux) or Trident (for Windows) into wxPython.

This means that wxPython can implement a HTML instance that can use modern HTML goodies such as CSS, JQuery, Bootstrap, HighCharts, etc. This is what the output for Salstat is contained in.

So Salstat needs wxPython 2.9 or later but instructions on compiling, building and installing wxPython 3.0.0.0 worked first time on Linux Mint (though I had to 'sudo ldconfig' to prevent import errors).

And Salstat now runs on Linux again after 10 years of waiting. I'm well-chuffed because I felt guilty that it only ran on Windows and OSX.

Althought wxPython 3+ is not in the repos yet (though it might be somewhere!), at least it is possible now to get the wxPython 3 goodness which is a definite step-up from 2.8.

I've also been working on a new website (much needed – the old one is very early 2000s) but getting all the content is taking time. See http://test.salstat.com to see it in operation.

Salstat now does basic box plots. The chart defaults to minimum, 1st quartile, median, 3rd quartile, and maximum.
Salstat today loaded a CSV file with over 2.5 million rows. Excel does a sniff over 1 million so I'm winning against Microsoft there, at least for now.

Still working on Salstat from time to time. Latest work involves charting and importing from spreadsheets using xlrd (for Excel files) and ezodf (for Libre Office Calc files). Both libraries had similar interfaces so I cobbled together a lot of common code for both rather than having 2 separate routines.

I've also coded a CSV importer. Python's CSV file only seems to allow a single delimiter but my users sometimes need to handle multiple ones (particularly with files composed of several files from different sources). I wrote my own CSV parser than handles multiple delimiters and key characters within quotes too. The core routine is in here as a Gist (heavily commented too for when I have to trudge my lonely way back to the code to change it). It's not the fastest importer but it does the job accurately with some of the gnarly test data I threw at it.

Salstat code at GitHub

In latest developments, Salstat now displays results nicely, the clipboard functions work well, charts are coming along and bugs have been squashed.

Output display

The full-featured HTML display means that it can do good things when displaying results. It now incorporates JQuery and Twitter Bootstrap to form the output display. This means that tables actually look nice now.

Clipboard

Clipboard functions work across the application (data entry and output) which means the above can be edited (if necessary) and copied into a spreadsheet.

Charts

Salstat also refers to the Highcharts libraries which will be used for charts. Currently, I'm working on a chart window which allows us to generate a chart and edit it to perfection before it gets put into the results. This should help take the guesswork out of charts. And they will be exported to PNG, JPG, PDF and SVG formats directly. This is not yet working but I hope it to be fairly soon.

Bugs

A lot of bugs have been squashed too. Salstat used to freak (rather: refuse to do anything) when inputting anything other than a number into the data grid. Now, it's more relaxed and will try to deal with things downstream intelligently.

Other bugs such as putting data into the first, third and fourth columns have been squashed. Some other bugs with tests have also been squished.

Future plans

Proper data formatting (variable names, data formatting, specifying missing data and marking it visually with a different background colour)
Charts – Salstat has got to have these and they are coming!
Databases – input from and output to databases. Salstat will abstract the interface (using something like SQLAlchemy) in order to tackle a range of databases and dialects. Having said that, the requirements will be fairly simple (retrieve, write and commit) so fairly vanilla SQL will suffice. This, however, is tricky because I want a data browser whereby tables and some content can be browsed easily and data selected for import. This needs to work for remote and local databases as well as SQLite.
Bring in my custom statistics modules (properly unit tested!) from my forthcoming book, "Computational Statistics".

So lots to do yet, but lots done already over the last fortnight or so. I hope to make a new release on 22 October 2013 – 10 years to the day the last proper release was made!

Long time no write.

Ten years after making the last release of Salstat, I've decided to continue with it. The project is on Github now (https://github.com/salmoni/Salstat).

Today's release utilises the excellent xlrd module which has allowed Salstat to read Excel files (xls and xlsx). Many people have asked for this. For now, the basic "happy days" workflow is fine but there is poor error handling.

The next one will have database access. This is a more complex workflow. I also need to harden the Excel and CSV import routines.

Mozilla are looking for a Quantitative user researcher which sounds cool. The emphasis on user research sounds right up my street, particularly the need for mastery of experimental design and statistical analysis. It kind of takes me back to my PhD and work on SalStat (still going strong).

The problem is my covering letter. Can anyone here tell me what style of covering letters are preferred? Long and detailed explaining why I meet each of the requirements? The standard 3 paragraph ["intro", "I'm cool", "thanks"]? Or some combination in between?

In the meantime, I've released Roistr which does some basic semantic analysis / text analytics stuff. I put up some demos but it's hard to really show how useful this thing is. It's based on the open source Gensim toolkit along with numpy and scipy.

Scipy sounds like it's going places. Travis Oliphant recently announced an initiative to bring it to big data properly. I have an idea of what he means and it would be very cool.

Does anyone have any Google Plus invites that they could send (one) to me?

In other news, wife, daughter and I are off to the Philippines for 5 weeks and hoping to get some start-up work moving over there. UX is in demand at the moment so it's a good time to be around.

I've also been looking up versions of principle components analysis in Python and found these:



All the linguistic stuff I've been doing lately is making my head spin but it's coming together.

Lots happening: I've been building a semantic relevance engine - something that can accurately determine the semantic similarity of 2 text documents and it's working reasonably well. Working completely untrained, I'm getting accuracies of well above 0.8 and often above 0.9. Obviously 1.0 is the ideal but even human judgements rarely get above 0.9 with the corpora I've been using for this.

The good thing is that I appear to be discovering new stuff almost every day about how documents are understood. There are some approaches I've used that I've not read about in the literature so there might be some useful stuff for the world here.

However my aim is to make a web service around this. And it's all based on open source software (Python, numpy, Scipy, Gensim etc) which is perfect. There is proprietary knowledge used, however: the corpora, how it's prepared and the architecture of the engine; but that will all come publicly out soon enough.

591 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!