Older blog entries for titus (starting at number 269)

A new BLAST parser

I spent the weekend hacking out a BLAST parsing package with pyparsing.

BLAST is a really common bioinformatics tool used to search large-ish sequence databases, and the NCBI BLAST program is probably the single most heavily used program in bioinformatics by a long shot. Unfortunately, the NCBI folk have a habit of making tools with idiosyncratic output formats, and AFAIK the only way to obtain all of the information calculated by BLAST is to parse the (human-readable) text format.

This text format is not only human-readable (and not very machine-readable) but it changes fairly regularly, breaking parsers in packages like BioPython. Since I'm already using pyparsing in twill, and I appreciate its very nice syntax, I decided to try writing a maintainable BLAST parser with pyparsing. (The other primary goals were to build a nice Pythonic API and to simplify the use of introspection.)

It took me a long time (all weekend!) to do so, but I've finally got a nice, simple API and what seems to be a largely functioning parser:

for record in parse_file('blast_output.txt'):
   print '-', record.query_name
   for hit in record.hits:
      print '--', hit.subject_name, hit.subject_length
      for submatch in hit.matches:
         print submatch.expect, submatch.bits

         alignment = submatch.alignment
         print alignment.query_sequence
         print alignment.alignment
         print alignment.subject_sequence

It's not really ready for unsupervised use yet, but if anyone out there is jonesin' for a BLAST parser and wants to try this one out, please let me know via e-mail and I'll send it your way. I'd appreciate comments.

--titus

Syndicated 2007-04-30 14:58:51 from Titus Brown

Next: the moooovie

Just saw Next. Highly recommended, believe it or not -- it was a very intelligently done sci-fi movie.

Go! See it! You will enjoy it, if you're in the mood for a bit of silliness and some good ol' fashioned paranormal powers!

--titus

Syndicated 2007-04-29 20:03:03 from Titus Brown

PSF Summer Of Code planet up

I haven't seen anyone announce this, so I guess I should: there's now a Planet Python/Summer of Code site, http://soc.python.org/, hosted by yours truly.

Enjoy!

--titus

p.s. Regular blogging may resume shortly.

Syndicated 2007-04-27 19:03:05 from Titus Brown

Intermediate and Advanced Software Carpentry with Python

(Here's the blurb that I came up with for my Advanced SWC class. This particular class instance isn't open to the public, but I'm not averse to giving it again.

--titus)

What you will learn:

  • how to use and extend builtin advanced types in Python;

  • how to lay out code for ease of maintenance, reusability, and testability;

  • how to profile for performance bottlenecks and improve performance with

    extensions and threading;

  • how to start using the wide variety of external packages that are

    useful for scientists, including plotting and data analysis tools such as matplotlib, SciPy, IDLE, MPI, and Rpy;

  • make your data more accessible to yourself and others with databases

    and Web presentation tools;

Course benefits:

The Python programming language contains an immense number of features that are extraordinarily useful to scientific programmers and readily accessible to intermediate level developers. This course will provide an introduction to many of these features, focusing on those that will make your Python programs more maintainable, testable, accurate, and faster. This course will also introduce a number of third-party packages for development, plotting, and data-analysis that are particularly useful to scientists.

Who should attend:

Scientists who use Python for data processing, data analysis, data presentation, data management, or working with external code and libraries. An introductory knowledge of Python is assumed, as are basic concepts in object-oriented programming.

Hands-on training:

Exercises throughout this course offer immediate, hands-on reinforcement of the ideas you are learning. Exercises include:

  • recipes for interacting with advanced Python builtin types;
  • refactoring example programs for better code reuse and testing;
  • writing unit tests, doc tests, and functional tests for existing code;
  • enhancing data processing performance with psyco, pyrex, and C extensions;
  • refactoring C extension code to support multithreading;
  • graphing data in matplotlib;
  • working with MPI in Python;
  • practical work with the IDLE IDE;
  • interacting with a large database via the Web;
  • building a simple graphical interface for data analysis;

Syndicated 2007-04-02 17:03:08 from Titus Brown

Strangling Your Code and Growing Your Test Harness: The 9 Phases of Building Automated Tests Into Legacy Code

I'm in the early throes of building tests into my Cartwheel project. Cartwheel was one of the two projects that inspired my Web testing project, twill, so naturally I'm happy to finally be putting twill to good use in my own projects. Naturally the transition from building tools for building tools, to actually using the tools to build a tool, is a bit painful: I can't be general any more, now I have to be specific.

The process I'm going through right now is appropriately referred to as Strangling Legacy Code, in this case by Growing A Test Harness (both great articles). Leveraging the power of nose directory hierarchies, I'm slowly growing my setup/teardown code to cover more and more functional testing scenarios, which in turn exposes more scenarios to test. While it's an endless-seeming process, I think I'm at the inflection point where I've now automated more than half of the testing tasks for the Cartwheel Web server. Note that I'm by no means testing even 50 percent of the functionality, but what remains is relatively specific and accessible to testing by very small increments in my testing code. Given my general time constraints, I'm going to switch my focus to testing newly written code and writing automated tests for reported bugs; down the road I'll probably use coverage analysis to figure out what large masses of untestedness lie hidden in my codebase.

Looking back and prognosticating forward, I'd divvy the process up into N steps:

  1. Shock. (How the heck do I start testing this sprawling mass of code??)
  2. Hello, world. (Hey, look at that, I've got a basic import working in my automated tests!)
  3. Fixture code sucks. (Oh, gawd, I've got to automate setting that up, and that, and that...)
  4. Fixtures rock. (Wow, look at what I can test now!)
  5. Over the hump. (Where I am now.)
  6. What lurks beneath? (Using coverage analysis to find large areas of untested code.)
  7. Relaxing. (I've got XX% of my code covered with some kind of automated test! Hooray!)
  8. Reaction. (Hmm, guess I'm not actually testing for that specific bug...)
  9. Goto 7, increment XX.
  10. Asymptotically approach perfection.

Regardless of the steps ahead, it feels good to be at stage 5...

--titus

Syndicated 2007-03-30 19:03:03 from Titus Brown

An e-mail to the Xerces c-users mailing list

If anyone knows someone actually on the Xerces c-users mailing list or development team, could you please forward this on?

(I sent it directly to the list mentioned on http://xml.apache.org/xerces-c/feedback.html, but it hasn't shown up in the archives and I don't seem to have received a bounceback. I'm guessing that I can't post to the list without moderation because I'm not a member of the list, but I can't see any obvious way to sign up for the mailing list. At this point I'm worried it has simply vanished into the ether; hence this post.)

Hi folks,

I thought you might find my discussion of how to compile Xerces-C++ into
Mac OS X universal binaries useful:

      http://ivory.idyll.org/blog/mar-07/compiling-x-platform-on-macs.html

The two pertinent sections are #2 (proper runConfigure incantations) and
3(a) (linking the installed libxerces libraries into Mac OS X's
development hierarchy).  3(c) (distributing dylibs with your app) is
also extremely useful.

Someone should obviously verify all of this before putting it into the
docs; if someone else can verify it, I'd be happy to write it up in the
appropriate format and contribute it as a patch to Xerces-C.  Just ask
;).

cheers,
--titus

p.s. The reason I'd like someone *else* to verify it is that I've
fiddled with various paths on my laptop and no longer have a clean
environment within which to test instructions.  I'm pretty sure I got
all of the steps, but it'd be nice to have someone else check.

I continually have problems figuring out how to get e-mail through to projects. grumble Ahh well.

--titus

Syndicated 2007-03-30 18:03:12 from Titus Brown

Compiling Universal Binaries under Mac OS X -- My Experience

I've spent a few months (on and mostly off) trying to get my C++/FLTK program, FamilyRelationsII, to build on my MacBook for both old and new Macs.

I was helped immensely by Mando Rodriguez and Diane Trout, both of whom contributed various snippets. Getting it all to play nice together was still painful enough that I think it's time to contribute something back to the lazyweb/googleplex.

Problem 1: Compiling FLTK cross-platform

I didn't actually keep notes for this, but basically I generated an X code project for FLTK and then selected "build cross-platform".

  1. In your fltk-1.1.x directory, generate an Xcode project with cmake -G Xcode .
  2. open FLTK.xcodeproj
  3. Select "FLTK" in the "Groups and Files" pane (left)
  4. Double-click, select 'Build'. Double-click on Architectures. Pick 'em both.
  5. Assuming the build works, you should see something like 'bin/Debug/libfltk.a'. These are the libraries you want.

I haven't figured out how to have them installed correctly yet; presumably that's yet another click away. At this point I just copy them into /usr/local/lib ;).

Problem 2: Compiling Xerces C++ cross-platform

Substitute

./runConfigure -p macosx -n native -t native \
     -z -arch -z i386 -z -arch -z ppc \
     -l -arch -l i386 -l -arch -l ppc \
     -l -Wl,-syslibroot,/Developer/SDKs/MacOSX10.4u.sdk

for the default runConfigure command in the Xerces C documentation. Then make, make install.

(I guess I'll contact the Xerces C people about adding this to their docs...)

Problem 3(a): Linking /usr/local in properly

Because FLTK and Xerces-C++ are installed into /usr/local/lib by default, the -isysroot /Developer/SDKS/MacOSX10.4u.sdk stuff will not work unless you also do this:

% ln -fs /usr/local /Developer/SDKS/MacOSX10.4u.sdk/usr

Apparently you need to do this because -isysroot adds the /Developer/SDSKS/MacOSX10.4u.sdk/ prefix on to all library and header filename lookups; this lets Mac OS X know that universal libraries etc can be found under /usr/local as well.

Problem 3(b): Compiling your own code properly

The following CMake snippet (courtesy of Diane Trout) does the job well:

if(APPLE)
  set(APPLE_COMPILE
      "-isysroot /Developer/SDKS/MacOSX10.4u.sdk -arch i386 -arch ppc")

  set(APPLE_LINK
       "-Wl,-syslibroot,/Developer/SDKs/MacOSX10.4u.sdk -arch ppc -arch i386")

  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${APPLE_COMPILE}")
  set(LINK_FLAGS "${LINK_FLAGS} ${APPLE_LINK}")
endif(APPLE)

Then a standard 'cmake .' will put the right magic into your compilation flags.

Problem 3(c): Distributing dylibs with your Mac app

If you're planning to distribute your universal binary, and it depends on Xerces-C++ libraries, read this very helpful Qt doc; look esp at the Shared Libraries section. This worked beautifully for me.

Briefly, you need to use install_name_tool on both the Xerces-C++ dylib files and your compiled applications, to change the location in which they will be looking for libxerces-c.27.dylib. See my build-dist script diff for exactly what I do.

Conclusion: Double-checking stuff

If at any point you run into trouble compiling with both the '-arch ppc' and '-arch i386' flags, run 'file' on required libraries and other binaries to make sure they're universal:

% file FRII/app/FRII
FRII/app/FRII: Mach-O universal binary with 2 architectures
FRII/app/FRII (for architecture i386):  Mach-O executable i386
FRII/app/FRII (for architecture ppc):   Mach-O executable ppc

Well, I hope this helps someone! It was painful to learn, and AFAIK the correct Xerces-C++ incantations are not available elsewhere on the Web ;).

cheers, --titus

Syndicated 2007-03-28 19:14:01 from Titus Brown

Replacing ``commands`` with ``subprocess``

After an innocent question was answered positively, I am putting together a patch to deprecate the commands module in favor of a slightly expanded subprocess module (for 2.6).

Briefly, the idea is to add three new functions to subprocess:

output = get_output(cmd, input=None, cwd=None, env=None):

(status, output) = get_status_output(cmd, input=None, cwd=None, env=None)

(status, output, errout) = get_status_output_errors(cmd, input=None, cwd=None, env=None)

with the goal of replacing commands.getstatusoutput and commands.getoutput. (commands.getstatus has already been removed from 2.6.)

This will provide a simple set of functions for some very common subprocess use-cases, as well as providing for a cross-platform alternative to commands, with better post-fork behavior and error trapping, adhering to PEP 8coding standards. A win-win-win, I hope ;).

In addition to writing the basic code & some tests, I would like to:

  • reorganize, correct, and expand the subprocess documentation: right now it's not as useful as it could be.
  • put some warnings/error reporting into subprocess for bad class parameters; e.g. Popen.communicate should check to be sure both subprocess.stdout and stderr are PIPEs.

Questions:

  • anything else I should think about doing to subprocess?
  • right now the functions take only the input, cwd, and env arguments to pass through to the Popen constructor. Any other favorite arguments out there?
  • should language be added to the popen2 module pointing people at subprocess, and should popen2 be deprecated?
  • GvR suggested that I reimplement commands in terms of these subprocess functions for 2.6, even though the commands module could be deprecated in 2.6 and probably removed in 2.7. I would rather simply amend the documentation to point people at subprocess.

Thoughts?

--titus

p.s. The implementation of the above functions is dead simple:

def get_status_output(cmd, input=None, cwd=None, env=None):
    pipe = Popen(cmd, shell=True, cwd=cwd, env=env, stdout=PIPE, stderr=STDOUT)

    (output, errout) = pipe.communicate(input=input)
    assert not errout

    status = pipe.returncode

    return (status, output)

def get_status_output_errors(cmd, input=None, cwd=None, env=None):
    pipe = Popen(cmd, shell=True, cwd=cwd, env=env, stdout=PIPE, stderr=PIPE)

    (output, errout) = pipe.communicate(input=input)

    status = pipe.returncode

    return (status, output, errout)

def get_output(cmd, input=None, cwd=None, env=None):
    return get_status_output()[1]

Syndicated 2007-03-21 19:32:20 from Titus Brown

``unittest`` bitching: premature; lazyweb request

From reading Collin Winter's blog he's designing a new unittest module first, and then he's going to ask c.l.p and presumably python-dev about adding it to py3k. So it's not quite the fait accompli I thought it was, which reduces my complaints to mild grumbling.

And, dear lazyweb... is there a good way to find out when a particular line of code was introduced (or last touched) through subversion?

thanks, --titus

Syndicated 2007-03-21 18:03:12 from Titus Brown

Some interesting discussions on the testing-in-python list

For my own future reference, as well as to attract people to the fairly high signal testing-in-python mailing list, here are some particularly interesting posts to the TIP list.

Raphael Marvie implements a simple textual specification -> test suite generator.

Kumar McMillan makes some nice suggestions for Michal's "spec" nose plugin.

Benji York outlines a "Testing in Zope 3" case study (originally asked for by Grig).

Sebastien Douche discusses the reason he likes to use Trac to manage projects.

Kumar McMillan talks about the nosetrim plugin for suppressing duplicate errors in your nose output.

And last but not least, Grig starts a loooong discussion on "Doctest or unitest?" that has some really excellent responses; see especially Jim Fulton discusses doctests with setup/teardown in footnotes, Benji York posts the Platonic Ideal of doctests, and Benji York makes the case for good APIs (and zope.testbrowser and twill, too.

One less pleasant surprise was finding out that unittest is being rewritten for Py3K, and Collin is going with something that is both a significant rewrite and neither nose or py.test. The mind boggles.

Why are we extending a module based on an ugly paradigm (the unittest module is great, but it's got a lot of unnecessary syntactic sugar compared to nose/py.test), creating Yet Another unittest system, breaking people's old unittest extensions, and skipping past two fairly popular testing frameworks? Apparently this route is easier than either letting things be or convincing the nose/py.test people to "donate". I don't have anything against Collin, but is he really going to develop something that's significantly better than what's out there already? I'm particularly unhappy that it's going to be dropped into Py3K; I'm still not sure why this is happening. (Can anyone point me to a discussion?)

My preference would be to leave unittest as-is if we can't appropriate nose or py.test.

Anyway, I argued my point on the list, and no one else seems to be worried about it (or, rather, they don't have a solution ;).

--titus

Syndicated 2007-03-20 16:03:04 from Titus Brown

260 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!