Older blog entries for johnw (starting at number 64)

Journey into Haskell, part 6

Another thing to be learned down the Haskell rabbit-hole: Thinking in infinites. Today someone posed a puzzle which I tried to solve in a straight-forward, recursive manner: Building a list of primes. The requested algorithm was plain enough: > Create a list of primes "as you go", considering a number prime if it can't be divided by any number already considered prime. However, although my straightforward solution worked on discrete ranges, it couldn't yield a single prime when called on an infinite range -- something I'm completely unused to from other languages, except for some experience with the SERIES library in Common Lisp. ## An incomplete solution Looking similar to something I might have written in Lisp, I came up with this answer: primes = reverse . foldl fn [] where fn acc n | n `dividesBy` acc = acc | otherwise = (n:acc) dividesBy x (y:ys) | y == 1 = False | x `mod` y == 0 = True | otherwise = dividesBy x ys dividesBy x [] = False But when I suggested this on [#haskell](irc://irc.freenode.net/haskell), someone pointed out that you can't reverse an infinite list. That's when a light-bulb turned on: I hadn't learned to think in infinites yet. Although my function worked fine for discrete ranges, like `[1..100]`, it crashed on `[1..]`. So back to the drawing board, later to come up with this infinite-friendly version: primes :: [Int] -> [Int] primes = fn [] where fn _ [] = [] fn acc (y:ys) | y `dividesBy` acc = fn acc ys | otherwise = y : fn (y:acc) ys dividesBy _ [] = False dividesBy x (y:ys) | y == 1 = False | x `mod` y == 0 = True | otherwise = dividesBy x ys Here the accumulator grows for each prime found, but returns results in order whose value can be calculated as needed. This time when I put `primes [1..]` into GHCi it printed out prime numbers immediately, but visibly slowed as the accumulator grew larger.

Syndicated 2009-03-26 13:00:00 (Updated 2009-03-23 05:28:45) from Lost in Technopolis

Journey into Haskell, part 5

Haskell may be difficult to start out with, but once things start rolling, they roll fast. Yesterday (real world time, these blog entries are staggered) I had started the first lines of HackPorts, but now things are getting close to done for the first version. It's not that I've written much code, but that it was simple to integrate with other people's code. ## Borrowing all I can The first thing I wanted to do was avoid dealing with any of Hackage's data formats, so I cribbed everything I could from the `cabal-install` package. I actually imported the full source into HackPorts, ripped out its `List.hs` file, renamed it to my `Main.hs` file, and then began changing it from a function that prints out a list of available packages, to one that writes the data into properly formatted `Portfile` entries. The code does the following bits of work: 1. Talks to `cabal-install` and Cabal to get a list of all known packages on Hackage. 2. For every package, creates a directory named `haskell/$package`, and then writes information about that package into `haskell/$package/Portfile`. 3. As it does this, it fetches the current version's tarball over HTTP, and uses OpenSSL (directly, through FFI) to generate MD5, SHA1 and RIPEMD160 checksums of the tarball image. And voilá, a directory populated with 1136 Portfile entries. What's missing now is the external dependency mapping. As a stub, I have them all depending on `port:ghc`, but I think there's sufficient information in the Cabal package info to figure out what the right dependencies should be, both among the Hackage packages themselves and against any external libraries (like OpenSSL). ## What I learned As for my Haskell education, I learned about using Haskell's very nice FFI mechanism, and had a lot more experience using the IO Monad. An example of using FFI to call out to OpenSSL: {-# OPTIONS -#include "openssl/md5.h" #-} foreign import ccall "openssl/md5.h MD5" c_md5 :: Ptr CChar -> CULong -> Ptr CChar -> IO (Ptr Word8) I now have access to a `c_md5` function, which go directly over to the C library to do its work. Not too shabby! As for the IO Monad, here is the `main` function for Hackports: main :: IO () main = do createDirectoryIfMissing True "haskell" pkgs

Syndicated 2009-03-24 13:00:00 (Updated 2009-03-23 05:28:22) from Lost in Technopolis

Functional yet lazy

As I explore Haskell, I’m discovering that one of its trickiest aspects is not structuring things functionally, but the lazy evaluation. It turns out lazy evaluation comes with both great benefits, and significant difficulties. I’d like to point a few of these out, as they’re becoming clearer to me.

Benefits

One of the great benefits of lazy evaluation is that your code doesn’t need to account for the scale of an operation. Let’s take a simple example: checksumming a large file whose contents are being read, on-demand, over HTTP.

In C++, if I wanted to checksum a large file read over HTTP, I couldn’t buffer it memory because I don’t know how large it might get. Nor do I want my checksumming code to know anything about HTTP, or where the data comes from. The answer here is to use I/O streams. By passing a generic istream interface around, I can hide any knowledge of where the data came from. The checksumming algorithm just reads data from the stream as it needs to, and the HTTP layer downloads more bytes as required (typically caching to avoid constant network access).

However, there’s a downside to this: the checksumming code now knows something about I/O. In actuality, a checksumming algorithm only cares about the bytes being checksummed and little else. It shouldn’t have to know about I/O, or strings, or any of the details of where data comes from or how it’s structured. It should ideally receive a pointer to an arbitrary large sequences of 8-bit bytes, and return a fixed size checksum representing a fingerprint of those bytes.

Yet this naive approach can’t really be done in C++. If it were a file I was checksumming, I could memory map the file and pass around a byte pointer, and the OS would take care of lazily reading in the bytes for me as needed. But for a file being accessed over HTTP this would require first downloading the file and then checksumming it, when I specifically wanted to “checkum as I go”. Who knows, maybe I’ll discover a reason to stop summing beyond a certain point and I’d like to stop downloading at that point as well.

Well, just as memory mapping gives me lazy access to the contents of a file, a language with lazy evaluation gives me lazy access to the results of any algorithm – including downloading data over HTTP. With Haskell, I can indeed write my checksum algorithm as if it receives a giant byte buffer, and the language takes care of downloading only as much data as I’ve accessed (plus caching). This simplifies my checksumming code, and reduces the amount of knowledge that has to be passed around, such as a “stream” as opposed to a generic, 8-bit pointer.4

This simplification lets you design your algorithm as if in an ideal world. You want to process a bunch of numbers? Work on a list. What you say, the numbers are coming in from a socket and you don’t know when it will end? Doesn’t matter, just work on a list. In C++ I’d have to switch from passing a vector to passing an istream iterator, but in Haskell, I don’t care what algorithm is populating my list, only that it is a list, and that I know how to work it.

Detriments

For all its beauty, laziness has three costs I’ve run into so far. The first is that it lets you very easily write functioning algorithms with horrible performance characteristics. This happens because laziness causes a promise5 to be constructed, which takes memory and time to do. Sometimes, the cost of the underlying operation is far less than the memory cost of carrying a promise around to do that operation at a later point. This isn’t true of a slow operation like reading from a socket, but it’s certainly true of something trivial like summing two integers. It means one has to be aware of promises, when they’re constructed, and when it’s more beneficial to force evaluation always versus the benefits to be had from deferred a computation whose result may never be needed.

The second is that when a poorly performing algorithm dies, it dies when its value is used, not when the promises are made. This can make it look like the consumer is to blame, when really it’s the producer. Here is a trivial example:

  mysum = foldl (+) 0

main = print (mysum [1..1000000])

Although foldl is tail-recursive, so we aren’t blowing stack through recursive calls, it still blows stack because it builds up a huge, nested structure of promises that only gets evaluated once print is called to render it as a string. That is, the return value from mysum itself is no problem, it’s just a lazy computation against a large list. But then print needs the result, so it asks mysum to fulfill its promise. This in turns causes mysum to churn through the large list of integers, building up the return value as it goes.

However, and here is where the surprise comes in: foldl doesn’t actually compute those values as it walks the input list. No, even these are done lazily, because it can’t know how many of those values will actually be needed. We may know from looking at the code that it will need them all, but it doesn’t know. So it constructs something on the stack looking like this:

  ((((((((0+1)+2)+3)+4)+5)+6)+7)+...)

And so on, all the way to the last integer. Only when mysum is done constructing promises across the entire input list, and the promise structure is returned, will it actually get evaluated by summing the integers together and finalizing each promise. If you pick a input list large enough, there goes available memory.

The trick here is that the stack fault won’t ocur in foldl, or in mysum. It will occur in print, where the need to resolve the promise result in the call to mysum actually being made, which then calls foldl, which then starts building thunks until memory is gone. In this trivial example there’s very little code or time distance between the problem and its cause, but in real world code there may be enormous gaps between them.

In consequence of this I learned that it’s hard to get GHC to produce stack traces for you when there’s a runtime error. Your code can be going on its merry way, when suddenly there’s a stack fault. But that’s all you see: a stack fault indicating something went wrong. Where did it go wrong? Based on the behavior of the program, I’m led to believe it happened near what the code was actually doing6 – but in fact the problem may have started long, long before, except that laziness differed the trigger to a later time.

So, even though laziness can delay costs and abstract how data is determined, by the same taken it also delays errors and abstracts blame. In C++ if I pass in an I/O stream and there’s a crash reading from it, I know to look at my stream code. But in Haskell if I get a stack fault simply by processing a list, how am I to know what’s wrong? It’s not going to be in the List code, and probably not in the code walking the list, but in code which promised to produce the list potentially a long time ago.

I still think the benefits can outweight the difficulties – especially when it comes to parallelism, and avoiding unnecessary computations, and allowing code to safely traverse infinite series – but it definitely requires a level of algorithmic conciousness on the part of the engineer which seems quite a bit higher than with imperative languages.


  1. And if I do have to include state with this raw, lazy data, but I don’t the algorithm to know anything about it? That’s where the Monad steps in. Say instead of checksumming a file, I’m parsing an expression. There are a lot of details that go along with parsing that have little to do with interpret the next bit of text, such as token position, error context, backtracking information, etc. I want to be able to write a routine that parses a number very simply, without knowing about all those details. It’s the Monad that manages this extra information. You can read more here. ↩

  2. Promises are what get turned into real values when data is finally needed. ↩

  3. If you use profiling libraries along with -prof -auto-all, you can get a much clearer picture of what was executing at the time of the fault. ↩

Syndicated 2009-03-22 08:06:16 (Updated 2009-03-22 08:06:25) from Lost in Technopolis

21 Mar 2009 (updated 21 Mar 2009 at 20:06 UTC) »

Journey into Haskell, part 4

I’ve been reading Read World Haskell now, after having finished the delightful Learn You a Haskell Tutorial. I’m up to chapter 6, about to dive into Typeclasses. In the meantime, I’ve picked a toy project that also has a taste of usefulness: a script to convert the Hackage database into MacPorts Portfiles, respecting inter-package and external library dependencies. I call it HackPorts, of course.

Requirements

This translation should require two things:

  1. The Cabal package, for read information about all packages known to it. This avoids writing a custom parser, or using HTTP to crawl the online Hackage database.

  2. A mapping file of external dependency names to MacPorts port names. This is for dependencies on things like libbz2, where the script will need to be taught how MacPorts names that library. This is likely to be the most labor-intensive step, having nothing to do with Haskell.

Initial experiences

Haskell makes a concerted point about separating “pure” from “impure” code. Anything which talks to the outside world, such as reading and writing files, is impure. Anything which can be expressed in terms of standard data types – or compositions thereof – is pure.

Take for example a program to count lines in a file. The pure part of the code receives a giant string, splits it into lines at line boundaries, counts those lines, and returns an integer. The impure part takes a command-line argument, interprets it as a FilePath (an impure type, since it must concern itself with operating system-dependent naming conventions), and reads the contents of the file at that location. The program flows by passed the file contents as a string to the pure code, and receiving an integer to be printed on the output device.

This division into pure and impure has an interesting side-effect (no pun intended): Most of a program’s code is written in isolation of its context of usage. Take Cabal, as a case in point here. Part of Cabal deals with downloading information from the Web, reading and writing package files, and executing external commands, like make. But another part of Cabal is concerned only with the structure of package files, and determining the total set of dependencies required for building a package. These latter details can be discussed in complete isolation from what is done with that information.

As a result – and I’m not sure whether the Cabal authors designed it this way or not – Cabal is naturally part “program”, and part API. I was able to start taking apart package files almost instantly, with extremely little code. Here’s a toy program to print out a package’s maintainer, if given the path to a .cabal file:

  import System.Environment (getArgs)

import Distribution.Verbosity (verbose)
import Distribution.PackageDescription
import Distribution.PackageDescription.Parse (readPackageDescription)

main = do
  args <- getArgs
  pkg  <- readPackageDescription verbose (head args)
  print . maintainer . packageDescription $ pkg

Now, I do suppose it’s just as easy to do a similar thing in Python’s distutils, for example:

  import sys

from distutils.extension import *

exts = read_setup_file(sys.argv[1])
print exts[0].language       # print the ext 'language'

What excites me is that Haskell uniquely encourages the separation of alogrithm and application – the isolation of context-dependent knowledge into as small a region of a program as possible.

Too many times I’ve tried to use a utility’s code as a “library”, only to find it was so caught up in its idea of how it should be used, it had never bothered to abstract its core principles into a set of “pure” function, independent from that intent. This happens, for example, with the version control system Git. Although many have wanted a libgit.a for accessing Git’s data structures directly from other languages, yet none exists. One is forced to either shell out to the git command, or write another implementation to interface with the “pure” side of what Git does.

Syndicated 2009-03-21 10:18:23 (Updated 2009-03-21 19:20:43) from Lost in Technopolis

Updated site to use Blueprint CSS again

Recently I changed how the content on this site was generated, from using the standalone OS X application RapidWeaver, to the server-side publishing platform Movable Type. During that transition I changed the site’s style to the minimalist default offered by MT, which uses its own CSS for column layout and typography.

Tonight I finally got around to switching the site back to blueprint-css, which I very much prefer. I used the superb application CSSEdit to help me massage Movable Type’s style into something that compatible with Blueprint’s own typography and layout.

I hope the result is pleasing. If anyone sees strange artifacts or display issues, please let me know. I’m aware code examples were being truncated on the right side before, but this should be corrected now. More on Haskell to come soon!

Syndicated 2009-03-20 09:27:44 (Updated 2009-03-20 09:27:50) from Lost in Technopolis

Journey into Haskell, part 3

Today I need a wrapper script to drop arguments from a command-line. I instinctively reached for bash, but then thought it would be a good exercise for my infant Haskell knowledge.

The task

The task at hand is to write a wrapper script for /usr/bin/ld that drops arguments beginning with -Wl,-rpath,. Since it must deal with arguments containing spaces, and I didn’t want to get into executing external programs with Haskell just yet, I wrappered the wrapper:

  #!/bin/bash
$(dirname $0)/ld-wrapper "$@" | xargs -0 /usr/bin/ld

Here ld-wrapper is expected to return its arguments separated by NUL characters so I can feed it to xargs, and from there to /usr/bin/ld. I’m sure there’s an easy, all-in-one way to do this with Haskell, I just haven’t reached that chapter yet.

Haskell version

Anyway, here is the Haskell script:

  import Data.List
import System.Environment

main = do
  args <- getArgs
  putStr $ intercalate "\0"
         $ filter (not . isPrefixOf "-Wl,-rpath") args

Pretty basic: it filters the input arguments, keeping each one which does not begin with the sought-for string, and joins the list together using NUL as the separator.

Ruby version

As a quick sanity check, I wrote the same thing in Ruby, since it has facilities for being just as succinct:

  print ARGV.select {
  |y| !y.include?("-Wl,-rpath")
}.join("\0") + "\0"

I wanted to do this with an “inverse grep” instead of select, but couldn’t find a way to grep for the opposite of a pattern.

What’s interesting is that the Ruby version is marginally faster than the compiled Haskell one. For filtering 40,000 arguments, here are the averaged run-times over 20 invocations:

Language Speed
Haskell 0.00774523019791s
Ruby 0.00551697015762s

My guess is that Haskell is creating 40,000 different strings in memory as it constructs the final result, while Ruby is pasting one together as it goes. I don’t know which.

Syndicated 2009-03-19 05:59:06 (Updated 2009-03-19 08:04:05) from Lost in Technopolis

Journey into Haskell, part 2

Everybody talks about Monads when they mention Haskell, so I got a bit ahead of myself and wanted to see something of what they’re about. No, don’t worry, I’m not aspiring to yet another Monad tutorial. I feel I have a ways to go before I’m ready to craft my own light-saber.

I did read about 10 Monad articles on the Web, and found myself more confused when I came out than when I went in. Today’s exercise took about 5-6 hours of pure frustration, before a kind soul on IRC finally set me straight. It sure is difficult when getting past a single compiler error takes you hours.

That bedeviled cat

Most geeks know about Schrödinger’s cat, the fated beast who, when put into a box with a random source tied to a deadly gas trigger, remains in a state of quantum superposition in which he’s neither alive nor dead until someone opens the box to look.

Well, people kept saying that Monad are like “computational containers”, so I wanted to model the following:

  1. There is a Schroedinger Monad into which you can put a Cat.
  2. When you create the Monad, it is Unopened, and the Cat’s has no state.
  3. You also pass in a random generator from the outside world. This involves another Monad, the IO Monad, because randomness relates to the “world outside”.
  4. As long as you don’t use the monad object, the Cat’s is neither Dead nor Live.
  5. As soon as you peek into the box, or use it in any calculation, the Cat’s fate is decided by a roll of the dice.

When I run the program ten times in a row, here’s what I get:

  Opened (Live (Cat "Felix"))
Opened Dead
Opened Dead
Opened Dead
Opened Dead
Opened (Live (Cat "Felix"))
Opened Dead
Opened Dead
Opened (Live (Cat "Felix"))
Opened (Live (Cat "Felix"))

Let’s look at the code, and where I had troubles writing it.

A flip of the coin

The first function flips a coin and returns True or False to represent Heads or Tails:

  import System.Random

flipCoin :: StdGen -> Bool
flipCoin gen = fst $ random gen

The sugar fst $ random gen is just shorthand for fst (random gen). There is no difference, I was just playing with syntax. You do need to pass in a valid random generator, of type StdGen, for the function to work.

Cats

  data Cat = Cat String deriving Show
data Probable a = Dead | Live a deriving Show

These two types let me make Cats out of Strings, along with a Probable type which models a Live thing or a Dead thing. It treats all Dead things as equal. I can create a Live Cat with:

  felix = Live (Cat "Felix")

Following my “fun with syntax” up above, I could also have written:

  felix = Live $ Cat "Felix"

It doesn’t matter which. The $ character is the same as space, but with much lower precedence so that parentheses aren’t needed around the argument. If there were no parens, it would look like I was calling Live with two separate arguments: Cat and "Felix".

Flipping a Cat

  flipCat :: StdGen -> a -> Probable a
flipCat gen cat = if flipCoin gen 
                  then Live cat
                  else Dead

When I have a Cat, I can subject it to a coin toss in order to get back a Live Cat or a Dead one. I should probably have called this function randomGasTrigger, but hey.

The type of the function says that it expects a random generator (for flipCoin), some thing, and returns a Probable instance of that thing. The Probable means “can be Live or Dead”, according to how I defined the type above. The rest of the function is pretty clear, since it looks a lot like its imperative cousin would have.

Bringing in Schroedinger

  data Schroedinger a
    = Opened (Probable a)
    | Unopened StdGen a deriving Show

This type declaration is more complicated. It creates a Schroedinger type which has two data constructors: an Opened constructor which takes a Probable object – that is, whose Live or Dead state is known – and an Unopened constructor which takes a random generator, and an object without a particular state, such as a Cat.

Some values I could create with this type:

  felix   = Opened (Live (Cat "Felix")) -- lucky Felix
poorGuy = Opened Dead                 -- DOA
unknown = Unopened (mkStdGen 100) (Cat "Felix")

In the third case, the idea is that his fate will be determined by the random generator created with mkStdGen 100. However, I want a real random source, so I’m going to get one from the environment later.

Here comes the Monad

  instance Monad Schroedinger where
    Opened Dead >>= _ = Opened Dead
    Opened (Live a) >>= f = f a
    Unopened y x >>= f = Opened (flipCat y x) >>= f
    return x = Opened (Live x)

As complex as Monads sound on the Web, they are trivial to define. Maybe it’s a lot like binary code: nothing could be simpler than ones and zeroes, yet consider that all complexity expressable by computers, down to video, audio, programming languages, and reading this article, are contained within the possibilities of those two digits. Yeah. Monads are a little like that.

This useless Monad just illustrates how to define one, so let’s cut it apart piece by piece. By the way, I didn’t author this thing, I just started it. Much of its definition was completed by folks on IRC, who had to wipe the drool from my face toward the end.

  instance Monad Schroedinger where

Says that my Schroedinger type now participates in the joy and fun of Monads! He can be discussed at parties with much auspiciousness.

      Opened Dead >>= _ = Opened Dead

The >>= operator is the “bind” function. It happens when you bind a function to a Monad, which is like applying a function to it. This line says that if you apply a function to an Opened box containing a Dead thing, what you’ll get back is an Opened box with a Dead thing.

      Opened (Live a) >>= f = f a

If, however, you bind a function to an Opened box with a Live thing, it will apply the function to what’s in the box – in this case, the Cat itself. The function f is assumed to return another instance of the Schroedinger type, most likely containing the same cat or some transformed version of it.

      Unopened y x >>= f = Opened (flipCat y x) >>= f

Here is the meat of this example, it’s reason for being, all contained within this one line: If you bind a function to an Unopened box, it gets bound in turn to an Opened box containing a Cat whose fate has been decided by the dice. That’s all. The reason I used a Monad to do this is to defer the cat’s fate until someone actually looked inside the container.

      return x = Opened (Live x)

Lastly, if someone returns a cat from a box, assume its an Opened box with a Live Cat. I don’t honestly understand why this is necessary, but it seems Opened Dead cats are handled by the binding above, as shown by the output from my program. I’ll have to figure this part out soon…

The main function

The last part of the example is the main routine:

  main = do
  gen <- getStdGen
  print (do
          box <- Unopened gen (Cat "Felix")
          -- The cat's fate is undecided
          return box)

This is fairly linear: it gets a random generator from the operating system, then creates an Unopened box and returns it, which gets printed. print does its work by calling show on the Schroedinger type, since it was derived from Show earlier.

Something I still don’t understand: at exactly which point does the flipping happen? When box is returned? When show gets called? Or when print actually needs the value from show in order to pass it out to the IO subsystem?

Closing thoughts

The full version of this code is on my server. There is also a simpler version without Monads. I worked on the Monad version just to tweak my brain. At least I can say I’m closer to understanding them than when I started.

Syndicated 2009-03-18 08:07:37 (Updated 2009-03-18 08:07:45) from Lost in Technopolis

Journey into Haskell, Part 1

Having just begun my descent down the rabbit hole, I thought I’d try journaling about what I discover along the way, so that those who are merely curious can play the part of language voyeur. I’ve always wanted to do that: to see how someone dives into Erlang or O’Caml or Forth – or Haskell. Here’s your chance.

This is day 5 of the Haskell experience, and I’m having quite a bit of fun so far. It’s definitely twisting my head into pretzel shapes. I’ve spent hours getting less done than I could achieve with Python in moments. The hope is that all this retraining will pay off further down the road.

Fibonacci

My first attempt was a Fibonacci function, which I failed at miserably. Turns out I was unable to conceive of “lazy recursion”. When I looked up the answer, it just seemed beautiful:

  fib = 1 : 1 : zipWith (+) fib (last fib)

This function starts out the list with 1, followed by 1, then it starts adding two lists together – provided by the same function before it’s even done! In imperative land this would blow the stack in a heartbeat, but in Haskell it makes sense. The recursive call to fib returns 1, 1, 2, 3, 5 and the recursive call to last fib returns 1, 2, 3, 5. Add them together, and you get the sequence.

There is also the traditional definition, which matches what you find in math textbooks:

  fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

If evaluated at the interactive prompt, this function will generate numbers forever, so you have to ask for just a few, like the first 20:

  take 20 fib

So, things began with my face on the ground, which was humbling, but also refreshing that such a simple problem could floor me so easily.

Splitting strings

The next problem I tried to tackle was splitting a string into substrings at each colon. That is:

  "Hello:World"
  => ["Hello", "World"]

Again, fail. How shocking it was to spend over an hour on this and ultimately have to resort to Google. The answer was pretty straightforward:

  splitAtColons :: String -> [String]
splitAtColons = sac' []
    where sac' acc []       = [acc]
          sac' acc (':':xs) = acc : sac' [] xs
          sac' acc (x:xs)   = sac' (acc ++ [x]) xs

What I missed was using an accumulator to collect the current string. I kept thinking it was something I had to return as I went along, not passed down to each deeper level – and then returned after I’d added to it. Here’s the breakdown:

  splitAtColons :: String -> [String]

Defines the type of the function as something which takes a String and returns a list of String.

  splitAtColons = sac' []

This is essentially what I missed. The definition of splitAtColons calls a sub-function, passing in an empty string (aka list) as the “accumulator”.

      where sac' acc [] = [acc]

If sac' sees an empty string ([]) – the end of the string currently being processed – return the accumulated string in its own list.

            sac' acc (':':xs) = acc : sac' [] xs
          sac' acc (x:xs)   = sac' (acc ++ [x]) xs

Otherwise, take apart the current string into its first character, x, and the remainder, xs. If that first character is a colon, return a list with the current accumulator as the head, and recurse to process the rest of the string (and so on). Otherwise, add the non-colon character to the current accumulator, and recurse to process the rest of the string.

First reactions

Moral of my first story: prepare to be humbled. Google and IRC were a lifeline, and the people on #haskell, both helpful and patient. More soon.

Syndicated 2009-03-16 19:12:14 (Updated 2009-03-16 20:18:05) from Lost in Technopolis

The JVM, and costs vs. benefits

In a recent entry on differences between Haskell and Lisp, one of the Lisp community’s long-time members, Daniel Weinreb, asked about my stated aversion to JVM-based languages for everyday computing (some times referred to as “scripting”). Specifically, it was asked in relation to Clojure, and why I hasn’t been immediately taken by that languages – despite it’s having so many features I respect and admire.

I wanted to respond to Daniel’s question in a separate blog entry, since this topic has come up so often, it seems, and deserves thought. The JVM is a rich, mature platform, and you get so much for free by designing new languages on top of it. The point of debate is: what are the costs, and are they always worth the asking price?

Daniel’s question was:

In your own case, you mention “tiny” and “fast-running” executables. I am not sure why “tiny” matters these days: disk space is very cheap, and the byte code used by the JVM is compact. Common Lisp programs compiled with one of the major implementations, and programs written for the Java Virtual Machine, execute at very high speed.

The fact that you distinguish between server-side and client-side applications suggests to me that what you’re really talking about is start-up latency: you’re saying that a very small program written for the JVM nevertheless has a significant fixed overhead that causes perceived latency to the user. Is that what you have in mind?[…]

As a hypothetical question just to clarify your meaning: if there were a JVM implementation that started up instantly, so that the speed of execution of a small program would be the same as the speed of the same code appearing in the middle of a long-running server process, would that answer your objections?

Hi Daniel, thank you for your in-depth reply. As always, I enjoy reading what you’ve contributed to the Net’s compendium of thought on Lisp and related languages.

Your clarification was most accurate: When I said “scripting”, I was talking about a context of usage, not a particular language paradigm. I like that Haskell seems to be just as appropriate for tiny, throw-away scripts as it is for large, long-running programs.

When it comes to the latter, I really no have objections at all to the JVM or its startup time. I’m more than willing to wait 5 minutes for something to execute, if it will run for months at high efficiency. I face this situation all the time at work, where we have a huge EJB application hosted on JBoss. It may complicate debugging sometimes, but the costs are worth the benefits. The sheer number of things that J2EE and JBoss manage on our behalf, compared the small amount of code necessary to take advantage of them, is quite amazing.

What the JVM takes away, at least in 2009, is the choice of what those costs will be, and when I have to pay them. I think one of C’s biggest attractions for a long time has been that most of its costs are a conscious decision. If you favor startup time, or a small memory footprint, or fast execution, you can pretty much decide. This makes it as appropriate for embedded apps, as it is for running an HTTP server, as it is for building operating systems and compilers. With Java, despite all the things you get for “free”, it comes at the cost of other freedoms. And sometimes, Java’s priorities are not mine.

So while I can and do use the JVM for server-side computation, it’s a bit heavy weight for small and simple tasks. Common Lisp’s answer to this problem was an ingenious one. Instead of building programs that you run over and over, it offers an “environment” in which code is iteratively evaluated, so that you actually grow and nurture a burgeoning set of functionality within a long-running VM. I like this model when appropriate, and enjoy it, for example, in Emacs, which I can leave running for days on end while at the same time extending its functionality by writing new functions and customizing variables.

To answer your query then: yes, if JVM startup time could be eliminated, it would “free my hand”. I very much respect the maturity and stability of the JVM libraries Groovy and Clojure have access to. Also, what you said about the JIT, and alternative VMs, can be supplemented by mentioning all the other JVM facilities that exist, like code coverage, performance and memory analysis, and live introspection; along with the ability to pick JVMs to run on phones, or satisfy real-time computing requirements. It’s a rich platform, no doubt.

But why do we never see complaints about languages that link to the C++ standard library, or Boost, or any other of the large frameworks that exist? Because in those worlds, you don’t pay for what you don’t use. It’s been a design philosophy behind C++ for years, and to good effect. We might complain about the language, or its APIs, but you hardly notice if other projects use it, because largely, one can pretend it’s not even there. Not so with the JVM. Every time I start a Java application on my system, I feel it. Run several of them at once, and even my 3Gb laptop starts swapping. Only with the JVM are such things a source of common complaint.

I’m hoping that some day, projects like the LLVM will start to abstract these two sides of development. I want to be able to pick my language for its type safety, clarity, expressiveness, and joy of use; while at the same time I’d like to pick my VM for its security, footprint, handling of parallelism and messaging, and run-time appropriateness. This would let me choose Lisp, Haskell, Python or C++, depending on the skillset of engineers available to me; and the JVM, .NET platform, or LLVM, depending on how I meant the code to be used. Wouldn’t that be a powerful set of tools at one’s disposal?

Syndicated 2009-03-16 02:28:30 (Updated 2009-03-16 02:28:58) from Lost in Technopolis

Run times for Hello, World in 2009

Someone recently asked what my issue was regarding the JVM, since at the moment it prevents me from falling too much in love with Clojure – a language with the double-benefits of functional programming, and Lisp syntax and macros.

Well, below is my reason. These may not seem like much time in the scheme of things, but psychologically it builds up on me when I have to run a particular script over and over and over again. I’ve already noticed the pain with Groovy.

Language Running time
C 0.00415675640106
C++ 0.0043337225914
Haskell (compiled) 0.00494946241379
Perl 0.00773874521255
Ruby (1.8.7) 0.00913717746735
Ruby (1.9.1-p0) 0.0196997523308
Python 0.0269904136658
ECL (Common Lisp) 0.126332080364
Java (JDK6) 0.146584188938
Haskell (interpreted) 0.20009740591
Groovy (JDK6) 1.07791568041

If you’d like to generate some of these timings for your own system, I have created a Hello, world project on GitHub.

Syndicated 2009-03-15 16:50:14 (Updated 2009-03-15 19:57:23) from Lost in Technopolis

55 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!