A History of Haskell Commentary

There’s a draft paper named “A History of Haskell” on the Haskell Wiki right now, which will be submitted to the History of Programming Languages conference (HoPL) in 2007. I wasn’t even aware there was a conference dedicated to programming language history, and the idea is a great one: after reading the paper, I’ll be tracking down papers about other significant languages, such as LISP, Smalltalk and C++. (Hey, whatever your opinion of C++, it seems to be universally respected that Stroustrup’s The Design and Evolution of C++ is a title worthy of being on every serious programmer’s bookshelf. Seeing why a language is the way it is today is a huge key in understanding how to wield it.) The paper’s a long read (~40 pages) but a very easy read (no Greek in the entire thing!), and it’s fascinating hearing the inside stories from some obviously incredibly clever people. I’m also glad the authors mixed in some humour into the paper, which abounds in the Haskell community: not enough papers dare to inject (and successfully deliver) humour into their materials.

So, here’s some of my commentary on quotes that I particularly liked from the paper. First, it looks like the authors of the paper agree with me about Haskell’s problem with unpredictable performance:

The extreme sensitivity of Haskellís space use to evaluation order is a two-edged sword. Tiny changes—the addition or removal of a in one placeócan dramatically change space requirements. On the one hand, it is very hard for programmers to anticipate their programís space behaviour and place calls of correctly when the program is first written… As designers who believe in reasoning, we are a little ashamed that reasoning about space use in Haskell is so intractable.

Mind you, they do also point out something I didn’t mention in my writings at all:

On the other hand, given sufficiently good profiling information, space performance can be improved dramatically by very small changes in just the right place—without changing the overall structure of the program… Yet Haskell encourages programmers—even forces them—to forget space optimisation until after the code is written, profiled, and the major space leaks found, and at that point puts powerful tools at the programmerís disposal to fix them. Maybe this is nothing to be ashamed of, after all.

Note that Don Stewart’s incredible Data.ByteString library has pretty much nailed the performance gap between Haskell and C for string processing and binary I/O, which is one area where Haskell was notoriously weak at. (Haskell’s also risen to the top of the heap in the Great Language Shootout as well, largely thanks to Don’s efforts.)

There’s one paragraph on the great importance of profiling tools in finding the cause of performance problems, in one case leading to an absolutely amazing reduction in heap usage:

… there was no practical way of finding the causes of space leaks in large programs. Runciman and Wakeling developed a profiler that could display a graph of heap contents over time, classified by the function that allocated the data, the top-level constructor of the data, or even combinations of the two (for example, ìshow the allocating functions of all the cons cells in the heap over the entire program runî). The detailed information now available enabled lazy programmers to make dramatic improvements to space efficiency: as the first case study, Runciman and Wakeling reduced the peak space requirements of a clausification program for propositional logic by two orders of magnitude, from 1.3 megabytes to only 10K (Runciman and Wakeling, 1993)… A further extension introduced retainer profiling, which could explain why data was not garbage collected… With information at this level of detail, Runciman and Rˆjemo were able to improve the peak space requirements of their clausify program to less than 1K—three orders of magnitude better than the original version. They also achieved a factor of two improvement in the compiler itself, which had already been optimised using their earlier tools.

And, two paragraphs that lend more credence to the idea that Haskell really does lead to less lines of code:

Darcs was originally written in C++ but, as Roundy puts it, “after working on it for a while I had an essentially solid mass of bugs” (Stosberg, 2005). He came across Haskell and, after a few experiments in 2002, rewrote Darcs in Haskell. Four years later, the source code is still a relatively compact 28,000 lines of literate Haskell (thus including the source for the 100 page manual). Roundy reports that some developers now are learning Haskell specifically in order to contribute to darcs.

One of these programmers was Audrey Tang. She came across Darcs, spent a month learning Haskell, and jumped from there to Pierceís book Types and Programming Languages (Pierce, 2002). The book suggests implementing a toy language as an exercise, so Audrey picked Perl 6. At the time there were no implementations of Perl 6, at least partly because it is a ferociously difficult language to implement. Audrey started her project on 1 February 2005. A year later there were 200 developers contributing to it; perhaps amazingly (considering this number) the compiler is only 18,000 lines of Haskell (including comments) (Tang, 2005). Pugs makes heavy use of parser combinators (to support a dynamically-changable parser), and several more sophisticated Haskell idioms, including GADTs (Section 6.6) and delimited continuations (Dybvig et al., 2005).

For verification and comparison, here’s a very rough lines-of-code count generated using David A. Wheeler’s ‘sloccount’ for Darcs, Bazaar-NG, Mercurial, Arch and Subversion. Note that I excluded test cases, supporting libraries and other non-essentials, which can be massive (e.g. total omission of hackerlab for Arch, which comprises another 50,000 lines, and the Apache Portable Runtime for Subversion, which is another 85,000 lines):

  • Darcs: 21,135 lines
  • Bazaar-NG: 47,084 lines
  • Mercurial: 20,537 lines
  • Arch: 48,639 lines (~100,000 with hackerlab)
  • Subversion: 98,736 lines (~185,000 with APR and APR-util)

So Darcs and Mercurial are pretty similar, Bazaar-NG and Arch are nearly around the same size code base now (and it’s written in Python rather than C!), and Subversion is, well, a crapload bigger than all the others.

Plus, I didn’t realise that Pugs was a mere 18,000 lines of code, which is quite amazing. I’m not sure whether I’m more impressed by that figure, or that Erlang’s incredible Mnesia distributed database is a mere ~30,000 lines of code…

blog comments powered by Disqus