Entropic Thoughts

Static Generation with Haskell

Static Generation with Haskell

So now that this blog is statically generated, I thought I might go into a bit more detail on how that is accomplished. Most of it is probably pretty boring if you know this stuff, so I'm going to try to put a Haskell spin on it.

If you're familiar with the idea of static generation, you might want to skip past the Files and Stages sections, straight to the section on File I/O.

Files

There are four kinds of files involved in generating this blog. Again, if you know your way around web development this is probably not new, but if you don't, here are some words I'll use:

  • Articles, which are text files describing each blog post;
  • Code, which is the actual program that takes blog posts from text files and creates HTML files for the entire blog out of them;
  • Templates, which are ways of telling the program where it should insert the content; and
  • Static files, which are things like style sheets, images, and other things that can be included in templates or articles and are served raw as they are by the server.

There is actually a fifth kind of file involved in the generation of this blog, but it's so silly it doesn't deserve its own bullet point. I discovered when I made the templates that some things (like the URL "http://two-wrongs.com/") popped up very often, and to make it easy to change those things all at once, I put them in a configuration file. The first thing the program does when it starts up is read that configuration file and store the parameters somewhere the rest of the program can access them. The configuration file is not touched after that so I'll disregard it here too.

The templates might sound like an odd one if you're not used to web development. The idea of having templates as a separate concept is that you don't have to write HTML inside your code. You can keep the representation of your blog (the HTML code that makes it up) completely separate from the code that generates the blog. So if you have many blogs that look different, you can use the same code to generate them but based off of different templates.

You could also generate many kinds of views for your blog – for example you can render the same content into pure text files and HTML pages and PDF documents. Just swap out the templates!

This is really just an extension of the idea that keeps CSS separate from HTML.

Stages

There are also four basic stages involved in generating this blog. When the command to regenerate the blog is run, these things happen, in this order (again, ignoring the config file...):

  1. The program hunts through the drafts and published directories, picking up any .txt files it finds. These are the blog posts that are the content of the blog.

    By "picking up", I mean that it remembers where they are, what they are named and it reads their content. After this step, we don't touch the physical source files for the blog posts anymore.

  2. The blog posts are named and formatted according to a specific convention. A blog post about foxes might be named something like 2014-09-23-what-the-fox-say.txt. From this, the program figures out that the article was published September 23, 2014, and to get to it you have to visit http://example.com/what-the-fox-say. The non-date part of the filename is called the slug of the article, and is the thing you type in the URL to get to the article.

    The first line of the text in the blog post file is considered the title of the article, and the rest is the content. The program separates these parts out.

    At this point, the program also does some sanity checking of the blog posts as a whole. If a blog post is published in the future, it is not yet really published and will be moved over to the "unpublished" list instead of "published" – even if it was found in the published directory. The program also ensures there are no duplicate slugs, because if there is, it'll be impossible to view one of the posts…

  3. After the program has figured all of that out, it starts generating HTML (and XML in one case) for every page on the blog. This includes the About page, the index with all the articles, each individual article, and the atom feed (that's the XML one).

  4. When it has generated all the HTML, it runs a second file-handling pass where it writes all the HTML out to a directory which can then be served by a web server. It also copies over all static content associated with the blog, ensuring access to images, stylesheets, JavaScripts and the like.

File I/O with turtle

I'll quickly mention the file handling bits, steps 1 and 4, together. I originally wrote the file handling code with a myriad of utilities like Data.Text.IO, System.Directory and others. It wasn't very nice code. It was good, but not nice. I finally "gave up" on trying to make it nice, and pulled in turtle instead.

Turtle is amazing.

Turtle is basically the regular shell scripting you're used to from bash, except in Haskell. If you're trying to make a tool to glue together different parts of a system, don't hesitate to use turtle. It really does feel like writing a shell script, except with access to more libraries, more sane logic and a safer environment.

Here's the function that finds and reads in all text files in a directory, recursively:

getFilesIn :: FilePath -> Shell (FilePath, Text)
getFilesIn directory = do
    filepaths <- find (suffix ".txt") directory
    filetext < strict (input filepaths)
    return (filepaths, filetext)

The function call to strict is there because turtle is sometimes a little too enthusiastic in streaming everything, so I force it to read in the whole file all at once instead of one line at a time.

Another illustrative example on checking of a directory exists, and if so cleaning it out:

echo "Writing out HTML to permanent storage..."
newExists <- testdir "site_new"
when newExists (rmtree "site_new")
mkdir "site_new"

If I didn't know about turtle, I'd be surprised to hear that's code written in a real programming language. I would have assumed some sort of shell scripting dialect.

Therein lies the power of turtle. I cannot stress how rare it is to find a tool that is both convenient for hacking together a quick script but still feels safe and not too dangerous.

Parsing with attoparsec

Parsing is one of those things I really disliked doing a few years ago. I like it when a problem is well-defined, limited in scope and can be explained with just a couple of examples or even a formal definition.

I'm comfortable with those kinds of problems. They're easy.

Parsing is not one of them. Parsing pretty much anything is laden with pitfalls. Doubly so if we assume the content to be parsed is generated by a human. It's not even that parsing correctly formatted data is difficult; that's the easy part. What is hard is rejecting invalid documents.

Let's just say my opinion of parsing changed when I learned about parser combinators. The basic idea of a parser combinator library is that you build tiny micro parsers for the smallest fragments of the data you want to parse, and then you combine those micro parsers into slightly larger ones. You can then combine the slightly larger ones into even larger ones.

It's a very natural approach to parsing when executed well. Parsers are easy to write that way, and more importantly easy to read and understand.

In Haskell, we have two popular parser combinator libraries: Parsec, which I assume came first, and also attoparsec, which is loosely based on Parsec but makes slightly different tradeoffs.

  • Parsec is the "batteries included" option. It does all the things. It will give you detailed error messages, it can parse pretty much anything, it has a bunch of convenient helper methods for common scenarios.

  • Attoparsec instead focuses on being fast. It will not parse as many kinds of data. It will not give you great error messages. But if you write your parser carefully, it can rival a hand-rolled parser in C while being much easier to read still.

I actually prefer attoparsec for an odd reason: I like it precisely because it doesn't parse all kinds of data. It makes the type signatures easier to work with in my opinion. If I want to parse some Text, I just import Data.Attoparsec.Text and use the functions I need. I'm aware it won't parse ByteStrings unless I import a separate module for that, but it's rare that I want to parse many things in one module anyway.

The parser used in this blog to parse the filename of an article (which, as you remember from previous sections is on the form of 2014-09-23-what-the-fox-say.txt), is as simple as

day <- dateParser
char '-'
slug <- slugParser
endOfInput <?> "invalid slug"
return (day, slug)

First we parse the date and store the result in the day variable. Then we expect a dash, and then we parse the slug and put it in the slug variable. If, after that, we don't encounter the end of the input, then we know the slug contains invalid characters. (In a previous step we threw out the file extension, so the filename really should end after the slug at this point!)

If all of the above succeeded, we return a tuple of the date and slug.

The cool thing about this is that dateParser and slugParser in turn are regular attoparsec parsers. The dateParser looks like

year <- decimal <?> "failed parsing year"
char '-'
month <- decimal <?> "failed parsing month"
char '-'
day <- decimal <?> "failed parsing day"
maybe (fail "invalid date") return (fromGregorianValid year month day)

Most of this should be fairly self-explainatory. decimal is yet another attoparsec parser that reads in an integer. It's parsers all the way down when you're working with parser combinators.

The last line is a bit weird if you haven't done a lot of Haskell. It attempts to return a valid date value from the Y-m-d combination and errors out on the parsing if it can't.

Talk about self-documenting code when it comes to parser combinators! Your code – the parser you write – is basically a specification of the thing it's parsing.

Templating with Heist

Fun aside: last night I had so many problems getting Heist to do what I wanted it to do, I'm starting to wonder how many watchlists I ended up on by scouring the internet for help with heists!

Either way, I mentioned in an earlier article that I have very little experience with HTML templating in Haskell. I have used Yesod for a bit, and I looove the Hamlet templating language for HTML. Just syntactically, it's what HTML really should have been.

What's even cooler about Hamlet is that it is processed compile-time. So it runs the regular type checker from the Haskell compiler on your templates. You get compile-time warnings about missing variables, broken links, treating a number like a string and so on. That saves me at least an hour a day in diagnosing weird templating bugs. The drawback of it being compiled is that the templates have to be available when you compile the program*. The way I want to do this site is that I want to have a binary that reads in the templates anew every time I run it.

With Hamlet out, I decided to just pick the next option on the list: Heist. Heist is primarily used in the Snap web framework, which I've never used myself but heard good things about.

Choosing Heist may have been a mistake. It felt that way for the longest time, anyway. I can count the number of useful Heist tutorials I found on one hand. Well, on one finger, actually. Unfortunately, the only useful tutorial I found was two years old, so it referred to things that have changed a bit.

But now that I understand it, I'm starting to sorta-kinda like it. It's definitely better than the Django Template Language I'm used to from work, but it's not quite on par with Hamlet.

The idea with Heist is that you initialise it by specifying a few settings, such as where it should search for templates, and which "built-in" splices (explaining in a second) should be available by default. The initialised Heist object can later be extended with further splices and then be used to render a template.

Splices are the only abstraction tool available in Heist templates. Splices are "HTML tags" you have invented, which may do anything you can do from Haskell code. For example, you could imagine having a splice named currentTime which returns the current server time. If that splice is available when a template is rendered, you can display the current server time in that template by writing <currentTime />.

So splices are basically HTML tags which run some Haskell code, and then potentially include other templates, or dynamically create more splices. As a more realistic example, the Heist template for the listing of articles on the index page on this blog looks like (after some cleanup)

<ol>
  <latestPosts>
    <li>
      <date><datestamp /></date>

      <a href="${entrySlug}"><entryTitle /></a>
    </li>
  </latestPosts>
</ol>

Here, latestPosts is a splice that loops over its own content and prints it as many times as it needs to. For each iteration, it dynamically creates the splices datestamp, entrySlug, and entryTitle to contain the relevant text for each post it loops through.

The latestPosts splice is defined in Haskell code as

latestPosts :: Monad n => Blog -> Splice n
latestPosts blog =
    flip mapSplices (published blog) $ \post ->
        runChildrenWithText $ do
            "entryTitle" ## title post
            "datestamp" ## pack (show (datestamp post))
            "entrySlug" ## fromSlug (slug post)

Which means that for each post in published blog, it "runs the children" (which is really funny terminology for some reason), or in other words, renders the content of itself, with the text splices specified.

A quick word about splices and text splices. A splice is a monadic action of some sort. I don't particularly care for the details. This means that a splice can go to the database and fetch content, it can call your grandmother and ask for content – anything, it seems.

Often you just want to display some text you already know, and for that there is a convenience function textSplice which takes some text and creates a monadic splice something value that just returns that text. This pattern appears in many places: in the code I just showed, I use runChildrenWithText and not the regular runChildrenWith which lets you define more general splices to include when you "run the children".

I'll also mention the difference between Splice n and Splices (Splice n). A regular Splice n is just some Haskell code that does something. Maybe it calls your grandmother or maybe it just returns some pre-determined text. But! To be able to call it from a template, you need to bind a name to it. This is done by extending the Heist object. Once you have done

newHeistObject = bindSplice "callGM" callGrandMa heist

You can render a template with newHeistObect and if you in that template use the <callGM /> tag it will call your grandma. Note how the tag name is different from the splice name (which is just a regular Haskell value.)

Since it's common to want to bind many splices, there's a convenience method bindSplices which binds multiple splices in one go. To use it, you need to give it a Splices (Splice n) value. That value contains multiple splices and their tag names. The easiest way to construct such a value is through the do notation provided. You saw me use it in the previous code sample. The double-hashes connect a tag name to a splice. In my case it was text splices, but it can be any splice really.

Conclusion

In the end, though, I'm happy with my technology choices. Turtle is really convenient. Attoparsec is good and fun to use. Heist does its job really well once you understand how it works.

This is why I can only laugh when people perpetuate the myth that Haskell is only for academia, or that it's not good for making real programs solving real problems. That might have been true 10 or 20 years ago, actually, but far from true these days. In fact, when I'm working in other languages I often find myself thinking, "If only I had access to Haskell library so-and-so now. That'd make my life so much easier."

Luckily, the world-class libraries that started with Haskell are slowly seeping out into other languages. Parser combinators and Parsec clones of varying degrees of quality are available for other langauges. Property testing libraries like QuickCheck have been ported as well.

But when I want something newer that's not been ported yet (like lenses), or I want to know it's well built, back to Haskell I go. Haskell has a knack for being the first to get good libraries.