Static Generation with Haskell
So now that this blog is statically generated, I thought I might go into a bit more detail on how that is accomplished. Most of it is probably pretty boring if you know this stuff, so I'm going to try to put a Haskell spin on it.
If you're familiar with the idea of static generation, you might want to skip past the Files and Stages sections, straight to the section on File I/O.
Files
There are four kinds of files involved in generating this blog. Again, if you know your way around web development this is probably not new, but if you don't, here are some words I'll use:
- Articles, which are text files describing each blog post;
- Code, which is the actual program that takes blog posts from text files and creates HTML files for the entire blog out of them;
- Templates, which are ways of telling the program where it should insert the content; and
- Static files, which are things like style sheets, images, and other things that can be included in templates or articles and are served raw as they are by the server.
There is actually a fifth kind of file involved in the generation of this
blog, but it's so silly it doesn't deserve its own bullet point. I discovered
when I made the templates that some things (like the URL
"http://two-wrongs.com/
") popped up very often, and to make it
easy to change those things all at once, I put them in a configuration file.
The first thing the program does when it starts up is read that configuration
file and store the parameters somewhere the rest of the program can access
them. The configuration file is not touched after that so I'll disregard it
here too.
The templates might sound like an odd one if you're not used to web development. The idea of having templates as a separate concept is that you don't have to write HTML inside your code. You can keep the representation of your blog (the HTML code that makes it up) completely separate from the code that generates the blog. So if you have many blogs that look different, you can use the same code to generate them but based off of different templates.
You could also generate many kinds of views for your blog – for example you can render the same content into pure text files and HTML pages and PDF documents. Just swap out the templates!
This is really just an extension of the idea that keeps CSS separate from HTML.
Stages
There are also four basic stages involved in generating this blog. When the command to regenerate the blog is run, these things happen, in this order (again, ignoring the config file...):
The program hunts through the
drafts
andpublished
directories, picking up any.txt
files it finds. These are the blog posts that are the content of the blog.By "picking up", I mean that it remembers where they are, what they are named and it reads their content. After this step, we don't touch the physical source files for the blog posts anymore.
The blog posts are named and formatted according to a specific convention. A blog post about foxes might be named something like
2014-09-23-what-the-fox-say.txt
. From this, the program figures out that the article was published September 23, 2014, and to get to it you have to visithttp://example.com/what-the-fox-say
. The non-date part of the filename is called the slug of the article, and is the thing you type in the URL to get to the article.The first line of the text in the blog post file is considered the title of the article, and the rest is the content. The program separates these parts out.
At this point, the program also does some sanity checking of the blog posts as a whole. If a blog post is published in the future, it is not yet really published and will be moved over to the "unpublished" list instead of "published" – even if it was found in the
published
directory. The program also ensures there are no duplicate slugs, because if there is, it'll be impossible to view one of the posts…After the program has figured all of that out, it starts generating HTML (and XML in one case) for every page on the blog. This includes the About page, the index with all the articles, each individual article, and the atom feed (that's the XML one).
- When it has generated all the HTML, it runs a second file-handling pass where it writes all the HTML out to a directory which can then be served by a web server. It also copies over all static content associated with the blog, ensuring access to images, stylesheets, JavaScripts and the like.
File I/O with turtle
I'll quickly mention the file handling bits, steps 1 and 4, together. I
originally wrote the file handling code with a myriad of utilities like
Data.Text.IO
, System.Directory
and others. It wasn't
very nice code. It was good, but not nice. I finally "gave up" on trying to
make it nice, and pulled in turtle
instead.
Turtle is amazing.
Turtle is basically the regular shell scripting you're used to from bash, except in Haskell. If you're trying to make a tool to glue together different parts of a system, don't hesitate to use turtle. It really does feel like writing a shell script, except with access to more libraries, more sane logic and a safer environment.
Here's the function that finds and reads in all text files in a directory, recursively:
getFilesIn :: FilePath -> Shell (FilePath, Text)
getFilesIn directory = do
filepaths <- find (suffix ".txt") directory
filetext < strict (input filepaths)
return (filepaths, filetext)
The function call to strict
is there because turtle is
sometimes a little too enthusiastic in streaming everything, so I force it to
read in the whole file all at once instead of one line at a time.
Another illustrative example on checking of a directory exists, and if so cleaning it out:
echo "Writing out HTML to permanent storage..."
newExists <- testdir "site_new"
when newExists (rmtree "site_new")
mkdir "site_new"
If I didn't know about turtle, I'd be surprised to hear that's code written in a real programming language. I would have assumed some sort of shell scripting dialect.
Therein lies the power of turtle. I cannot stress how rare it is to find a tool that is both convenient for hacking together a quick script but still feels safe and not too dangerous.
Parsing with attoparsec
Parsing is one of those things I really disliked doing a few years ago. I like it when a problem is well-defined, limited in scope and can be explained with just a couple of examples or even a formal definition.
I'm comfortable with those kinds of problems. They're easy.
Parsing is not one of them. Parsing pretty much anything is laden with pitfalls. Doubly so if we assume the content to be parsed is generated by a human. It's not even that parsing correctly formatted data is difficult; that's the easy part. What is hard is rejecting invalid documents.
Let's just say my opinion of parsing changed when I learned about parser combinators. The basic idea of a parser combinator library is that you build tiny micro parsers for the smallest fragments of the data you want to parse, and then you combine those micro parsers into slightly larger ones. You can then combine the slightly larger ones into even larger ones.
It's a very natural approach to parsing when executed well. Parsers are easy to write that way, and more importantly easy to read and understand.
In Haskell, we have two popular parser combinator libraries: Parsec, which I assume came first, and also attoparsec, which is loosely based on Parsec but makes slightly different tradeoffs.
Parsec is the "batteries included" option. It does all the things. It will give you detailed error messages, it can parse pretty much anything, it has a bunch of convenient helper methods for common scenarios.
Attoparsec instead focuses on being fast. It will not parse as many kinds of data. It will not give you great error messages. But if you write your parser carefully, it can rival a hand-rolled parser in C while being much easier to read still.
I actually prefer attoparsec for an odd reason: I like it precisely
because it doesn't parse all kinds of data. It makes the type
signatures easier to work with in my opinion. If I want to parse some
Text
, I just import Data.Attoparsec.Text
and use
the functions I need. I'm aware it won't parse ByteString
s
unless I import a separate module for that, but it's rare that I want to parse
many things in one module anyway.
The parser used in this blog to parse the filename of an article (which, as
you remember from previous sections is on the form of
2014-09-23-what-the-fox-say.txt
), is as simple as
day <- dateParser
char '-'
slug <- slugParser
endOfInput <?> "invalid slug"
return (day, slug)
First we parse the date and store the result in the day
variable. Then we expect a dash, and then we parse the slug and put it in the
slug
variable. If, after that, we don't encounter the end of the
input, then we know the slug contains invalid characters. (In a previous step
we threw out the file extension, so the filename really should end after the
slug at this point!)
If all of the above succeeded, we return a tuple of the date and slug.
The cool thing about this is that dateParser
and
slugParser
in turn are regular attoparsec parsers. The
dateParser
looks like
year <- decimal <?> "failed parsing year"
char '-'
month <- decimal <?> "failed parsing month"
char '-'
day <- decimal <?> "failed parsing day"
maybe (fail "invalid date") return (fromGregorianValid year month day)
Most of this should be fairly self-explainatory. decimal
is yet
another attoparsec parser that reads in an integer. It's parsers all the way
down when you're working with parser combinators.
The last line is a bit weird if you haven't done a lot of Haskell. It
attempts to return a valid date value from the Y-m-d
combination
and errors out on the parsing if it can't.
Talk about self-documenting code when it comes to parser combinators! Your code – the parser you write – is basically a specification of the thing it's parsing.
Templating with Heist
Fun aside: last night I had so many problems getting Heist to do what I wanted it to do, I'm starting to wonder how many watchlists I ended up on by scouring the internet for help with heists!
Either way, I mentioned in an earlier article that I have very little experience with HTML templating in Haskell. I have used Yesod for a bit, and I looove the Hamlet templating language for HTML. Just syntactically, it's what HTML really should have been.
What's even cooler about Hamlet is that it is processed compile-time. So it runs the regular type checker from the Haskell compiler on your templates. You get compile-time warnings about missing variables, broken links, treating a number like a string and so on. That saves me at least an hour a day in diagnosing weird templating bugs. The drawback of it being compiled is that the templates have to be available when you compile the program*. The way I want to do this site is that I want to have a binary that reads in the templates anew every time I run it.
With Hamlet out, I decided to just pick the next option on the list: Heist. Heist is primarily used in the Snap web framework, which I've never used myself but heard good things about.
Choosing Heist may have been a mistake. It felt that way for the longest time, anyway. I can count the number of useful Heist tutorials I found on one hand. Well, on one finger, actually. Unfortunately, the only useful tutorial I found was two years old, so it referred to things that have changed a bit.
But now that I understand it, I'm starting to sorta-kinda like it. It's definitely better than the Django Template Language I'm used to from work, but it's not quite on par with Hamlet.
The idea with Heist is that you initialise it by specifying a few settings, such as where it should search for templates, and which "built-in" splices (explaining in a second) should be available by default. The initialised Heist object can later be extended with further splices and then be used to render a template.
Splices are the only abstraction tool available in Heist templates. Splices
are "HTML tags" you have invented, which may do anything you can do from
Haskell code. For example, you could imagine having a splice named
currentTime
which returns the current server time. If that splice
is available when a template is rendered, you can display the current server
time in that template by writing <currentTime />
.
So splices are basically HTML tags which run some Haskell code, and then potentially include other templates, or dynamically create more splices. As a more realistic example, the Heist template for the listing of articles on the index page on this blog looks like (after some cleanup)
<ol>
<latestPosts>
<li>
<date><datestamp /></date>
<a href="${entrySlug}"><entryTitle /></a>
</li>
</latestPosts>
</ol>
Here, latestPosts
is a splice that loops over its own content
and prints it as many times as it needs to. For each iteration, it dynamically
creates the splices datestamp
, entrySlug
, and
entryTitle
to contain the relevant text for each post it loops
through.
The latestPosts
splice is defined in Haskell code as
latestPosts :: Monad n => Blog -> Splice n
latestPosts blog =
flip mapSplices (published blog) $ \post ->
runChildrenWithText $ do
"entryTitle" ## title post
"datestamp" ## pack (show (datestamp post))
"entrySlug" ## fromSlug (slug post)
Which means that for each post in published blog
, it "runs the
children" (which is really funny terminology for some reason), or in other
words, renders the content of itself, with the text splices specified.
A quick word about splices and text splices. A splice is a monadic action of some sort. I don't particularly care for the details. This means that a splice can go to the database and fetch content, it can call your grandmother and ask for content – anything, it seems.
Often you just want to display some text you already know, and for that
there is a convenience function textSplice
which takes some text
and creates a monadic splice something value that just returns that text. This
pattern appears in many places: in the code I just showed, I use
runChildrenWithText
and not the regular
runChildrenWith
which lets you define more general splices to
include when you "run the children".
I'll also mention the difference between Splice n
and
Splices (Splice n)
. A regular Splice n
is just some Haskell code
that does something. Maybe it calls your grandmother or maybe it just returns
some pre-determined text. But! To be able to call it from a template, you need
to bind a name to it. This is done by extending the Heist object. Once you have
done
newHeistObject = bindSplice "callGM" callGrandMa heist
You can render a template with newHeistObect
and if you in that
template use the <callGM />
tag it will call your grandma.
Note how the tag name is different from the splice name (which is just a
regular Haskell value.)
Since it's common to want to bind many splices, there's a convenience method
bindSplices
which binds multiple splices in one go. To use it, you
need to give it a Splices (Splice n)
value. That value contains
multiple splices and their tag names. The easiest way to construct
such a value is through the do
notation provided. You saw me use
it in the previous code sample. The double-hashes connect a tag name to a
splice. In my case it was text splices, but it can be any splice really.
Conclusion
In the end, though, I'm happy with my technology choices. Turtle is really convenient. Attoparsec is good and fun to use. Heist does its job really well once you understand how it works.
This is why I can only laugh when people perpetuate the myth that Haskell is only for academia, or that it's not good for making real programs solving real problems. That might have been true 10 or 20 years ago, actually, but far from true these days. In fact, when I'm working in other languages I often find myself thinking, "If only I had access to Haskell library so-and-so now. That'd make my life so much easier."
Luckily, the world-class libraries that started with Haskell are slowly seeping out into other languages. Parser combinators and Parsec clones of varying degrees of quality are available for other langauges. Property testing libraries like QuickCheck have been ported as well.
But when I want something newer that's not been ported yet (like lenses), or I want to know it's well built, back to Haskell I go. Haskell has a knack for being the first to get good libraries.