Entropic Thoughts

Parser Combinators: Parsing for Haskell Beginners

Parser Combinators: Parsing for Haskell Beginners

Parsing is something every programmer does, all the time. Often, you are lucky, and the data you receive is structured according to some standard like json, xml … you name it. When it is, you just download a library for converting that format into native data types, and call it a day.

Sometimes, you are not quite so lucky. Sometimes you get data in an unstructured, badly documented “miniformat”, such as the various ways in which people write phone numbers, license plates and social security numbers, the output of command line interfaces or systematically named files on the file system. Sometimes you’re actually dealing with a standard format, but you have no parser for it because it’s not very popular or well-known. Or maybe you’re reading input from the user and you want a relatively user-friendly format for it.

This is when you need to write a parsing routine of some sort, and there are a few ways of doing it. In Haskell, we prefer using parser combinators. I’ll take a couple of minutes to show you why. If you already know why it’s important to learn parser combinators, feel free to skip down to the heading ReadP.

Parsing – the Ugly

When I just started out with programming, I tended to roll my own parsing routine by splitting strings on keywords, and comparing to known values. For example, metar reports (an international semi-standard format for reporting conditions on airports, such as weather, cloud layers, humidity and such) can look like

BIRK 281500Z 09014KT CAVOK M03/M06 Q0980 R13/910195

Here, the third “word”, i.e. 09014KT contains information about the wind. If I want to extract the wind speed (14, for now ignoring the unit knots), I might do something like

windSpeed :: String -> Maybe Int
windSpeed windInfo =
    let
        -- remove the wind direction 
        speed = drop 3 windInfo
    in
        -- remove the "knots" unit and read number
        readMaybe (take (length speed - 2) speed)

Ignoring the fact that this is pretty hard to read after a few months1 Why does this number 2 pop up? What is the significance of the length of the speed value? this is also not very stable code. Some metar reports specify wind speed in m/s, and look what happens when we feed that to our function:

λ> windSpeed "09007MPS"
Nothing

Why does it return Nothing? Well, one of our magic constants, one of the ~2~s in the code, does not apply to wind speed numbers stated in m/s. Our code was specific to knots. Whoops. You can of course work around this by checking what the first letter of the wind speed string is but at this point it’s getting fairly complicated already.

This is clearly not a good approach.

Regexes

When I had evolved slightly in my programming ability, I learned about regexes. Oh those little wonderful concise soups of characters that do magic. Surely, you can use them to do a bit of parsing?

Sure you can. Here’s the new version of our windSpeed function.

windSpeed :: String -> Maybe Int
windSpeed report =
    case matchRegex "[0-9]{3}([0-9]{2,3})(KT|MPS)" report of
        Just (speed, unit) -> readMaybe speed
        Nothing -> Nothing

This is actually better than the previous code for a couple of reasons: first of all, it doesn’t choke on measurements in m/s. Secondly, it extracts the unit which can be (but is not, for didactic purposes) used to convert the speed value into a more useful, standardised measurement.

However, it does suffer some of the same problems as the previous code too. Most noticeably, it is hard to maintain. The regex itself is indeed a soup of characters and separating out which does what can be difficult.

Parser Combinators

This is where parser combinators come in. They are roughly as easy to write as regexes, but they are much more maintainable. They make it convenient to write a proper parser. The windSpeed function looks slightly different with those.

windSpeed :: String -> Maybe Int
windSpeed windInfo =
    parseMaybe windSpeedParser windInfo

windSpeedParser :: ReadP Int
windSpeedParser = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    unit <- string "KT" <|> string "MPS"
    return speed

For this small example, this is definitely “more code”. However, look at how easy it is to read the windSpeedParser. Even if you have no prior experience with parser combinators, you should be able to sort of read that. First comes the direction, which is three numbers. It is followed by the speed, which is either two or three numbers. Finally there’s the unit, which is either KT or MPS. From this, the speed is returned. The parser itself is basically a description of what it parses!

In this small example, you might not realise how cool this is, but when it comes to parsing larger things it is immensely helpful to be able to just dive in and read the parser to figure out what it does. Parser combinator based parsers scale extremely well to larger tasks.

ReadP

The Haskell standard library comes with a small but competent parser generator library: Text.ParserCombinators.ReadP. Whenever you need to write your own parser to consume some kind of data, this is what you should reach for first. Forget splitting strings. Forget regexes. ReadP is where it is at from now on.2 There are a bunch of parser combinator libraries for Haskell, including but not limited to Attoparsec, Parsec, Megaparsec, Turtle.Pattern and Earley. These are all good for different things. The reason I recommend ReadP is that it’s good, of course, but also that if you have ghc installed, you already have ReadP on your computer! Oh, but you’d rather want to learn Parsec or some other more “batteries included” parser combinator library? No problem, the things you learn in this tutorial are things you can use in other parser combinator libraries too. In fact, I’ve decided to try to teach only functions and operators which have the same name across multiple parser combinator libraries. So you can totally read this tutorial and then use the things you learned with Parser or Attoparsec or something along those lines.

To get acquianted with parser combinators, let’s start with the simplest parser I can think of: we’ll parse a single vowel.

import Text.ParserCombinators.ReadP

isVowel :: Char -> Bool
isVowel char =
    any (char ==) "aouei"

vowel :: ReadP Char
vowel =
    satisfy isVowel

We make the helper function isVowel which simply returns True for any character that is a vowel. It does this by checking if the argument character is equal to any character in the string "aouei".

isVowel is then used in the parser we name vowel, through the satisfy function from the ReadP library, one of our staples. This function is important, so lets look at its type signature.

satisfy :: (Char -> Bool) -> ReadP Char

It takes any function Char -> Bool and returns a parser that parses any character that passes the test function we give it. In this case, we give it isVowel, so it will return a parser that parses a single vowel. You could just as well imagine the satisfy isDigit parser that parses a single digit. Or a satisfy (== ' ') parser that parses only a single space character and will fail on anything else.

Oh, and in case it is not evident yet, a value of type ReadP Char is a parser that parses characters and returns a Char value. A parser of type ReadP Float also parses characters (all parsers do) but returns a Float type value. Any time you see the type ReadP something, you can internally read it as “parser of something”.

However, a parser is not itself a function that takes input. It needs to be “run” on some input by another function. In the case of ReadP, this is done by the confusingly named readP_to_S function, which takes a parser and an input and runs the parser on the input. We can test our vowel parser with that. This is its type signature, when it has been “demystified”:

readP_to_S :: ReadP a -> String -> [(a, String)]

The output of readP_to_S might look a little odd at first, but by looking at several examples of it you will get a sense of what it means. In essence, readP_to_S returns a list of successful parses, where “a parse” loosely means the two-tuple (parsedValue, unparsedRemainderOfString). If the parser fails (i.e. could not parse anything at the beginning of the input) it will return the empty list. In action:

λ> readP_to_S vowel "e"
[('e',"")]
λ> readP_to_S vowel "k"
[]
λ> readP_to_S vowel "another one bites the dust"
[('a',"nother one bites the dust")]
λ> readP_to_S vowel "did you see that"
[]

The first element of the tuple is the successful parse, the second element of the tuple is the unparsed remainder of the string.

If the string does not start with a vowel, the parser fails entirely. The parser will not automatically skip irrelevant characters, but leaves that decision up to the one who writes the parser. This greater control, while sometimes inconvenient, is normally useful.

However, reading just one vowel is not as interesting as reading several of them. Since readP_to_S returns the unparsed remainder of the input, we can imagine writing a function to chain together parsers.

atLeastOne :: ReadP Char -> String -> [(String, String)]
atLeastOne parser input =
    case readP_to_S parser input of

        -- Empty list means failed parse, so this parser
        -- should fail too
        [] -> []

        -- Successfully parsed at least one character, so
        -- try parsing a few more by recursively calling
        -- atLeastOne
        [(char, remainder)] ->
            case atLeastOne parser remainder of

                -- After a successful parse, it failed when
                -- trying to do it again. Return the single
                -- successful parse
                [] -> [(char:"", remainder)]

                -- The recursive call was successful. Append
                -- our results to the rest of them, and return
                -- whatever is left of the input
                [(str, finalRemainder)] ->
                    [(char:str, finalRemainder)]

While this works, as demonstrated below, it is a very bad idea.

λ> atLeastOne vowel "aouibcdef"
[("aoui","bcdef")]
λ> atLeastOne vowel "gjshifu"
[]

Why is atLeastOne not good? For one, it is brittle and not quite following the expectations we have of parser combinators, but moreover it is hugely inconvenient to write, and not very clear at all when trying to read it later.

This is where the combinator part of parser combinators come in. Our atLeastOne function dealt with parsed results, while the combinator functions we want to use work with parsers.

For instance, there is the many1 combinator function in Text.ParserCombinators.ReadP which does exactly what we want. The type signature of it looks like

many1 :: ReadP a -> ReadP [a]

In other words, it takes a parser that parses a single a (which in our case is Char) and returns a parser that parses several ~a~s. By “several”, I mean at least one, but potentially infinitely many.

With this, we can create

atLeastOneVowel :: ReadP [Char]
atLeastOneVowel =
    many1 vowel

and behold! This might not be what you expected.

λ> readP_to_S atLeastOneVowel "aouibcdef"
[("a","ouibcdef")
,("ao","uibcdef")
,("aou","ibcdef")
,("aoui","bcdef")]

Now we see why readP_to_S returns a list. “At least one vowel” can mean just one vowel. It can also mean two, or three, or four of them. So many1 accounts for these possibilities by simply giving back all possible parses, and lets you pick whichever one you wanted.

This may look problematic, but it turns out that often it does not matter, because most of the time there is only one possible parse anyway.

New here? I intend to write more about Haskell patterns in the coming few months (although this is not a promise). You should subscribe to receive weekly summaries of new articles by email. If you don't like it, you can unsubscribe any time.

metar

Going back to our initial example of the metar reports, we can now actually start parsing them, believe it or not! The first word of a metar report is the icao code name of the airport it was sent from. This is at least one upper case letter.

airport :: ReadP String
airport =
    many1 (satisfy (\char -> char >= 'A' && char <= 'Z'))
λ> readP_to_S airport "BIRK 281500Z 09014KT CAVOK M03/M06 Q0980 R13/910195"
[("B","IRK 281500Z 09014KT CAVOK M03/M06 Q0980 R13/910195")
,("BI","RK 281500Z 09014KT CAVOK M03/M06 Q0980 R13/910195")
,("BIR","K 281500Z 09014KT CAVOK M03/M06 Q0980 R13/910195")
,("BIRK"," 281500Z 09014KT CAVOK M03/M06 Q0980 R13/910195")
]

These upper-case letters are then followed by a space. So far, we’ve not constructed a parser that says, “parse this, and then parse this.” To be able to do that, we can exploit the fact that parsers are a kind of monad, and write them with do syntax.

airport :: ReadP String
airport = do
    code <- many1 (satisfy (\char -> char >= 'A' && char <= 'Z'))
    satisfy (== ' ')
    return code

We save the result of the many1 parser in the code variable, then we parse a single space character, and then return the ica code we parsed. Keep in mind that this is the Haskell return, which doesn’t actually “return” anything from functions; Haskell automatically returns the last value of a function so you don’t have to specify it. Instead, in the context of parsers, you can view return as

return :: a -> ReadP a

This may not tell you much, but what it means is that return is a function that takes a value and returns a parser that parses nothing, but still gives you back a valid parse with the value you specified to begin with. Something like return 4 is a parser that will succeed regardless of what input you give it, and will always “parse” the number 4 from the input, leaving the input untouched.

Inspection

The next part of the metar report is a time and date stamp. The format is

<day of month><hours><minutes>Z

This means we need to parse 2-digit numbers! First, how do we parse a single digit? You probably guessed it already:

digit :: ReadP Char
digit =
    satisfy (\char -> char >= '0' && char <= '9')

Now how do we parse two digits? We use the power of combinators! The ReadP module has a function count, which will run the parser you specify exactly n times in sequence. To parse a two digit number, you can create the parser count 2 digit.

Our entire timestamp parser is, at this point,

timestamp :: ReadP (Int, Int, Int)
timestamp = do
    day <- count 2 digit
    hour <- count 2 digit
    minute <- count 2 digit
    string "Z "
    return (read day, read hour, read minute)

Here, we used the string function which matches the exact string you specify. That’s a convenience where instead of writing

satisfy (== 'H')
satisfy (== 'e')
satisfy (== 'l')
satisfy (== 'l')
satisfy (== 'o')

you can just write string "Hello". Now, string exists in the ReadP module, but what if it didn’t? How would you create a function that takes a string and parses all characters in the string in sequence? That’s a good exercise to really understand parser combinators in Haskell. Try to figure it out now, and if you can’t, come back later and try again!3 Hints: the easy solution uses explicit recursion, the hard solution uses Data.Traversable, exploiting the fact that parsers are applicatives and strings are traversables.

But wait! I hear you say. You’re using read here. Isn’t that… like… bad? It can crash your program and stuff? Normally, yeah, you’re right. But in this case we know it’s being handed a string of digits, which we know it can convert to an Int, so no worries!

However, there are a few other things wrong with this function. I’ll focus on one of them. We said the first number is the day of month the report was created. Should the following really happen?

λ> readP_to_S timestamp "888990Z "
[((88,89,90),"")]

What day of month is the 88th? And when in the day is the time 89:90? Clearly, we need to check the values we parse and make sure they’re not bonkers. In preparation, we’ll make a small change to our timestamp parser:

timestamp :: ReadP (Int, Int, Int)
timestamp = do
    day <- fmap read (count 2 digit)
    hour <- fmap read (count 2 digit)
    minute <- fmap read (count 2 digit)
    string "Z "
    return (day, hour, minute)

All we’ve done here is move the call to read higher up. If you recall, a functor is something we can “map” a function over, and a parser is a functor! This means we can fmap read over a parser and it will run read over the parsed results. If you’re a type system kind of person, you might prefer the following one-line explanation instead:

fmap read :: Parser String -> Parser Int

Running read directly over the parsed results has the benefit that now the variables day, hour and minute are integers, so we can check if they’re too big or small! This becomes more apparent if we create a function for the repeated code:

timestamp :: ReadP (Int, Int, Int)
timestamp = do
    day <- numbers 2
    hour <- numbers 2
    minute <- numbers 2
    string "Z "
    return (day, hour, minute)

numbers :: Int -> ReadP Int
numbers digits =
    fmap read (count digits digit)

If you are not comfortable with functors yet, you may prefer the following, equivalent, definition of numbers:

numbers :: Int -> ReadP Int
numbers digits = do
    parse <- count digits digit
    return (read parse)

We can now check the validity of the numbers we parse:

timestamp :: ReadP (Int, Int, Int)
timestamp = do
    day <- numbers 2
    hour <- numbers 2
    minute <- numbers 2
    string "Z "
    if day < 1 || day > 31 || hour > 23 || minute > 59 then
        pfail
    else
        return (day, hour, minute)

This should be fairly self-explainatory. The day number shouldn’t be less than one or more than 31 – if it is, we return the pfail parser which is a parser that will always fail for any input. If all numbers are within specification, we return them. We can test that this is working in the interpreter:

λ> readP_to_S timestamp "888990Z "
[]
λ> readP_to_S timestamp "302359Z "
[((30,23,59),"")]

The invalid timestamp cannot be parsed, but the valid timestamp is parsed just fine!

At this point, you might feel like it’s silly that we return a utc timestamp as a tuple of three numbers and you’re right. We really should return this as a value of the UTCTime type. However, since the metar report only contains the day of month, and not even the year in which the report was issued, you get into a complication where you may need to have a dummy value for the year and month and then later on replace those with the correct values. This is out of scope of this tutorial, but if you wish to try it as an exercise, I encourage you to familiarise yourself with the Haskell time library and then dive into it!

This or That Parsers

The next part of a metar report is the wind information: wind speed and direction. Wind information comes in three parts: the first three digits are the wind direction, in degrees. The next two or three digits are the wind speed. This is followed by either “_kt_” or “_mps_” which signify that the wind speed was given in either knots or metres per second.

We have a lot of “or” here; something can be either this or that. Fortunately, ReadP makes this simple for us. The <|> operator lets us specify that we want the parser to try several different sub-parsers, and return the first successful one. To get access to <|> we need to import Control.Applicative.

(Side note for advanced beginners: the <|> operator is part of the Alternative typeclass and can be used with many things other than parsers, but for our purposes you can view it as specific to parsers.)

To begin with, we’ll parse the first three numbers which are the wind direction.

windInfo :: ReadP Int
windInfo = do
    direction <- numbers 3
    return direction

Next, we need to parse two or three digits which are the wind speed. Using the operator we just learned about, the parser will then look like

import Control.Applicative

windInfo :: ReadP (Int, Int)
windInfo = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    return (direction, speed)

Similarly, we need to parse the unit for the wind speed, which can be “_kt_” or “_mps_”. We also need to parse the final space character.

windInfo :: ReadP (Int, Int, String)
windInfo = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    unit <- string "KT" <|> string "MPS"
    string " "
    return (direction, speed, unit)

At this stage, we might as well convert the wind speed to a proper unit – let’s pick m/s for no particular reason other than it being the only real unit for speed.

windInfo :: ReadP (Int, Int)
windInfo = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    unit <- string "KT" <|> string "MPS"
    string " "
    return (direction, toMPS unit speed)

toMPS :: String -> Int -> Int
toMPS unit speed =
    case unit of
         "KT" -> div speed 2
         "MPS" -> speed

I want to emphasise that being able to access intermediary values like this is the reason we say parsers are a monad. The fact that we can check what the unit value is and use the speed value for different calculations depending on the unit value inside the parser that is parsing them is because parsers are a monad. Had parser not been a monad, we would not have been able to depend on some parsed value to do a calculation based on another one.

Maaaybeee…

Okay, I lied to you. Sorry. The wind information isn’t quite as simple as I first said. If there is a constant wind from the south of 27 knots, but with gusts of 31 knots, that is specified as 18027G31KT. In other words, the speed of the gusts are given by putting a G in there and then giving two or three numbers which are the gust speed, same unit as the wind speed.

Since this gust speed may not be in the metar report, we have to come up with a way of parsing something that may not exist in the input. To help with this, the ReadP module has the option function:

option :: a -> ReadP a -> ReadP a

The option function takes two arguments: one value and one parser. It will try to run the parser, but if the parser fails, it’ll return the value instead. First, let’s create a gust parser! (Recall that this doesn’t need “return” because Haskell do syntax automatically returns the last value in the function. We only need return when we want to create a parser that has a pre-determined value as its parsed result.)

gustParser :: ReadP Int
gustParser = do
    satisfy (== 'G')
    numbers 2 <|> numbers 3

We could just include this in our wind information parser, like so:

windInfo :: ReadP (Int, Int, Int)
windInfo = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    gusts <- gustParser 
    unit <- string "KT" <|> string "MPS"
    string " "
    return (direction, toMPS unit speed, toMPS unit gusts)

but the problem now is that this parser will fail for any wind information section that doesn’t have gust information. So we try using option.

windInfo :: ReadP (Int, Int, Int)
windInfo = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    gusts <- option 0 gustParser 
    unit <- string "KT" <|> string "MPS"
    string " "
    return (direction, toMPS unit speed, toMPS unit gusts)

This is great… with one exception. Take a look at this:

λ> readP_to_S windInfo "09014KT "
[((90,7,0),"")]

This is not the best kind of output to read, but it is saying the wind has a speed of 7 m/s with gusts of 0 m/s. We probably want to use Maybe here instead.

windInfo :: ReadP (Int, Int, Maybe Int)
windInfo = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    gusts <- option Nothing (fmap Just gustParser)
    unit <- string "KT" <|> string "MPS"
    string " "
    return (direction, toMPS unit speed, fmap (toMPS unit) gusts)

Here, again, we exploit the fact that parsers are functors, so we know that if gustParser :: ReadP Int then

fmap Just gustParser :: ReadP (Maybe Int)

We also map over the gusts variable, which contains a Maybe Int value, and convert it to m/s.

As a final touch to the wind information parser, I’d like to show that we don’t have to return silly tuples from our parser. If we define a data type like

data WindInfo = WindInfo
    { dir :: Int
    , speed :: Int
    , gusts :: Maybe Int
    }
    deriving Show

we can then let the parser return that instead:

windInfo :: ReadP WindInfo
windInfo = do
    direction <- numbers 3
    speed <- numbers 2 <|> numbers 3
    gusts <- option Nothing (fmap Just gustParser)
    unit <- string "KT" <|> string "MPS"
    string " "
    return (WindInfo
        direction
        (toMPS unit speed)
        (fmap (toMPS unit) gusts))

This will give output that is much nicer to read, and more importantly, much nicer to handle in your own code..

λ> readP_to_S windInfo "09014KT "
[(WindInfo {dir = 90, speed = 7, gusts = Nothing},"")]

Tying Together

If we combine the parsers we have written so far, we have a solid start for parsing metar reports. We combine the parsers we have written in just the same way we already have combined parsers before. You see now where the “combinator” in “parser combinators” come from? We do a lot of combinating with parsers. That’s part of why they are so nice to work with. A big parser is just as simple as a small parser. We also define a data type for the first few fields of the metar report.

data Report = Report
    { station :: String
    , time :: (Int, Int, Int)
    , wind :: WindInfo
    }
    deriving Show

metar :: ReadP Report
metar = do
    code <- airport
    time <- timestamp
    wind <- windInfo
    return (Report code time wind)

and this actually works!

λ> readP_to_S metar "BIRK 281500Z 09014G17KT CAVOK M03/M06 Q0980 R13/910195"
[
    ( Report
        { station = "BIRK"
        , time = (28,15,0)
        , wind = WindInfo
            { dir = 90
            , speed = 7
            , gusts = Just 8
            }
        }
    , "CAVOK M03/M06 Q0980 R13/910195"
    )
]

Well … it doesn’t parse the full metar report yet. There’s a lot of things in a report, like visibility, cloud layers, precipitation, atmospheric pressure, temperature and runway conditions; for example, the rest of the birk report says that the skies are clear (“_cavoc_”), the air temperature is -3 °C and the dew point is -6 °C (“M03/M06”), the atmospheric pressure is 0.98 kPa (“Q0980”) and runway 13 has a little frost on it but brakes should bite anyway (“R13/910195”).

I’m not going to implement the full parser for all of this in this tutorial. However, with what you know now, you know everything you need to know to parse a full metar report, should you want to. If you want access to more metar reports, you can get them at nws decoded metar reports. Just change the four characters in the file name for the icao code for the airport you’re interested in. Change “decoded” in the url into “stations” if you want just a file with the raw metar report, without the human decoding of it.

If you do write a full parser, feel free to make a library of it and submit it to Hackage, because as it stands, there are no metar parsing libraries for Haskell. Even if you don’t, I hope you feel confident enough to bust out ReadP the next time you have something you need to parse and nobody else has written a parser for it yet.


Oh, and by the way, the parseMaybe function I used in the initial demo? Does not exist in the library, but easy to implement as

parseMaybe :: ReadP a -> String -> Maybe a
parseMaybe parser input =
    case readP_to_S parser input of
        [] -> Nothing
        ((result, _):_) -> Just result

or any variation thereof.