Entropic Thoughts

Parser Combinators Beat Regexes

Parser Combinators Beat Regexes

Someone online was solving Advent of Code problems, and had a question about this years’ day 3. They had a working solution using regular expressions (regexes) on String values, but they wanted to use ByteString values instead for performance reasons. They were surprised, however, that there seems to be a lack of community cohesion around regex libraries in Haskell.

There’s a reason for that. We generally don’t use regexes in Haskell. We use parser combinators instead, because they are almost always better. In other languages, it would be considered overkill to write a full parser when a simple regex can do the same thing. In Haskell, writing a parser is no big deal. We just do it and move on with our lives.

The regex solution

The first part of the Advent of Code problem is well suited for a regex-based solution. Here’s what such a solution might look like in Haskell. It uses the pcre-heavy Haskell library which in turn calls out to the system-wide pcre C library for actually compiling and running the regex.

{-# LANGUAGE QuasiQuotes #-}

module Main where

import           Data.ByteString       (ByteString)
import qualified Data.ByteString.Char8 as Char8
import           Data.Monoid           (Sum (..))
import qualified Text.Regex.PCRE.Heavy as Re
import           Text.Regex.PCRE.Heavy (re)

test_input :: ByteString
test_input = Char8.pack "xmul(2,4)%&mul[3,7]!@^do_not_mul(5,5)+mul(32,64]then(mul(11,8)mul(8,5))"

regex_matches :: ByteString -> Int
regex_matches input =
  let
    -- Get a list of regex matches with numbers extracted.
    -- The 're' quasi-quoter compiles the regex at compile
    -- time, avoiding the runtime cost of that.
    hits = Re.scan [re|mul\((\d+),(\d+)\)|] input
    -- Take one hit and compute its product, converting the
    -- stringified numbers into actual numbers, then wrapping
    -- it up as a term in a sum.
    compute (_, [a, b]) =
      Sum (read (Char8.unpack a) * read (Char8.unpack b))
  in
    -- Fold up all terms in the sum into a single number.
    getSum (foldMap compute hits)

main :: IO ()
main = do
  print (regex_matches test_input)

This does the job and returns the expected sum of 161. If we run this on just under a megabyte of input data, it takes 19 seconds on my machine, spending almost all of that time inside the pcre library.1 Though this is a little strange, because if I do the same thing in Perl, it takes 0.02 seconds. I wish I had time to investigate what goes wrong.

One thing I dislike about this is that there’s a very strong implicit contract between the regex and the compute function. The compute function assumes that there were exactly two capturing groups and that they are strings that can safely be converted to integers. This is true, but there’s nothing in the code making that guarantee. If those assumptions become violated a year from now, that would not be a compiler error but an exception that interrupts the production service at 3 am the day before an important customer meeting.

The parser solution

For the sort of problem we are facing here, the parser-based solution does look more complicated at first glance. Part of the reason is that the regex library comes with a method that gets all matches, regardless of their position in the input. Parser combinators are meant to be written to consume all of the input, so we have to manually write the parser code that iterates the input and finds the next match.

This solution uses the attoparsec library, which is made to work with ByteString values specifically, since the person who asked the question seemed to care about performance.

module Main where

import           Control.Applicative              ((<|>))
import qualified Data.Attoparsec.ByteString.Char8 as Parser
import           Data.ByteString                  (ByteString)
import qualified Data.ByteString.Char8            as Char8
import qualified Data.Either                      as Either

test_input :: ByteString
test_input = Char8.pack "xmul(2,4)%&mul[3,7]!@^do_not_mul(5,5)+mul(32,64]then(mul(11,8)mul(8,5))"

parser_matches :: ByteString -> Int
parser_matches input =
  let
    -- Parse a mul instruction. Starts by reading the opening
    -- sequence, then the first value, then the separator,
    -- then the second value, then the terminator. Finally,
    -- it returns the product of the pair of values.
    mul = do
      Parser.string (Char8.pack "mul(")
      first <- Parser.decimal
      Parser.char ','
      second <- Parser.decimal
      Parser.char ')'
      pure (first * second)

    -- Parse the next mul instruction. Start by trying to
    -- parse a mul instruction immediately. If that fails,
    -- instead (1) advance the parser one step, and
    -- (2) try again.
    next =
      mul <|> do
        Parser.anyChar
        next

    -- Scan the entire input for all mul instructions,
    -- returning the sum of their products.
    scan = do
      muls <- Parser.many1 next
      pure (sum muls)

  in
    -- Run the parser over the input,
    -- handling a parse error by throwing
    -- an exception.
    Either.fromRight (error "Failed to parse input.")
      (Parser.parseOnly scan input)

main :: IO ()
main = do
  print (parser_matches test_input)

This is written in what one might call direct, monadic style, doing just what was requested in the first part of the Advent of Code puzzle, using do notation for sequencing. We could rewrite it to be neater, but we will resist refactoring for now; we never know what twist in the requirements will be introduced in the second part of an Advent of Code puzzle. (The curious can check the refactoring out in the appendix.)

This did involve more up-front writing, but there are some benefits. Granted, we use Either.fromRight to throw away all the useful error reporting we could get from attoparsec, but if we didn’t do that we would get more useful error messages when things went wrong. We also didn’t have to manually convert strings to integers, or blindly hope that the right number of integers were parsed. The compiler checks all those assumptions for us.

But, perhaps more notably, when we run this on the same megabyte of input, it takes 0.07 seconds, rather than the 19 seconds required by the regex.

As we will see shortly, another big benefit of parsers is that they are more flexible to adapt to future requirements.

Making the parser stateful

The next part of the Advent of Code puzzle involves interpreting instructions called do() and don't() which turn on and off the contributions of mul instructions to the sum. As we parse, we now need to keep track of one bit of state. This is a nightmare for regexes to deal with, because they recognise regular languages, and regular languages are literally stateless languages.2 There’s more nuance, of course, but as a first approximation. Technically regular languages are those that can be recognised by a finite state automaton, and if there are a finite number of states (as there are in this case) then all of them can be encoded in an fsa but let’s not get pedantic here.

But with the parser-based solution, we can lift it into a state transformer, and we get a stateful parser.3 Note that in a serious application, we might have lexing and parsing as separate steps, but parser combinators give us the freedom to combine both steps for tiny parsers like this.

module Main where

import           Control.Applicative              (asum, (<|>))
import qualified Control.Monad.State.Class        as State
import qualified Control.Monad.State.Strict       as State
import qualified Data.Attoparsec.ByteString.Char8 as Parser
import           Data.Bool                        (bool)
import           Data.ByteString                  (ByteString)
import qualified Data.ByteString                  as ByteString
import qualified Data.ByteString.Char8            as Char8
import qualified Data.Either                      as Either

test_input :: ByteString
test_input = Char8.pack "xmul(2,4)&mul[3,7]!^don't()_mul(5,5)+mul(32,64](mul(11,8)undo()?mul(8,5))"

parser_matches :: ByteString -> Int
parser_matches input =
  let
    -- Upon successfully parsing a do instruction,
    -- enable contribution of mul instructions.
    enable = do
      State.lift (Parser.string (Char8.pack "do()"))
      State.put True

    -- Upon successfully parsing a don't instruction,
    -- disable contribution of mul instructions.
    disable = do
      State.lift (Parser.string (Char8.pack "don't()"))
      State.put False

    -- Parse a mul instruction just as before, except
    -- now lifted into a stateful operation.
    mul = State.lift $ do
      Parser.string (Char8.pack "mul(")
      first <- Parser.decimal
      Parser.char ','
      second <- Parser.decimal
      Parser.char ')'
      pure (first * second)

    -- If a do or don't instruction is parsed,
    -- continue searching for the next mul.
    --
    -- If a mul is encountered, check the value of
    -- the state. If muls are enabled, leave it
    -- unaltered (id), but if muls are disabled,
    -- force its value to be zero (const 0).
    --
    -- If no instruction is recognised, step
    -- forward one character and try again.
    next = asum
      [ enable *> next
      , disable *> next
      , bool (const 0) id <$> State.get <*> mul
      , State.lift Parser.anyChar *> next
      ]

    scan = sum <$> Parser.many1 next

  in
    -- Start off evaluation with mul contributions enabled.
    Either.fromRight (error "Failed to parse input.") $
      Parser.parseOnly (State.evalStateT scan True) input

main :: IO ()
main = do
  print (parser_matches test_input)

One has to be a little careful when making specifically attoparsec parsers stateful, because attoparsec is very happy to backtrack in the input on failure, but it will not undo the state changes that were caused by the initial parsing attempt. Thus, when making attoparsec parsers stateful, we have to write them such that they never backtrack past a state change.4 One could imagine an attoparsec primitive called something like cut (inspired by Prolog) which stops backtracking past a certain point. This could be used in combination with state changes to make sure the parser never backtracks past that change. Such a primitive would be fairly easy to write but it needs to be supplied by the library – it cannot be written by library users. On the positive side, avoiding long sequences of backtracking is also good for memory usage and performance, so we should do that anyway.

The alert reader might notice that we used the parser result to store the value of the sum, but stored the contribution bit in the state. We could have done it any other way. We could store both sum and bit in state, or both in the result, or the other way around with the sum in the state and the bit in the result. This was just the quickest way I could think of to get it up and running. Haskell makes it cheap and safe to refactor later.

The above runs in 0.12 seconds on about a megabyte of input. I won’t even try to write a regex-based solution for the second part, but I’m fairly confident it would be slower, less flexible, and more difficult to maintain.

Appendix A: Refactoring the direct, monadic parser

The mul instruction is really an instance of a more general “delimited pair of values” parser. We can extract this into its own function.

pair :: Parser open -> Parser sep -> Parser close -> Parser value -> Parser (value, value)
pair open sep close value =
  liftA2 (,) (open *> value) (sep *> value <* close)

This takes three parsers for the opening sequence, separator, and closing sequence, and then a parser for the actual values, and constructs a parser that returns a pair of values. We can use this to parse the mul instruction, but also any other instruction based on a pair of values.

Similarly, we were missing a function that returns all matches for a parser, regardless of where in the input they were located. The regex library had that function, but it does not come out of the box with attoparsec. We can make that.

scan_all :: Parser a -> Parser [a]
scan_all p =
  let next = p <|> Parser.anyChar *> next
  in Parser.many1 next

With these parts of the code broken out into their own functions, the final parser can be written in a more applicative style as

parser_matches :: ByteString -> Int
parser_matches input =
  let
    mul = uncurry (*) <$> pair
      (Parser.string (Char8.pack "mul("))
      (Parser.char ',')
      (Parser.char ')')
      Parser.decimal

    result = sum <$> scan_all mul
  in
    Either.fromRight (error "Failed to parse input.")
      (Parser.parseOnly result input)

This reads quite nicely: the result parser is the sum of all mul instructions, which are specified as the product of a pair of values with specific delimiters.