Parser Combinators Beat Regexes
Someone online was solving Advent of Code problems, and had a question about
this years’ day 3. They had a working solution using regular expressions
(regexes) on String
values, but they wanted to use ByteString
values instead
for performance reasons. They were surprised, however, that there seems to be a
lack of community cohesion around regex libraries in Haskell.
There’s a reason for that. We generally don’t use regexes in Haskell. We use parser combinators instead, because they are almost always better. In other languages, it would be considered overkill to write a full parser when a simple regex can do the same thing. In Haskell, writing a parser is no big deal. We just do it and move on with our lives.
The regex solution
The first part of the Advent of Code problem is well suited for a regex-based solution. Here’s what such a solution might look like in Haskell. It uses the pcre-heavy Haskell library which in turn calls out to the system-wide pcre C library for actually compiling and running the regex.
{-# LANGUAGE QuasiQuotes #-} module Main where import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as Char8 import Data.Monoid (Sum (..)) import qualified Text.Regex.PCRE.Heavy as Re import Text.Regex.PCRE.Heavy (re) test_input :: ByteString test_input = Char8.pack "xmul(2,4)%&mul[3,7]!@^do_not_mul(5,5)+mul(32,64]then(mul(11,8)mul(8,5))" regex_matches :: ByteString -> Int regex_matches input = let -- Get a list of regex matches with numbers extracted. -- The 're' quasi-quoter compiles the regex at compile -- time, avoiding the runtime cost of that. hits = Re.scan [re|mul\((\d+),(\d+)\)|] input -- Take one hit and compute its product, converting the -- stringified numbers into actual numbers, then wrapping -- it up as a term in a sum. compute (_, [a, b]) = Sum (read (Char8.unpack a) * read (Char8.unpack b)) in -- Fold up all terms in the sum into a single number. getSum (foldMap compute hits) main :: IO () main = do print (regex_matches test_input)
This does the job and returns the expected sum of 161. If we run this on just under a megabyte of input data, it takes 19 seconds on my machine, spending almost all of that time inside the pcre library.1 Though this is a little strange, because if I do the same thing in Perl, it takes 0.02 seconds. I wish I had time to investigate what goes wrong.
One thing I dislike about this is that there’s a very strong implicit contract
between the regex and the compute
function. The compute
function assumes
that there were exactly two capturing groups and that they are strings that can
safely be converted to integers. This is true, but there’s nothing in the code
making that guarantee. If those assumptions become violated a year from now,
that would not be a compiler error but an exception that interrupts the
production service at 3 am the day before an important customer meeting.
The parser solution
For the sort of problem we are facing here, the parser-based solution does look more complicated at first glance. Part of the reason is that the regex library comes with a method that gets all matches, regardless of their position in the input. Parser combinators are meant to be written to consume all of the input, so we have to manually write the parser code that iterates the input and finds the next match.
This solution uses the attoparsec library, which is made to work with
ByteString
values specifically, since the person who asked the question seemed
to care about performance.
module Main where import Control.Applicative ((<|>)) import qualified Data.Attoparsec.ByteString.Char8 as Parser import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as Char8 import qualified Data.Either as Either test_input :: ByteString test_input = Char8.pack "xmul(2,4)%&mul[3,7]!@^do_not_mul(5,5)+mul(32,64]then(mul(11,8)mul(8,5))" parser_matches :: ByteString -> Int parser_matches input = let -- Parse a mul instruction. Starts by reading the opening -- sequence, then the first value, then the separator, -- then the second value, then the terminator. Finally, -- it returns the product of the pair of values. mul = do Parser.string (Char8.pack "mul(") first <- Parser.decimal Parser.char ',' second <- Parser.decimal Parser.char ')' pure (first * second) -- Parse the next mul instruction. Start by trying to -- parse a mul instruction immediately. If that fails, -- instead (1) advance the parser one step, and -- (2) try again. next = mul <|> do Parser.anyChar next -- Scan the entire input for all mul instructions, -- returning the sum of their products. scan = do muls <- Parser.many1 next pure (sum muls) in -- Run the parser over the input, -- handling a parse error by throwing -- an exception. Either.fromRight (error "Failed to parse input.") (Parser.parseOnly scan input) main :: IO () main = do print (parser_matches test_input)
This is written in what one might call direct, monadic style, doing just what
was requested in the first part of the Advent of Code puzzle, using do
notation for sequencing. We could rewrite it to be neater, but we will resist
refactoring for now; we never know what twist in the requirements will be
introduced in the second part of an Advent of Code puzzle. (The curious can
check the refactoring out in the appendix.)
This did involve more up-front writing, but there are some benefits. Granted, we
use Either.fromRight
to throw away all the useful error reporting we could get
from attoparsec, but if we didn’t do that we would get more useful error
messages when things went wrong. We also didn’t have to manually convert strings
to integers, or blindly hope that the right number of integers were parsed. The
compiler checks all those assumptions for us.
But, perhaps more notably, when we run this on the same megabyte of input, it takes 0.07 seconds, rather than the 19 seconds required by the regex.
As we will see shortly, another big benefit of parsers is that they are more flexible to adapt to future requirements.
Making the parser stateful
The next part of the Advent of Code puzzle involves interpreting instructions
called do()
and don't()
which turn on and off the contributions of mul
instructions to the sum. As we parse, we now need to keep track of one bit of
state. This is a nightmare for regexes to deal with, because they recognise
regular languages, and regular languages are literally stateless
languages.2 There’s more nuance, of course, but as a first approximation.
Technically regular languages are those that can be recognised by a finite state
automaton, and if there are a finite number of states (as there are in this
case) then all of them can be encoded in an fsa but let’s not get pedantic
here.
But with the parser-based solution, we can lift it into a state transformer, and we get a stateful parser.3 Note that in a serious application, we might have lexing and parsing as separate steps, but parser combinators give us the freedom to combine both steps for tiny parsers like this.
module Main where import Control.Applicative (asum, (<|>)) import qualified Control.Monad.State.Class as State import qualified Control.Monad.State.Strict as State import qualified Data.Attoparsec.ByteString.Char8 as Parser import Data.Bool (bool) import Data.ByteString (ByteString) import qualified Data.ByteString as ByteString import qualified Data.ByteString.Char8 as Char8 import qualified Data.Either as Either test_input :: ByteString test_input = Char8.pack "xmul(2,4)&mul[3,7]!^don't()_mul(5,5)+mul(32,64](mul(11,8)undo()?mul(8,5))" parser_matches :: ByteString -> Int parser_matches input = let -- Upon successfully parsing a do instruction, -- enable contribution of mul instructions. enable = do State.lift (Parser.string (Char8.pack "do()")) State.put True -- Upon successfully parsing a don't instruction, -- disable contribution of mul instructions. disable = do State.lift (Parser.string (Char8.pack "don't()")) State.put False -- Parse a mul instruction just as before, except -- now lifted into a stateful operation. mul = State.lift $ do Parser.string (Char8.pack "mul(") first <- Parser.decimal Parser.char ',' second <- Parser.decimal Parser.char ')' pure (first * second) -- If a do or don't instruction is parsed, -- continue searching for the next mul. -- -- If a mul is encountered, check the value of -- the state. If muls are enabled, leave it -- unaltered (id), but if muls are disabled, -- force its value to be zero (const 0). -- -- If no instruction is recognised, step -- forward one character and try again. next = asum [ enable *> next , disable *> next , bool (const 0) id <$> State.get <*> mul , State.lift Parser.anyChar *> next ] scan = sum <$> Parser.many1 next in -- Start off evaluation with mul contributions enabled. Either.fromRight (error "Failed to parse input.") $ Parser.parseOnly (State.evalStateT scan True) input main :: IO () main = do print (parser_matches test_input)
One has to be a little careful when making specifically attoparsec parsers
stateful, because attoparsec is very happy to backtrack in the input on failure,
but it will not undo the state changes that were caused by the initial parsing
attempt. Thus, when making attoparsec parsers stateful, we have to write them
such that they never backtrack past a state change.4 One could imagine an
attoparsec primitive called something like cut
(inspired by Prolog) which
stops backtracking past a certain point. This could be used in combination with
state changes to make sure the parser never backtracks past that change. Such a
primitive would be fairly easy to write but it needs to be supplied by the
library – it cannot be written by library users. On the positive side, avoiding
long sequences of backtracking is also good for memory usage and performance, so
we should do that anyway.
The alert reader might notice that we used the parser result to store the value of the sum, but stored the contribution bit in the state. We could have done it any other way. We could store both sum and bit in state, or both in the result, or the other way around with the sum in the state and the bit in the result. This was just the quickest way I could think of to get it up and running. Haskell makes it cheap and safe to refactor later.
The above runs in 0.12 seconds on about a megabyte of input. I won’t even try to write a regex-based solution for the second part, but I’m fairly confident it would be slower, less flexible, and more difficult to maintain.
Appendix A: Refactoring the direct, monadic parser
The mul
instruction is really an instance of a more general “delimited pair of
values” parser. We can extract this into its own function.
pair :: Parser open -> Parser sep -> Parser close -> Parser value -> Parser (value, value) pair open sep close value = liftA2 (,) (open *> value) (sep *> value <* close)
This takes three parsers for the opening sequence, separator, and closing
sequence, and then a parser for the actual values, and constructs a parser that
returns a pair of values. We can use this to parse the mul
instruction, but
also any other instruction based on a pair of values.
Similarly, we were missing a function that returns all matches for a parser, regardless of where in the input they were located. The regex library had that function, but it does not come out of the box with attoparsec. We can make that.
scan_all :: Parser a -> Parser [a] scan_all p = let next = p <|> Parser.anyChar *> next in Parser.many1 next
With these parts of the code broken out into their own functions, the final parser can be written in a more applicative style as
parser_matches :: ByteString -> Int parser_matches input = let mul = uncurry (*) <$> pair (Parser.string (Char8.pack "mul(")) (Parser.char ',') (Parser.char ')') Parser.decimal result = sum <$> scan_all mul in Either.fromRight (error "Failed to parse input.") (Parser.parseOnly result input)
This reads quite nicely: the result parser is the sum of all mul
instructions,
which are specified as the product of a pair of values with specific delimiters.