Entropic Thoughts

Search Index in 150 Lines of Haskell

Search Index in 150 Lines of Haskell

I stumbled over Bart de Goede’s article on building a full-text search engine in 150 lines of Python, and was reminded of my quest to show how useful Haskell is for solving real-world problems. Python is an eminently practical language, so nobody is surprised this can be done in Python. But Haskell? The Python code spends a lot of time updating mutable dictionaries. Surely we cannot easily port this code over to Haskell.

Let’s find out.

Data from Wikipedia

Bart uses a small-ish Wikipedia archive to have real-world data to index. That version of the archive is no longer available at the time of writing, but a newer version is. That is an 855 mb gzipped xml file, containing just under 7 million Wikipedia articles.1 This is apparently used by Yahoo search, somehow. The Wikimedia dump service describes the job as “Recombine extracted page abstracts for Yahoo.” Each document in the xml file is not the full Wikipedia article, but

  1. Title,
  2. url,
  3. the first line or so of its MediaWiki source (“abstract”), and
  4. additional links.

Borrowing the same example document as Bart de Goede, we can see the xml data – once un-gzipped – is shaped like

<feed>
...
<doc>
<title>Wikipedia: London Beer Flood</title>
<url>https://en.wikipedia.org/wiki/London_Beer_Flood</url>
<abstract>The London Beer Flood was an accident at Meux & Co's Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.</abstract>
...
</doc>
...
</feed>

As in the Python article, we will begin by creating a type representing a Document in Haskell. We’re also adding the fullText convenience function to join together title and abstract with a space.

module Document where

import Data.Text qualified as Text

data Document = Document
  { title    :: !Text.Text
  , url      :: !Text.Text
  , abstract :: !Text.Text
  , id       :: !Int
  }
  deriving Show

fullText :: Document -> Text.Text
fullText doc =
  Text.intercalate " " [ title doc, abstract doc ]

Of note here is that we use the Text type from the popular text package rather than the standard-library String type. This is practically always what we want. Haskell standard-library String values are very performance intensive.2 This is because they are represented as lazy linked lists. The lazy linked list is great for iterators where elements are generated and then discarded again. It’s not so good for long-term storage.

The exclamation marks in front of the types in the record declaration are strictness annotations, indicating to the compiler that we don’t want this type to be lazy in its fields. This is usually the right choice for records that hold small data that is cheap to produce and will, for sure, eventually be needed.

Deriving Show gives us a convenient way to dump entire Document values to string. It’s not meant for production use but it corresponds closely to the dataclass __repr__ used in the Python article.

Reading document data

Since the data is going to be over a few hundred megabytes decompressed, we want to use a streaming parser – just like in the Python code. There are some alternatives for constant-memory streaming processing in Haskell, and the library I’m most familiar with is Conduit. That also has support for decompression and streaming xml parsing.

module Main where

import Conduit
import Data.Conduit.Zlib
import Data.String (fromString)
import Text.XML.Stream.Parse

import Document (Document (..))

documents :: ConduitT () Document (ResourceT IO) ()
documents =
  let
    -- Helper function that throws exceptions when a tag
    -- does not exist, otherwise returns whatever the inner
    -- parser of that tag returns.
    assertTag name inner =
      force name . tagIgnoreAttrs (fromString name) $
        inner

    -- Parse a feed tag that holds many doc tags. Create
    -- a document for each doc tag, asserting the first
    -- three tags being title, url, and abstract, ignoring
    -- any further tags.
    --
    -- Note that this does not supply a value for the id
    -- of the document! The Document constructor remains
    -- unapplied to its last argument, which will be
    -- populated with an increasing sequence of integers.
    parseFeed =
      assertTag "feed" . manyYield $
        tagNoAttr "doc" $ Document
          <$> assertTag "title" content
          <*> assertTag "url" content
          <*> assertTag "abstract" content
          <* ignoreAnyTreeContent

    -- Read the file, un-gzip it, then parse the XML into
    -- Document values as above.
    feed =
      sourceFile "enwiki-20250123-abstract.xml.gz"
      .| multiple ungzip
      .| parseBytes def
      .| parseFeed

  in
    -- Zip together the document constructor with id
    -- numbers, creating a stream of Document values.
    getZipSource $ ZipSource feed <*> ZipSource (yieldMany [1..])

This is a tiny bit longer than the Python code, mainly because the xml-conduit library doesn’t have all the convenience functions of lxml in Python.

Analysis

If the user searches for “containment”, we will also give them matches that have the word “contain”. This is called stemming and is performed by search engines to avoid having results depend on the specific grammar choices in the search query. Just as in the Python code, we will do this along with some other normalisations.

module Analysis (analyse) where

import Data.Char qualified as Char
import Data.Set qualified as Set
import Data.Text qualified as Text
import NLP.Snowball qualified as Stemmer

stopWords :: Set.Set Text.Text
stopWords = Set.fromList . map Text.toUpper $ concatMap Text.words
  [ "the be to of and a in that have I it for not on with he"
  , "as you do at this but his by from wikipedia"
  ]

analyse :: Text.Text -> [Text.Text]
analyse =
  map (Stemmer.stem Stemmer.English)
  . filter (not . flip Set.member stopWords)
  . filter (/= "")
  . map (Text.filter (Char.isAlphaNum))
  . Text.words
  . Text.toUpper

This module exposes a function analyse that takes a bit of text, converts it to upper case3 The Python code converts it to lower case, but my time spent with C# has ingrained in me that upper case is more suitable for case normalisation., splits it into words, removes punctuation, filters away the words that were only punctuation, removes stop words, and then stems what remains.

I like how the Haskell code makes it very clear that we’re running a value through a pipeline of transformations. Compare to the corresponding Python code:

def analyze(text):
    tokens = text.split()
    tokens = [token.upper() for token in tokens]
    tokens = [PUNCTUATION.sub('', token) for token in tokens]
    tokens = [token for token in tokens if token not in STOPWORDS]
    tokens = STEMMER.stemWords(tokens)
    return [token for token in tokens if token]

I mean yes, that works and does sort of the same thing, but there is a lot of noise in there!4 Also why is the Python code using list comprehensions and not generators here? It looks like this analysis step creates at least four intermediary list objects that will have to be allocated and then immediately garbage collected. A generator comprehension would yield an item at a time as its needed. This is the sort of thing where I’m saying lazy linked lists are an okay choice.

Indexing the documents

This is where the Python code loops through all parsed documents, and passes each to an Index object. It’s gets a bit convoluted to explain in words what happens next, but the code is clear enough:

def index_document(self, document):
    if document.ID not in self.documents:
        self.documents[document.ID] = document

    for token in analyze(document.fulltext):
        if token not in self.index:
            self.index[token] = set()
        self.index[token].add(document.ID)

We could use that approach also in Haskell, by storing the data in mutable hash tables from the hashtables packages. But there’s a different way that’s easier to think about, and easier to test. We’ll see how it works.

First, we create a newtype wrapper to hold a term–document mapping.

module Index where

import Data.HashMap.Strict (HashMap)
import Data.IntMap.Strict (IntMap)
import Data.IntSet (IntSet)
import Data.Text (Text)

newtype TermDoc = TermDoc { unTermDoc :: HashMap Text IntSet }

Since we are talking about aggregating documents into an index, the word semigroup might pop into our heads.

Semigroup is a Haskell interface that exposes just one operation, called <>. This operation smushes together two things into one. For example,

ghci> [4, 2] <> [7, 8, 3]
[4,2,7,8,3]

ghci> Set.fromList ["hello", "world"] <> Set.fromList ["goodbye", "world"]
Set.fromList ["goodbye","hello","world"]

ghci> Sum 5 <> Sum 8
Sum { getSum = 13 }

It might seem odd at this point, but something else we ought to be able to smush together are term–document mappings! This would take the terms from both mappings, and create a new mapping where each term is associated with both sets of documents from the prior mappings.

We also implement the Monoid interface which builds on top of Semigroup and exposes just one operation mempty which lets us conjure up an empty value out of nothing.

module Index where

import qualified Data.HashMap.Strict as HashMap
import qualified Data.IntMap.Strict as IntMap
import qualified Data.IntSet as IntSet

instance Semigroup TermDoc where
  TermDoc mA <> TermDoc mB =
    TermDoc (HashMap.unionWith IntSet.union mA mB)

instance Monoid TermDoc where
  mempty = TermDoc mempty

It might not be clear how these are used, so let’s see some examples.

ghci> print td1
TermDoc
  { unTermDoc = fromList
      [ ( "world", fromList [1,2] )
      , ( "hello", fromList [1] )
      ]
  }

ghci> print td2
TermDoc
  { unTermDoc = fromList
      [ ( "goodbye", fromList [3] )
      , ( "world", fromList [3] )
      ]
  }

ghci> td3 = td1 <> td2
ghci> print td3
TermDoc
  { unTermDoc = fromList
      [ ( "goodbye", fromList [3])
      , ( "world", fromList [1,2,3] )
      , ( "hello", fromList [1] )
      ]
  }

Given an existing term–document mapping, we also want a function that can create a new term–document mapping with a term–document pair added to it.

addTerm :: Int -> Text -> TermDoc -> TermDoc
addTerm doc term td =
  TermDoc
  . HashMap.insertWith IntSet.union term (IntSet.singleton doc)
  $ unTermDoc td

It might seem expensive to create an entire new term–document mapping every time we add a term–document pair to it, but we’ll start out with this naïve approach and adjust later if we have to.

As in the Python code, the term–document mapping is the main part of the search index, but it only lets us look up document ids. We want to show the full documents to the user, so we want a document map that links each document id to the corresponding full document value.

import Document (Document)

data Index = Index
  { termdoc :: !TermDoc
  , docs    :: !(IntMap Document)
  }

The exclamation marks here also serve the purpose of forcing the entire Index to exist once the constructor is evaluated. It doesn’t actually have anything to do with performance in this case, it just makes sure the index is fully computed before the first query arrives. Without these strictness annotations, some of the index computations are made during the first query.

We’ll implement Semigroup and Monoid for this too.

instance Semigroup Index where
  Index tdA dsA <> Index tdB dsB = Index
    Index (tdA <> tdB) (dsA <> dsB)

instance Monoid Index where
  mempty = Index mempty mempty

These instances mean that if we have two indices for two possibly-overlapping sets of documents, we can smush them together with <> and get a single index covering the entire set of documents. Why on Earth is this useful? Well, for our last trick, we’ll convert a single document to an index.

import Document qualified as Document

fromDocument :: Document -> Index
fromDocument doc = Index
  { docs =
      IntMap.singleton (Document.id doc) doc
  , termdoc =
      foldl' (flip (addTerm (Document.id doc)))
        mempty
        (Analysis.analyse (Document.fullText doc))
  }

This looks complicated, but it performs a very simple function: create an index covering only a single document. Most clear with an example:

ghci> Index.fromDocument $
        Document "Hello, world!" "https://example.com"
          "This is a message introduced in the classic K&R book." 89

Index
  { termdoc = TermDoc
      { unTermDoc = fromList
          [ ("MESSAGE", fromList [89])
          , ("CLASSIC", fromList [89])
          , ("WORLD", fromList [89])
          , ("KR", fromList [89])
          , ("INTRODUCED", fromList [89])
          , ("IS", fromList [89])
          , ("HELLO", fromList [89])
          , ("BOOK", fromList [89])
          ]
      },
    docs = fromList
      [
        ( 1
        , Document
            { title = "Hello, world!"
            , url = "https://example.com"
            , abstract = "This is a message introduced in the classic K&R book."
            , id = 89
            }
        )
      ]
  }

We can create these single-document indices for every document in our corpus. Once we have those, we can smush them all together with <> and out of it we get a single index covering all documents! This is indeed what we will do.5 It would be neat if we could subdivide the input data into equal-sized chunks and then process them in parallel, map–reduce style. Our Index monoid supports this, but the input data does not. However, the Conduit foldMapC is strict enough that it never actually produces a long list of indices, it just creates one at a time and smushes it up with the large aggregate index it is building.

module Main where

buildIndex :: IO Index
buildIndex = runConduitRes $
  documents .| foldMapC Index.fromDocument

This is a common pattern in Haskell: if we want to aggregate something out of smaller pieces, we can often find a way to convert those pieces into a small version of the aggregate object, and then implement Semigroup to smush small things together. The big thing falls out of that naturally. The benefit is that it is very easy to test and reason about the smushing operation, because if it is correct for small cases, it is also correct very large cases thanks to the smushing laws.

One might wonder if it isn’t awfully inefficient to build the search index the way we do here. We are building a few million one-document indices only to smush them together. For each term we are adding to the term–document map, we are constructing a brand new term–document map.

It turns out it’s not so bad. Building the seven-million document index on my laptop takes just over 8 minutes, and a non-trivial chunk of that time is spent decompressing and parsing xml. However, it does spend nearly half its time garbage collecting, so it could be more efficient, but this simple code is obviously correct and the program still fits comfortably on my laptop. It doesn’t need to be more efficient at this stage.

Searching

Searching is a matter of doing the term lookups in the term–document index, and then converting the document ids to actual documents. We can create a function for this in our Index module.

-- Get a set containing all document ids where
-- the document contains the term.
lookupTerm :: Index -> Text -> IntSet
lookupTerm idx term =
  fromMaybe (IntSet.empty)
  . HashMap.lookup term
  . unTermDoc
  $ termdoc idx

search :: Text -> Index -> [Document]
search query idx =
  -- Convert query to id sets and combine them.
  -- Then look all ids up in the document map.
  catMaybes
  . map (flip IntMap.lookup (docs idx))
  . IntSet.toList
  $ case map (lookupTerm idx) (Analysis.analyse query) of
      []     -> IntSet.empty
      (x:xs) -> foldl' IntSet.intersection x xs

We would call this from our main function along with timing information like-a-so:

module Main where
main :: IO ()
main = do
  index <- buildIndex

  startSearch <- getCurrentTime
  -- Run search query and print all results.
  traverse_ print $
    Index.search Index.And "London Beer Flood" index
  searchTime <- flip diffUTCTime startSearch <$> getCurrentTime
  putStrLn $ "==== Query took " <> show searchTime <> " seconds"

This prints out what we would expect.

Document {title = "Wikipedia: Horse Shoe Brewery", url = "https://en.wikipedia.org/wiki/Horse_Shoe_Brewery", abstract = "The Horse Shoe Brewery was an English brewery in the City of Westminster that was established in 1764 and became a major producer of porter, from 1809 as Henry Meux &amp; Co. It was the site of the London Beer Flood in 1814, which killed eight people after a porter vat burst.", id = 1481330}
Document {title = "Wikipedia: London Beer Flood", url = "https://en.wikipedia.org/wiki/London_Beer_Flood", abstract = "The London Beer Flood was an accident at Meux &amp; Co's Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the  wooden vats of fermenting porter burst.", id = 1803069}
==== Query took 0.000415501s seconds

The Python code was fast, coming in at 30 milliseconds seconds per query once the caches were warm. Haskell gets the job done nearly 70× faster, at 0.4 milliseconds per query.

At this point, we’re about 120 lines of code in and are missing two features from the full code in the Python article:

  • A toggle to choose set union instead of set intersection when combining the results from multiple terms in the search query, and
  • A tf–idf based ranking of the search results.

At first I was going to stop here, but these are small additions, so let’s roll on.

Ranking results with tf–idf

An old-school baseline for relevance determination is scoring results by term frequency multiplied by inverse document frequency.6 If you want more details on how this works, it should be very easy to look up. There’s been much written about it. The first step in the Python article is amending documents with their term frequency data, and we’ll do the same.

data Document = Document
  { id       :: !Int
  , title    :: !Text.Text
  , url      :: !Text.Text
  , abstract :: !Text.Text
  , termFreq :: HashMap Text Int
  }
  deriving Show

We have deliberately left the term frequency field lazy in the Document record. Maybe there are some documents that are practically never included in the result sets. By leaving the term frequency lazy, it will never need to be computed for those documents. Usage patterns will dictate what’s appropriate in any specific case. If the term frequency map is very large, it might even be preferable to compute it fresh each time rather than store it as a field.

We want to automatically compute the term frequency whenever we construct a document, so we’ll create a constructor function for documents. This goes through an intermediary HashMap where the values are of Sum 1 values, meaning the smushing operator acts as a counter.

mkDocument :: Text -> Text -> Text -> Int -> Document
mkDocument t u a i =
  let
    tf =
      fmap getSum
      . HashMap.fromListWith (<>)
      . map (, Sum 1)
      $ Analysis.analyse (t <> " " <> a)
  in
    Document i t u a tf

When searching, we will do lookups in this map, as well as compute the inverse document frequency for each term in the search phrase. The idf we compute here is nearly the same as in the Python article, except we are adding one to the denominator to avoid dividing by zero.7 I cannot figure out how the Python code avoids this, so it might be a latent bug that the author failed to discover because they didn’t test this version of the code with 'OR' searches.

search :: SearchType -> Text -> Index -> [Document]
search searchType query idx =
  let
    combining = case searchType of
      And -> IntSet.intersection
      Or  -> IntSet.union

    terms = Analysis.analyse query

    results =
      catMaybes
      . map (flip IntMap.lookup (docs idx))
      . IntSet.toList
      $ case map (lookupTerm idx) terms of
          []     -> IntSet.empty
          (x:xs) -> foldl' combining x xs

    idf term =
      let
        n = fromIntegral (IntMap.size (docs idx))
        df_t = fromIntegral (IntSet.size (lookupTerm idx term))
      in
        log (n / (1 + df_t))

    tf doc term =
      fromIntegral . fromMaybe 0
      . HashMap.lookup term
      $ Document.termFreq doc

    score doc =
      foldMap (\t -> Sum (idf t * tf doc t)) terms

  in
    sortOn (Down . score) results

Again we use the Sum type to smush together the scores for each search result. At the end, we wrap the score in the Down type to get a highest-score-first ordering of the search results. This completes a search for “London Beer Flood” in 49 milliseconds – roughly the time it took Python without ranking.

And that’s it! We have replicated 150-ish lines of Python in 150-ish lines of Haskell. It wasn’t that hard. It might look foreign, but to an experienced programmer, it is easier to test, maintain, and ensure the correctness of.

Appendix A: Optimising parsing

The xml parser used in the code above is definitely correct, but unnecessarily powerful for what we need of it. The data supplied in the Wikimedia dump looks like xml, but is in fact much more regular than that. We can produce a practically equivalent document iterator using more primitive functions operating directly on the byte stream.

documentsPrimitive :: ConduitT () Document (ResourceT IO) ()
documentsPrimitive =
  let
    -- We are cheating very hard here and relying on the strong guarantees
    -- we can make on the input format down to byte sequences. This checks if
    -- the line starts with e.g. "<title>" and then strips characters that
    -- hopefully correspond to start and end tags.
    getTagContent name str =
      if not (ByteString.isPrefixOf name str) then
        Nothing
      else
        Just . Text.decodeUtf8
        . ByteString.drop (ByteString.length name)
        . ByteString.take (ByteString.length str - ByteString.length name - 1)
        $ str

    -- This builds a partial document by making every field optional, and then
    -- yielding a full Document once all fields are filled.
    buildDoc str (!ai, !at, !au, !aa) =
      case (ai, at, au, aa) of
        (i, Just t, Just u, Just a) ->
          ( (i+1, Nothing, Nothing, Nothing)
          , Just (mkDocument t u a i)
          )
        (i, mt, mu, ma) ->
          ( ( i
            , mt <|> getTagContent "<title>" str
            , mu <|> getTagContent "<url>" str
            , ma <|> getTagContent "<abstract>" str
            )
          , Nothing
          )

    yieldMaybeDocs =
      Conduit.mapAccum buildDoc (0, Nothing, Nothing, Nothing) $> ()

  in do
    sourceFile "enwiki-20250123-abstract.xml.gz"
    .| multiple ungzip
    .| Conduit.lines
    .| yieldMaybeDocs
    .| Conduit.catMaybes

This is useful for testing purposes when one wants fast feedback loops.