Search Index in 150 Lines of Haskell
I stumbled over Bart de Goede’s article on building a full-text search engine in 150 lines of Python, and was reminded of my quest to show how useful Haskell is for solving real-world problems. Python is an eminently practical language, so nobody is surprised this can be done in Python. But Haskell? The Python code spends a lot of time updating mutable dictionaries. Surely we cannot easily port this code over to Haskell.
Let’s find out.
Data from Wikipedia
Bart uses a small-ish Wikipedia archive to have real-world data to index. That version of the archive is no longer available at the time of writing, but a newer version is. That is an 855 mb gzipped xml file, containing just under 7 million Wikipedia articles.1 This is apparently used by Yahoo search, somehow. The Wikimedia dump service describes the job as “Recombine extracted page abstracts for Yahoo.” Each document in the xml file is not the full Wikipedia article, but
- Title,
- url,
- the first line or so of its MediaWiki source (“abstract”), and
- additional links.
Borrowing the same example document as Bart de Goede, we can see the xml data – once un-gzipped – is shaped like
<feed> ... <doc> <title>Wikipedia: London Beer Flood</title> <url>https://en.wikipedia.org/wiki/London_Beer_Flood</url> <abstract>The London Beer Flood was an accident at Meux & Co's Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.</abstract> ... </doc> ... </feed>
As in the Python article, we will begin by creating a type representing a
Document
in Haskell. We’re also adding the fullText
convenience function to
join together title and abstract with a space.
module Document where import Data.Text qualified as Text data Document = Document { title :: !Text.Text , url :: !Text.Text , abstract :: !Text.Text , id :: !Int } deriving Show fullText :: Document -> Text.Text fullText doc = Text.intercalate " " [ title doc, abstract doc ]
Of note here is that we use the Text
type from the popular text
package
rather than the standard-library String
type. This is practically always what
we want. Haskell standard-library String
values are very performance
intensive.2 This is because they are represented as lazy linked lists. The
lazy linked list is great for iterators where elements are generated and then
discarded again. It’s not so good for long-term storage.
The exclamation marks in front of the types in the record declaration are strictness annotations, indicating to the compiler that we don’t want this type to be lazy in its fields. This is usually the right choice for records that hold small data that is cheap to produce and will, for sure, eventually be needed.
Deriving Show
gives us a convenient way to dump entire Document
values to
string. It’s not meant for production use but it corresponds closely to the
dataclass __repr__
used in the Python article.
Reading document data
Since the data is going to be over a few hundred megabytes decompressed, we want to use a streaming parser – just like in the Python code. There are some alternatives for constant-memory streaming processing in Haskell, and the library I’m most familiar with is Conduit. That also has support for decompression and streaming xml parsing.
module Main where import Conduit import Data.Conduit.Zlib import Data.String (fromString) import Text.XML.Stream.Parse import Document (Document (..)) documents :: ConduitT () Document (ResourceT IO) () documents = let -- Helper function that throws exceptions when a tag -- does not exist, otherwise returns whatever the inner -- parser of that tag returns. assertTag name inner = force name . tagIgnoreAttrs (fromString name) $ inner -- Parse a feed tag that holds many doc tags. Create -- a document for each doc tag, asserting the first -- three tags being title, url, and abstract, ignoring -- any further tags. -- -- Note that this does not supply a value for the id -- of the document! The Document constructor remains -- unapplied to its last argument, which will be -- populated with an increasing sequence of integers. parseFeed = assertTag "feed" . manyYield $ tagNoAttr "doc" $ Document <$> assertTag "title" content <*> assertTag "url" content <*> assertTag "abstract" content <* ignoreAnyTreeContent -- Read the file, un-gzip it, then parse the XML into -- Document values as above. feed = sourceFile "enwiki-20250123-abstract.xml.gz" .| multiple ungzip .| parseBytes def .| parseFeed in -- Zip together the document constructor with id -- numbers, creating a stream of Document values. getZipSource $ ZipSource feed <*> ZipSource (yieldMany [1..])
This is a tiny bit longer than the Python code, mainly because the xml-conduit library doesn’t have all the convenience functions of lxml in Python.
Analysis
If the user searches for “containment”, we will also give them matches that have the word “contain”. This is called stemming and is performed by search engines to avoid having results depend on the specific grammar choices in the search query. Just as in the Python code, we will do this along with some other normalisations.
module Analysis (analyse) where import Data.Char qualified as Char import Data.Set qualified as Set import Data.Text qualified as Text import NLP.Snowball qualified as Stemmer stopWords :: Set.Set Text.Text stopWords = Set.fromList . map Text.toUpper $ concatMap Text.words [ "the be to of and a in that have I it for not on with he" , "as you do at this but his by from wikipedia" ] analyse :: Text.Text -> [Text.Text] analyse = map (Stemmer.stem Stemmer.English) . filter (not . flip Set.member stopWords) . filter (/= "") . map (Text.filter (Char.isAlphaNum)) . Text.words . Text.toUpper
This module exposes a function analyse
that takes a bit of text, converts it
to upper case3 The Python code converts it to lower case, but my time spent
with C# has ingrained in me that upper case is more suitable for case
normalisation., splits it into words, removes punctuation, filters away the
words that were only punctuation, removes stop words, and then stems what
remains.
I like how the Haskell code makes it very clear that we’re running a value through a pipeline of transformations. Compare to the corresponding Python code:
def analyze(text): tokens = text.split() tokens = [token.upper() for token in tokens] tokens = [PUNCTUATION.sub('', token) for token in tokens] tokens = [token for token in tokens if token not in STOPWORDS] tokens = STEMMER.stemWords(tokens) return [token for token in tokens if token]
I mean yes, that works and does sort of the same thing, but there is a lot of noise in there!4 Also why is the Python code using list comprehensions and not generators here? It looks like this analysis step creates at least four intermediary list objects that will have to be allocated and then immediately garbage collected. A generator comprehension would yield an item at a time as its needed. This is the sort of thing where I’m saying lazy linked lists are an okay choice.
Indexing the documents
This is where the Python code loops through all parsed documents, and passes
each to an Index
object. It’s gets a bit convoluted to explain in words what
happens next, but the code is clear enough:
def index_document(self, document): if document.ID not in self.documents: self.documents[document.ID] = document for token in analyze(document.fulltext): if token not in self.index: self.index[token] = set() self.index[token].add(document.ID)
We could use that approach also in Haskell, by storing the data in mutable hash
tables from the hashtables
packages. But there’s a different way that’s easier
to think about, and easier to test. We’ll see how it works.
First, we create a newtype wrapper to hold a term–document mapping.
module Index where import Data.HashMap.Strict (HashMap) import Data.IntMap.Strict (IntMap) import Data.IntSet (IntSet) import Data.Text (Text) newtype TermDoc = TermDoc { unTermDoc :: HashMap Text IntSet }
Since we are talking about aggregating documents into an index, the word semigroup might pop into our heads.
Semigroup
is a Haskell interface that exposes just one operation, called <>
.
This operation smushes together two things into one. For example,
ghci> [4, 2] <> [7, 8, 3] [4,2,7,8,3] ghci> Set.fromList ["hello", "world"] <> Set.fromList ["goodbye", "world"] Set.fromList ["goodbye","hello","world"] ghci> Sum 5 <> Sum 8 Sum { getSum = 13 }
It might seem odd at this point, but something else we ought to be able to smush together are term–document mappings! This would take the terms from both mappings, and create a new mapping where each term is associated with both sets of documents from the prior mappings.
We also implement the Monoid
interface which builds on top of Semigroup
and
exposes just one operation mempty
which lets us conjure up an empty value out
of nothing.
module Index where import qualified Data.HashMap.Strict as HashMap import qualified Data.IntMap.Strict as IntMap import qualified Data.IntSet as IntSet instance Semigroup TermDoc where TermDoc mA <> TermDoc mB = TermDoc (HashMap.unionWith IntSet.union mA mB) instance Monoid TermDoc where mempty = TermDoc mempty
It might not be clear how these are used, so let’s see some examples.
ghci> print td1 TermDoc { unTermDoc = fromList [ ( "world", fromList [1,2] ) , ( "hello", fromList [1] ) ] } ghci> print td2 TermDoc { unTermDoc = fromList [ ( "goodbye", fromList [3] ) , ( "world", fromList [3] ) ] } ghci> td3 = td1 <> td2 ghci> print td3 TermDoc { unTermDoc = fromList [ ( "goodbye", fromList [3]) , ( "world", fromList [1,2,3] ) , ( "hello", fromList [1] ) ] }
Given an existing term–document mapping, we also want a function that can create a new term–document mapping with a term–document pair added to it.
addTerm :: Int -> Text -> TermDoc -> TermDoc addTerm doc term td = TermDoc . HashMap.insertWith IntSet.union term (IntSet.singleton doc) $ unTermDoc td
It might seem expensive to create an entire new term–document mapping every time we add a term–document pair to it, but we’ll start out with this naïve approach and adjust later if we have to.
As in the Python code, the term–document mapping is the main part of the search index, but it only lets us look up document ids. We want to show the full documents to the user, so we want a document map that links each document id to the corresponding full document value.
import Document (Document) data Index = Index { termdoc :: !TermDoc , docs :: !(IntMap Document) }
The exclamation marks here also serve the purpose of forcing the entire Index
to exist once the constructor is evaluated. It doesn’t actually have anything to
do with performance in this case, it just makes sure the index is fully computed
before the first query arrives. Without these strictness annotations, some of
the index computations are made during the first query.
We’ll implement Semigroup
and Monoid
for this too.
instance Semigroup Index where Index tdA dsA <> Index tdB dsB = Index Index (tdA <> tdB) (dsA <> dsB) instance Monoid Index where mempty = Index mempty mempty
These instances mean that if we have two indices for two possibly-overlapping
sets of documents, we can smush them together with <>
and get a single index
covering the entire set of documents. Why on Earth is this useful? Well, for our
last trick, we’ll convert a single document to an index.
import Document qualified as Document fromDocument :: Document -> Index fromDocument doc = Index { docs = IntMap.singleton (Document.id doc) doc , termdoc = foldl' (flip (addTerm (Document.id doc))) mempty (Analysis.analyse (Document.fullText doc)) }
This looks complicated, but it performs a very simple function: create an index covering only a single document. Most clear with an example:
ghci> Index.fromDocument $ Document "Hello, world!" "https://example.com" "This is a message introduced in the classic K&R book." 89 Index { termdoc = TermDoc { unTermDoc = fromList [ ("MESSAGE", fromList [89]) , ("CLASSIC", fromList [89]) , ("WORLD", fromList [89]) , ("KR", fromList [89]) , ("INTRODUCED", fromList [89]) , ("IS", fromList [89]) , ("HELLO", fromList [89]) , ("BOOK", fromList [89]) ] }, docs = fromList [ ( 1 , Document { title = "Hello, world!" , url = "https://example.com" , abstract = "This is a message introduced in the classic K&R book." , id = 89 } ) ] }
We can create these single-document indices for every document in our corpus.
Once we have those, we can smush them all together with <>
and out of it we
get a single index covering all documents! This is indeed what we will do.5 It would be neat if we could subdivide the input data into equal-sized chunks
and then process them in parallel, map–reduce style. Our Index
monoid supports
this, but the input data does not. However, the Conduit foldMapC
is strict
enough that it never actually produces a long list of indices, it just creates
one at a time and smushes it up with the large aggregate index it is building.
module Main where buildIndex :: IO Index buildIndex = runConduitRes $ documents .| foldMapC Index.fromDocument
This is a common pattern in Haskell: if we want to aggregate something out of
smaller pieces, we can often find a way to convert those pieces into a small
version of the aggregate object, and then implement Semigroup
to smush small
things together. The big thing falls out of that naturally. The benefit is that
it is very easy to test and reason about the smushing operation, because if it
is correct for small cases, it is also correct very large cases thanks to the
smushing laws.
One might wonder if it isn’t awfully inefficient to build the search index the way we do here. We are building a few million one-document indices only to smush them together. For each term we are adding to the term–document map, we are constructing a brand new term–document map.
It turns out it’s not so bad. Building the seven-million document index on my laptop takes just over 8 minutes, and a non-trivial chunk of that time is spent decompressing and parsing xml. However, it does spend nearly half its time garbage collecting, so it could be more efficient, but this simple code is obviously correct and the program still fits comfortably on my laptop. It doesn’t need to be more efficient at this stage.
Searching
Searching is a matter of doing the term lookups in the term–document index, and
then converting the document ids to actual documents. We can create a function
for this in our Index
module.
-- Get a set containing all document ids where -- the document contains the term. lookupTerm :: Index -> Text -> IntSet lookupTerm idx term = fromMaybe (IntSet.empty) . HashMap.lookup term . unTermDoc $ termdoc idx search :: Text -> Index -> [Document] search query idx = -- Convert query to id sets and combine them. -- Then look all ids up in the document map. catMaybes . map (flip IntMap.lookup (docs idx)) . IntSet.toList $ case map (lookupTerm idx) (Analysis.analyse query) of [] -> IntSet.empty (x:xs) -> foldl' IntSet.intersection x xs
We would call this from our main
function along with timing information
like-a-so:
module Main where main :: IO () main = do index <- buildIndex startSearch <- getCurrentTime -- Run search query and print all results. traverse_ print $ Index.search Index.And "London Beer Flood" index searchTime <- flip diffUTCTime startSearch <$> getCurrentTime putStrLn $ "==== Query took " <> show searchTime <> " seconds"
This prints out what we would expect.
Document {title = "Wikipedia: Horse Shoe Brewery", url = "https://en.wikipedia.org/wiki/Horse_Shoe_Brewery", abstract = "The Horse Shoe Brewery was an English brewery in the City of Westminster that was established in 1764 and became a major producer of porter, from 1809 as Henry Meux & Co. It was the site of the London Beer Flood in 1814, which killed eight people after a porter vat burst.", id = 1481330} Document {title = "Wikipedia: London Beer Flood", url = "https://en.wikipedia.org/wiki/London_Beer_Flood", abstract = "The London Beer Flood was an accident at Meux & Co's Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.", id = 1803069} ==== Query took 0.000415501s seconds
The Python code was fast, coming in at 30 milliseconds seconds per query once the caches were warm. Haskell gets the job done nearly 70× faster, at 0.4 milliseconds per query.
At this point, we’re about 120 lines of code in and are missing two features from the full code in the Python article:
- A toggle to choose set union instead of set intersection when combining the results from multiple terms in the search query, and
- A tf–idf based ranking of the search results.
At first I was going to stop here, but these are small additions, so let’s roll on.
Choosing between all or some words in search
First off, the Python article suggests sending in a string to the search
function indicating whether we want to get back all documents which contain any
of the search terms, or only documents which contain all search terms. We can
add support to this to our function too, except it’s Haskell so we don’t use a
string for it.
data SearchType = And | Or search :: SearchType -> Text -> Index -> [Document] search searchType query idx = let combining = case searchType of And -> IntSet.intersection Or -> IntSet.union in --------8<------ $ case map (lookupTerm idx) (Analysis.analyse query) of [] -> IntSet.empty (x:xs) -> foldl' combining x xs
This really only changes how we combine the result sets and it is rather uninteresting.
Ranking results with tf–idf
An old-school baseline for relevance determination is scoring results by term frequency multiplied by inverse document frequency.6 If you want more details on how this works, it should be very easy to look up. There’s been much written about it. The first step in the Python article is amending documents with their term frequency data, and we’ll do the same.
data Document = Document { id :: !Int , title :: !Text.Text , url :: !Text.Text , abstract :: !Text.Text , termFreq :: HashMap Text Int } deriving Show
We have deliberately left the term frequency field lazy in the Document
record. Maybe there are some documents that are practically never included in
the result sets. By leaving the term frequency lazy, it will never need to be
computed for those documents. Usage patterns will dictate what’s appropriate in
any specific case. If the term frequency map is very large, it might even be
preferable to compute it fresh each time rather than store it as a field.
We want to automatically compute the term frequency whenever we construct a
document, so we’ll create a constructor function for documents. This goes
through an intermediary HashMap
where the values are of Sum 1
values,
meaning the smushing operator acts as a counter.
mkDocument :: Text -> Text -> Text -> Int -> Document mkDocument t u a i = let tf = fmap getSum . HashMap.fromListWith (<>) . map (, Sum 1) $ Analysis.analyse (t <> " " <> a) in Document i t u a tf
When searching, we will do lookups in this map, as well as compute the inverse
document frequency for each term in the search phrase. The idf we compute here
is nearly the same as in the Python article, except we are adding one to the
denominator to avoid dividing by zero.7 I cannot figure out how the Python
code avoids this, so it might be a latent bug that the author failed to discover
because they didn’t test this version of the code with 'OR'
searches.
search :: SearchType -> Text -> Index -> [Document] search searchType query idx = let combining = case searchType of And -> IntSet.intersection Or -> IntSet.union terms = Analysis.analyse query results = catMaybes . map (flip IntMap.lookup (docs idx)) . IntSet.toList $ case map (lookupTerm idx) terms of [] -> IntSet.empty (x:xs) -> foldl' combining x xs idf term = let n = fromIntegral (IntMap.size (docs idx)) df_t = fromIntegral (IntSet.size (lookupTerm idx term)) in log (n / (1 + df_t)) tf doc term = fromIntegral . fromMaybe 0 . HashMap.lookup term $ Document.termFreq doc score doc = foldMap (\t -> Sum (idf t * tf doc t)) terms in sortOn (Down . score) results
Again we use the Sum
type to smush together the scores for each search result.
At the end, we wrap the score in the Down
type to get a highest-score-first
ordering of the search results. This completes a search for “London Beer Flood”
in 49 milliseconds – roughly the time it took Python without ranking.
And that’s it! We have replicated 150-ish lines of Python in 150-ish lines of Haskell. It wasn’t that hard. It might look foreign, but to an experienced programmer, it is easier to test, maintain, and ensure the correctness of.
Appendix A: Optimising parsing
The xml parser used in the code above is definitely correct, but unnecessarily powerful for what we need of it. The data supplied in the Wikimedia dump looks like xml, but is in fact much more regular than that. We can produce a practically equivalent document iterator using more primitive functions operating directly on the byte stream.
documentsPrimitive :: ConduitT () Document (ResourceT IO) () documentsPrimitive = let -- We are cheating very hard here and relying on the strong guarantees -- we can make on the input format down to byte sequences. This checks if -- the line starts with e.g. "<title>" and then strips characters that -- hopefully correspond to start and end tags. getTagContent name str = if not (ByteString.isPrefixOf name str) then Nothing else Just . Text.decodeUtf8 . ByteString.drop (ByteString.length name) . ByteString.take (ByteString.length str - ByteString.length name - 1) $ str -- This builds a partial document by making every field optional, and then -- yielding a full Document once all fields are filled. buildDoc str (!ai, !at, !au, !aa) = case (ai, at, au, aa) of (i, Just t, Just u, Just a) -> ( (i+1, Nothing, Nothing, Nothing) , Just (mkDocument t u a i) ) (i, mt, mu, ma) -> ( ( i , mt <|> getTagContent "<title>" str , mu <|> getTagContent "<url>" str , ma <|> getTagContent "<abstract>" str ) , Nothing ) yieldMaybeDocs = Conduit.mapAccum buildDoc (0, Nothing, Nothing, Nothing) $> () in do sourceFile "enwiki-20250123-abstract.xml.gz" .| multiple ungzip .| Conduit.lines .| yieldMaybeDocs .| Conduit.catMaybes
This is useful for testing purposes when one wants fast feedback loops.