Web Scraping with Lenses

kqr

, published 2015-07-01

Tags:

Sometimes I'm curious about something on the web. Maybe it's a table with numbers and I'd like an arithmetic average of them. Or, in this case, someone says that "Project Euler isn't as maths-y as people say." Immediately I want to look at the titles of a random sample of a few Project Euler challenges to see how mathsy they really are. I could do all this manually, but I could also automate it because I'm a programmer.

Preparation

Since challenges on Project Euler are indexed by numbers 1 through 512, I know I need a bunch of random numbers to pick out random challenges. System.Random to the rescue!

import Control.Monad
import System.Random

main = do
    numbers <- take 10 . randomRs (1,512) <$> getStdGen
    forM_ numbers $ \i -> do
        print (i :: Int)

This should be pretty self-explainatory. numbers is a list of 10 random numbers distributed between 1 and 512, based on the global standard generator. I loop through them and print them all. As it turns out, the Int type signature is necessary, because otherwise GHC doesn't know if I want Ints, Integers, Doubles or anything else vaguely number-y.

Download the Page

Now that we have the random numbers, let's download the challenges corresponding to those numbers! This is easy as pie with wreq. The only thing we change (besides imports) is the loop body.

import Control.Concurrent
import Control.Lens
import Control.Monad
import Network.Wreq
import System.Random

problem n = "https://projecteuler.net/problem=" ++ show (n :: Int)

main = do
    numbers <- take 10 . randomRs (1,512) <$> getStdGen
    forM_ numbers $ \i -> do
        response <- get (problem i)
        print (response ^. responseBody)
        threadDelay 2000000

First we use the wreq function get to make a get request for a problem. (The type signature is included here for the same reason as before.) We store the response in the response variable. Then we print the responseBody field of the response. Finally we sleep for two seconds after each request to be nice toward the server.

Get the Titles

Just dumping the HTML of the page, as we have done now, isn't particularly productive. We would like to extract the title of the challenge and print out only that to make it easier to read the data. This requires a small modification of the loop body again, plus some imports – notably bringing taggy-lens, which does most of the heavy lifting, into scope.

{-# LANGUAGE OverloadedStrings #-}

import Control.Concurrent
import Control.Lens
import Control.Monad
import Data.Text.IO as T
import Data.Text.Lazy.Encoding
import Network.Wreq
import System.Random
import Text.Taggy.Lens

problem n = "https://projecteuler.net/problem=" ++ show (n :: Int)

main = do
    numbers <- take 10 . randomRs (1,512) <$> getStdGen
    forM_ numbers $ \i -> do
        response <- get (problem i)
        T.putStrLn (response ^. responseBody . to decodeUtf8 . title)
        threadDelay 2000000

title = html . allNamed (only "h2") . contents

I know the title of the challenge is in the only <h2> tag on the page, so I create a lens title which drills down into the HTML, then into all <h2> tags, and their contents. The lens combinator ^. will turn them all into a single text value (by concatenation), which I then print.

Wrapping Up

And that's it, really. What's so great about this is how the lenses that do the extraction work combine so easily. It's like writing JQuery except in a real language! The combination of wreq and taggy-lens works great in the interactive interpreter too! In fact, that's how I came up with the access string

responseBody . to decodeUtf8 . html . allNamed (only "h2") . contents

I just started with the first bit and then added step after step until I had focused on the data I wanted.

So... what's the result?

Scary Sphere
Digit factorials
Compromise or persist
Number letter counts
The Ackermann function
Lowest-cost Search
Combined Volume of Cuboids
Arithmetic expressions
Remainder of polynomial division
Composites with prime repunit property

Pretty mathsy, I'd say.