Web Scraping with Lenses
Sometimes I'm curious about something on the web. Maybe it's a table with numbers and I'd like an arithmetic average of them. Or, in this case, someone says that "Project Euler isn't as maths-y as people say." Immediately I want to look at the titles of a random sample of a few Project Euler challenges to see how mathsy they really are. I could do all this manually, but I could also automate it because I'm a programmer.
Preparation
Since challenges on Project Euler are indexed by numbers 1 through 512, I know I need a bunch of random numbers to pick out random challenges. System.Random
to the rescue!
import Control.Monad
import System.Random
main = do
numbers <- take 10 . randomRs (1,512) <$> getStdGen
forM_ numbers $ \i -> do
print (i :: Int)
This should be pretty self-explainatory. numbers
is a list of 10 random numbers distributed between 1 and 512, based on the global standard generator. I loop through them and print them all. As it turns out, the Int
type signature is necessary, because otherwise GHC doesn't know if I want Int
s, Integer
s, Double
s or anything else vaguely number-y.
Download the Page
Now that we have the random numbers, let's download the challenges corresponding to those numbers! This is easy as pie with wreq. The only thing we change (besides imports) is the loop body.
import Control.Concurrent
import Control.Lens
import Control.Monad
import Network.Wreq
import System.Random
problem n = "https://projecteuler.net/problem=" ++ show (n :: Int)
main = do
numbers <- take 10 . randomRs (1,512) <$> getStdGen
forM_ numbers $ \i -> do
response <- get (problem i)
print (response ^. responseBody)
threadDelay 2000000
First we use the wreq function get
to make a get request for a problem. (The type signature is included here for the same reason as before.) We store the response in the response
variable. Then we print the responseBody
field of the response. Finally we sleep for two seconds after each request to be nice toward the server.
Get the Titles
Just dumping the HTML of the page, as we have done now, isn't particularly productive. We would like to extract the title of the challenge and print out only that to make it easier to read the data. This requires a small modification of the loop body again, plus some imports – notably bringing taggy-lens, which does most of the heavy lifting, into scope.
{-# LANGUAGE OverloadedStrings #-}
import Control.Concurrent
import Control.Lens
import Control.Monad
import Data.Text.IO as T
import Data.Text.Lazy.Encoding
import Network.Wreq
import System.Random
import Text.Taggy.Lens
problem n = "https://projecteuler.net/problem=" ++ show (n :: Int)
main = do
numbers <- take 10 . randomRs (1,512) <$> getStdGen
forM_ numbers $ \i -> do
response <- get (problem i)
T.putStrLn (response ^. responseBody . to decodeUtf8 . title)
threadDelay 2000000
title = html . allNamed (only "h2") . contents
I know the title of the challenge is in the only <h2>
tag on the page, so I create a lens title
which drills down into the HTML, then into all <h2>
tags, and their contents. The lens combinator ^.
will turn them all into a single text value (by concatenation), which I then print.
Wrapping Up
And that's it, really. What's so great about this is how the lenses that do the extraction work combine so easily. It's like writing JQuery except in a real language! The combination of wreq and taggy-lens works great in the interactive interpreter too! In fact, that's how I came up with the access string
responseBody . to decodeUtf8 . html . allNamed (only "h2") . contents
I just started with the first bit and then added step after step until I had focused on the data I wanted.
So... what's the result?
Scary Sphere
Digit factorials
Compromise or persist
Number letter counts
The Ackermann function
Lowest-cost Search
Combined Volume of Cuboids
Arithmetic expressions
Remainder of polynomial division
Composites with prime repunit property
Pretty mathsy, I'd say.