Entropic Thoughts

Chi-Squared From Fundamentals

Chi-Squared From Fundamentals

During the second world war, the Germans invented the cruise missile and sent free samples over the channel to London. The missiles appeared to hit some places in London more than others (train tracks?), and the British government was worried the Germans were somehow able to target the missiles effectively.

To find out, an actuary did what Fisher and other biostatisticians had done many years prior when studying e.g. bacterial cultures:

  1. Take a map of the affected area.
  2. Draw over it a grid of evenly sized squares.
  3. Count the number of hits in each square.
  4. Check the distribution of counts.

If there is no meaningful targeting going on, the counts in each square would follow the Poisson distribution.1 Under appropriate assumptions, but as always, I’m playing fast and loose with this stuff.

We’ll perform the same exercise together to learn how it all works! As we will see, this can be useful to e.g. diagnose visitor patterns on a website.

Web page visit distribution

I cannot launch cruise missiles.2 Since my current $day_job is Haskell programming, I won’t even do it by accident. We need some other data to play with. What we can do is take a random sample of pages on our website, and count how many visits they’ve had in the past 24 hours. What we aim to find out is whether some pages are (in missile terminology) effectively targeted, i.e. significantly more popular, or if it’s mainly random chance that dictates popularity. We will be filtering out the most popular pages because we already know they are being targeted effectively – that’s what it means for them to be popular.

This means we will be looking at the background rate of visits to our website. Among the less popular pages, are some of them actually more popular than others, or are they roughly of the same popularity and then some just happen by chance to get more visits, due to random clustering?

This is a histogram of visit count for 53 randomly selected low-activity articles on this website.

chisq-basics-01.svg

We can eyeball it and it appears sort of Poisson-y, right? But eyeballing is the one thing Fisher warns us we must not rely on. We have to see what the numbers say.

We remind ourselves that that the histogram is just an artificially grouped graphic. We are still talking about actual visit count numbers, one for each article. The first eleven are

2 3 7 3 4 6 4 10 9 3 5

but there are 42 more of them.

The arithmetic mean (“average”) of all these values is \(\overline{x}\) = 5.2 visits/page. We can either estimate the variance manually, or ask our spreadsheet software to do it for us. Either way, we’ll get something around \(s^2\) = 3.35.

From maybe Poisson to chi-squared

Those statistics should make us a little suspicious! For a pure Poisson distribution, the index of dispersion, defined as the variance divided by the mean, should be unity.3 This is because in a true Poisson distribution, the variance is equal to the mean. In our data, we have

\[\frac{s^2}{\overline{x}} = \frac{3.35}{5.2} \approx 0.64 \ll 1\]

This suggests that the page visit data is underdispersed: that visit counts are too narrowly spread, too uniform, compared to what they might be if they were entirely random.

Here’s where statistical process control teaches us to be careful: if we follow every little suggestion by the data, we’ll spend all our time chasing mirages and not get anything useful done – we’ll hunt every random variation down only to discover, after much sweat, that it was just a random variation.

So before we waste time investigating this, we can check how statistically significant this underdispersion is. If we multiply the index of dispersion by the number of items in the sample, we get a statistic

\[Q = n\frac{s^2}{\overline{x}} = 53 \frac{3.35}{5.2} \approx 34.1\]

and if the distribution of page visits is Poisson, this statistic is distributed according to the \(\chi^2(52)\) distribution. We can use this to help us understand how likely it is that this underdispersion is just a random variation, or if there’s a meaningful cause behind it.

What it means for Q to be chi-squared

Okay, sorry. That escalated quickly. It is not at all obvious what it means for \(Q\) to be \(\chi^2\) distributed. We’ll back up and go a little slower.

Imagine we had a true Poisson distribution. If we had a really, really, freaking large sample from that – say – 99,000 samples, its \(Q\) statistic would be almost exactly equal to \(n\), i.e. 99,000. The reason is that for a true Poisson distribution, the index of dispersion \(s^2/\overline{x} = 1\), so when we multiply by \(n\), we get … well, \(n\).

As we take smaller samples from the Poisson distribution, the index of dispersion might vary from unity due to random chance. For example, these numbers are drawn from a pure Poisson distribution:

3 6 5 10 6 4 5 2 6 7

The mean is 5.4 and variance 4.9, meaning the index of dispersion is 0.9. The \(Q\) statistic is ten times that, i.e. 9.

Here’s another example of a Poisson draw:

4 2 3 4 10 8 6 4 4 4

The mean is 4.9 and the variance 5.9, meaning the index of dispersion is 1.2. The \(Q\) statistic is ten times that, i.e. 12.

As we draw more and more samples of 10 from a pure Poisson population, we will find most of the \(Q\) statistics ending up around 10, with some random variation this way and that way. Here are some more \(Q\) statistics from samples of 10 draws from other Poisson distributions:

  • 11.4
  • 10.4
  • 8.1
  • 6.9
  • 7.0

We can ask the computer to continue this without our intervention, and compute \(Q\) for 5000 draws of 10 each, and we’d end up with the \(Q\) statistics here:

chisq-basics-02.svg

This happens to align very well with a known distribution, namely the \(\chi^2(9)\) distribution!

chisq-basics-03.svg

Quiz time: If someone tells us they have drawn 10 values from a Poisson distribution, and that their \(Q\) statistic was 46.5, should we believe them? No. That just doesn’t happen. In the above plot, we have repeated such a draw 5000 times and not once did we get a \(Q\) statistic of 46.5 – that’s way too high.

If the \(Q\) statistic computed from 10 values is indeed 46.5, we know with virtual certainty that all of those values did not come from a Poisson distribution. If they did, the \(Q\) statistic would have fallen in one of the bars in the plot above.

The underdispersed page visits

This is what we are doing when checking whether the page visits we have recorded are Poisson. We had 53 values, and their \(Q\) statistic was 34.1. Where does this end up on the \(\chi^2(52)\) distribution?

chisq-basics-04.svg

A charitable reading of this is that even when page visits are a pure Poisson draw, we can get a \(Q\) of 34.1 – it’s not entirely unthinkable. But also, it only happens about once in every 50 draws. It seems rather unlikely, given this \(Q\) statistic, that the page visits are pure Poisson draws.

So what has happened?

Hunting down the cause

The plot from earlier gives a hint as to what might cause this underdispersion:

chisq-basics-01.svg

If there’s anything wrong with this shape compared to a Poisson distribution, it’s that this seems too uniform. It’s as if there’s something systematically making sure each article gets a low number of visits all the time. And indeed, if we grep the logs for these low-visit articles, we get hits on things like openai-searchbot and AhrefsBot. These are crawlers I forgot to exclude from my log extractor script.4 I was already filtering out the likes of Baidu, Google, Bing, ByteDance.

Stop and admire! We didn’t set out to find missing crawler exclusions. We just noticed something was maybe not right with the page visit patterns. We checked numerically to make sure our intuition was barking up the right tree, and then we investigated. We managed to make a permanent improvement to our tooling that we otherwise might not have made, all thanks to statistics.

Reanalysis after fixing the cause

After fixing the analytics script, the histogram of page visits looks like this instead.

chisq-basics-05.svg

Here, the mean is 2.8 rather than 5.2, and the variance is still around 3.3. The \(Q\) statistic is \(53\frac{3.3}{2.8} \approx68\), which is on the higher end of the distribution:

chisq-basics-06.svg

As a reminder, what we want to know here is whether

  • (null hypothesis) All unpopular pages are equally unpopular, and only randomness decides which one of them gets more visits5 This would be well-modeled by the Poisson distribution.; or
  • (alternative hypothesis) Some unpopular pages are more popular than others, and these get an outsize number of visits compared to their less popular peers6 This would look like something more widely dispersed than the Poisson distribution..

The result above suggests overdispersion, i.e. that a popularity contest does exist among unpopular pages. However, it would not be crazy to get this \(Q\) statistic even there is no real popularity difference between unpopular pages: it falls in the part of the \(\chi^2\) distribution where it would happen about once every 13 samples, which is not unthinkable to encounter by chance.

In the end, it’s up to us to determine whether we see that as sufficient evidence against randomness. If we’re on the fence, what we can do is collect more data. If we mainly get results that weakly suggest overdispersion, we’ll have failed to find strong evidence against randomness7 Tortured sentence, but that’s how it works. and we’ll have to accept that there is no meaningful popularity difference between unpopular articles. If some results strongly suggest overdispersion, we may entertain the idea of popularity differences even among unpopular articles.8 I have other reasons to believe unpopular articles exhibit differential popularity. Having glanced at the logs for so many years, I recognise that some articles are referenced in wikis, private onboarding material, etc. These get slightly more visits than others for that reason.

Combining data from multiple studies

I haven’t run the study above for multiple days, but my wife has these professional-grade, very sticky fly trap papers up around the apartment.9 We have had a fruit fly invasion or something the past couple of months. They are gridded meaning we can count the number of flies they have caught in each square.10 I think this – or something like it – is actually the reason they are gridded. My wife got them from a person working with exotic plants at the university.

The first paper I looked at, let’s call it A, had these numbers of flies per square:

0 1 1 3 0 0 1 1 1
1 2 1 3 2 0 0 2

Another, B, had these counts:

5 9 7 12 8 10 4
9 5 6 10 12 10 9

And a third11 Yes, I did actually count nearly 200 fruit flies on a single paper., C:

15 12 17 20 14 9 16
13 8 16 23 15 10  

These series are too short to draw any distributional conclusions from. We cannot, for each series alone, determine whether it is Poisson or not. However, they are taken from sources that are very similar, so we could imagine that if one is governed by a Poisson law, then all of them are. Under this assumption, we can combine them into one long series.

What we cannot do is just concatenate them, since they have very different means. We have to find another way to combine them. Let’s compute the \(Q\) statistic for each of them in turn:

Paper \(n\) \(\overline{x}\) \(s^2\) Index of disp. \(Q\)
A 17 1.1 0.99 0.88 14.8
B 14 8.3 6.5 0.79 11.0
C 13 15 18 1.2 16.1

What can help us is that the sum of two variables drawn from \(\chi^2(k)\) and \(\chi^2(m)\) is distributed \(\chi^2(k+m)\).

We sum up all the \(Q\) into 41.9, and then test this against \(\chi^2(44)\). We find out it lands on the 43rd percentile. Far from any of the tails, and thus we have failed to find evidence against randomness, and we can assume that flies get stuck at random locations of the paper.

From Poisson to arbitrary distributions

You may have heard of the chi-squared test in the context of distributions other than the Poisson. This article contains the intuition necessary to understand how that works also. The key is to realise that if we bucket other distributions at the appropriate granularity, and start filling in the buckets from observations according to that distribution, then the number of observations that fall into each bucket is effectively drawn from a Poisson distribution!

The details of that will be a future article.