Regatta Starting Stations – Chi-squared Continued
In the Henley Royal Regatta two teams at a time propel their boats up a river and compete to be first to go a distance. Teams get assigned to their starting stations – Berkshire or Buckinghamshire – at random. From there, it is a straight shot up the river, with the lane from each starting station being seemingly identical.1 This wasn’t always so! Historically, the race has been extremely unfair with bends in it favouring one of the starting stations.
I didn’t know any of this, but a reader reached out some time ago because they had noticed something odd about this, and they wanted to borrow me as a sounding board. Here’s the odd thing: the team that starts from the Berkshire station has won 53.5 % of the 7555 races in the historic data this reader looked at. This is highly unexpected. If teams are assigned at random, and the starting stations are practically equal, then the starting station of the winning team should be a coin flip.
If we flip 7555 coins, we would never have as many as 53.5 % come up heads.
Saying that the starting station of the winner is a coin flip is, in statistical terminology, known as the winner being a binomial draw with equal probability for each station. This does not mean exactly half of the wins will be from either station – but somewhere in that neighbourhood. Since we have a large number of samples, the size of that neighbourhood is given by the standard deviation of the distribution, which would be
\[\sqrt{np(1-p)} = \sqrt{7555 \times 0.5 \times 0.5} = 43.5\]
When determining whether something warrants further investigation, I normally use a fast-and-loose criterion of 1.645 standard deviations. With this threshold, we can compute that it wouldn’t be strange if as few as
\[\frac{7555}{2} - 1.645 × 43.5 = 3706\]
races were won from one of the stations, or indeed as many as
\[\frac{7555}{2} + 1.645 × 43.5 = 3850.\]
But in this case, well over 4000 were. Assuming it is true that teams are assigned starting stations at random, this can only mean one thing: the stations are not equal, and the Berkshire station is the better station to start from.
The effect is equally large for either sex
The races are divided into men’s and women’s classes. It seems like the Berkshire buff might be weaker for women than men. While a whopping 53.7 % of men’s wins came from the Berkshire station, only 52.7 % of women’s wins did. If this difference is statistically significant, that might give us a clue as to what causes the Berkshire buff: maybe it’s something that interacts with upper-body strength, which men typically have more of.
This is a case where we can pull out that poor man’s logistic regression. Let’s put the wins from the Berkshire station in a contingency table.
| Men | Women | |
|---|---|---|
| Wins | 3441 | 603 |
| Losses | 2969 | 542 |
If we think of our hypothesis as “competing in the men’s class makes the Berkshire buff stronger” we follow the steps of the referenced article, cross-multiply and approximate to find the log-odds difference
\[\log{\left(\frac{3441 \times 542}{2969 \times 603}\right)} \approx +4.2 \%\]
Then we take the square root of the sum of inverses, yielding a standard error of
\[\sqrt{\frac{1}{3441} + \frac{1}{2969} + \frac{1}{603} + \frac{1}{542}} \approx 6.4 \%\]
This means the difference between sexes (4.2 %) is less than a standard deviation (6.4 %) and thus nowhere near significance.
In that previous article, we saw this as a cheaper alternative to logistic regression. Another way to look at it is as a cheaper chi-squared test. The reason this is a useful way to think of it is that the chi-squared test can be expanded to multiple classes, rather than just the binary one for men and women.
Transitioning to a chi-squared test
We have seen that men win more often from the Berkshire station than women. What we want to know is what the source of this variation is:
- does it come from a true difference between the sexes, or
- does it arise from sampling error?
The chi-squared test can help us rule out sampling error as a cause.
If the source of variation is purely sampling error, that would mean the real Berkshire win rate for both men and women is actually the population average of 53.5 %. Under that assumption, we can compare the hypothesised counts to the actual counts.
We compute the hypothesised counts by multiplying the total number of races for each sex with 53.5 % to get their wins, and then the losses fall out of that too. The hypothesis suggests that men, who competed in 3441+2969=6410 races, should have won 0.535×6410=3429 races, and lost the other 2981. We put the hypothesised counts into brackets in the table.
| Men | Women | |
|---|---|---|
| Wins | 3441 [3429] | 603 [613] |
| Losses | 2969 [2981] | 542 [532] |
Note that once we determined the first expected count, the rest were given to us for free. This is because we don’t want to change the totals across rows and columns. Once we have determined the hypothesised count of 3429 wins for the men, we could not have picked any other number of wins for women or losses for men without changing proportions we want to stay fixed. This fact is going to become important shortly!
To figure out whether this difference between observed and hypothesised is significant, we compute the squared difference between observed and hypothesised, divided by the hypothesised. We get
| Men | Women | |
|---|---|---|
| Wins | \(\frac{(3441-3429)^2}{3429} = 0.042\) | \(\frac{(603-613)^2}{613} = 0.16\) |
| Losses | \(\frac{(2969-2981)^2}{2981} = 0.048\) | \(\frac{(542-532)^2}{532} = 0.19\) |
When we add these together, we get the \(\chi^2\) statistic from which the test derives its name. In this case, we have
\[\chi^2 = 0.042 + 0.048 + 0.16 + 0.19 = 0.44\]
Just as when we encountered the chi-squared test earlier, this number is – assuming variation is attributable to sampling error only – going to follow a chi-squared distribution. On the contrary, if our statistic lies meaningfully outside of what normally happens when drawing from a chi-squared distribution, we are right to suspect that there is something beyond sampling error going on; in this case, that there is some actual difference between the sexes in terms of Berkshire win rate.
Because all hypothesised values fell out of the first we calculated (thanks to the sums being fixed), we should be looking at the chi-squared distribution with one degree of freedom. If we plot that or look up 0.44 in a table, we will see that it falls smack in the middle of the chi-squared distribution. This means we have no reason to suspect the variation in the Berkshire win rate between sexes have any cause other than sampling error.
We already knew this from the poor man’s logistic regression above2 Different ways of performing the same hypothesis test will never yield different results. If one test does not show significance, then the other also will not. Two tests can show different results only if they are somehow different tests, i.e. one test makes assumptions the other does not., but we went through the motions here anyway to introduce the \(\chi^2\) test with a 2×2 table before we extend it to a 3×2 table.
Stage as skill proxy
Since the chi-squared test works with any number of classes, we might want to look into whether the effect is stronger in later stages of the race. The data I received is split up into heats, semi-finals, and finals.
Maybe the teams in the finals are the most skilled teams, and maybe they are also better able to make use of the Berkshire buff, whatever it is? It might seem that way, because in the heats, 53 % of wins started at the Berkshire station, but in the semi-finals and finals, all of 56 % did.
Intuition is no substitute for statistics, so let’s get down to it. The expected wins in the following table are calculated based on the hypothesis that all classes actually have a 53.5 % win rate, and any difference is due to sampling error alone.
| Heats | Semi-finals | Finals | |
|---|---|---|---|
| Wins | 3195 [3228] | 565 [538] | 284 [275] |
| Losses | 2839 [2806] | 441 [468] | 231 [240] |
Computing the test statistic from these, we get
| Heats | Semi-finals | Finals | |
|---|---|---|---|
| Wins | 0.34 | 1.36 | 0.29 |
| Losses | 0.39 | 1.56 | 0.34 |
This means the total \(\chi^2\) is 4.28 with two degrees of freedom, a p-value of 0.12. This is not a significant difference.
But what if … we didn’t have data split up into semi-finals and finals? What if our data was just split into heats and advanced rounds?
| Heats | Advanced rounds | |
|---|---|---|
| Wins | 3195 [3230] | 849 [814] |
| Losses | 2839 [2804] | 672 [707] |
The chi-squared statistics for each cell would be
| Heats | Advanced rounds | |
|---|---|---|
| Wins | 0.38 | 1.5 |
| Losses | 0.44 | 1.7 |
In total, that is a \(\chi^2\) of 4.02 which is lower than before, but now with just one degree of freedom! That puts it beyond a p=0.05 significance threshold. Aha! So there might be an element of skill in using the Berkshire buff well, after all.
We shouldn’t do what we just did because changing groupings around until we get a significant result is p-hacking. But what I really wanted to show is that more detailed groupings does not necessarily a stronger chi-squared test make. If we know in advance that the greatest contrast to be found in skill level is not in the three-level grouping, but in the difference between heats and advanced rounds, then that’s what we should aim to test, only.
We still don’t know what happens
We have found a meaningful difference between the Berkshire and Buckinghamshire starting stations. We have not seen evidence that it is mitigated or amplified by upper body strength. We do have some evidence that it might be easier for more skilled competitors to make use of that effect. What is it?
I found one article which briefly discusses the advantages of the Berkshire station, and it seems to put it down to wake flow in the water from other boats around the race disturbing the competitors in the Buckinghamshire station more. This would make sense given our discoveries too – margins ought to be tighter between more skilled competitors, so outside disturbances determine more of the outcome. But I don’t know.