Entropic Thoughts

Laws of Succession

Laws of Succession

Rajiv Prabhakar presents us with a hypothetical:

You and your friend are walking by a magic store and find a trick coin. You toss it 14 times and end up with 10 heads. Your friend thinks at least one of the next two tosses will end up tails, and is willing to offer you $10 in an even-money bet on it. Should you take him up?

This is a fancy way of asking,

Is the probability of two heads on the next two tosses greater than 50 %?

Rajiv claims to look at this a frequentist perspective, arriving at a 51 % probability, and then a Bayesian perspective, arriving at 48.5 %.

We can see the puzzle: one says “take the bet!” and the other says “ask for better odds!” How can statistics betray us like that in our time of greatest need?

… but it is based on flawed analysis.

Laplace’s law of succession

The central problem is that a true frequentist cannot answer the question we have asked. That question does not make sense from a frequentist perspective. This is why frequentists perform contortionist tricks like test hypotheses by asking,

Assuming heads and tails are equally likely, what would be the likelihood of observing 10 heads in the next 14 tosses?

and end up comparing that likelihood to somewhat arbitrary thresholds. This is not a strawman position – it plagued the father of frequentism, Ronald Fisher, for all his life.

Frequentists can, of course, answer a trivial questions like

What was the probability of heads in the past 14 tosses?

and it is tempting to extrapolate the answer to that straight into the future, but again, that would not be a valid operation according to a frequentist. Frequentists cannot assign probabilities to future events, they can only compare probabilities of past events.

Now, not all is lost. There’s a neat trick that blends faux frequentism with Bayesian reasoning. If our prior probability is uniformly distributed, we can update it with Laplace’s rule of succession, which tells us to imagine there have been two more trials, one of which was successful. Then the ratio corresponds to the Bayesian posterior. In this case, that means imagining there were 11 heads in 16 tosses. The new ratio is \(\frac{11}{16}\) and the posterior probability of two heads would then be \(\left(\frac{11}{16}\right)^2 \approx 0.47\). This is much easier and gives almost the same result as the actual Bayesian estimation, and the computer estimation.1 Why is this a slightly different number than the actual value of 48.6 %? Because we have converted the posterior distribution to a point estimation, and then squared the point estimation. Jensen’s inequality tells us we will arrive at a lower number that way than if we first square the posterior distribution and then make a point estimation.

This is a law of succession most forecasters are familiar with: add one success and one failure to get the Bayesian posterior of coin flips. It’s called Laplace’s law of succession.

Agresti–Coull confidence intervals

Here, have a pair of frequentist goggles again. Now we cannot estimate probabilities of future events. But we can perform a standard hypothesis test to figure out whether we should believe the coin is biased at all! For this we need a 95 % confidence interval around the bias.

The quick way to whip one of those up is to first pretend there were four more trials: two successes and two failures. Then we compute the size of two standard errors the normal way:

\[\hat{p} = \frac{10+2}{14+4} \approx 0.67\]

\[2\hat{se} = 2\sqrt{\frac{\hat{p}\left(1 - \hat{p}\right)}{14+4}} \approx 0.22\]

This gives us a confidence interval for the bias of 45 %–89 %. Since that includes 50 %, our null hypotheses, we cannot reject the null hypothesis and have to contend with the fact that the coin might not be biased at all. This is probably a reason not to take the bet. See, Rajiv, frequentists are not so dumb after all!

This procedure is called an Agresti–Coull confidence interval, and it’s an approximation that’s very close to an actual confidence interval. The drawback is that this simple version only lets us create two-sided 95 % confidence intervals. The paper by Agresti and Coull contains the details to construct other confidence intervals, but it gets a little more arithmetically complicated.

Poisson-based law of succession

Imagine now someone tosses the coin once a week, and in the past year, it has landed on heads four times. What’s the probability it lands on tails next week? We can use what we already know here. There have been 52 trials and 4 heads, so that’s a posterior of \(\frac{5}{54} \approx 0.093\) for heads. The inverse of this is the probability that it does not land on heads in any given week: 90.7 %.

What if we had misheard, and the coin was actually tossed every day? This makes the coin extremely biased. It has been tossed for 365 days and only landed heads four times!2 A calendar year actually has 365.2425 days, but here we use year to mean “the past 365 days”. By the same approach, we have the posterior, \(\frac{5}{367} \approx 0.014\). On the other hand, during the next week, it will have more opportunities to come up heads as well. To get the weekly probability of no heads, we have to take the inverse of the posterior and raise it to the power of 7. We get 90.8 %.

Oh but wait, I meant to say it is flipped every minute, not every day! Assuming there are 525600-something minutes per year, we get a weekly probability of heads of 90.9 %.

The general trend starts to show: when we pick finer subdivisions of time, the probability goes up. We could compute the probability if the coin was flipped every second, every millisecond, etc. But we won’t, because there’s a limit this tends to:

\[\left(1 + \frac{t}{T}\right)^{-(S + 1)}\]

Here, \(t\) is the future timespan we are interested in, \(T\) is the timespan in the past we have observed for, and \(S\) are the number of successes observed in that timespan. Unlike in the previous Laplace case, the units of the timespans do not matter here, because they cancel out. With our coin, taking weeks as the natural unit:

\[\left(1 + \frac{1}{52}\right)^{-(4 + 1)} \approx 0.909\]

This is the probability that a continuously flipped coin shows heads any time during next week, if it has done so four times in the past year.

This is nearly the same probability as when we computed Laplace’s law of succession using minutes as the time increment, and this will be the case more generally. If we pick a time increment that is small compared to the future period we’re interested in, Laplace’s law of succession will give an answer very close to this limit. The benefit of knowing how to compute the limit is that we can use more convenient units of time when doing so.

For more examples of this continuous limit, see the LessWrong article that introduces it.

Key points

We don’t really have to learn the continuous limit. If we’re fine with arithmetic using big numbers, we can use the plain Laplace rule even for continuous processes, as long as we pick a subdivision of time that is small enough.

What needs to be remembered is that with \(s\) successes in \(n\) trials, the posterior probability is \(\frac{s+1}{n+2}\).

But also, if we want a 95 % confidence interval, we compute

\[\hat{p} = \frac{s+2}{n+4}\]

and

\[2\hat{se} = 2\sqrt{\frac{\hat{p}(1-\hat{p})}{n+4}}\]

and then the interval is

\[\left(\hat{p}-2\hat{se}, \hat{p}+2\hat{se}\right)\]

It is easy to get these two methods confused, because they both add virtual successes and failures. Try not to!