Log-Survival to Death Rate
When I stumble over my flashcards from that Fisher book1 Statistical Methods for Research Workers; Fisher; Oliver & Boyd; 1925. I keep being surprised by how insightful they all seem. I normally wait a year or two before re-reading a good book, but I’m thinking I might as well go through this one right away. The main takeaway I had previously was how interesting correlations are. Maybe I can learn something else this time.
One complaint I have about this book – and indeed all older math books – is that they present useful ideas in thick paragraphs of text. Here’s how today’s concept gets illustrated:
A useful form, similar to the above, is used to compare the death-rates, throughout life, of different populations. The logarithm of the number of survivors at any age is plotted against the age attained. Since the death-rate is the rate of decrease of the logarithm of the number of survivors, equal gradients on such curves represent equal death-rates. They therefore serve well to show the increase of death-rate with increasing age, and to compare populations with different death-rates. Such diagrams are less sensitive to small fluctuations than would be the corresponding frequency diagrams showing the distribution of the population according to age at death; they are therefore appropriate when such small fluctuations are due principally to errors of random sampling, which in the more sensitive type of diagram might obscure the larger features of the comparison.
I have realised that when I come across paragraphs like these, I need to sit down with some data and play with the ideas to truly understand them. Today, you’ll get the opportunity to join me.
Pull request merges as deaths
When Fisher talks about deaths, we can substitute any other event for the death. In our case, we’ll look at how long pull requests spend open before they are merged.
I have scraped the lifetimes of 80 recent pull requests from two relatively popular open source projects, which will serve as our two different populations. Here’s how long each project has had its pull requests open until they were merged.
We see that most pull requests are merged within a few hours, and then there are some difficult pull requests that stay open for a long time. This looks like one of those heavy-tailed distributions, which means the outliers are probably the most interesting data points to study.
However, we’re not looking to understand pull requests better. We’re looking for an excuse to apply a technique Fisher recommended, so we’re going to do something very, very naughty. We’ll lop off all the difficult parts of the data and focus in on the common case. Never do this unless the data is ultimately irrelevant, which it is in this case.
Here’s the data zoomed in to the range 20–100 minutes, which is where most of the action is going down.
This makes it clear that in the common case of project X, it takes maybe an hour until a pull request is merged. In project Y, it rather reliably takes 20–40 minutes. We can’t say much more than that, though. So let’s do what Fisher suggested, and … what was it?
The logarithm of the number of survivors at any age is plotted against the age attained.
Oh, yeah, that!
Since these lines look fairly straight, we might be very tempted to take their slopes. Eyeballing, we have maybe
- Project X: drop of 0.65 over 75 minutes, slope of 0.9 %
- Project Y: drop of 0.65 over 17 minutes, slope is 3.8 %
These percentages are, it turns out, the average chance of a pull request being merged any given minute during the time surveyed. Continuing to assume the common case, every minute that passes comes with a 1 % chance that our pull request in project X gets merged, and a 4 % chance that our pull request in project Y gets merged.
The reason this works is through two mathematical shortcuts. We’ll use project Y as the example. At 22 minutes of age, 37 of the 40 surveyed pull requests are still open. At 39 minutes of age, 18 of the 40 surveyed pull requests are open. This means the reduction in pull request count during that span of 17.5 minutes is 18/37 = 0.49. If we want to find out the reduction per minute from this, we have to solve the tricky equation
\[\left(\frac{18}{37}\right)^{\frac{1}{17.5}}\]
where the 17.5th root reveals the percentage chance every minute that would have us attain 49 % remaining at minute 17.5. This equation is trivial to solve on a computer. We just type in the numbers and off it goes, giving us the answer 0.96 which is a 4 % merge rate per minute. This corresponds very preciely to the slope we eyeballed in the plot.
The reason it corresponds so nicely is the second shortcut: we are using the approximation rule that the logarithm of 1+x is nearly the same as x when x is small.
Why go through these hoops, when we can trivially solve the complicated equation on our computer? We have to remind ourselves that Fisher didn’t have a computer. In his time, the easiest way to solve that gnarly equation was to go through the logarithm. Taking the logarithm of the entire thing simplifies to
\[\left(\log{18} - \log{37}\right) \times \frac{1}{17.5}\]
and this the thing we did when we took the slope of the plot!
Shapes of curves
This also helps us intuit what shapes in the log-survival plot mean:
- A straight line means a constant event rate. Age does not affect the chance of an event occurring.
- A line that bends downwards indicates age-dependent mortality, that the likelihood of an event increases with time.
- A line that bends upwards indicates mortality leveling off, that the likelihood of an event decreases with time.
With this background, let’s look at these curves for the full data set.
Both projects have a very steep curve in the first few hours. After that, project Y levels off and appears to have a fairly constant merge rate of 4 % per hour. Project X, on the other hand, has a more complicated curve. The first maybe 30 hours seems to follow the general shape of project Y, but then the curve flattens out significantly, with a slope of maybe 0.6 % per hour.
In other words, project X exhibits merge rate leveling off, while project Y exhibits constant merge rate. If our pull request in project X takes a long time to get merged, we should be worried that it will take even longer! If our pull request in project Y takes a long time to get merged, that doesn’t tell us anything about how worried we should be.