Precel: Like Excel but for Uncertain Values
Task effort estimation: Imagine you have two programming tasks, and you have estimated the first will take somewhere between 10 and 40 days to complete (at 90 % confidence), while the other will take between 5 and 25 days. How long will it take to do both tasks in sequence? For this and the questions below, assume that your estimations are calibrated and that the uncertain values are uncorrelated.
If you answered 15–65 days, you made the same mistake almost everyone makes. This mistake is even encoded as a mandatory step into some estimation methodologies – I’m looking at you, pert. The actual answer is 20–55 days.
Driving time estimation: Next, imagine that you are going to move for work, but you don’t know exactly where yet. However, you know it will be 10–30 km from the workplace, and that the roads in between will let you drive at 50–90 km/h at rush hour. How long will it take you to get to work?
If you answered 7–36 minutes, you made the same mistake again.1 If you answered 12–20 minutes, you made another mistake. Think about what the numbers represent! The actual answer is 9–28 minutes.
If you still don’t see what’s happened, let’s try another variant:
Grocery cost estimation: You estimate your future weekly grocery purchases will total $80–$150. How much should you estimate your yearly grocery purchases to total? It’s not $4k–$8k – it is a narrower interval: $5.5k–$6k.
In all cases, the naïve combination overestimates the width of the actual interval. Why?
Uncorrelated values cancel out a little
The reason that the true interval is smaller than we instinctively think is that uncertain values, when uncorrelated, cancel each other out a little.
For the two tasks in the first example, we can think of how long they take as belonging to a very fast scenario, very slow scenario, or average. Then, very crudely speaking, these are the possible outcomes for the time it takes both tasks:
Task B | Task A | 10 days | 20 days | 40 days |
---|---|---|---|---|
5 days | 15 days | 25 days | 45 days | |
15 days | 25 days | 35 days | 55 days | |
25 days | 35 days | 55 days | 65 days |
To get a completion time of 15 days for the two tasks together, we need to complete both tasks very quickly. Similarly, a completion time of 65 days can only happen if we complete both tasks very slowly. The most likely outcome is somewhere in the middle, which according to this table is 25–55 days. More computationally intense analysis2 Considering more outcomes, weighing more in favour of ones close to average, and taking the 90 % central values of the result. gives the answer 20–55.
This is the law of large numbers: as we add more numbers, the standard deviation of the sum grows slower than the sum itself (square root vs. linear), so the spread of the distribution of the sum narrows. Eventually the Studentised spread converges to zero.
Convergence helps us estimate more precisely
This effect is useful because it allows us to make more precise estimations by breaking down the uncertain value to be estimated. If I had to guess the distance between Stockholm and Gothenburg, I would give the interval of 300–800 km, which has a factor 2.7 between high and low.
However, I happen to know the time it takes to go between the cities by train is just over three hours. I don’t know the speed of the train, but I’m reasonably sure it’s in the range of 130–170 km/h. If I multiply (3.0–3.2) hours by (130–170) km/h, I get a new estimation for the distance: 400–530 km. This has just a factor 1.3 between high and low, meaning the precision of the estimation has been doubled!
Decomposing estimations into components that are easier to estimate is a very powerful tool, but it does require that one is able to re-combine multiple estimations into one later on.
We need computer assistance
Unfortunately, it is difficult to combine distributions with just mental maths. The easiest way I know is through simulation: draw 5000 values from both distributions and add them up individually. Check where the resulting distribution ends up. This is very easy to write the code for, but also very annoying to write fresh code for every time it’s needed.
So I started writing a tool for it, called Precel. For the distance estimation we looked at in the last example, run the command
$ ./precel '130 170 interval 3 3.2 interval multiply'
130 170 interval 3 3.2 interval multiply =
401.54 [461.41] 528.55
The syntax looks funky and is a little clunky but that’s because I have invested
minimal effort into it so far. It’s postfix, so 130 170 interval
means “a 90 %
interval estimation of 130–170,” and a b multiply
means elementwise a × b.
Check the readme for further instructions.
The tool is in very rough shape and I don’t plan on actively working on it. At the moment it satisfies my most immediate demands, and as I need more advanced features I will add them.3 There are some things I suspect I will need in the future like reading distributions from files, writing out correlations and intermediate steps that would be useful for interactive use and diagnostics, as well as interoperation with itself.
Excel for uncertain values
One of my dreams is eventually writing a viable competitor to Excel, but instead of being limited to single numbers in cells, it can handle probability distributions. There are some plugins to Excel that do this4 And you can actually emulate it by encoding generated random values as a comma-separated string, splitting and joining it and using array operations to calculate with it. I have done this once, and the result was the world’s slowest spreadsheet. Took minutes to propagate changes!, and some external tools, but they are either proprietary, or suck, or both. I think world deserves a good, open source implementation. As humanity moves forward, we will all need to be more probabilistically literate, and that’s hard when the tools that run the world are created in complete ignorance of probabilistic principles.
In the meantime, I have Precel. With the right features and performance optimisations, it may even become the engine that powers a probilistic spreadsheet. Who knows?