Why Story Points Don’t Work

kqr

, published 2024-08-27

Tags:

Barry Jones has written an excellent article on why story points are pointless. I believe his core arguments can be expressed more succinctly. In my reading of it, there are two of them:

Effort estimation needs to be grounded in reality to be meaningful. Story points are – by design – separate from reality, and thus not meaningful.
Within the accuracy level we can practically achieve, all tasks are the same size, so we can estimate effort by counting tasks rather than time.

I agree with both of these points, but will focus on the first one in this article, because it is a critical business realisation and it does not get enough airtime.

Story points are never correct

One thing we have never heard about effort estimations using story points is

Oh, Mila is great at story point estimation. Even in hindsight, her estimations are always correct.

Nobody ever says that, because what would it even mean for a story point estimation to be correct? Actually, what does it mean for anything to be correct? This is not as obvious as it may seem.

For a long time, correct meant in agreement with the teachings of the church. Sometimes, the church does not have an opinion: for example, if we wonder what a plant will look like after being deprived of sunlight for a week, there is little in the holy scriptures to guide us. In Greek times, we would have philosophers debating what a plant ought to look like after a week without sunlight.

But we know better! We can stuff a damn plant into our closet and forget about it for a week, and then take a good look. This is empirical correctness – the only correctness that really matters when trying to build products that make people’s lives better. We can say a hypothesis is correct when we have favourably compared it against reality.

Story points are never empirically correct, because there is no connection between them and reality. We can never take story points out into the real world and compare them to the actual outcome.1¹ Note that story points can be internally consistent. Also internally consistent is the Lord of the Rings. That does not make it an accurate history of England.

This is the fundamental flaw of story points (and t-shirt sizes). They cannot be compared to reality. There’s no way of knowing whether Mila is a good estimator. There’s not even a way of knowing whether we are getting better or worse at estimating.

Due to the lack of external reference, it’s also possible for Mila and Sven to disagree on their story point estimations, even though they both internally agree on the complexity of the task. Or worse – they may have very different ideas of the complexity of the task, yet translate that to the same story point estimations.

Story points are an estimation dead-end. What’s better?

First: what are we trying to accomplish?

Before doing anything, we should consider what we are trying to accomplish. There are two purposes of estimating the effort of software tasks:

The output itself, the estimation, which can be used as a prioritisation aid; and
The process, the estimating, where we discuss the imagined scope and various assumptions going into a task.

For the effort estimation task to be worthwhile, its cost has to be lower than its benefit. Aside from the process (which we will get back to) the benefit of an effort estimation is better prioritisation. When we know the cost and value of a task, we can perform a basic return-on-investment calculation to prioritise more wisely.

This is a very low bar to clear: we get a net profit from effort estimation when the cost of doing it is lower than the improvement to prioritisation we get out of it. The cost of one team doing it is at worst, what, $5000 per year? The gains in better prioritisation are worth way more than that.2² Again, assuming we know the value of the task. This is where most organisations should spend more time estimating!

Time estimations can be correct

So the effort estimation should express a guess at the cost of performing the task. The main cost in software development is usually programmer-hours, so that must also be our unit of estimation. Saying that a task is so-and-so many wedding dresses, fruit baskets, oil changes, t-shirt sizes, or story points does not help us with a return-on-investment calculation.

An estimation in hours can be made in ways that are verifiable, i.e. we can say after the fact if they were correct or not. Here’s what such an estimation sounds like:

Buffering the wormhole streams will take less than 15 hours, at 90 % confidence.

If that task is completed with less than 15 developer-hours spent on it, the estimation was correct. If it took more than 15 hours, the estimation wrong.

It’s okay to be wrong sometimes – nobody can know ahead of time how much effort something will take – but at 90 % confidence, only two out of every twenty of our estimations should be wrong. This might sound like a fantasy, but most people are capable of achieving that level of calibration if they try.

Time estimations can contain uncertainty

Another reason people use story points is that they allegedly bake uncertainty into the scale. And this is important! Neil Armstrong was adamant about his God-given right to be wishy-washy about where he was going to land on the moon, and we have a God-given right to be wishy-washy about how expensive something is going to be. We can do that with hour-based estimation, by giving an interval.

Here’s what that sounds like:

Buffering the wormhole streams will take 3–25 hours, at 90 % confidence.

This is a task which has a relatively wide distribution.3³ Note that when we switch from a 90 % point estimation to a 90 % interval estimation, the upper bound moves. This is because at a 90 % interval estimation, the upper bound effectively represents a 95 % point estimation. The interval we are estimating is that which lies between 5 % and 95 % – there are 90 % in that range. Compare that to something we have probably all done before several times, so we are very certain of it:

Calibrating the dragon breath will take 5–7 hours, at 90 % confidence.

As a reminder, that confidence level means we should be wrong only about 10 % of the time, so a 5–7 hour estimation at that level of confidence is really only something we’ll give when we’ve done something many times before. Something like 3–25 hours or even wider is far more realistic.

Avoiding the hard parts of reality

When someone in upper management hears that a task is estimated to take 3–25 hours, they will go

But that is not possible! We cannot work with that level of uncertainty. You have to do better than that.

In this context, “better” is a funny word, because they are seeing it as a negotiation, while we are just trying to convey the uncertainties of reality. Reality will always win any negotiation. This task will take 3–25 hours, regardless of what weird admission management coerces out of its developers that makes it seem less uncertain.

This is known as wishful thinking – the life force of managers and toddlers all around the world. When they get a sudden glimpse of reality, and it hits them how difficult it is to navigate, they shut their eyes and demand that reality changes.4⁴ Allen Ward wrote about this as one of the big wastes of knowledge work today. Wishful thinking has so any bad consequences for everyone involved.

The process of estimation is just as important

And this gets us back to process. The top comment on the Hacker News discussion of the Jones article said

My personal experience with story points is that the number never really mattered, but the process of the team discussing how to rate the complexity of a task was very useful.

This is one of the major benefits of estimation: everyone sits down and is forced to confront the scope of the task, their assumptions, the interconnections with other parts, the skill level of the team, their workload, and so on.

This is amplified when estimations happen in reality-based units, such as hours. We can discuss to no end if a task is a Super Mario or a Luigi or even a Bowser, but people mean different things by a Luigi, so we’re not getting very far. However, everyone means an hour when they say an hour, so reality-based estimation becomes a key that unlocks important disagreements.

In the end, it’s up to each and every one of us: do we prefer to take the role of philosophers debating what kind of cheese the Moon ought to be made out of? Or would we prefer to be like Armstrong, making a wishy-washy – but firmly planted in reality – landing to verify?

Bonus section: it goes beyond effort estimations

This insistence on fuzzy, non-reality-based measurements goes beyond effort estimations. Here are some things I’ve heard in my career.

This prospective customer is almost a guaranteed contract.

I have heard this at two different occasions, and upon deeper probing, the same salesperson meant 90 % the first time and 60 % the second time. “Almost guaranteed” is meaningless when it spans that probability range.

I don’t think most of our backoffice users will be on mobile devices in the next year.

I hear “don’t think” a lot, and when I have tried to elicit probabilities they have ranged between 0 % and 70 % chance. Those endpoints call for very different planning!

This integration is rather complex, and I’d prefer to get started on it as soon as possible.

What does “rather complex” mean? Is it 800 lines of code+tests, or 14000? Both are valid guesses and require different precautions.

This loop5⁵ This is apparently a vogue word meaning “something that reaches out to the user and annoys them and gets them curious enough to sign in to the system on the spot.” should increase retention significantly.

When data people have expressed a desire for “significant increase in retention” they have meant both two more months on average, and twelve more months on average. One is very cool, but the other is a figuratively literal gold mine!

The overarching theme is the same: you can get away with saying anything as long as it’s not verifiable. In contrast, reality-based statements bring out useful disagreements. They surface hidden assumptions. And they give us the honour of being wrong. The magic question we should keep in our heads is

… and exactly how will we know if we were wrong?

If there are no concrete steps we can take where the end result is an objective judgment of whether a statement was correct or not, we have not been sufficiently reality-based. We cannot learn from non-reality-based hypotheses.