Entropic Thoughts

Lines of code are useful

Lines of code are useful

The internet is full of people dismissing lines of code as a measurement. People say things like

Lines of code written has been firmly established over the decades as a largely meaningless metric.

and

Metrics and kpis are being based on stupid measurements like lines of code.

and

Lines of code is a dumb metric and anyone touting them for anything meaningful is disconnected from reality.

and

We’ve apparently collectively forgotten that lines of code is one of the worst metrics for measuring productivity.

and

I think obsession with lines of code is one of the most counterproductive practices.

and

Lines of code is renowned for being a very bad measure of anything

I find these statements strange, because they are not true.

Measuring complexity

Lines of code measure code complexity. That is well established. You don’t have to take my word for it.

  • Basili and Hutchens (1981) consider the Halstead volume, cyclomatic complexity, and a bunch of other measurements to be part of the same category of logical program size measurements. They also look at ways to measure control organisation, including basing it on flow graphs and nesting. They analyse 19 programs, and find line count correlates strongly with their own volume definition (+0.98) and cyclomatic complexity (+0.88). They find that lines of code predicts the number of changes required to get a working program better than more complicated complexity measures.
  • Revilla and van der Meulen (2007) analyse over 70,000 small C programs performing 59 different tasks, and determine that lines of code correlate very strongly with both Halstead volume (+0.82) and cyclomatic complexity (+0.78). They find that lines of code predicts anything just as well as more complicated complexity measures.
  • Herraiz and Hassan (2010) analyse over 200,000 C source code files from a real open source project (Arch Linux) and found lines of code correlates strongly with cyclomatic complexity (+0.72) and various Halstead metrics (+0.91).

A pattern emerges: any time we have actually gone through the effort of testing a code complexity metric it has resulted in the same predictions (or worse) than a simple count of the number of lines of code. This does not mean better complexity metrics cannot exist, it just means we should be skeptical when someone suggests a new complexity metric; it is highly likely to be a trenchcoated lines-of-code measurement, like the complexity metrics that came before it.

Basili and Hutchens phrase this particularly well:

Since the line count is very easy to calculate and many researchers have found that it does a credible job of measuring the complexity, it must be considered as the metric to beat in studies of this kind. We have failed to find a metric that is significantly better than line count.

Line count is not a meaningless measurement. It is the best way we know of to measure code complexity.

Complexity matters

Complexity, in turn, is what determines how expensive software is to build, maintain, and, to some degree, how useful it is. All else equal, more complex software (a) costs more, and (b) can perform more useful tasks.

It is important here to distinguish between what Fred Brooks called essential and accidental complexity.1 No Silver Bullet: Essence and Accident in Software Engineering; Brooks; Proceedings of the ifip Tenth World Computing Conference; 1986.

Essential complexity

We can think of essential complexity as complexity that exists due to the problem to be solved. Landing a rover on Mars is a complex problem, and the software that solves a problem cannot be any simpler than the problem it is solving. As Brooks puts it,

Much of the complexity the software engineer must master is arbitrary complexity, forced […] by the many […] systems to which his interfaces must conform. […] This cannot be simplified out by any redesign of the software alone.

Brooks also mentions a few other forms of essential complexity we have to deal with in software, that have more to do with the invisible and fractal fundamental nature of software. These are not relevant when comparing the complexity of two different programs, because they apply equally to all software.

Accidental complexity

On the other hand, accidental complexity does not come from the problem to be solved. Instead, it is complexity introduced when translating that problem into software. Brooks never defines accidental complexity, but discusses how advances in tooling has removed some of it. For example, he says,

Abstract data types and hierarchical types each remove one more accidental difficulty from the process, allowing the designer to express the essence of his design without […] large amounts of syntactic material that add no new information content. The result is to allow a higher-order expression of design.

Accidental complexity, then, is complexity we put into the code which didn’t have to be there. Some accidental complexity is forced into the code because the programming language we use does not have the abstractions we need, and some we put in because we’re not all rockstar 10× developer ninjas.

Complexity means cost

This distinction is critical because

  • essential complexity is what increases the value of software, whereas
  • both essential and accidental complexity increase the cost of software.

For better or worse, lines of code only measures total complexity; it cannot distinguish between essential and accidental complexity. Thus, lines of code correspond closely to the cost of software. This means that if e.g. Blender has more lines of code than nginx (which is does), we expect Blender to have been more expensive to develop (it was) – but also that the ongoing cost of its maintenance is higher (it is). See appendix A for more data on this.

This relationship between lines of code and complexity is true both for the total size of a software project (the larger project is more expensive) but also for changes to software size. A project that grows by so-and-so many lines of code per day also has its ongoing maintenance cost grow by so-and-so many minutes per day. For small changes, the effect is small. Over time, it adds up. Lines of code is how we measure this growth in maintenance cost.

If a team spends a quarter of its time on maintenance, and during the next year new development grows the project to be twice as many lines of code, then the pace of new development must decrease by a third, to account for the additional maintenance.

Measuring added value

At this point, we may be tempted to admit that Dijkstra was right all along when he wrote2 ewd 1036: On the cruelty of really teaching computer science; Dijkstra; 1998. Available online.

My point today is that, if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.

But it’s also not that simple. Lines written are lines spent, but some of them are also lines produced. Some of the additional complexity is accidental, and some of it is essential. The latter does indeed increase the value of the software.

If the ratio of essential complexity to accidental complexity is relatively constant in a project (which it seems reasonable to believe, at least in aggregate across medium timespans) then the line count is also a measurement strongly correlated with essential complexity, i.e with the amount of value the software can provide.

This is true also for individual productivity. If the ratio of essential complexity to accidental complexity is relatively constant for someone’s contributions one month to the next, then the number lines of code they have added to a project are a proxy for how much additional value they have made the software provide.

For one-off analyses or individuals who want to know how their impact varies over time, lines of code is a great productivity metric. It’s just not useful for driving projects or managing people, thanks to Goodhart’s law. The second lines produced become incentivised, people will start producing accidental complexity and the ratio stops being stable.

Guidance

The above may make it sound like lines of code is a useless metric after all, but it’s important we separate the two ways we can use it as a metric:

  • As a cost metric, lines of code is mostly fine, even considering Goodhart’s law. This is because lines of code is an almost perfect proxy for total – essential plus accidental – complexity.
  • As a productivity metric, lines of code is fine only if we can ensure the relationship between essential and accidental complexity stays stable during the full period of measurement. It tends not to, because it’s easy to increase lines of code by increasing only accidental complexity.

There, slightly more nuance than the quotes that opened this article.

Appendix A: The cost of open source maintenance

We used Blender and nginx as examples in the article. Blender has 2.8 million lines of code, and nginx has 250 thousand lines of code. It seems reasonable to guess that Blender costs more to maintain than nginx, and this is also true, at least in one way. As a proxy of maintenance cost, we’ll use the number of maintainers, defined as the number of authors that have made more than five commits in the past six months. Here’s the data for a spattering of projects popular on GitHub.

Project Line count Maintainers
Rust 3,800,000 229
Blender 2,800,000 71
Kubernetes 2,300,000 89
VSCode 2,000,000 81
Node.js 1,300,000 39
PowerToys 760,000 15
React 550,000 10
NeoVim 410,000 25
Transmission 300,000 8
nginx 250,000 4
yt-dlp 240,000 6
Redis 230,000 12
Audacity 180,000 9
Excalidraw 160,000 4
tmux 100,000 3
Fish 90,000 10
htop 49,000 4
i3 29,000 3
scc 24,000 3

As a very rough guide, this data hints that each 25 thousand lines of open source requires one more maintainer.3 If we look closer, we’ll see the maintainer count seems to be a quadratic function of code size, meaning the cost of a line of code becomes greater when the total code size is greater. That makes sense – interactions between components scale quadratically with the number of components because in n things, there are roughly n² pairs of things. Let’s say these maintainers spend on average an hour a day on maintenance.4 Some of these projects are spare time efforts, others have corporate sponsors. The hour-a-day figure comes out of my arse. If you think it’s wrong, you can repeat these calculations with a number you agree with more. That means

  • Each 200,000 lines of code needs a full workday of maintenance per day. That lets us estimate a limit for how large a project a single person can maintain, while not having time for any new development.
  • If the fully-loaded hourly cost of a developer is $50 (this may be on the cheap end depending on where you live) then every 100,000 lines of code costs $200 per day in maintenance alone.
  • If a team of eight people wants to spend at most 1/5 of its time maintaining a project, then their project can only expand to 320,000 lines of code before they need to hire one more person.

Small sample. Probably not universally true. But still interesting to think about, and certainly close enough to my professional experience.

Appendix B: do we need make functions smaller?

Basili and Hutchens made another interesting observation, although they admit their sample is too small to generalise from. They measured the number of changes necessary to get a program to successfully fulfill a specification, as a proxy for how difficult the program was to write. They also measured the complexity of individual components of the resulting program. They compared the number of changes to high percentiles of the complexity of individual components, and tested a linear fit against an exponential fit.

Going by classic advice from the likes of Uncle Bob, who suggest functions must be short lest maintainability suffers5 Clean Code: A Handbook of Agile Software Craftsmanship; Martin; Pearson; 2008., we would expect the exponential fit to be best. That corresponds to the hypothesis that individual complex components cause an outsized maintenance demand.

But Basili and Hutchens found the linear fit to be better! In their data, one complex component doesn’t cause more maintenance demand than five components, each a fifth of the size. This agrees more with advice of the likes of John Carmack suggesting inlining subprograms results in greater clarity.

Again, their sample is too small to generalise from. I don’t know if anyone has replicated the result with more varied data. But it is thought-provoking.