Arithmetic Models: Better Than You Think

kqr

, published 2025-02-25

Tags:

LessWrong user dynomight explains how arithmetic is an underrated world-modeling technology and uses dimensional algebra as the motivating case. I agree dimensional algebra is fantastic, but there’s an even better motivating example for arithmetic in world-modeling: linear models for prediction.

Simple linear models outperform experts

In 1954, Paul Meehl published what he later came to call my disturbing little book. This book1¹ Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence; Meehl; University of Minnesota Press; 1954. contains the most important and well-replicated research I know of; yet most people don’t know about it. The basic argument is that many real-world phenomena – even fickle ones – can be adequately modeled with addition and multiplication.

Tempered by the lack of evidence at the time, the book doesn’t go quite as far as it could have. Here are some statements that have later turned out to be true, given in order of increasing outrageousness.

Simple linear models outperform experts.
Simple linear models outperform experts when the experts get access to additional information that the model does not get.
Simple linear models outperform experts when the experts also get to know and use the outcome of the linear model.
Simple linear models trained only on experts’ judgments and not the actual outcome outperform the very experts they were trained on.
Simple linear models with random weights outperform experts.
Simple linear models with equal weights (i.e. a tallying heuristic) outperform experts.
Simple linear models with equal weights when limited to three predictors still outperform experts.

Obviously, these are phrased to provoke, and take additional nuance to be fully understood, but the general theme remains the same: addition and multiplication take you surprisingly far.

Predictive modeling is more important than explanatory modeling

There are two reasons to make models: prediction and explanation.2² Meehl calls these discriminative and structural, but I think predictive and explanatory is the more modern terminology.

Predictive modeling: Figure out how things are correlated: if X has been observed, does that generally happen when Y is more or less likely to happen?
Explanatory modeling: Figure out how things are causally linked. When X happens, does that trigger Y to also happen?

Most people reason about the world through explanatory modeling, and are uncomfortable with predictive modeling. I would argue that predictive modeling is the better approach.

Predictive modeling has the funny property that it is acausal and atemporal: we can predict the risk of someone being in an accident when they are driving drunk, but we can also predict the probability that someone was drunk when we have observed an accident. I think this is what makes people step away from predictive models. Once we are using consequences to predict antecedents, we are flaunting our ignorance of all the nice logic the Greeks came up with – even though it was a while since Bayes and Laplace taught us this is fine.3³ Even Fisher was suspicious of the “method of inverse probability”, and tried – but failed – to replace it with something better.

Predictive modeling doesn’t really care what was the cause, or what came first. It is all about figuring out which observations tend to come together.

In fact, I’d go further and argue that explanatory modeling is just a mistaken approach to predictive modeling. Why do we want to understand how things work? To make better decisions. But we don’t really need to understand how things work to make better decisions, we just need to know how things will react to our what we do. Knowing how they work can be an aid in that, but when it is, predictive modeling will pick up also on causal factors.

What makes predictive modeling a better idea is that it also allows us to find factors that are not causal, but still useful.4⁴ A typical case would be consequences of a common antecedent: high intelligence results in good grades, but is also associated with a good grip on language. This means you can predict someone’s elementary school test scores based on the first draft of an essay they wrote in a completely different situation. I know because that was a game I used to play in elementary school.

In defense of explanatory modeling

Something Pearl emphasises in his book on causal inference5⁵ Causality: Models, Reasoning, and Inference; Pearl; Cambridge University Press; 2009. is one of the primary strengths of causal reasoning: the stability of the relationships it uncovers.6⁶ Another strength of causal modeling is social, namely that people are more ready to accept a causal model.

Using drunk driving as an example, it would be reasonable to believe that drunk driving does not cause accidents at all, it is just that people who drive under the influence tend to also behave recklessly in other ways, and this is what causes accidents. But we can test that: we can get randomly selected people drunk and then put them in a driving simulator, and note whether the drunk group has a higher accident rate. They do, because drunk driving is causally associated with accidents, meaning the relationship is stable even when we control for other factors like personality type.

Predictive models are not guaranteed to be stable like causal models. What we need to do with predictive modeling is find out under which conditions the models hold, and find alternative models for when those conditions are violated. The upshot is that predictive models are much easier to construct, meaning we can take advantage of modeling for a lower cost.

When arithmetic beats expertise

Meehl lists several comparisons of expertise and arithmetic in his disturbing little book. I strongly encourage you to read the book for more motivation and evidence, but here are some representative examples:

Linear regression on just the two variables (1) high school percentile rank and (2) score on college aptitude test predicted gpa just as well as doctorate, well-trained student counselors, who had access to way more information (interview notes, additional tests, individual record form, etc.)
Linear regression with somewhat arbitrary weights on a 30-item questionnaire predicted improvements in schizophrenics much better than a team of psychiatrists discussing the same cases. The psychiatrists got on average 44 % correct, whereas the linear regression achieved 81 %.
Tallying up 15 factors known to correlate with criminal behaviour predicted criminal recidivism just as well as a prison medical officer.
Test scores of nearly 40,000 naval enlisted correlated better with their training success than the evaluations of interviewers – who had the same test scores in front of them.

Note that even if the expert seems to perform on the same level as arithmetic in some of these examples, the expert often refused to predict on difficult cases. If we assume their refusal to predict is a middle-of-the-range prediction, the expert performance becomes worse than the linear model.

But this does not mean we can go without experts. To run a linear model, we need data. Experts are good at extracting data from complicated situations. Another example Meehl brings up in the book is how movements in family casework were roughly as well predicted by a five-variable linear regression as by experts. What’s curious about this are the variables used:

These five factors were (1) referral source, (2) problem area, (3) insight, (4) resistance, and (5) degree to which client was overwhelmed. It is evident that the rating on these five variables as exhibited in the records of the initial contact already involves some considerable degree of clinical judgment by the skilled case reader.

Experts are good at taking complicated qualitative data and distilling it down to a quantitative range.7⁷ Anyone competing in forecasting contests knows this already: a simple way to improve one’s forecast is to break it down into components, estimate the components, and then combine into a final prediction. One reason this helps is that errors in estimating each component cancel out, another reason it helps is through coherence laws relating the estimations of the components to each other.

A few years later8⁸ When Shall We Use Our Heads Instead of the Formula?; Meehl; Journal of Counseling Psychology; 1957., Meehl described the current-until-then state of the research:

As of today, there are 27 empirical studies in the literature which make some meaningful comparison between the predictive success of the clinician and the statistician. […] Of these 27 studies, 17 show a definite superiority for the statistical method; 10 show the methods to be of about equal efficiency; none of them show the clinician predicting better.

There has been a lot of research on this since the 1950s, but to close this section in this article, we’ll briefly summarise a meta-study9⁹ Clinical Versus Mechanical Prediction: A Meta-Analysis; Grove, Zald, Lebow, Snitz, Nelson; Psychological Assessment; 2000. performed in 2000. The authors surveyed 136 comparisons between expert judgment and arithmetical data combination, on subjects like

advertising sales,
coupon redemption,
military success,
heart disease,
job performance,
business failure,
suicide attempt,
game outcome,
homosexuality, and
marital satisfaction.

They conclude that arithmetically combining data is on average 10 % more accurate than expert judgment. The 136 comparisons break down into the following three cases.

Arithmetic better than experts	40 %
Arithmetic and experts equal	49 %
Experts better than arithmetic	11 %

Across the board, the experts did not perform better with more experience in the field, nor did they perform better with more data. In fact, there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly (like a clinical interview.)

There are three reasons arithmetic wins even though it is often equal to expert judgment:

Using arithmetic for prediction lets experts attend to more important things10¹⁰ One thing Meehl points out experts can do which arithmetic cannot is generate new hypotheses. To predict arithmetically, we need to aim for a criterion, or a fixed set of potential outcomes. This set needs to be conjured out of thin air, and that is something experts do really well.,
A lower-salaried person can evaluate arithmetic, and
Arithmetic is faster than expert judgment.

I know what the reader thinks to themselves at this point:

Yeah, that is all well and good for the areas surveyed in the meta-study. But in this specific situation I have in mind, my judgment is much more nuanced and accurate than an arithmetic model.

It is worth emphasising that the meta-study covered a wide range of fields, and the authors did not find any systematic indicator of when expert performance is better than arithmetic. This means if we assume our judgment is better than arithmetic we are betting that our case randomly falls into that 11 % bucket. We are playing Russian roulette with five chambers loaded and one empty!

It would be sensible to do so if expert jugment was much cheaper than arithmetic, but it’s not. In his disturbing little book, Meehl puts this particularly well.

It is apparently hard for some experts to assimilate the kind of thinking represented here – chiefly, I gather, because they cannot help concentrating on the unfortunate case that would have been handled better had we followed the advice of super-expert X, in defiance of the arithmetic. But what such objectors do not see is that in order to save that case, they must lay down and abide by a decision-policy which will misclassify some other case by defying the statistics.

We cannot choose policies based on the effect they have on individual cases, because the choice of individual case implies conflicting decisions. When we choose policy, we must do it based on the aggregate effects of that policy, even when it has known, heart-breaking consequences in individual cases.

Improper linear models are still better than experts

To see the full extent to which arithmetic outperforms expertise, we will turn to Dawes11¹¹ The Robust Beauty of Improper Linear Models in Decision Making; Dawes; American Psychologist; 1979., who breaks up expert judgment into components and estimates the value contributed by each component.

We pretend an expert prediction is composed of three actions:

The expert selects what to observe and guesses which direction holds for the relationship between the observation and the outcome. (E.g. are we more or less likely to be having a subtle incident when we have made a deployment in the past four hours?)
The expert assigns weights to the observations, indicating their relative importance to each other and the outcome. (E.g. memory usage seems normal, but is that a more or less important predictor than the recent deployment?)
Finally, the expert adds some noise to their prediction, by considering individual variables that aren’t systematically related to the outcome but sound meaningful to a human suffering from narrative biases. (E.g. they know that mostly good developers have worked on the latest release, so clearly the risk of incident is very low.)

The expert does not know these are the things they do – the expert just looks at a bunch of data and makes a prediction based on their gut feel. But we can use this model to dive into why expert predictions don’t work as well as arithmetic.

Parametric representation removes noise

Parametric representation is a way to average out the noise component from expert judgment. It works by asking experts to predict based on randomly generated data, and then training a linear model on the expert predictions. Since the training data does not contain outcomes, we are actually figuring out the shape of the mathematical function in the experts’ heads. We are using the experts to select predictors, determine their direction and relative weight, but then the linear regression will average away the noise of their predictions. What we find is that this linear model, trained on expert predictions only, outperform experts. The conclusion is simple: the noise experts add is not useful.

Experts do not sensibly assign weights

What about assigning weights? How much value does the expert add to that activity? We can find out by comparing the parametric representation with the same model except with weights assigned randomly. Both of those models perform on a roughly equal level, which indicates that experts, on average, select weights randomly.

Experts are good at selecting variables

These results are based on a small set of studies, and thus cannot be considered conclusive. What they do is waggle their eyebrows suggestively in the direction of the tree Dawes’ is barking up:

Experts are really good at selecting which variables we should look at, and at telling us whether they support or oppose the outcome.
Experts are not good at combining these variables into a final prediction, even discounting their noisiness.

Just count it. What? Something. Anything!

Given the strong performance of optimally-weighted linear regression over even expert-assigned weights, we might find ourselves in a predicament: finding optimal weights requires knowledge of the outcome, and we often can’t measure the outcome – either because we don’t know how to, or because we lack historic data.

Under those circumstances, we can still do better than random weights: unit weights. Linear models are surprisingly powerful even with unit weights, and this is a good thing because it’s easy to find situations where estimated weights are unstable:

The previous example of parametric representation yields unstable weights (since they are effectively selected at random);
In multidimensional problems (where we have many potential predictors) we need a ridiculous sample size to get stable weights.

In these situations, setting all weights to be equal yields higher predictive accuracy than both random weights and attempting to estimate optimal weights.12¹² Dawes has a footnote on the technical details here, for the interested.

One funny consequence of the optimality of unit-weighted models is we can do this also when we are unable to measure the outcome we are trying to optimise for, such as in the case of hiring in small organisations. We don’t know exactly what job success looks like, but we have a decent idea of which factors contribute to it, so we can measure our candidates on those variables, and then combine those measurements with equal weights.

As Dawes snappily puts it:

The whole trick is to decide what variables to look at and then know how to add.

Although it does get into dangerous territory, this can even be used to overcome measurement problems. Maybe we are looking to hire people who are willing to challenge the consensus, but we don’t know how to measure this willingness. What we can do is find an easily-measured proxy, such as counting the frequency with which the applicant contradicts one of the interviewers. This can happen for any number of reasons which are unrelated to job performance (“Would you like tea or coffee?” “No, thank you. Water is good.”), but if we try to be clever we risk introducing noise. Instead, we can just run the dumb count and combine with equal weights. It may well perform better than expert judgment.

Rationale

The authors of the meta-study referenced above ask themselves why their results were obtained, and speculate that

Humans are susceptible to many errors in clinical judgment. These include ignoring base rates, assigning nonoptimal weights to cues, failure to take into account regression toward the mean, and failure to properly assess covariation. Heuristics such as representativeness (which leads to belief in the law of small numbers) or availability (leading to over-weighting vivid data) can similarly reduce clinicians’ accuracy. Also, clinicians often do not receive adequate feedback on the accuracy of their judgments, which gives them scant opportunity to change maladaptive judgment habits.

Meehl with some coauthors13¹³ Clinical Versus Actuarial Judgment; Dawes, Faust, Meehl; Science; 1989. dive a little deeper into reasons arithmetic prediction is so strong:

Arithmetic models predict the same result given the same data, every time. Experts disagree even with themselves when re-evaluating cases. As Kahneman has shown, this noise is a great destroyer of predictive accuracy14¹⁴ Noise: A Flaw in Human Judgment; Kahneman, Sibony, Sunstein; Little, Brown and Company; 2021. This may well be the primary reason expert judgment does not measure up to arithmetic.
An arithmetic model can have weights derived from historic frequencies, which almost always serve as a good base for predictions of the future.
Humans, as opposed to calibrated models, have increased confidence in their predictions when given more data, even if the additional data has no predictive power.
When an arithmetic model is calibrated, it is specifically by including feedback from the real-world effects of its predictions. Experts do not, as a rule, seek out any feedback on their calibration.
Related to the above, experts can form superstitions around the predictive power of particular observations based on vivid cases in their past, even though these observations may not apply generally beyond that vivid case.
Humans tend to confabulate memories that corroborate reality. When an expert is asked to recall what they predicted in a case, they will honestly remember their prediction as aligning more with actual outcomes than it did.
Experts tend to overvalue observations based on norms rather than actual predictive power. The example used by Meehl and co-authors is how clinicians that work with criminals might note that they often have a subtle eeg abnormality, and thus conclude that subtle eeg abnormalities predict criminal behaviour. This is an invalid conclusion unless they also know that subtle eeg abnormalities are rare in the larger population15¹⁵ In case you don’t see why: imagine detaining everyone with a subtle eeg abnormality if it is relatively common in the population. Most of these people will not be criminals, even if every single criminal has an eeg abnormality! – but they don’t, because they only subject criminals to the test!

Arithmetic models also suck, but don’t let that stop you

One of the common objections against arithmetic models I have encountered is that they still suck. They do. You might get away from all the words above that linear models will perform well, but the truth is that generally they don’t. All the above is saying is that linear models perform better than experts, but experts also suck.

The difference is that when we have developed an arithmetic model, we can usually give a number indicating how well it performs, and this is the first time people are faced with how difficult prediction is. The expert may have predicted the same thing worse for a decade, but nobody ever evaluated their track record so closely. They’re an expert! Obviously they know what they are doing, don’t they? So when people are faced with poor predictive power for the first time it is usually in the context of an arithmetic model, and they reject it because “We must be able to do better than that!” Here, I agree with Dawes, and tend to answer, “Really? What makes you think so?”

Many real-world outcomes arise as complicated interactions between a multitude of variables, and they are genuinely very hard to predict. There’s no reason to think we can do better, and when we try (e.g. by hiring an expert), we run a very large risk of just introducing noise which makes the predictions – on aggregate – worse, regardless of the effect on individual cases.