Entropic Thoughts

Intention-to-Treat Experiments

Intention-to-Treat Experiments

There is an experiment by LessWrong user dkl9 where they try to figure out how their mood is affected by music. Aside from the difficulties of running with no replicates1 Which is what always happens when one experiments on oneself only., they had a fairly good setup going. You can read their article for more details2 Some of those details improve on the design I’ve laid out here., but for the sake of this article, we can pretend the experiment worked like this:

Since they weren’t able to comply fully with the random assignments, they also recorded on which days they actually listened to music, and on which days they did not.

At the end, they report these average moods:

Day Mean mood
Music 0.29
No music 0.22

Making some very rough assumptions about variation, this difference is maybe 1–2 standard errors away from zero, which on could be considered weak evidence that music improves mood.

Except!

There is one big problem with this approach to analysis. Although the experiment started off in a good direction with picking intended music days at random, it then suffered non-compliance, which means the actual days of music are no longer randomly selected. Rather, they are influenced by the environment – which might also influence mood in the same direction. This would strengthen the apparent relationship with no change in the effect of music itself.

The solution is to adopt an intention-to-treat approach to analysis.

Illustrating with synthetic data

I don’t have access to the data dkl9 used, but we can create synthetic data to simulate the experiment. For the sake of this article we’ll keep it as simple as possible; we make some reasonable assumptions and model mood as

\[g_i = M + rg_{i-1} + km_d + bs_d + e_i\]

This is a bit dense, but it says that our mood at any given time (\(g_i\)) is affected by four things:

  • A baseline mood (\(M\)) which is constant and indicative of our life situation more generally.
  • Our previous mood (\(g_{i-1}\)), because if we were unusually happy at lunch, some of that mood is likely to linger in the afternoon. The rate of decay is given by the coefficient \(r\).
  • Whether we listen to music that day or not (\(m_d\)), a term with strength \(k\). In case it is not yet clear, the purpose of the experiment is figuring out, from data, if \(k\) is positive, negative, or zero.
  • Whether we are in a good situation or not that day (\(s_d\)), a term with strength \(b\). We cannot infer this term from data because it is indistinguishable from the error term, but the reason we still include it in the model will be apparent soon.3 Actually, since the situation is based on days and there are six measurements per day, we might be able to infer this parameter from data also. But we will not.
  • An error term (\(e_i\)) which represents other factors that vary from time to time.

Here’s an example of what an experiment might look like under this model. The wiggly line is mood, and the bars indicate whether or not we listen to music each day. (The upper bars indicate listening to music, the lower bars indicate no music.)

intention-to-treat-01.svg

The reason we included the situation \(s_i\) as a separate term is that we want to add a correlation between whether we are listening to music and the situation we are in. This seems sensible – it could be things like

  • We love shopping for jeans, and clothes stores tend to play music.
  • We had expected a great time at home listening to music, but ended up having to go out roofing in the rain and cannot bring a speaker because rain.

The model then simulates 25 % non-conformance, i.e. in roughly a quarter of the days we do not follow the random assignment of music. This level of non-conformance matches the reported result of 0.5 correlation between random music assignment and actual music listening.

When we continue to calibrate the model to produce results similar to those reported in the experiment, we get the following constants and coefficients:

\[M=0.15,\;\;r=0.4,\;\;k=0.0,\;\;b=0.1\]

The model then results in the following moods:

Day Mean mood
Music 0.29
No music 0.20

We could spend time tweaking the model until it matches perfectly4 I know because we have something like 7 degrees of freedom for tweaking, and we only need to reproduce 5 numbers with them. but this is close enough for continued discussion.

The very alert reader will notice what happened already: we set \(k=0\), meaning music has no effect on mood at all in our model! Yet it produced results similar to those reported. This is confounding in action. Confounding is responsible for all of the observed effect in this model.

This is also robust in the face of variation. The model allows us to run the experiment many times, and even when we have configured music to have no effect, we get an apparent effect 99 % of the time.

intention-to-treat-02.svg

With the naïve analysis we have used so far, the correlation between mood and music is 0.26, with a standard error of 0.10. This indeed appears to be some evidence that music boosts mood.

But it’s wrong! We know it is wrong, because we set \(k=0.0\) in the model!

Switching to intention-to-treat analysis

There are two reasons for randomisation. The one we care about here is that it distributes confounders equally across both music days and non-music days.5 The other purpose of randomisation is to make it possible to compute the probability of a result from the null hypothesis. Due to non-compliance, music listening days ended up not being randomly selected, but potentially confounded by other factors that may also affect mood.

Non-compliance is common, and there is a simple solution: instead of doing the analysis in terms of music listening days, do it in terms of planned music days. I.e. although the original randomisation didn’t quite work out, still use it for analysis. This should be fine, because if music has an effect on mood, then at least a little of that effect will be visible through the random assignments, even though they didn’t all work out. This is called intention-to-treat analysis.6 This is from the medical field, because we randomise who we intend to treat, but then some subjects may elect to move to a different arm of the experiment and we can’t ethically force them to accept treatment.

In this plot, the lighter bands indicate when we planned to listen to music, and the darker bands when we actually did so.

intention-to-treat-03.svg

With very keen eyes, we can already see the great effect of confounding on mood. As a hint, look for where the bars indicate non-compliance, and you’ll see how often that corresponds to big shifts in mood.

When looking at mood through the lens of when we planned to listen to music, there is no longer any meaningful difference.

Day Mean mood
Music planned 0.24
Silence planned 0.23
Correlation 0.03
Standard error 0.03

Thus, when we do the analysis in terms of intention-to-treat, we see clearly that music has no discernible effect on mood. This is to be expected, because we set \(k=0.0\) after all, so there shouldn’t be any effect.

The cost is lower statistical power

To explore the drawback of intention-to-treat analysis, we can adjust the model such that music has a fairly significant effect on mood. We will make music 4× as powerful as situation.

\[M=0.14,\;\;k=0.04,\;\;b=0.01\]

This new model gives us roughly the same results as reported before when looking purely in terms of when music is playing:

Day Mean mood
Music 0.29
No music 0.21

On the other hand, if we look at it through an intention-to-treat lens, we see there is now an effect (as we would expect), although too small to be trusted based on the data alone.

Day Mean mood
Music planned 0.26
Silence planned 0.23
Correlation 0.09
Standard error 0.11

Remember that we constructed this version of the model to have a definitive effect of music, but because we are looking at it through an intention-to-treat analysis, it becomes harder to see. To bring it out, we would need to run the experiment not for 31 days, but for half a year!

Such is the cost of including confounders in one’s data: they make experiments much more expensive by virtue of clouding the real relationships. Ignoring them does not make things better, it only risks producing mirages.

Brief summary of findings

To summarise, these are the situations we can find ourselves in:

Analysis type Significant effect Non-significant effect
Naïve Accurate result or confounder Accurate result
Intention-to-treat Accurate result Accurate result or confounder

In other words, by switching from a naïve analysis to an intention-to-treat analysis, we make confounders result in false negatives rather than false positives. This is usually preferred when sciencing.