Entropic Thoughts

Sample Unit Engineering

Sample Unit Engineering

One of the coolest techniques I learned from Deming (and I have learned many cool techniques from Deming) comes not from his more popular books, but from Some Theory of Sampling1 Some Theory of Sampling; Deming; John Wiley & Sons; 1950. Deming never gives the technique a name, because to him it’s just an obvious part of being a one of the best industrial statisticians in history. I want to highlight this one technique specifically. To be able to talk about it, I’m going to call it sample unit engineering.

In this article, I aim to give you an intuition of what sample units are, and how they affect the precision of a sample. This lets us do sample unit engineering. We care a lot about sampling because it lets us count things at a fifth of the cost of a full count – and this saves hours of effort both at work and personally.

Taking random samples

In a simple random sample, we draw sample units and measure them. For the sake of example, let’s say we have stood in a boarding tunnel to a cruiseferry and asked randomly2 Here the word randomly is used in its technical sense, not as a synonym for haphazardly. selected passengers for their name and how much money they intend to spend on entertainment.

We get the following measurements.

Emma    ────────────────────── $220
Cheng   ─────── $70
Liam    ───────────── $130
Sanjay  ──────── $80
Amara   ── $20
Ethan   ─ $10
Annika  ─────────── $110
Matteo  ──── $40
Taavi    $0
Olivia  ───────────────── $180

In the sample, we see that

  • Taavi, making up 10 % of the sample, plans to not spend anything on entertainment.
  • Ethan, Amara, Matteo, Cheng, and Sanjay make up half the sample and plan to spend in the $10–$100 range; and
  • Annika, Liam, Olivia, and Emma make up almost half3 Strictly speaking 40 %, but since we are going to extrapolate into the full population we’ll try to stick to fuzzy words to avoid implying a precision that is not there. of the sample and will spend more than $100.
  • Only Emma – 10 % of the sample – will spend over $200.

For a sample to be called random and get all the benefits of a random sample, it must have one property: each measurement in the full population must have had an equal probability of ending up in the sample.

We will call this the randomness property, and satisfying it means we cannot stand in the boarding tunnel for five minutes at a particular point in time, but rather we have to stand there throughout the entire boarding process, and have some way of selecting passengers completely randomly. We might, for example, throw dice and sample only the passengers for which we get snake eyes.

The randomness property is critical, yet we cannot verify that it is satisfied after the fact, only by looking at the sample. We have to trust that whoever did the sampling followed a procedure that ensured each passenger had an equal chance of being asked for their spending plans. This is the key to remember about simple random sampling. As long as the randomness property is satisfied, we are free to do anything we want to get our samples. (Foreshadowing.)

Interpreting random samples

What is neat is that as long as the above is a random sample, then its proportions are roughly true even outside the sample, e.g. for the entire passenger base. This means, if we take on our fuzzy glasses to smooth out random variation, we can consider about 10 % of all passengers as not paying for entertainment.4 When interpreting samples, we are going to play fast and loose to get to the intuition quicker. People well-versed in mathematics hate when I do this. There are formal things we can say with probabilities attached based on random samples, but we will not go into that now. Any book on sampling will do that better than I could. Similarly, about half spend some amount, but less than $100.

Since we know each passenger’s probability of ending up in the sample (1/36 when we did snake eyes selection), we can also make informed guesses at totals. We can say that there should be about 10×36 = 360 passengers on board. Of these, 90 % are planning to pay for entertainment, i.e. we can set the staff schedule based on 320 passengers expecting entertainment.

The total planned spending in the sample is $860. Since this was a random sample, it should represent roughly 1/36 of the total planned spending for all passengers. We can then guess that on this trip, the shipping line will receive somewhere around $31,000 in entertainment revenue.

Isn’t that pretty cool? By just getting a number from 10 randomly selected passengers, we can make an informed guess on how much the cruiseferry will make in entertainment revenue.

It gets better once we understand where sample precision comes from.

Sample precision as mean distance between samples

We have already seen one of the magical results of random sampling: it has perfect accuracy. As long as the randomness property is fulfilled, and we are doing unbiased calculations on the sample, our results will be unbiased estimations of population values. The other magical result is that we can determine the precision of a random sample by knowing nothing except the measurements in the random sample itself. 5 This section will contain a little maths, but we will only use it once in the next section to derive an intuitive idea of sample precision so the details can safely be skipped.

The sampled values we got earlier were

220 70 130 80 20 10 110 40 0 180

The mean of these values is $86. We can write that as

\[\hat{\mu} = 86.\]

If we borrow from statistical process control to estimate the standard deviation of this sample, we write down the consecutive differences between the samples as

220   70   130   80   20   10   110   40   0   180
  150   60   50   60   10   100   70   40   140  

The mean difference is 80, which we divide by 1.128 to get the standard deviation, which is 71. We can call this

\[s = 71.\]

The standard error of the estimated mean \(\hat{\mu}\) is an indication of how precise our sample is. We get it by dividing \(s\) with the square root of the number of samples6 Strictly speaking, this assumes the sample variation is exactly equal to the population variation, which is not true due to sampling varitaion in the variation itself (!). But as a first approximation, we’ll roll with it. And for a large number of samples, it is very nearly true.,7 It also assumes that people actually intended to spend dollars rounded to even 10s. If not (and the rounding was done as a separate step) then that also introduces slightly more variation that we are not accounting for here., i.e.

\[\textrm{se}_{\hat{\mu}} = \frac{s}{\sqrt{n}} = \frac{71}{\sqrt{10}} \approx 22.\]

We can compare this to \(\hat{\mu}\), and we see that the standard error of the estimation ($22) is about a quarter of the size of the estimation itself ($86). This is usually considered insufficient precision, but it depends on economics.

Improving sample precision

What’s more important, though, is that we now have a way to see if our sample precision improves or gets worse. Note that \(\hat{\mu}\) is mainly determined by the population and not something we can affect. What we want to reduce is the standard error of the estimation, \(\textrm{se}_{\hat{\mu}}\). Since

\[\textrm{se}_{\hat{\mu}} = \frac{s}{\sqrt{n}}\]

We see we have two things we can improve to make precision better:

  • We can increase \(n\), the sample size. This will have a decaying effect on the precision because of the square root: quadrupling the number of samples doubles the precision. But this is not as big a deal as it sounds. The real problem with increasing sample size is that it costs more time and/or money!

    Actually collecting samples is usually the expensive part of any counting endeavour (this is why we sample in the first place), so we want to get by with as few samples as possible.

  • We can try to decrease \(s\), the standard deviation between samples. This will have a linear effect: if we manage to halve the between-sample variation, it will double the precision. It might sound like \(s\) is a property of the population just like \(\hat{\mu}\), but it turns out it’s not – it’s a property of the process by which we select samples!

A consequence of the above is that if we can find a way to decrease \(s\), we can get higher precision for essentially the same cost. To get there, we need a good intuition for what \(s\) means and what it looks like in our data.

Recall that we calculated \(s\) from the average consecutive difference between samples, which means how much the samples wiggle up and down is a good intuitive picture of \(s\). If we can get the samples to wiggle up and down less, we get a smaller \(s\), and then the sample precision improves. However, we still have to maintain the property that each measurement should have an equal chance of showing up in the sample. We cannot, therefore, pre-screen samples and only include those that don’t wiggle too much. That would no longer be a random sample.8 Nor can we reorder samples to make them wiggle less. That just makes the procedure for estimating their standard deviation wrong, it doesn’t actually change their standard deviation.

In order to figure this puzzle out, we have to learn about the difference between sample units and ultimate units.

Sample unit different from ultimate unit

The ultimate units are the targets of study, the things we actually want to count. In the cruiseferry problem, we have seen two ultimate units so far:

  • The number of entertainment spending dollars aboard the ship, and
  • The number of passengers expecting entertainment aboard.

What’s curious is we never really sampled dollars – we have only sampled passengers, and then counted their dollars. This is fine and makes intuitive sense – every entertainment dollar belongs to some passenger, and no dollar belongs to more than one passenger. As long as those two conditions are true, we can have a sample unit (passengers) that is different from the ultimate unit (dollars) and still satisfy the randomness property.

We can expand the sample unit further. Instead of bothering people when they board, we can walk along the hallways of the ship once it’s left the dock, and knock on randomly selected passenger cabin doors9 We’re now assuming a simplified condition where people stay in their living cabins throughout the journey. It would get a lot more complicated if people walked around on the ship.. When someone opens, we ask every person in the cabin how much money they plan to spend on entertainment.10 Note that for the randomness property to be satisfied, i.e. for each person to have an equal chance of showing up in the sample, we have to ask everyone we find in a sampled cabin. Even if it’s a single-person cabin and there happens to be 20 people living in it, we have to ask everyone.

Let’s say there are 150 cabins, and we select 10 of them to knock on. In the first, we find four passengers who collectively plan to spend $280. In the next, we find just one passenger who plans to spend $50. And so it goes for eight more cabins of varying numbers of inhabitants.

─────────────────────────── $270
───── $50
───────────────────────────────── $330
────────────────────── $220
──────────── $120
───────── $90
──────────── $120
──────────────────────────────────────── $400
────────────────── $180
─── $30

In this case, the mean is $181 per cabin and the standard error of the mean is $44, so we got roughly the same precision as before. This might seem odd, seeing that we definitely surveyed more passengers this way.

You can probably guess why this happened: in some cabins there are many people, and in others fewer. The cabins with a lot of people are likely to spend more on entertainment than those with fewer. This amplifies the wiggles that we tried to smooth out by talking to more people.

But we can run this effect backwards and take advantage of it!

Sample unit engineering: bunching up

What we can do is invent artificial sample units. Maybe that big family cabin is its own sample unit as before, but we combine four tiny engine room-floor cabins into one sample unit. In other words, we write the cabin numbers of every cabin on a piece of paper, but we write four tiny engine room-floor cabin numbers on one piece of paper, while the family cabin gets its own paper. Then we put them in a hat and shuffle to get our sample.

At this point we’re not sampling people, nor are we sampling cabins, but instead sampling an artifical unit which consists of one or more cabins. We have designed these artificial units to contain roughly the same number of passengers in each.

──────────────────────── $240
───────── $90
───────────────────────────────────── $370
───────────────────────────── $290
─────────────── $150
────────────────────── $220
─────────────────────── $230
──────────────────────────────── $320
────────────────── $180
───────────────────────── $250

We see visually that these numbers wiggle up and down far less than before. We can verify numerically that the mean is $234 and the standard error $25. Look at that! The standard error is down to a tenth of the mean. We did the exact same thing as before, except now we made sure to design sample units that were roughly similar in number of passengers.11 We did interview more than 10 passengers this way, though, so it’s a little cheaty.

Can we do better? Sure we can! We can look at historic trends in entertainment spending from people in various types of cabin. We can create sample units that consist of pairs of low-spending and high-rolling passengers based on which cabin they are in. Then we pick 10 random of these pairs of passengers, and ask how much they spend.

We might end up with something like

───────────────────────── $250
────────────────────── $220
───────────────────────────── $290
─────────────────── $190
──────────────────────────────── $320
───────────────────── $210
──────────────────────── $240
──────────────────────────────────── $360
───────────────── $170
───────────────────────── $250

The mean pair spend is $250, but the standard error is an incredible $19. This means the standard error is now around 7 % of the mean.

We did survey 20 passengers for this, but what had happened if we had surveyed just 10, i.e. five pairs? The standard error would be 11 % of the mean. In other words, without changing how many passengers we survey, we can double our precision by selecting them cleverly – but still randomly.

A good sample unit captures variation

We have seen a few ways to maintain the randomness property, while still being able to reduce wiggles. What is it we have really done, intuitively?

We can visualise our sample as a stacked bar chart. When we stood in the boarding tunnel and prodded individual passengers, we got one measurement for each sample unit:

──────┐
      │
──────┘
───────────┐
           │
───────────┘
─┐
 │
─┘
─────────┐
         │
─────────┘
─────┐
     │
─────┘
───────────┐
           │
───────────┘
───────────────┐
               │
───────────────┘
────────┐
        │
────────┘

When we do this, we get a certain level of wiggling up and down between the sample units, and as long as we have just one measurement per box, this wiggling is entirely determined by the population variation.

When we sampled cabins instead, we caught more people in each sample unit. The precision of the sample is still the amount of wiggling from group to group, but each group consists of more than one measurement. Recall that this is a stacked bar chart, meaning the top of the bar is the sum of all measurements in that sample unit.

──────────────┬──────┬────┐
              │      │    │
──────────────┴──────┴────┘
──────┬─┐
      │ │
──────┴─┘
──────────┐
          │
──────────┘
────────────┬──────────┬─────┐
            │          │     │
────────────┴──────────┴─────┘
─┐
 │
─┘
──────────────────────┬────────────┐
                      │            │
──────────────────────┴────────────┘
─────────┐
         │
─────────┘
──────┬───┐
      │   │
──────┴───┘

This didn’t help much, though – it still wiggles. What did help significantly was capturing a pair of low-spending and high-rolling passengers in each sample unit. Visually, we can think of it this way:

───────────────────┬────┐
                   │    │
───────────────────┴────┘
─────────────────┬─┐
                 │ │
─────────────────┴─┘
───────────────────┬─────────┐
                   │         │
───────────────────┴─────────┘
────────────┬─────┐
            │     │
────────────┴─────┘
─────────────────────────┬┐
                         ││
─────────────────────────┴┘
──────────────────────────────┬─────────────┐
                              │             │
──────────────────────────────┴─────────────┘
─────────────┬────────┐
             │        │
─────────────┴────────┘
────────────────┬───┐
                │   │
────────────────┴───┘

By choosing sample units such that we are likely to have one large and one small measurement in each group, we get stacks of approximately equal height – they will wiggle up and down less, and this lets us sample with greater precision while preserving the sample size.

There’s another way to visualise this that perhaps makes it even more clear what goes on. Instead of stacked bars, imagine each measurement is an (unstacked) point, and we draw a box around all measurements that came from the same sample unit. In the boarding tunnel case, we get the fairly obvious one-box-over-each-measurement picture:

     ┌─┐
     │o│
     └─┘  ┌─┐
          │o│
┌─┐       └─┘
│o│
└─┘     ┌─┐
        │o│
    ┌─┐ └─┘
    │o│
    └─┘   ┌─┐
          │o│
          └─┘ ┌─┐
              │o│
       ┌─┐    └─┘
       │o│
       └─┘

When we knock on cabin doors, we get a variable number of measurements in each box, but the average value of each box still wiggles up and down a lot:

    ┌──────────┐
    │o o      o│
    └──────────┘
 ┌─────┐
 │o   o│
 └─────┘
         ┌─┐
         │o│
         └─┘
     ┌───────┐
     │o    oo│
     └───────┘
┌─┐
│o│
└─┘
             ┌─────────┐
             │o       o│
             └─────────┘
        ┌─┐
        │o│
        └─┘
   ┌───┐
   │o o│
   └───┘

However, when talking to the pairs of high-rollers and low-spenders, the average value of the boxes stay roughly similar, even if each box covers a large area:

    ┌───────────────┐
    │o             o│
    └───────────────┘
 ┌────────────────┐
 │o              o│
 └────────────────┘
         ┌──────────┐
         │o        o│
         └──────────┘
     ┌───────┐
     │o     o│
     └───────┘
┌─────────────────────────┐
│o                       o│
└─────────────────────────┘
             ┌─────────────────┐
             │o               o│
             └─────────────────┘
        ┌─────┐
        │o   o│
        └─────┘
   ┌─────────────┐
   │o           o│
   └─────────────┘

and this, finally, gets to the heart of the intuition. A good sampling plan is one that designs sample units such that they result in wide boxes.

The way statisticians speak of this is that there is between-sample variation and within-sample variation. The between-sample variation (i.e. the wiggling of the midpoints of the boxes) is what is bad for precision. The within-sample variation (the width of the boxes) has no effect on precision on its own. But! If we can capture as much as possible of the population variation inside the sample, the between-sample variation will go down.

That said, we still have to maintain the randomness property:

  • Each ultimate unit should belong to exactly one sample unit,
  • We must sum up the measurements from all ultimate units in the sample unit, and
  • Each sample unit should have an equal chance of being selected.12 Strictly speaking, the sample units should have a known probability of being selected – it does not have to be equal. If it’s not equal, we can weight measurements according to their probability and still get a correct result. But equiprobable sample units simplify the procedure.

As long as that is true, we can draw sample unit boundaries however we want, and we want to do it in a way that results in large variation inside the sample units, and small variation between them.

Cluster sampling may not be effective

Now that we understand the basics of how variation between and within samples affects precision, we can take on a challenge.

Cluster sampling is a technique where one generates random samples, and then also takes the opportunity to include in the sample nearby ultimate units when one is there anyway. In our example, that might be selecting ten random cabins to knock on the doors of, but then also taking the opportunity to knock on the doors of three neighbouring cabins. This is intended as a cheap way to increase the sample size, which it does: the sample size in the example is quadrupled at low additional cost since one is down there knocking on doors anyway.

But! Cluster sampling usually doesn’t improve precision as much as one would think from a quadrupling of the sample size. Why is that?

Seriously, think about it.

The problem is that nearby cabins are often highly correlated. In other words, while we wanted to draw wide boxes to capture as much variation as possible inside the sample units, cluster sampling often leads to very narrow boxes. This means that even though we quadruple the sample size, we might get better precision – for the same cost – by only doubling the sample size rather than quadrupling it, but doing it properly by drawing another set of fresh samples that are uncorrelated with the ones we already had.

This rests on the realisation that there are two things that effect sample precision:

  1. Number of samples, and
  2. How much variation is captured within each sample.

If sample size is increased while capturing less variation within each sample, we won’t improve precision by as much as we had hoped.

Sample unit engineering: splitting apart

One final trick: maybe there is a huge party cabin on the cruiseferry. If this gets selected by the random sampling process, it will ruin the precision of the sample because of its size and the likelihood of high spending by its inhabitants. On the other hand, if it’s not selected in the random sample, it will still be bad because even though we don’t know it, the variability it adds to the population makes our sample less precise.

Large sample units – whether or not they are selected – are bad for precision.

What we can do is split large sample units up into smaller ones. We take the one large cabin and create two artificial sample units from it, by splitting it down the middle and include the left and right half of it as separate sampling units. If we happen to draw the left half of the party cabin from the hat, we instruct the field worker to only speak to people who were on the left side of the door when they open.

This, in some sense, gives the huge cabin two chances to appear in the sample (as opposed to regular cabins that just get one) but since we only speak to the half of the cabin that is drawn, the measurements of the cabin get a correspondingly lower weight in the data. This is similar to the Monte Carlo technique of importance sampling except performed more primitively.