Selection bias

Gabriel Mesevage

The plan

  1. The essay
  2. What is sample selection bias?
  3. Selection on observables
  4. Selection on unobservables
  5. How selection bias distorts relationships between variables

Your essay

3,000 words.

Due on 28 April 2026 before 14:00

Essay question

Discuss the role of theory and measurement in a historiographical dispute in economic history.

Examples you can use:

  1. The debate over heights, well-being, and the ‘industrialization’ puzzle.
  2. The debate over the ‘high wage economy’ thesis and why Britain industrialized first.
  3. The debate over the gold standard, empire, and sovereign bond yields in the 19th century.
  4. The debate over the measurement of income inequality in the 20th century united states.

Essay question

You can write your own but:

  1. It needs to be a scholarly dispute
  2. It needs to be quantitative
  3. I need to approve it

Essay practice

I will have you write an outline first, and will grade it for you.

We can pick the due date for that together.

From statistical significance to bias

  • Last week we covered means, standard errors, and statistical significance

  • Those tools assume our sample is a random draw from the population

  • But what happens when it isn’t?

  • If our sample is not random, our estimates may be biased – systematically wrong in a particular direction

  • No amount of data fixes a biased sample: a bigger biased sample is still biased

The bigger threat

  • In observational studies the bigger danger is not sampling error but that your sample is systematically unrepresentative of the population

  • Statistical significance tells us about precision (how much noise is in our estimate)

  • But bias is about accuracy (whether we are pointing at the right answer)

  • You can have a very precisely measured estimates of the wrong thing

What is sample selection bias?

Definition: Sample selection bias occurs when your sample is not a random draw from the population you want to study. You are more likely to capture some kinds of people rather than others.

  • The process that generates your data is not independent of the thing you are trying to measure

  • This means your sample statistics (means, correlations, etc.) may not reflect the true population values

A simple example: height and the Royal Marines

  • Suppose we want to estimate average height in the population using military records

  • The Royal Marines require a minimum height of 145 cm

  • Everyone below the cutoff is excluded from our data

  • Our sample mean will overestimate the true population mean

  • This is truncation bias: we only observe part of the distribution

Truncation bias illustrated

Minimum height requirement biases the sample mean upward

Worst-case bounds and missing data

Manski (2007) : A survey contacts 137 unhoused people and follows up later to ask whether they found housing.

  • At follow-up, only 78 respond. Of these, 24 had exited homelessness.

  • 59 non-respondents: did they find housing? We don’t know.

  • Lower bound: Assume all 59 non-respondents did not exit \(\rightarrow\) \(24/137 = 17.5\%\)

  • Upper bound: Assume all 59 non-respondents did exit \(\rightarrow\) \((24 + 59)/137 = 60.6\%\)

  • Naive estimate (respondents only): \(24/78 = 30.8\%\)

  • The true answer lies somewhere in \([17.5\%, 60.6\%]\) – a wide range, but an honest one

Worse-case bounds

Selection on observables

Definition: Selection bias that depends on variables you can see in your data.

  • If you know what is causing the selection, you can potentially correct for it

  • The key requirement: the variable driving selection must be measured in your dataset

A polling example

  • Suppose you’re polling voting intentions

  • Your sample over-represents university-educated voters (60% of sample vs. 30% of population)

  • University-educated voters favour Party A at 70%; non-university voters favour Party A at 40%

  • Naive estimate (raw sample): \(0.6 \times 70 + 0.4 \times 40 = 58\%\) for Party A

  • Reweighted estimate (using population shares): \(0.3 \times 70 + 0.7 \times 40 = 49\%\) for Party A

  • Because we observed education level, we could fix the bias

Selection on unobservables

Definition: Selection bias that depends on variables you cannot see in your data.

  • You cannot reweight or control for something you haven’t measured

  • This is the hard problem: the bias is invisible in your dataset

  • You need assumptions or external information to address it

Correcting for selection on unobservables

  • The Heckman correction attempts to fix selection on unobservables

  • Key assumption: the errors in the outcome equation and the selection equation are jointly normally distributed

  • High level overview: you assume a specific formula describes the relationship between the unobservables driving selection and the outcome.

Selection bias and relationships between variables

  • So far we’ve focused on how selection bias shifts averages

  • But selection bias can also distort relationships between variables

  • This is sometimes called Berkson’s paradox or collider bias

  • The key insight: conditioning on a variable that is caused by two other variables can create a spurious correlation between them

Berkson’s paradox: the NBA example

  • Among NBA players, taller players tend to have worse free-throw percentages

  • But in the general population, height has no relationship to free-throw accuracy

  • The NBA selects players who are either very tall or very accurate shooters (or both)

  • Among the selected group, if you’re not tall, you must be an amazing shooter (otherwise you wouldn’t be in the NBA)

  • This creates a spurious negative correlation in the selected sample that doesn’t exist in the population

Berkson’s paradox illustrated

Selection creates a spurious negative correlation (Berkson’s paradox)

Key takeaways

  • Sample selection bias is often a bigger threat than sampling error in historical research

  • Selection on observables can be corrected if you measure the relevant variables

  • Selection on unobservables requires strong assumptions to correct (e.g., Heckman’s joint normality)

  • Selection can distort relationships between variables (Berkson’s paradox), not just averages

  • Always ask: who is in the sample and why?

Bibliography

Manski, Charles F. 2007. Identification for Predication and Decision. Harvard University Press.