Linear Regression

Gabriel Mesevage

Today’s plan

  1. An overview of linear regression
  2. Class work on the high-wage economy debate

A motivating example

  • Imagine we are conducting an anthropometric study using a dataset of 1,517 heights recorded in 1850

  • 1,215 have recorded gender as male and 302 as female

  • For each record we also measure the person’s date of birth

  • We want to understand the relationship between height, gender, and date of birth

  • We can think of regression as a way of comparing averages (Gelman, Hill, and Vehtari 2020)

A snippet of the data

Table 1: A snippet of the height data
Individual Height (cm) Gender DOB
1 177.44 M 1826-10-28
2 167.63 M 1829-05-14
3 176.31 M 1829-02-20
1517 162.86 F 1824-01-18

Comparing averages by gender

We can calculate averages separately by gender:

\[ \text{Avg}_M = \sum_{i=1}^{1215} \frac{h_i}{1215} \qquad \text{Avg}_F = \sum_{i=1}^{302} \frac{h_i}{302} \]

Precision differs by group size

  • Assume the standard deviation of men’s and women’s heights is the same value \(\sigma\)

  • Standard error of male heights: \(\sigma_M = \frac{\sigma}{\sqrt{1215}} \approx \frac{\sigma}{35}\)

  • Standard error of female heights: \(\sigma_F = \frac{\sigma}{\sqrt{302}} \approx \frac{\sigma}{17}\)

  • Our measure of male heights is about 2 times as accurate as our measure of female heights

  • This occurs simply because we observe fewer women in the data

Heights by gender

Figure 1: Heights by gender. Red points and error bars show the mean \(\pm\) 2 standard errors.

Averaging by date of birth

What if we want to calculate the average by date of birth?

Figure 2: Individual heights plotted against date of birth in 1822.

The problem with fine-grained averages

  • If we reduce DOB to the year of birth we can calculate an average, but it is less precise

  • At the year-and-month level it becomes almost impossible

  • At the actual day of birth most days have 1 or no observations

  • We need an approach that uses all of the observed data and generalizes to any date

Thinking about prediction

  • Let’s shift perspective: consider a date we have no observations for

  • What is a good strategy for guessing the average height at this date?

Figure 3: Average height by week. Point size indicates the number of observations.

Interpolation

  • Say we observe average heights for February 1822 and April 1822

  • A reasonable guess for March 1822:

\[ \hat{h}_{1822\text{-Mar}} = h_{1822\text{-Feb}} + \frac{h_{1822\text{-Apr}} - h_{1822\text{-Feb}}}{t_{1822\text{-Apr}} - t_{1822\text{-Feb}}} \]

  • This is the rise over run: change in height divided by change in time

Limitations of interpolation

  • The individual monthly observations are based on few observations — they may not be very accurate

  • We observe many data points — how do we include all of them?

  • What if we are missing two observations in a row?

  • We are working with time averages but really we see the day people are born

  • We need an approach that:

    1. Uses all of the observed data
    2. Generalizes to any date with missing values

Our goal: conditional averages

  • The best case: so many observations per date that we could calculate the average for each day

  • This would be the conditional average: the average conditional on the day a person was born

  • We don’t have enough data for this, but at least we know our goal is a conditional average

  • Solution: calculate an average that depends on the date and a very small number of unknown parameters

Linear regression

Linear regression computes a linear approximation to the conditional average.

The word linear means:

  1. The relationship is the same no matter what time period we look at: moving from 1822-02-10 to 1822-02-20 has the same effect as moving from 1823-02-10 to 1823-02-20

  2. The relationship between the outcome and the predictor is governed by a single parameter

The regression equation

\[ h_i = \alpha + \beta \, d_i + \varepsilon_i \]

  • \(h_i\): the height of individual \(i\) (one of 1,517 observations)

  • \(d_i\): the date of birth, expressed in decimal years

  • \(\alpha\): the intercept — the predicted average height when \(d_i = 0\)

  • \(\beta\): the slope — the predicted change in average height for a one-year increase in DOB

  • \(\varepsilon_i\): the error term — the deviation of a person’s height from the average height of someone born on their birthday

Ordinary Least Squares

We estimate \(\alpha\) and \(\beta\) by minimizing the sum of squared errors:

\[ \min_{\alpha,\,\beta} \sum_{i=1}^{1517} \varepsilon_i^2 \]

where

\[ \varepsilon_i^2 = (h_i - \alpha - \beta \, d_i)^2 \]

  • We pick the values of \(\alpha\) and \(\beta\) that make the squared deviations as small as possible

  • There are closed form solutions to this model (you cold solve by hand) but your computer can do it trivially.

Our first regression

Figure 4: Height versus date of birth with OLS regression line and 95% confidence band.

Predicting off the support

  • The slope \(\hat\beta\) is 0.119 cm per year of birth — slightly positive

  • What if we predict height for someone born on 2026-01-01?

\[ \hat{h}_{2026\text{-}01\text{-}01} = \hat\alpha + \hat\beta \, d_{2026\text{-}01\text{-}01} \]

  • Our prediction: 191.9 cm — much too tall for average height!

  • The linear relationship holds across the dates we observe, but the machine can predict at any date

  • Predictions become less reliable the farther we move from observed dates

  • This is called predicting off the support of the distribution

Errors vs residuals

An important distinction:

  • Errors \(\varepsilon_i = h_i - \alpha - \beta \, d_i\): the difference between a person’s height and the true conditional average. We never observe these because we never know the true \(\alpha\) and \(\beta\).

  • Residuals \(\hat\varepsilon_i = h_i - \hat\alpha - \hat\beta \, d_i\): the difference between observed heights and the estimated regression line. We do observe these.

  • The residuals are our best available stand-in for the unknown errors

  • Our measures of uncertainty are themselves estimates, built from residuals rather than true errors

Statistical uncertainty

  • The regression coefficients come from a sample and are therefore uncertain

  • Just as with a mean, we need a standard error

  • Recall: for the sample mean \(\bar{h}\), the standard error is

\[ \text{SE}(\bar{h}) = \frac{\hat\sigma}{\sqrt{n}} \]

  • More noise \(\rightarrow\) less certain; more data \(\rightarrow\) more certain

Standard error of a slope

The standard error for a regression slope follows the same logic:

\[ \text{SE}(\hat\beta) = \sqrt{\frac{\hat\sigma^2}{\sum_{i=1}^{n}(d_i - \bar{d})^2}} \]

where \(\hat\sigma^2 = \frac{1}{n-2}\sum_{i=1}^{n}\hat\varepsilon_i^2\)

  • Numerator: how noisy the data are around the regression line

  • Denominator: how spread out the predictor is along the x-axis

Why the spread of the predictor matters

  • A regression slope measures a rate of change (cm per year of DOB)

  • To pin down a rate of change, what matters is not just how many people we observe but how spread out they are along the x-axis

  • Two datasets, both with 1,000 observations:

    • First: everyone born within a single month
    • Second: births spread over a decade
  • The second dataset is far more informative about the slope

  • \(\sum(d_i - \bar{d})^2\) captures this: larger when dates are more spread out, making \(\text{SE}(\hat\beta)\) smaller

Confidence intervals

\[ \hat\beta \pm 2 \times \text{SE}(\hat\beta) \]

  • If we repeated our study many times, approximately 95% of these intervals would contain the true value of \(\beta\)

  • Any single interval either contains the truth or it doesn’t, but the procedure is right 95% of the time

  • When zero falls inside the interval: data are consistent with no relationship

  • When zero falls outside the interval: data suggest the true slope is different from zero

The t-statistic

\[ t = \frac{\hat\beta}{\text{SE}(\hat\beta)} \]

  • How many standard errors our estimate is away from zero

  • Rule of thumb: \(|t| > 2\) means “statistically significant” at conventional levels

  • Equivalent to saying zero lies outside the 95% confidence interval

  • A coefficient of 0.5 with SE of 0.1 (\(t = 5\)) is much more convincing than a coefficient of 0.5 with SE of 0.4 (\(t = 1.25\))

Reading a regression table

Table 2: Simple regression of height on date of birth.
Height (cm)
Intercept −49.120
(118.754)
Date of birth (year) 0.119+
(0.065)
Num.Obs. 1517
R2 0.002
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Elements of a regression table

  • Coefficient estimates: each row is a variable; the number is the point estimate

  • Standard errors in parentheses: below each coefficient, indicating precision

  • Stars: conventionally + means \(p < 0.1\), * means \(p < 0.05\), ** means \(p < 0.01\) *** means \(p < 0.001\)

  • Intercept: predicted height when DOB = 0 (not meaningful here, but necessary)

  • \(R^2\): what percent of variation in height is explained by DOB alone (close to 0 = very little)

  • Num.Obs. (\(N\)): the number of observations

Adding another regressor

We can include both DOB and gender in a multiple regression:

\[ \text{height}_i = \alpha + \beta_1 \cdot \text{dob}_i + \beta_2 \cdot \mathbf{1}[\text{female}_i] + \varepsilon_i \]

  • \(\mathbf{1}[\text{female}_i]\) is an indicator variable (equals 1 if female, 0 if male)

  • For a male: predicted height is \(\alpha + \beta_1 d\)

  • For a female: predicted height is \(\alpha + \beta_1 d + \beta_2\)

  • \(\beta_2\) measures the average height difference for females relative to males

“Holding constant”

Multiple regression estimates each coefficient holding the other variables constant:

  • \(\beta_1\): the effect of DOB on height holding gender constant — comparing people of the same gender born at different dates

  • \(\beta_2\): the average height difference for females relative to males holding DOB constant — comparing men and women born at the same time

Visualizing “holding constant”

Figure 5: How multiple regression isolates the DOB effect after removing gender. Panel A: raw data coloured by gender, with group means marked. Panel B: after subtracting each group’s mean height, the two clouds overlap vertically; group mean DOB is marked. Panel C: after also subtracting each group’s mean DOB, the data are centred at the origin. Panel D: the regression slope through the doubly-residualized data equals \(\hat\beta_1\) from the multiple regression.

What the panels show

  • Panel A: Raw data coloured by gender, with group mean heights marked

  • Panel B: After subtracting each group’s mean height, the two clouds overlap vertically

  • Panel C: After also subtracting each group’s mean DOB, both variables are “cleaned” of gender

  • Panel D: The regression slope through this doubly-residualized data equals \(\hat\beta_1\) from the multiple regression (Frisch-Waugh-Lovell theorem)

  • Multiple regression is, at its core, about comparing like with like

Multiple regression table

Table 3: Regression results: simple and multiple regression.
Height (cm) (1)  Height (cm) (2)
Intercept −49.120 −141.882
(118.754) (100.340)
Date of birth (year) 0.119+ 0.171**
(0.065) (0.055)
Female −10.739***
(0.434)
Num.Obs. 1517 1517
R2 0.002 0.289
R2 Adj. 0.002 0.288
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Comparing the two columns

  • DOB coefficient increases from (1) to (2): the simple regression was attenuated by mixing men and women. The standard error falls because including gender shrinks the residuals.

  • Female coefficient is large and negative: the well-known average height difference between men and women. Statistically significant at the 1% level.

  • \(R^2\) jumps substantially: DOB alone explains little, but adding gender explains the ~10 cm gap between men and women.

  • Adjusted \(R^2\) penalizes for number of predictors. When it rises meaningfully, the added variable genuinely improves fit.

Key takeaways

  • Regression is a way of comparing averages — it computes a linear approximation to the conditional mean

  • OLS picks the line that minimizes the sum of squared errors

  • Standard errors tell us about the precision of our estimates

  • Multiple regression estimates each coefficient holding other variables constant

  • Always check: is the relationship linear? Are you predicting off the support?

Bibliography

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781139161879.