Chapter 5 of 5

Can We Trust the Estimates?

Standard errors, t-tests, and confidence intervals

If we collected new data, the line would shift. How much can we trust our β₁?

1 samples drawn

Here's how we'll figure out if the slope is trustworthy or just noise.

Standard Error

How much would the slope change if we collected new data?

t-Statistic

Is the slope far enough from zero to be meaningful?

p-Value

What's the probability this result happened by chance?

Confidence Interval

What range of slopes is plausible?

Run it 500 times. The slopes form a distribution — the sampling distribution.

Distribution of Estimated Slopes (500 samples)

The spread of that distribution is the standard error.

We just saw that slopes vary across samples. Ideally, we'd measure that spread by running hundreds of experiments — but in real life, we only have one dataset.

So statisticians developed a formula that estimates the spread from a single sample. That estimate is the standard error (SE).

Sampling Distribution with Standard Error

SE(\hat{\beta}_1) = \sqrt{\frac{MSE}{\sum (x_i - \bar{x})^2}}

What's MSE? It stands for Mean Squared Error — the average size of the squared residuals. It's the SSE from Chapter 4, divided by n − 2 (degrees of freedom):

MSE = \frac{SSE}{n - 2} = \frac{182}{15 - 2} = 14.0

The SE formula has two ingredients: MSE (how noisy the data is) in the numerator, and ∑(x_i − x̄)² (the spread of x-values) in the denominator. More noise → bigger SE. More spread in x → smaller SE.

From the formula

SE = 0.45

Computed from one sample using the formula above. This is what you'd use in practice — you only have one dataset.

From simulation

—

The actual standard deviation of slopes across 500 samples. This is the "true" spread, but requires repeating the experiment many times.

Run the simulations above to see how the formula-based SE compares to the actual spread of slopes across many samples.

The t-statistic: how many standard errors is β̂₁ away from zero?

t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{3.3}{0.45} = 7.23

The p-value: if the true slope were 0, how likely is a t-stat this extreme?

Tail area = p-value

p = < 0.001

p < 0.05 — We reject H₀. The slope is statistically significant.

The 95% confidence interval: we're 95% confident the true slope is in this range.

CI_{95\%} = \hat{\beta}_1 \pm t^* \cdot SE = 3.3 \pm 2.160 \times 0.45

= [2.3, 4.2]

The interval does not contain 0 — consistent with rejecting H0.

More data → smaller standard error → tighter confidence interval.

Sample size: 50

0.0

t-stat

0.00

p-value

< 0.001

95% CI: [2.9, 3.8] — width: 0.9

In practice, you'll see results presented as a regression table. Here's how to read one.

	Estimate	Std. Error	t-statistic	p-value	95% CI
Intercept (β₀)	6.1	3.65	1.68	0.1172	[-1.8, 14.0]
Annual Salary (β₁)	3.3	0.45	7.23	< 0.001	[2.3, 4.2]

R² = 0.801

RMSE = €3.5M

n = 15

Let's walk through what each part of the table tells us.

Intercept (β₀) 6.1

The predicted market value when annual salary is €0M. In practice, this is rarely meaningful on its own — no player has €0M salary. It anchors the line so that predictions in the observed range are accurate.

p = 0.1172 — not statistically significant at the 5% level.

Annual Salary (β₁) 3.3

For each additional €1M in annual salary, the predicted market value increases by €3.3M. This is the main finding of the regression.

SE = 0.45 — tells us the slope estimate could plausibly shift by this much with different data.

t = 7.23 — the slope is 7.2 standard errors away from zero. That's far.

p < 0.001 — if annual salary had no relationship with market value, the chance of seeing a slope this extreme is essentially zero.

95% CI = [2.3, 4.2] — we're 95% confident the true slope falls in this range. Since the interval doesn't include 0, the relationship is statistically significant.

Model-Level Statistics

R² = 0.801 — Annual salary explains 80.1% of the variation in market value. The remaining 19.9% is driven by other factors (current form, league, position, age, etc.).

RMSE = €3.5M — On average, our predictions are off by about €3.5M. This gives you a sense of the model's practical accuracy in the same units as the outcome.

Finally, let's be clear about what this analysis can and cannot tell us.

What we can say

• There is a statistically significant association between annual salary and market value.
• Higher-paid players tend to have higher market values. On average, each additional €1M in salary is associated with ~€3.3M more in market value.
• Annual salary alone accounts for 80% of the market value variation in this dataset.
• The result is unlikely to be due to random chance (p < 0.001).

What we cannot say

• Paying a player €1M more causes their market value to rise by €3.3M. Correlation is not causation. Higher-paid players may be in stronger leagues, play more prominent positions, or be in better form — these confounders could be driving the relationship.
• The model captures the full picture. With R² = 0.80, there is still 20% unexplained variation. Important predictors are missing.
• The relationship is linear everywhere. Our model assumes a straight line, but the true relationship could curve at very low or very high salary levels.
• We can predict outside our data range. Extrapolating to, say, a player earning €50M/year is risky — the pattern may not hold beyond the observed range.

To establish causation, we would need a controlled experiment or advanced techniques (like instrumental variables, difference-in-differences, or regression discontinuity) that account for confounders.