Evaluating the Model Fit
How well does the line explain the data?
We found the best line. But how good is it? We need a way to measure what the line explains vs. what it misses.
House prices vary a lot. Some of that variation is because of house size (which our line captures), and some is due to other factors we don't model.
To measure this, we split the total variation into two parts: what the line explains and what it doesn't.
Step 1: How much do prices vary overall? Measure each point's distance from the average price.
Each purple line shows how far a house's price is from the average. TSS is the sum of all those gaps, squared.
Step 2: How much does our line improve over the mean? The explained variation.
Each green line shows the distance between where our line predicts (ŷ) and the mean (ȳ). This is what the line adds over just guessing the average.
Step 3: What's left over? The gaps between predictions and actual values — our old friends, the residuals.
Each red line is a residual — the part our line couldn't explain. These are the same errors from Chapter 2, now squared and summed.
TSS = SSR + SSE. The total variation splits neatly into explained + unexplained.
Every point's total gap from the mean splits into two parts: the part the line explains + the part left over.
R² — the line explains this fraction of the total price variation.
House size explains 83.1% of the variation in price. The remaining 16.9% is unexplained.
R² tells us the proportion explained. But how far off are our predictions in dollars?
SSE = 11,622,569,746 … but that number is in squared dollars ($ ²). Hard to interpret! We need to convert it back to regular dollars.
On average, our predictions are off by about $27,836. Unlike R², RMSE is in the same units as the data — so you can directly feel how big the errors are.