Regression deals with predicting continuous responses (outputs). Think of a continuous variable as a floating-point number — the response can take any real value within a range.
y ≈ f(x) from data. Because we learn from noisy data, the relationship is always approximate — expect error!Rather than using a real-world dataset, the lecture generated all data artificially using random number generators in R. This is a deliberate teaching choice — it means we get to know the ground truth, which is impossible in any real application.
β₀*=0.33, β₁*=1.15, β₂*=−2.25
This is a parabola. We set the exact coefficient values ourselves — something never possible in a real problem.
ε ~ random noise, N = 30 points
Random noise ε is added to the true signal to simulate what real measurements look like. The result is N=30 synthetic input-output pairs {(xₙ, yₙ)}.
Even though we generated the data, we then deliberately pretend we don't know the truth — just as we would in a real ML project.
- True signal y* — a clean parabola (red curve)
- Observed data — 30 points scattered around it
- Question: can our methods recover the parabola from noise alone?
Knowing the ground truth lets us:
- Compare estimated β's to their TRUE values
- See exactly when and why models fail
- Diagnose overfitting vs underfitting visually
- Validate evaluation methods like cross-validation
lm() in R.# Fit a simple linear (degree-1) model mod1 <- lm(y ~ x, data = my_train) # Formula reads: "y is a function of x" # β₀ = Intercept, β₁ = slope on x summary(mod1) # text summary coefplot(mod1) # visualize coefficients + CIs
The coefplot() function from the coefplot package plots coefficient estimates with confidence intervals. If zero is inside the CI for β₁, the slope is not statistically significant.
# x² must be wrapped in I() so ^ is not misinterpreted mod2 <- lm(y ~ x + I(x^2), data = my_train) # Or using poly() for orthogonal polynomials # Formula: y ~ x + <predictor2> = additive model
The lecture fits 9 models: degree 0 (intercept only) through degree 8.
| Degree | Formula | Coefficients | Notes |
|---|---|---|---|
| 0 | y ~ 1 | 1 | Constant / mean only |
| 1 | y ~ x | 2 | Linear line |
| 2 | y ~ x + I(x^2) | 3 | ← TRUE degree |
| 3 | y ~ x + I(x^2) + I(x^3) | 4 | Cubic |
| … | … | … | … |
| 8 | 8 predictors | 9 | Severely overfit |
# Create a "fine" test grid of x values for visualization test_viz <- tibble(x = seq(-1.5, 1.5, length.out = 51)) # Predict with a single model preds <- predict(mod2, test_viz) # Predict WITH uncertainty intervals conf_int <- predict(mod2, test_viz, interval = "confidence") pred_int <- predict(mod2, test_viz, interval = "prediction")
⚠️ High-degree polynomials (7th, 8th) have enormous prediction intervals — the model is highly uncertain about its own mean trend.
In a real problem, we cannot compare coefficient estimates to "true" values because we don't know the truth. We need an objective measure of how well the model explains the data.
| Metric | Formula | Units | Best Value |
|---|---|---|---|
| MSE Mean Squared Error |
Σ(yᵢ − ŷᵢ)² / N |
Units² | 0 (lower = better) |
| RMSE Root MSE |
√MSE |
Same as y | 0 (lower = better) |
| MAE Mean Absolute Error |
Σ|yᵢ − ŷᵢ| / N |
Same as y | 0 (lower = better) |
| R² R-squared |
corr(y, ŷ)² |
Unitless [0, 1] | 1 (higher = better) |
R² is the squared correlation between the model predictions (ŷ) and the observed responses (y).
# R code — using modelr package rsquare(mod2, my_train) # training R² rsquare(mod8, my_train) # 8th degree R² on training
⚠️ On the training set, R² keeps increasing as degree increases — but this is misleading! The 8th degree model is overfit.
# modelr::rmse(model_object, data_frame) rmse(mod1, my_train) # linear model training RMSE rmse(mod3, my_train) # cubic model training RMSE # Evaluate on hold-out test set rmse(mod3, test_split) # generalisation error
- Training RMSE keeps decreasing, but test/CV RMSE increases
- Coefficient estimates are enormous (e.g. ±20 for degree-8)
- Confidence intervals on coefficients are extremely wide
- Prediction intervals span huge y-axis ranges
- Model predictions change drastically when the training set changes slightly (high variance across CV folds)
For polynomial regression, complexity is measured by the polynomial degree. Each additional degree adds one more coefficient (parameter) to estimate from the data.
More coefficients → more flexibility → potentially captures more signal but also more noise.
We want to strike a balance between underfit and overfit — a model that is "just right." This is the essence of the bias-variance trade-off and model selection.
Training set performance is always optimistic — the model has already "seen" that data. The 8th-degree polynomial gets the best training RMSE yet is clearly wrong. We need a way to estimate performance on new, unseen data.
Split the dataset once before doing anything else. Common rule of thumb: 80% training, 20% hold-out test.
# Simple random split in R idx <- sample(1:nrow(my_data), size = 0.8 * nrow(my_data)) train_split <- my_data[idx, ] test_split <- my_data[-idx, ]
Partition the training data into k folds. Each observation appears in a test set exactly once.
| Method | k | Test size per fold | Trade-off |
|---|---|---|---|
| 5-fold CV | 5 | 20% of data | Fewer folds, more points per test set, less variance in estimates |
| 10-fold CV | 10 | 10% of data | More folds, smaller test sets, more computation |
| LOO-CV | N | 1 point | Exact but very slow for large datasets |
| Repeated k-fold | k × r | Varies | Runs k-fold r times with different splits; more stable |
Even after CV, two models may appear statistically equivalent within the margin of error. Use the 1-SE Rule:
In the lecture, the quadratic and cubic models both perform within 1-SE of each other. By the 1-SE rule we prefer the quadratic — which happens to be the true model!