Overfitting, resampling, k-fold cross-validation, and the 1-SE rule all carry over directly to classification. This lecture introduces the new problem type and its specific performance metrics. The generalization concerns are identical.
In regression, the response y is a continuous decimal number. In classification, the response is categorical — it belongs to one of a discrete set of classes.
e.g. 3.14, −0.07, 22.5
e.g. Hit/Out, Yes/No, 1/0
This lecture focuses on the binary case — exactly two possible outcomes. Real-world examples include: hit/out in baseball, default/no-default on a loan, cancer/no cancer, stay/leave on a website.
The lecture walks through this question step by step — and the answer is more nuanced than a flat "no":
Just like in Week 2, the lecture uses artificially generated data so we can know the ground truth and verify our methods. Here is exactly how the data was created:
where 0 ≤ f(x, β) ≤ 1 always
The model is fit by minimizing a loss function based on the likelihood of the observed response (y=1 or y=0) given the modeled probability. This is fundamentally different from the squared error used in regression.
To make the response numeric, we apply a simple encoding before fitting:
| Outcome | Encoding | Error formulation |
|---|---|---|
| EVENT occurs | y = 1 | 1 − f(x) = error |
| NON-EVENT occurs | y = 0 | 0 − f(x) = −f(x) = error |
The true event probability curve in the toy problem (and in logistic regression generally) has an S-shape (sigmoid). You will learn the mathematical reason later in the semester. For now, understand its behavior:
Probability approaches 0. Events are rare.
Probability near 0.5. Outcomes are uncertain.
Probability approaches 1. Events are common.
# Classification uses glm() instead of lm() # Must specify the family (sets the loss function) mod_logistic <- glm(y ~ x, data = my_data, family = binomial()) # Same formula interface as lm(): # response ~ predictor(s) summary(mod_logistic)
family = binomial().binomial() triggers the likelihood-based loss appropriate for binary 0/1 responses.The model outputs a probability. To get a predicted class, we compare it to a threshold:
if μ < threshold → predict NON-EVENT (ŷ = 0)
Move the slider to see how changing the threshold shifts the trade-off between sensitivity and specificity.
A NON-EVENT is incorrectly predicted as an EVENT. Like a false accusation.
Costly when: spam filters block important emails, plagiarism detectors flag clean work, criminal risk tools falsely flag innocent people.
An EVENT is incorrectly predicted as a NON-EVENT. A missed detection.
Costly when: disease screening misses cancer, fault detection misses failures, fraud detection misses real fraud.
With binary classification there are 4 possible combinations of predicted and observed class. The confusion matrix counts how many observations fall into each cell.
Main diagonal (TP + TN) = correct predictions. Off-diagonal (FP + FN) = errors.
| Metric | Formula | Focuses on |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) |
Overall correct rate |
| Sensitivity (TPR) | TP / (TP + FN) |
Observed EVENTs only |
| Specificity (TNR) | TN / (TN + FP) |
Observed NON-EVENTs only |
| False Positive Rate | FP / (FP + TN) = 1 − Specificity |
NON-EVENTs called events |
Accuracy collapses all four cells into a single number. It does NOT tell you:
- Whether failures are False Positives or False Negatives
- How the model performs within each class separately
- How performance changes as the threshold changes
Out of all actual events, how many did we catch? Sensitive to events.
Out of all actual non-events, how many did we correctly call non-event?
Instead of evaluating the model at a single threshold, we sweep through all possible thresholds and plot the resulting (FPR, TPR) pairs. This is the Receiver Operating Characteristic (ROC) curve.
Nearly everything classified as EVENT → TPR high, FPR high. Upper-right corner.
Nearly everything classified as NON-EVENT → TPR low, FPR low. Lower-left corner.
ROC curve goes straight up the y-axis to (0, 1) then right along the top. At some threshold, TPR = 1 and FPR = 0 simultaneously. Real models never achieve this.
ROC curve follows the diagonal from (0,0) to (1,1). Lowering the threshold increases FPR and TPR equally — no discrimination ability whatsoever.
The ROC curve can be summarised by integrating the area under it — the AUC. This gives a single threshold-independent measure of model discrimination ability.
| AUC Range | Interpretation |
|---|---|
0.5 | No better than random guessing |
0.5 – 0.7 | Poor discrimination |
0.7 – 0.8 | Acceptable |
0.8 – 0.9 | Excellent |
0.9 – 1.0 | Outstanding |
An ideal ROC curve should:
- Rise steeply toward TPR=1 while FPR remains near 0 (upper-left bulge)
- Remain close to 1 on the TPR axis as FPR grows
- Show a large area between the curve and the diagonal
Accuracy and ROC/AUC compare predicted classes to observed classes — they are point-wise metrics. But the model actually predicts probabilities. A different question emerges:
More resolution, but fewer observations per bin. Empirical frequencies are noisy estimates — unreliable with small samples.
More observations per bin → more stable estimates. But fewer bins means less detail about model behaviour across the probability range.
| Dimension | Accuracy / ROC / AUC | Calibration |
|---|---|---|
| What it measures | Point-wise class discrimination | Agreement between predicted probs & observed frequencies |
| Requires threshold? | Yes (Accuracy); No (AUC) | No |
| Can be computed in real problems? | Yes | Yes |
| View of model behaviour | Single-prediction accuracy | Long-run aggregate behaviour |
In a real problem, you can never directly compare the model's predicted probability to the "true" probability — because the true probability is unknown. The only comparison available is to the observed 0/1 outcomes.
An important insight from this comparison: when the model predicts ~0.5, we should expect a near 50/50 mix of 0s and 1s — and we should not be surprised by misclassifications in that region. They are statistically appropriate, not model failures.