INFSCI 2595 · Week 1

Introduction to Machine Learning

Interactive Study Guide — Lecture Notes Companion
What is Machine Learning?
// pattern recognition · applied math · course structure · what we'll learn
The One-Line Definition Core
Machine Learning = tools for recognizing patterns in data
…essentially, applied linear algebra & probability

Every ML algorithm — from a simple linear regression to a neural network — is built on the same mathematical bedrock. The models, languages, and tools change constantly; the foundations do not.

Even the most advanced "AI" is based on basic principles of mathematics and statistics. Understanding those principles is what separates practitioners who just run code from those who truly understand what they are doing.
What the Course Covers Big Picture
Algorithms & Theory
  • Linear models & regularization
  • Generalized linear models (logistic regression)
  • Bayesian linear models
  • Neural networks
  • Decision trees, random forests, boosting
  • Unsupervised: PCA & clustering
Concepts & Practice
  • Bias-variance trade-off
  • Cross-validation & resampling
  • Maximum likelihood (MLE)
  • Bayesian inference (MAP)
Course Structure — 4 Primary Modules Roadmap
Applied ML
(now)
Distribution
Fitting
Supervised
Learning Deep Dive
Unsupervised
Learning

The course deliberately starts with applied ML (regression & classification hands-on) before building up the statistical theory. This lets you see the big picture first, then understand why everything works.

Why not jump straight to neural networks? The course builds understanding from fundamental linear models upward. Neural networks then become natural extensions of basic principles, not mysterious black boxes.
Why R? Tools

The course uses R because the focus is on statistics, not software engineering. R makes the statistical fundamentals front-and-center.

Powerful packages for visualization (ggplot2), data manipulation (dplyr/tidyverse), modeling (modelr, caret, tidymodels), and more.
Homework uses R Markdown — code + written explanation in one document, submitted as rendered HTML.

⚡ If you understand the statistics and linear algebra, you have the foundation to work in any programming language once you know the syntax. The language is a tool; the math is the skill.

The Core ML Framework
// y ≈ f(x) · loss functions · parameters · training vs testing · supervised learning
The Fundamental Approximation Foundation
The supervised learning goal
y ≈ f(x) + error

We want to learn a function f(x) that maps inputs x to outputs y. Because we learn from noisy, finite data, the relationship is always approximate — there will always be error.

Inputs (features)

Denoted x (scalar), x (vector), or X (matrix). What we observe and feed into the model.

Output (response / target)

Denoted y or t. What we want to predict. Can be continuous, categorical, count, etc.

The Three Pillars of Every ML Algorithm Structure
1
Loss / Objective Function
Every ML algorithm requires specifying what is to be learned — the goal. All ML seeks to minimize this loss by adjusting the model's parameters. Example: YouTube minimizes watch time lost; a regression model minimizes prediction error.
2
Tunable Parameters (β / coefficients)
Every model has a set of parameters that control what patterns it can identify. Learning = finding the parameter values that minimize the loss function. More parameters → more complex model.
3
Training Data + Testing Data
Training data is used to learn the optimal parameters. Testing data evaluates whether the model generalizes to new samples. A model that only performs well on training data may have memorized noise — it hasn't truly learned.
Learning is supervised because the known output labels guide (supervise) the learning of the unknown model parameters. The parameters are learned by minimizing the model error.
The Role of Probability Distributions Why It Matters

Model assumptions are related to probability distributions. Understanding the probability distributions behind popular model formulations allows us to:

Understand more about the data generating process
Understand what the model thinks can happen
Quantify and communicate uncertainty

Two frameworks for finding optimal parameters: Maximum Likelihood Estimation (MLE) — frequentist statistics, and Maximum a Posteriori (MAP) — Bayesian inference. Both will be covered in depth.

Understanding Assumptions = Understanding the Model Insight

When we understand a model's assumptions, we can:

  • Understand what controls the model behavior
  • Identify the strengths and weaknesses of different methods
  • Recognize when a method is NOT appropriate
  • Interpret the model's learned parameters
Reality check: You can apply methods without understanding them — and sometimes it will obviously fail and you'll know how to fix it. The danger is when the model is silently wrong and you don't notice.
ML Problem Types
// supervised vs unsupervised · regression · classification · other response types
Supervised vs. Unsupervised Learning Top-Level Split
Supervised Learning

We have labelled input-output pairs. The output supervises learning of the model parameters. Goal: learn a mapping from input to output.

e.g. Predicting house price from features, classifying emails as spam/not-spam.

Unsupervised Learning

No output labels. Goal: discover interesting patterns and structure in the input data without being told what to look for.

e.g. Clustering customers, PCA for dimensionality reduction.

This course focuses primarily on supervised learning, with coverage of unsupervised methods (PCA and clustering) in the final module. The key question in supervised learning: does a new observation correctly trigger the learned pattern?
Supervised Learning: Response Types Taxonomy
Regression

Continuous response — any real number.

House price · stock price · temperature next week · expected video views
Binary Classification

Exactly 2 possible classes.

Loan default / no default · pass / fail · disease present / absent
Multi-class Classification

3 or more possible classes.

Image classification · which team wins the Super Bowl · next song to recommend
Other Response Types

Also exist — covered later.

Count data (calls/hour, goals/game) · Survival/reliability (time to failure)
This course focuses primarily on binary classification, but will cover the multiclass generalization — the fundamental ideas transfer directly.
Unsupervised Learning ("Data Discovery") Definition

Observe variables without a distinction between inputs and responses. Explore the inputs without regard for any label (or when no label exists).

Find relationships between variables — which features co-vary?
Find relationships between observations — which samples cluster together?
Especially useful in high-dimensional settings where manual exploration is impossible

In this course: K-means and hierarchical clustering for grouping observations; Principal Component Analysis (PCA) for dimensionality reduction and discovering latent structure.

Other Types of Learning Broader Landscape

Supervised and unsupervised are the two main paradigms, but others exist:

ParadigmDescription
Semi-supervisedUses a mix of labelled and unlabelled data
Self-supervisedGenerates its own supervision signal from the data structure (e.g. predicting masked words)
Active learningModel queries for labels on the most informative samples
Online learningModel updates continuously as new data arrives
Reinforcement learningAgent learns via trial and error, receiving reward/penalty signals
The ML Workflow
// data access · EDA · cleaning · preprocessing · training · resampling · model selection
The Complete Supervised Learning Pipeline Big Picture

Real ML projects follow this pipeline. Note that EDA appears at multiple points — understanding the data is an ongoing activity, not a one-time step.

Data
Access
EDA
Contextualize
& Clean
EDA
Identify Models
& Preprocess
Fit on
Training Data
Resampling
Identify
Best Model
⟲ This process is iterative — model results often send you back to re-examine or re-clean the data
Each Stage Explained Details
1
Data Access
Acquiring suitable data can be complicated — especially ethically when human subjects are involved. Raw data is typically spread across multiple sources and not organized conveniently. Sources must be merged using common "keys."
2
EDA — Exploratory Data Analysis
Create figures and describe the data to motivate further analysis. Identify relationships between inputs and outputs, spot outliers, understand distributions, and decide on preprocessing steps. EDA happens both before and after cleaning.
3
Contextualize & Clean
Remove duplicate rows, correct erroneous values, handle missing data. This is the most time-consuming part of real ML projects — data access, contextualization, and cleaning can take 60–80% of total project time. The most sophisticated model is worthless on invalid data.
4
Identify Candidate Models & Preprocess
Choose models appropriate for the data structure, response type, and domain. Apply preprocessing (standardization, normalization, feature selection, transformations) based on the assumptions of the selected models. Preprocessing choices depend on model choice.
5
Fit Models to Training Data
Train each candidate model on the training set. This is where the "cool stuff" lives — but it's only a small fraction of the total effort. Analogy: if you got 100% on a practice exam, you wouldn't say you aced the real exam.
6
Resampling (Cross-Validation)
Evaluate models on data they haven't seen to estimate true generalization performance. Prevents overfitting to the training set. ML models must be tested on unseen data to determine whether they've truly learned a general pattern.
7
Compare Models & Identify Best
Compare candidates using held-out performance metrics. Select the best model considering both performance and complexity. The 1-SE rule (from Week 2) helps avoid selecting unnecessarily complex models.
Data Format: The Rectangular Table Structure

The ideal data format is a flat rectangular table (a data frame or tibble in R):

ObservationInput 1Input 2Response 1 (continuous)Response 2 (binary)
15.2green43.1TRUE
26.1green57.4FALSE
32.0yellow18.9FALSE

Each row = one observation/sample. Each column = one input feature or response variable. In real projects, data usually starts scattered across multiple sources and must be merged before reaching this tidy format.

Why Most ML Projects Fail Reality Check

Despite enormous hype, a very large fraction of "big data" projects fail (estimates around 80%), and only a small fraction of proof-of-concept projects ever make it into production.

Root cause: Data that is not suitable for analysis. Shortcuts taken during data preparation. Problems that simply aren't amenable to ML.
The fix: Understand and prepare the data before throwing complex models at the problem. Data quality determines the ceiling on model performance.
Why Math Matters
// the danger zone · understanding assumptions · "WHY" questions · what math enables
Avoiding the "Danger Zone" Motivation

Data science sits at the intersection of three skill areas. The "Danger Zone" occurs when you have hacking skills and domain knowledge but no math & statistics understanding — you can produce confident-looking but completely wrong results without noticing.

Machine Learning Traditional Research Data Science DANGER ZONE Math & Statistics Knowledge Hacking Skills Substantive Expertise
The Danger Zone is the overlap of Hacking Skills + Substantive Expertise without Math & Statistics. You can get code to run, results to look plausible, and confidently report conclusions that are entirely wrong.
What Mathematical Understanding Enables The Goal

After building solid foundations, you will be able to confidently answer questions like:

🔍
WHY did it work?
Trace the result back to the model assumptions and data properties
WHY didn't it work?
Identify violated assumptions, data quality issues, or mismatched loss functions
⚙️
WHY that setup?
Justify modeling choices — not just copy a tutorial blindly
🔄
WHY different results?
Diagnose differences between your implementation and a reference
📊
WHY does it matter?
Communicate findings clearly to stakeholders in research, business, or policy
🚀
Adapt easily
Pick up new languages and frameworks quickly — the foundations transfer everywhere
The Practical Philosophy Mindset
Tools, languages, and specific techniques are constantly changing. With a solid mathematical foundation, you can adapt easily to anything new.

The most important part of machine learning is understanding the statistics behind the model. Math may seem intimidating at first, but visual and intuitive explanations — like those found in good online resources — make it far more approachable than it might seem.

As a data scientist, your value comes from applying models to real use cases, finding insights about key factors driving behavior, and communicating those findings effectively. The math is what makes that possible reliably — not just occasionally by luck.
ML in the Real World
// applications across fields · what ML can do · generative AI · hype vs reality
Fields Transformed by ML Impact
🏥
Healthcare
Disease detection, drug discovery, patient risk scoring, public health surveillance
💰
Finance
Credit risk, fraud detection, algorithmic trading, loan default prediction
🏭
Engineering
Predictive maintenance, quality control, fault detection, reliability analysis
📱
Tech & Products
Recommendation systems, search ranking, content moderation, ad targeting
🔬
Science & Research
Seismic signal classification, genomics, climate modeling, particle physics
📚
Humanities & Social Science
Text corpus analysis for literary & legal studies, social media analysis

ML is being used by companies across every industry: Microsoft Azure, GE, Amazon Web Services (AWS), Google (TensorFlow), Airbnb, Coca-Cola, Netflix, the NFL, and countless others.

What ML Actually Does Capabilities
CapabilityExample
Find patterns automaticallyDiscover structure in data without being told what to look for
Model relationshipsLink observed measurements/traits to outcomes (purchases, health, capacity)
Predict outcomesHow likely is a new customer to buy a cordless drill given similar customers' behavior?
Adapt to feedbackImprove performance relative to environment and user feedback (reinforcement learning)
Generate outputsCreate new content based on some class of input — chatbots, generative art, code
Generative AI & LLMs in Context Perspective

Recent advances in large language models and generative AI have created enormous hype. Two important things to keep in mind:

LLMs hallucinate. There is a good chance an LLM will give you the wrong answer to technical ML questions — especially on the specific mathematical details you will be working with in this course.
All of it is math. The most impressive generative models are fundamentally extensions of the same principles — linear algebra, probability, optimization — that this course is built around.
The tension illustrated in the lecture: a model prompted to paint "Harvey Milk" produces a stylistically compelling but factually incorrect portrait. High capability, genuine risk. Understanding the foundations helps you use these tools critically.
Why ML Became Popular Now History

The mathematical ideas behind ML are decades old. What changed recently:

  • Ubiquitous digital data collection (surveillance, sensors, social media)
  • Data centers and supercomputers capable of processing it all
  • Proliferation of internet-connected devices (IoT)
  • Growth of ML/data science university programs
  • Open-source algorithms and frameworks (TensorFlow, scikit-learn, R, etc.)
Quick Glossary
// All key terms from the Week 1 lectures