Home Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Week 10 Week 11 Week 12 Week 13 Week 14

Moneyball 2.0: Winning in Sports w Data

INFSCI 1091: Special Topics

General Information

Data and analytics have been part of the sports industry from as early as the 1870s, when the first boxscore in baseball was recorded. However, it is only recently that advanced data mining and machine learning techniques facilitated by our ability to collect more fine-grained data, have been utilized for facilitating the operations of sports franchises. Draft selection, game-day decision making and player evaluation are just a few of the applications where sports analytics play a crucial role today. Apart from the sports clubs, other stakeholders in the industry (e.g., the leagues' offices, media, etc.) invest in analytics. The leagues increasingly rely on data to decide on potential rule changes. In this course, we will introduce data science concepts for sports analytics. Students will learn concepts related to data collection, data analysis and modeling as well as data visualization.

For whom? Students need to have some basic statistical background and programing skill. They should also have an interest in sports, since all the analytical examples will be taken from the sports field.

Course Info

Course meetings: Tu/Thu, 11am-12:15pm, IS 501

Instructor: Konstantinos Pelechrinis

Syllabus: (pdf)

Project: (pdf)

Data Competition: (pdf) (Sample Submission File) (Scoring Function Script)

Week 1

Where are analytics used in sports? What is the current state-of-the-art? In Week 1 we will explore various sports APIs and web scraping frameworks. (ppt)
Sample code: (zip)

Week 2

This week we will introduce the notion of empirical probability and the concept of statistical tests. We will showcase these concepts using examples from in-game strategic decision making. Two-point conversion and fourth-down decisions in NFL, end-game basketball strategy etc. (ppt)

Week 3

This week we will introduce the concept of Monte Carlo simulations and resampling with replacement. (ppt)
Sample code for WWRT with Monte Carlo simulations in R: (R)

Week 4

This week we will introduce linear and logistic regression in the context of sports analytics. We will examine the Bradley-Terry model, as well as, in-game win probability models. We will also examine appropriate evaluation metrics for probability models (Brier score, probability validation curves). (ppt)
Sample Code and Data for the Bradley-Terry Model in Python: (Python)
  • R. Yurko, M. Horowitz and S. Ventura. "NFL Expected Points With NflscrapR" (Part 1) (Part 2)

Week 5

This week we will examine various methods for rating teams and players. In particular, we will examine the Elo rating method that was first used to rate chess player, and today is the core of fivethirtyeight's predictions. We will also examine regression-based ratings as well as network ratings. We will further see advanced metrics for player evaluation (adjusted plus/minus, wins above replacement etc.).(ppt)
Sample code for obtaining initial Elo ratings using pre-season total wins betting lines: (zip)
Sample code for obtaining team ratings (NBA current season): (zip)
Sample code for obtaining adjusted plus-minus: (zip)

Week 6

During this week we will introduce the bias-variance tradeoff and the problem of overfitting. We will also introduce the notion of regularization for preventing overfitting. Finally, we will discuss how we can combine Monte Carlo simulations and team ratings for simulating sports tournaments. (ppt)
Sample code for obtaining regularized adjusted plus-minus (using the same data as above): (py)

Week 7

The Field Goal First Conundrum: (html)

Week 8

During this week we will introduce the concept of schedule strength and statistical adjustment that controls for the different schedule strengths. We will also introduce the ideas behind expected points per play in NFL. (ppt)

Week 9

During this week we will examine how we can evaluate players who we have small number of observations using Bayesian inference. Tangential to this problem is also evaluating draft picks and the efficiency of the underlying market. We will particularly deal with the NFL and NBA draft through the seminal studies from Massey and Thaler, as well as Winston, Sagaring and Medland. (ppt)
Sample code for Bayesian inference: (github)
C. Massey and R. Thaler, "Overconfidence VS Market Efficiency in the NFL": (pdf)
The making and comparison of draft curves: (html)

Week 10

During this week we will discuss spatial data in sports. We will focus specifically on the NBA that has been using optical tracking in all of its stadium for several years now. We will introduce matrix factorization techniques (and in particular, Singular Value Decomposition and Non-negative Matrix Factorization) that can be used to identify latent patterns in spatial data, as well as, metrics that can quantify floor spacing. (ppt)

Week 11

During this week we will discuss some basic concepts of game theory. We will focus on the notion of pure and mixed strategies, zero-sum games and Nash Equilibrium. We will see examples of game theory being applied on American football, basketball and soccer. (ppt)

Week 12

During this week we will discuss algorithms for identifying clusters in data. We will see k-means and hierarchical clustering and we will discuss ways for choosing the number of clusters. We also discuss the curse of (high) dimensionality and Principal Component Analysis for dimensionality reduction. (ppt)
P. Domingos, "A Few Useful Things to Know about Machine Learning": (pdf)

Week 13

During this week we will discuss applications of network science and analysis in the realm of sports analytics. In particular, we will see the representation of player interactions through networks, as well as, learning representations through network relations. (ppt)
B. Skinner, "The Price of Anarchy in Basketball": (pdf)
K. Pelechrinis, "LinNet: Probabilistic Lineup Evaluation Through Network Embedding": (pdf)

Week 14

Final exam

Every play is a data point!