STAT 1261/2260: Principles of Data Science

Lecture 14 - Statistical Foundation

Where are we?

data visualization
data wrangling
professional ethics
statistical foundation

Statistical foundation

The ultimate objective in data science is to extract meaning from data.
Data wrangling and visualization are just tools:
- Wrangling re-organizes observations and variables to make data easier to analyze.
- Visualization connects our minds with the data to help us finding patterns

We are very good at seeing patterns but can be easily misled by the accidental patterns that appear in random noise.

Statistical methods quantify patterns and their strength. They are essential tools to
- find patterns that are too complex to be seen visually
- interpret data
- do modeling

Data Science workflow

A typical data science project (from R for Data Science) looks something like this:

Two types of data analysis

Exploratory data analysis (EDA)

To investigate the data integrity
To generate hypotheses. For examples,
- “smoking is related to cancer”
- “shape and size of diamonds predict price”
Both visualization and modeling are important tools in EDA
EDA typically involves many iterations of “transform-visualize-model”

Confirmatory data analysis

To confirm that the fitted model is appropriate
To confirm the hypothesis is true
Prediction accuracy vs. statistical inference
Use visualization to graphically convey the conclusion

Notes:

In general, data used in EDA are not used in confirmation.
Statistical methodologies are used in both exploration and confirmation.

Statistical methods

Statistical methodologies refer to the collective knowledge about

models
(numerical) methods to fit the model
assumptions and situations under which models can be used
(probabilistic) methods to quantify the uncertainty in model fit
quantification of the strength of evidence for (against) hypotheses

Example: “z-test”

Observations: \(x_1,\ldots,x_n\).
Hypothesis: The mean of \(x_i\) is 0, i.e. \(H_0: \mu = \mu_0\)
The z-test: Reject the hypothesis if \(z > 1.96\) or \(z < -1.96\), where \[ z = \frac{\bar x-\mu_0}{\sigma/\sqrt n} \]

A statistical methodology called z-test is in fact more than the formula! For example,

model: observations are centered at \(\mu\) (unknown), with certain amount of error.
fit: the center and amount of error are fitted by computing \(\bar x\) and \(s\)
assumption: observations are numeric and follow a normal distribution
uncertainty quantification : We are 95% confident that true \(\mu\) is in the interval \[\bar x \pm 1.96 \cdot \sigma/\sqrt{n}\]
strength of evidence: the smaller the p-value, the stronger the evidence against the null hypothesis

A key in better understanding this, thus to correctly use statistical methods, is to think of data as a sample in a population.

Samples and Populations

We have considered data as being fixed (the word “data” stems from the Latin word for “given”).

Statistical methodology is governed by a broader point of view: it assumes that the cases are drawn from a much larger set of potential cases.

The given data are a sample of a larger population.

(from Dr. Jung’s lecture notes for mathematical statistics)

Example: Sampling from the population

Suppose you were asked to help develop a travel policy for business travelers based in New York City.
- Imagine that the traveler has a meeting in San Francisco (airport code SFO) at a specified time \(t\) in 2013.
- The policy to be formulated will say how much earlier than t an acceptable flight should arrive in order to avoid being late to the meeting due to a flight delay.

We’re going to use the flights dataset in the nycflights13 package, which gives airline delays from New York City airports in 2013. So we already have the complete population of the 336,776 flights.

However, if we had the complete population we could simply look up the best flight that arrive in time for the meeting. So, more realistically, the problem would be to develop a policy for this year based on the sample of data that have already been collected.

We’re going to simulate this situation by drawing a sample from the population of flights into SFO.

Example: Set up population and draw a sample

First we set up the population.

library(mdsr)
library(nycflights13)
SF <- flights %>%
  filter(dest == "SFO", !is.na(arr_delay))

We will work with just a sample from this population. For now, we’ll set the sample size to be \(n = 25\) cases.

set.seed(101)
Sample25 <- SF %>%
  sample_n(size = 25)

Here, we used

sample_n() to select random rows from a table,
set.seed() to specify how to mimic random selection by pseudo-random generator.

Example: Sample statistics and distribution

We are interested in the arrival delay. Focus on the variable arr_delay. Compute some useful statistics and visualize the distribution.

favstats( ~ arr_delay, data = Sample25)

##  min  Q1 median Q3 max  mean       sd  n missing
##  -50 -23     -7  4 124 -2.96 35.33822 25       0

Sample25 %>% ggplot(aes(x=arr_delay)) + geom_density()

Notice that most of delays are less than 50 minutes, but the maximum delay is 124 minutes.

Example: a naive policy

A simple way to set the policy is to look for the longest flight delay (in our data), and insist that travel be arranged to deal with this delay.

Naive policy: book a flight that is scheduled to arrive at least 124 minutes before.

How well does this policy work?

Since in our imaginary situation we have the entire population, we can get an answer of this.

In particular, we can derive how often the travelers from NYC will be late for a meeting in San Francisco, when the naive policy is used.

SF %>% 
  mutate(is.late = arr_delay > 124) %>%
  summarize(prop.late = mean(is.late))

## # A tibble: 1 x 1
##   prop.late
##       <dbl>
## 1    0.0294

This seems reasonable, if being late or missing a meeting with 2.9% of chance is acceptable.

Example: another policy

Now consider another policy, utilizing a 2-standard deviation rule.

If the arrival delay is normally distributed, then only 2.27% of flights will have delay greater than \(\mu + 2\sigma\) (mean + 2 standard deviation ).

However, in the realistic situation, we do not have the population; we only have the sample data in Sample25. Consequently, we do not know \(\mu\) and \(\sigma\) and must estimate them from the sample.

Sample25 %>%
  summarize(mean = mean(arr_delay), sd = sd(arr_delay))

## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1 -2.96  35.3

m=mean(Sample25$arr_delay);s=sd(Sample25$arr_delay)
m+2*s

## [1] 67.71645

Policy #2: book a flight that is scheduled to arrive at least 68 minutes before.

Example: another policy (cont.)

How does this policy work? Again, since we have the population, we can see how often the travelers will miss their meetings:

SF %>% 
  mutate(is.late = arr_delay > 68) %>%
  summarize(prop.late = mean(is.late))

## # A tibble: 1 x 1
##   prop.late
##       <dbl>
## 1    0.0693

It’s much worse than 2.27%! About 7% of time, the traveler will be late and miss the meeting.

What has gone wrong? To answer this, we need to consider the following:

Was the assumption appropriate?
What do we know about the uncertainty in the estimation of \(\mu\) and \(\sigma\)?

Uncertainty quantification by sampling distribution

How reliable is a sample statistic?

For example, the sample standard deviation, computed from our sample, is 35 minutes. This is much smaller than the population standard deviation, about 47 minutes.

The sample mean is about -3 minutes, which is close to the population mean, 2.7 minutes.

If we were to collect a new sample from the population, how similar would the sample statistic on that new sample be to the same statistic calculated on the original sample?

SF %>% sample_n(size = 25) %>%
  summarize(mean = mean(arr_delay), sd = sd(arr_delay))

## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1 -3.84  38.3

Note that the sample mean and standard deviation of the new sample are -3.84 and 38.3, respectively. They are reasonably similar to those of the original sample, -2.96 and 35.3, respectively.

Policy evaluation

If we draw many different samples from the population, each of size \(n\), and calculated the sample statistic on each of those samples, how similar would the sample statistic be across all the samples?

With the population in hand, it’s easy to figure this out; use sample_n() many times and calculate the sample statistic on each trial.

Trials <- list()
set.seed(101)
for(i in 1:100){
  Trials[[i]] <- SF %>% 
    sample_n(size = 25) %>%
    summarize(mean = mean(arr_delay), sd = sd(arr_delay))
}
Trials_df <- 
  bind_rows(Trials) %>% 
  mutate(policy.cut = mean + 2 * sd)

Several new functions are used above: for(){...} and dplyr::bind_rows(). We will talk about them a little later.

Policy evaluation (cont.)

head(Trials_df,10)

## # A tibble: 10 x 3
##      mean    sd policy.cut
##     <dbl> <dbl>      <dbl>
##  1  -2.96  35.3       67.7
##  2  -3.84  38.3       72.8
##  3  15.8   61.0      138. 
##  4 -10.1   21.7       33.3
##  5   0.12  41.8       83.6
##  6  24.2   80.9      186. 
##  7  -2.64  28.0       53.3
##  8  13.1   53.8      121. 
##  9   7.8   66.4      141. 
## 10  16.6   60.3      137.

These numbers vary a lot! The large variability of the sample statistic policy.cut show how unreliable the statistic is.

favstats( ~ policy.cut, data = Trials_df)

##       min      Q1   median       Q3      max     mean       sd   n missing
##  14.45086 55.8069 85.53907 107.2871 449.4002 90.02024 54.42503 100       0

Policy evaluation (cont.)

Let us see the distribution of the policy cut:

Trials_df %>% ggplot(aes(x=policy.cut)) + geom_density()

We were lucky in the first sample Sample25 that provided only 7% chance of being late. This could be more than 20% (with a 14-minute policy), or could be only 0.0003% with a ridiculous 449-minute (7.5 hr) policy.

This discussion is only possible because we have the population. In a more realistic situation, ONLY the sample is available to us. Is quantifying the uncertainty ever possible?

Other issues

Assumptions made on the population

Is normal distribution assumed for the population appropriate?
Are there outliers ((unsual or extreme observations)?

Conditioning

Should the policy depend on the airlines, the time of the year, day of the week, and time of the day?

Missing values

The data may contain some missing values. Is discarding the missing values a good practice?

A short excursion for iterations by “for” loops

Iteration is a powerful tool for reducing duplication, when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

Consider a toy data frame consisting of 10 observations of 4 variables.

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

We want to compute the median of each column. You could do with copy-and-paste:

median(df$a)
median(df$b)
median(df$c)
median(df$d)

Instead, we could use a for loop.

output <- vector("double", 4)         # 1. output
#output <-vector(mode="double",length=4)
for (i in 1:4) {                      # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output

## [1] -0.00468744 -0.20602777 -0.43395171  0.46391441

The three components of a loop

The output: output <- vector("double", length(n)).
- Before you start the loop, you must always allocate sufficient space for the output.
- A general way of creating an empty vector of given length is the vector() function.
- In the earlier example, we created an empty list, using the list() function.
The sequence: i in 1:4. This determines what to loop over: each run of the for loop will assign i to a different value from the sequence 1:4.
The body: output[[i]] <- median(df[[i]]).
- This is the code that does the work.
- It’s run repeatedly, each time with a different value for i.
- The first iteration will run output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.

Class Exercise (1)

Compute the number of unique values in each column of iris. Note that iris is a built-in dataset in base R. So you don’t need to import or load it.

You may use n_distinct() function (in tidyverse package) to find distinct values of a variable.

# compute the number of unique values in each column of iris.
library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x mosaic::count() masks dplyr::count()
## x purrr::cross()  masks mosaic::cross()
## x mosaic::do()    masks dplyr::do()
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x mosaic::stat()  masks ggplot2::stat()
## x mosaic::tally() masks dplyr::tally()

n1<- ncol(iris)
out1<- vector("integer",n1)
for(i in 1:n1){
  out1[i]=n_distinct(iris[i])
}
out1

## [1] 35 23 43 22  3

# Alternative way
n1<- dim(iris)[2]
out1<- rep(NA,n1)
for(i in 1:n1){
  out1[i]=n_distinct(iris[i])
}
out1

## [1] 35 23 43 22  3

Class Exercise (2)

Determine the type of each column in nycflights::flights. You may use typeof() function to find the type of a column.

library(nycflights13)
n2 <- ncol(flights)

# Method 1 #
out2 <- vector("character",n2)
for(i in 1:n2){
  out2[i]=typeof(flights[[i]])
}
out2

##  [1] "integer"   "integer"   "integer"   "integer"   "integer"  
##  [6] "double"    "integer"   "integer"   "double"    "character"
## [11] "integer"   "character" "character" "character" "double"   
## [16] "double"    "double"    "double"    "double"

# Method 2 #
out3 <- matrix(NA,n2,2)
name <- names(flights)
for(i in 1:n2){
  out3[i,1]= name[i]
  out3[i,2]=typeof(flights[[i]])
}
out3

##       [,1]             [,2]       
##  [1,] "year"           "integer"  
##  [2,] "month"          "integer"  
##  [3,] "day"            "integer"  
##  [4,] "dep_time"       "integer"  
##  [5,] "sched_dep_time" "integer"  
##  [6,] "dep_delay"      "double"   
##  [7,] "arr_time"       "integer"  
##  [8,] "sched_arr_time" "integer"  
##  [9,] "arr_delay"      "double"   
## [10,] "carrier"        "character"
## [11,] "flight"         "integer"  
## [12,] "tailnum"        "character"
## [13,] "origin"         "character"
## [14,] "dest"           "character"
## [15,] "air_time"       "double"   
## [16,] "distance"       "double"   
## [17,] "hour"           "double"   
## [18,] "minute"         "double"   
## [19,] "time_hour"      "double"

# Method 3 #
type <- vector("character",n2)
name <- names(flights)
for(i in 1:n2){
  type[i]=typeof(flights[[i]])
}
output <- data.frame(name,type)
output

##              name      type
## 1            year   integer
## 2           month   integer
## 3             day   integer
## 4        dep_time   integer
## 5  sched_dep_time   integer
## 6       dep_delay    double
## 7        arr_time   integer
## 8  sched_arr_time   integer
## 9       arr_delay    double
## 10        carrier character
## 11         flight   integer
## 12        tailnum character
## 13         origin character
## 14           dest character
## 15       air_time    double
## 16       distance    double
## 17           hour    double
## 18         minute    double
## 19      time_hour    double

# Method 4 #
flights %>% 
  summarise_all(typeof) %>% 
  gather

## # A tibble: 19 x 2
##    key            value    
##    <chr>          <chr>    
##  1 year           integer  
##  2 month          integer  
##  3 day            integer  
##  4 dep_time       integer  
##  5 sched_dep_time integer  
##  6 dep_delay      double   
##  7 arr_time       integer  
##  8 sched_arr_time integer  
##  9 arr_delay      double   
## 10 carrier        character
## 11 flight         integer  
## 12 tailnum        character
## 13 origin         character
## 14 dest           character
## 15 air_time       double   
## 16 distance       double   
## 17 hour           double   
## 18 minute         double   
## 19 time_hour      double