data visualization
data wrangling
professional ethics
statistical foundation
The ultimate objective in data science is to extract meaning from data.
Data wrangling and visualization are just tools:
We are very good at seeing patterns but can be easily misled by the accidental patterns that appear in random noise.
Statistical methods quantify patterns and their strength. They are essential tools to
Statistical methodologies refer to the collective knowledge about
A statistical methodology called z-test is in fact more than the formula! For example,
A key in better understanding this, thus to correctly use statistical methods, is to think of data as a sample in a population.
We have considered data as being fixed (the word “data” stems from the Latin word for “given”).
Statistical methodology is governed by a broader point of view: it assumes that the cases are drawn from a much larger set of potential cases.
The given data are a sample of a larger population.
(from Dr. Jung’s lecture notes for mathematical statistics)
Suppose you were asked to help develop a travel policy for business travelers based in New York City.
We’re going to use the flights dataset in the nycflights13 package, which gives airline delays from New York City airports in 2013. So we already have the complete population of the 336,776 flights.
However, if we had the complete population we could simply look up the best flight that arrive in time for the meeting. So, more realistically, the problem would be to develop a policy for this year based on the sample of data that have already been collected.
We’re going to simulate this situation by drawing a sample from the population of flights into SFO.
First we set up the population.
library(mdsr)
library(nycflights13)
SF <- flights %>%
filter(dest == "SFO", !is.na(arr_delay))We will work with just a sample from this population. For now, we’ll set the sample size to be \(n = 25\) cases.
set.seed(101)
Sample25 <- SF %>%
sample_n(size = 25)Here, we used
sample_n() to select random rows from a table,set.seed() to specify how to mimic random selection by pseudo-random generator.We are interested in the arrival delay. Focus on the variable arr_delay. Compute some useful statistics and visualize the distribution.
favstats( ~ arr_delay, data = Sample25)## min Q1 median Q3 max mean sd n missing
## -50 -23 -7 4 124 -2.96 35.33822 25 0
Sample25 %>% ggplot(aes(x=arr_delay)) + geom_density()Notice that most of delays are less than 50 minutes, but the maximum delay is 124 minutes.
A simple way to set the policy is to look for the longest flight delay (in our data), and insist that travel be arranged to deal with this delay.
Naive policy: book a flight that is scheduled to arrive at least 124 minutes before.
How well does this policy work?
Since in our imaginary situation we have the entire population, we can get an answer of this.
In particular, we can derive how often the travelers from NYC will be late for a meeting in San Francisco, when the naive policy is used.
SF %>%
mutate(is.late = arr_delay > 124) %>%
summarize(prop.late = mean(is.late))## # A tibble: 1 x 1
## prop.late
## <dbl>
## 1 0.0294
This seems reasonable, if being late or missing a meeting with 2.9% of chance is acceptable.
Now consider another policy, utilizing a 2-standard deviation rule.
If the arrival delay is normally distributed, then only 2.27% of flights will have delay greater than \(\mu + 2\sigma\) (mean + 2 standard deviation ).
However, in the realistic situation, we do not have the population; we only have the sample data in Sample25. Consequently, we do not know \(\mu\) and \(\sigma\) and must estimate them from the sample.
Sample25 %>%
summarize(mean = mean(arr_delay), sd = sd(arr_delay))## # A tibble: 1 x 2
## mean sd
## <dbl> <dbl>
## 1 -2.96 35.3
m=mean(Sample25$arr_delay);s=sd(Sample25$arr_delay)
m+2*s## [1] 67.71645
Policy #2: book a flight that is scheduled to arrive at least 68 minutes before.
How does this policy work? Again, since we have the population, we can see how often the travelers will miss their meetings:
SF %>%
mutate(is.late = arr_delay > 68) %>%
summarize(prop.late = mean(is.late))## # A tibble: 1 x 1
## prop.late
## <dbl>
## 1 0.0693
It’s much worse than 2.27%! About 7% of time, the traveler will be late and miss the meeting.
What has gone wrong? To answer this, we need to consider the following:
How reliable is a sample statistic?
For example, the sample standard deviation, computed from our sample, is 35 minutes. This is much smaller than the population standard deviation, about 47 minutes.
The sample mean is about -3 minutes, which is close to the population mean, 2.7 minutes.
If we were to collect a new sample from the population, how similar would the sample statistic on that new sample be to the same statistic calculated on the original sample?
SF %>% sample_n(size = 25) %>%
summarize(mean = mean(arr_delay), sd = sd(arr_delay))## # A tibble: 1 x 2
## mean sd
## <dbl> <dbl>
## 1 -3.84 38.3
Note that the sample mean and standard deviation of the new sample are -3.84 and 38.3, respectively. They are reasonably similar to those of the original sample, -2.96 and 35.3, respectively.
If we draw many different samples from the population, each of size \(n\), and calculated the sample statistic on each of those samples, how similar would the sample statistic be across all the samples?
With the population in hand, it’s easy to figure this out; use sample_n() many times and calculate the sample statistic on each trial.
Trials <- list()
set.seed(101)
for(i in 1:100){
Trials[[i]] <- SF %>%
sample_n(size = 25) %>%
summarize(mean = mean(arr_delay), sd = sd(arr_delay))
}
Trials_df <-
bind_rows(Trials) %>%
mutate(policy.cut = mean + 2 * sd)Several new functions are used above: for(){...} and dplyr::bind_rows(). We will talk about them a little later.
head(Trials_df,10)## # A tibble: 10 x 3
## mean sd policy.cut
## <dbl> <dbl> <dbl>
## 1 -2.96 35.3 67.7
## 2 -3.84 38.3 72.8
## 3 15.8 61.0 138.
## 4 -10.1 21.7 33.3
## 5 0.12 41.8 83.6
## 6 24.2 80.9 186.
## 7 -2.64 28.0 53.3
## 8 13.1 53.8 121.
## 9 7.8 66.4 141.
## 10 16.6 60.3 137.
These numbers vary a lot! The large variability of the sample statistic policy.cut show how unreliable the statistic is.
favstats( ~ policy.cut, data = Trials_df)## min Q1 median Q3 max mean sd n missing
## 14.45086 55.8069 85.53907 107.2871 449.4002 90.02024 54.42503 100 0
Trials_df %>% ggplot(aes(x=policy.cut)) + geom_density()We were lucky in the first sample Sample25 that provided only 7% chance of being late. This could be more than 20% (with a 14-minute policy), or could be only 0.0003% with a ridiculous 449-minute (7.5 hr) policy.
This discussion is only possible because we have the population. In a more realistic situation, ONLY the sample is available to us. Is quantifying the uncertainty ever possible?
Is normal distribution assumed for the population appropriate?
Are there outliers ((unsual or extreme observations)?
Should the policy depend on the airlines, the time of the year, day of the week, and time of the day?
The data may contain some missing values. Is discarding the missing values a good practice?
Iteration is a powerful tool for reducing duplication, when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
Consider a toy data frame consisting of 10 observations of 4 variables.
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)We want to compute the median of each column. You could do with copy-and-paste:
median(df$a)
median(df$b)
median(df$c)
median(df$d)
Instead, we could use a for loop.
output <- vector("double", 4) # 1. output
#output <-vector(mode="double",length=4)
for (i in 1:4) { # 2. sequence
output[[i]] <- median(df[[i]]) # 3. body
}
output## [1] -0.00468744 -0.20602777 -0.43395171 0.46391441
The output: output <- vector("double", length(n)).
Before you start the loop, you must always allocate sufficient space for the output.
A general way of creating an empty vector of given length is the vector() function.
In the earlier example, we created an empty list, using the list() function.
The sequence: i in 1:4. This determines what to loop over: each run of the for loop will assign i to a different value from the sequence 1:4.
The body: output[[i]] <- median(df[[i]]).
output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.Compute the number of unique values in each column of iris. Note that iris is a built-in dataset in base R. So you don’t need to import or load it.
You may use n_distinct() function (in tidyverse package) to find distinct values of a variable.
# compute the number of unique values in each column of iris.
library(tidyverse)## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x mosaic::count() masks dplyr::count()
## x purrr::cross() masks mosaic::cross()
## x mosaic::do() masks dplyr::do()
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x mosaic::stat() masks ggplot2::stat()
## x mosaic::tally() masks dplyr::tally()
n1<- ncol(iris)
out1<- vector("integer",n1)
for(i in 1:n1){
out1[i]=n_distinct(iris[i])
}
out1## [1] 35 23 43 22 3
# Alternative way
n1<- dim(iris)[2]
out1<- rep(NA,n1)
for(i in 1:n1){
out1[i]=n_distinct(iris[i])
}
out1## [1] 35 23 43 22 3
Determine the type of each column in nycflights::flights. You may use typeof() function to find the type of a column.
library(nycflights13)
n2 <- ncol(flights)
# Method 1 #
out2 <- vector("character",n2)
for(i in 1:n2){
out2[i]=typeof(flights[[i]])
}
out2## [1] "integer" "integer" "integer" "integer" "integer"
## [6] "double" "integer" "integer" "double" "character"
## [11] "integer" "character" "character" "character" "double"
## [16] "double" "double" "double" "double"
# Method 2 #
out3 <- matrix(NA,n2,2)
name <- names(flights)
for(i in 1:n2){
out3[i,1]= name[i]
out3[i,2]=typeof(flights[[i]])
}
out3## [,1] [,2]
## [1,] "year" "integer"
## [2,] "month" "integer"
## [3,] "day" "integer"
## [4,] "dep_time" "integer"
## [5,] "sched_dep_time" "integer"
## [6,] "dep_delay" "double"
## [7,] "arr_time" "integer"
## [8,] "sched_arr_time" "integer"
## [9,] "arr_delay" "double"
## [10,] "carrier" "character"
## [11,] "flight" "integer"
## [12,] "tailnum" "character"
## [13,] "origin" "character"
## [14,] "dest" "character"
## [15,] "air_time" "double"
## [16,] "distance" "double"
## [17,] "hour" "double"
## [18,] "minute" "double"
## [19,] "time_hour" "double"
# Method 3 #
type <- vector("character",n2)
name <- names(flights)
for(i in 1:n2){
type[i]=typeof(flights[[i]])
}
output <- data.frame(name,type)
output## name type
## 1 year integer
## 2 month integer
## 3 day integer
## 4 dep_time integer
## 5 sched_dep_time integer
## 6 dep_delay double
## 7 arr_time integer
## 8 sched_arr_time integer
## 9 arr_delay double
## 10 carrier character
## 11 flight integer
## 12 tailnum character
## 13 origin character
## 14 dest character
## 15 air_time double
## 16 distance double
## 17 hour double
## 18 minute double
## 19 time_hour double
# Method 4 #
flights %>%
summarise_all(typeof) %>%
gather## # A tibble: 19 x 2
## key value
## <chr> <chr>
## 1 year integer
## 2 month integer
## 3 day integer
## 4 dep_time integer
## 5 sched_dep_time integer
## 6 dep_delay double
## 7 arr_time integer
## 8 sched_arr_time integer
## 9 arr_delay double
## 10 carrier character
## 11 flight integer
## 12 tailnum character
## 13 origin character
## 14 dest character
## 15 air_time double
## 16 distance double
## 17 hour double
## 18 minute double
## 19 time_hour double