Applied Statistical Methods 1000
Extra Credit Exercises

Rules

All extra credit should be your individual work; otherwise, points will be deducted. [Students who wish to work together on these problems must request my permission to do so in advance.]
Hand them in to me in lecture, in a separate pile from regular assignments.

Extra Credit 1 Due in lecture February 2. Worth 5 pts.

Students in a class were classified according to whether their major was undecided or not, and whether they lived on or off campus. 40 students lived off campus and had a decided major; 10 students lived off campus and had an undecided major. 24 students lived on campus and had a decided major; 26 students lived on campus and had an undecided major.

First analyze the relationship:
1. Complete a two-way table for the data.
2. Which group has a higher proportion living on campus---the decided or the undecided majors?
3. Compute a table of counts expected if there were no relationship between living situation and major decided or not.
4. Calculate the chi-squared statistic.
5. Which one of the following is the best way to summarize the situation? (i) There is no statistically significant relationship between living situation and major being decided or not. (ii) Year at Pitt is a confounding variable in the relationship between living situation and major decided or not. (iii) Living on campus prevents students from deciding on a major. (iv) Deciding on a major causes students to move off campus.
[This is the challenging part!] Now create two separate two-way tables for "underclassmen" and "upperclassmen", whose counts together total to those in the original table, but neither of which shows a significant relationship between living situation and major being decided or not. In other words, create a scenario which demonstrates Simpson's Paradox.

Extra Credit 2 Due in lecture March 12. Worth 5 pts.

Extra Credit Exercises 2 through 10 are based on student survey data survey9-21-03.txt, which is taken to be our population. To download it into MINITAB, type ctrl A to highlight, ctrl C to copy, start up MINITAB, type ctrl V to paste it. If it asks about delimiters, click OK. The purpose of this exercise is to explore how sample size affects the distribution of sample proportion.

First verify that the population of categorical values for the variable "live" has equal proportions of the two possible values "off" and "on":
- Stat>Tables>Tally
- Variables Live
- Display Counts and Percents
Next take repeated small samples (size 10) from the population of values for the variable "live", for which the population proportion p living off campus you have verified to be approximately 50% or .5. [About half of our population of students live off campus, the other half on campus.] Our theory about the behavior of sampling distributions is for an infinite number of repetitions, but for practical purposes you will take 20 random samples altogether.

Calc>Random Data>Sample from Columns
Sample 10 rows from Column Live
Store samples in Livesmallsample1
Stat>Tables>Tally
Variables Livesmallsample1
Display Counts and Percents
Create a column called "phatliven=10" and type in the sample proportion living off campus (for example, .6 if the sample proportion is 60%)
Calc>Random Data>Sample from Columns
Sample 10 rows from Column Live
Store samples in Livesmallsample2 [The easiest way to do this is to simply change the "1" in the variable name to a "2".]
Stat>Tables>Tally
Variables Livesmallsample2 [Again, just change the "1" to a "2".]
Display Counts and Percents
Type the second sample proportion as the second entry in the column "phatliven=10"
Repeat the process above 20 times altogether, finishing with "Livesmallsample20", for which the proportion living off campus will be the 20th entry in "phatliven=10"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable phatliven=10
Graph>Stem-and-Leaf
Enter the variable phatliven=10
Summarize the distribution of sample proportion for samples of size 10 by reporting center, spread, and shape.

Now take repeated large samples (size 40) from the population of values for the variable "live" (20 samples altogether):

Calc>Random Data>Sample from Columns
Sample 40 rows from Column Live
Store samples in Livelargesample1
Stat>Tables>Tally
Variables Livelargesample1
Display Counts and Percents
Create a column called "phatliven=40" and type in the sample proportion living off campus (for example, .525 if the sample proportion is 52.5%)
Calc>Random Data>Sample from Columns
Sample 40 rows from Column Live
Store samples in Livelargesample2
Stat>Tables>Tally
Variables Livelargesample2
Display Counts and Percents
Type the second sample proportion as the second entry in the column "phatliven=40"
Repeat the process above 20 times altogether, finishing with "Livelargesample20", for which the proportion living off campus will be the 20th entry in "phatliven=40"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable phatliven=40
Graph>Stem-and-Leaf
Enter the variable phatliven=40
Summarize the distribution of sample proportion for samples of size 40 by reporting center, spread, and shape.

Lastly, and most importantly, compare the centers, spreads, and shapes for samples of size 10 vs. 40.
- Stat>Basic Statistics>Display Descriptive Statistics
- Enter the variables phatliven=10 and phatliven=40
- Stat>Basic Statistics>2-sample t
- Activate "Samples in Different Columns"
- Enter the variables phatliven=10 and phatliven=40
- Check Graph>Boxplots of Data
Write a few sentences to compare the distribution of sample proportion for small vs. large samples, including mention of center, spread, and shape. Are your results consistent with the theory presented in Chapter 9?

Extra Credit 3 Prerequisite: Extra Credit 2. Due in lecture March 12. Worth 5 pts.

The purpose of this exercise is to explore how population shape affects the distribution of sample proportion.

First verify that the population of categorical values for the variable "handed", for which we are interested in the proportion who are ambidextrous, is very skewed: there are only about 3% (.03) who are ambidextrous; the remaining 97% favor either the right or the left hand.
- Stat>Tables>Tally
- Variables Handed
- Display Counts and Percents
Next take repeated small samples (size 10) from the population of values for the variable "handed", for which the population proportion p who are ambidextrous you have verified to be approximately 3% or .03. Our theory about the behavior of sampling distributions is for an infinite number of repetitions, but for practical purposes you will take 20 random samples altogether.

Calc>Random Data>Sample from Columns
Sample 10 rows from Column Handed
Store samples in Handedsmallsample1
Stat>Tables>Tally
Variables Handedsmallsample1
Display Counts and Percents
Create a column called "phathandedn=10" and type in the sample proportion ambidextrous (for example, .1 if the sample proportion is 10%, or 0 if the sample only contains right- and left-handed people)
Calc>Random Data>Sample from Columns
Sample 10 rows from Column Handed
Store samples in Handedsmallsample2 [The easiest way to do this is to simply change the "1" in the variable name to a "2".]
Stat>Tables>Tally
Variables Handedsmallsample2 [Again, just change the "1" to a "2".]
Display Counts and Percents
Type the second sample proportion as the second entry in the column "phathandedn=10"
Repeat the process above 20 times altogether, finishing with "Handedsmallsample20", for which the proportion who are ambidextrous will be the 20th entry in "phathandedn=10"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable phathandedn=10
Graph>Stem-and-Leaf
Enter the variable phathandedn=10
Summarize the distribution of sample proportion for samples of size 10 by reporting center, spread, and shape.

Now take repeated large samples (size 40) from the population of values for the variable "handed" (20 samples altogether):

Calc>Random Data>Sample from Columns
Sample 40 rows from Column Handed
Store samples in Handedlargesample1
Stat>Tables>Tally
Variables Handedlargesample1
Display Counts and Percents
Create a column called "phathandedn=40" and type in the sample proportion who are ambidextrous (for example, .075 if the sample proportion is 7.5%)
Calc>Random Data>Sample from Columns
Sample 40 rows from Column Handed
Store samples in Handedlargesample2
Stat>Tables>Tally
Variables Handedlargesample2
Display Counts and Percents
Type the second sample proportion as the second entry in the column "phathandedn=40"
Repeat the process above 20 times altogether, finishing with "Handedlargesample20", for which the proportion who are ambidextrous will be the 20th entry in "phathandedn=40"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable phathandedn=40
Graph>Stem-and-Leaf
Enter the variable phathandedn=40
Summarize the distribution of sample proportion for samples of size 40 by reporting center, spread, and shape.

Next compare the centers, spreads, and shapes for samples of size 10 vs. 40.
- Stat>Basic Statistics>Display Descriptive Statistics
- Enter the variables phathandedn=10 and phathandedn=40
- Stat>Basic Statistics>2-sample t
- Activate "Samples in Different Columns"
- Enter the variables phathandedn=10 and phathandedn=40
- Check Graph>Boxplots of Data
Are your results consistent with the theory presented in Chapter 9?
Lastly, and most importantly, compare the shapes of the distributions of sample proportion for samples of size 10 coming from "Live" vs. from "Handed" and for samples of size 40 coming from "Live" vs. from "Handed". For which population do the distributions of sample proportion for a given sample size tend to be more normal, for the variable "Live" or for the variable "Handed"?

Extra Credit 4 Due in lecture March 12. Worth 5 pts.

Extra Credit Exercises 2 through 10 are based on student survey data survey9-21-03.txt, which is taken to be our population. To download it into MINITAB, type ctrl A to highlight, ctrl C to copy, start up MINITAB, type ctrl V to paste it. If it asks about delimiters, click OK. The purpose of this exercise is to explore how sample size affects the distribution of sample mean.

First verify that our population of quantitative values for the variable "math" has mean mu=610.44 and standard deviation sigma=72.14, and that the shape is quite normal:
- Stat>Basic Statistics>Display Descriptive Statistics
- Variables Math
- Graph>Histogram
- Variables Math
Now take repeated small samples (size 10) from the population of quantitative values for the variable "math". Our theory about the behavior of sampling distributions is for an infinite number of repetitions, but for practical purposes you will take 20 random samples altogether.

Calc>Random Data>Sample from Columns
Sample 10 rows from Column Math
Store samples in Mathsmallsample1
Stat>Basic Statistics>Display Descriptive Statistics
Variables Mathsmallsample1
Create a column called "xbarmathn=10" and type in the sample mean Math SAT score
Calc>Random Data>Sample from Columns
Sample 10 rows from Column Math
Store samples in Mathsmallsample2
Stat>Basic Statistics>Display Descriptive Statistics
Variables Mathsmallsample2
Type the second sample mean as the second entry in the column "xbarmathn=10"
Repeat the process above 20 times altogether, finishing with "Mathsmallsample20", for which the sample mean Math SAT score will be the 20th entry in "xbarmathn=10"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable xbarmathn=10
Graph>Stem-and-Leaf
Enter the variable xbarmathn=10
Summarize the distribution of sample mean for samples of size 10 by reporting center, spread, and shape.

Now take repeated large samples (size 40) from the population of values for the variable "math" (20 samples altogether):

Calc>Random Data>Sample from Columns
Sample 40 rows from Column Math
Store samples in Mathlargesample1
Stat>Basic Statistics>Display Descriptive Statistics
Variables Mathlargesample1
Create a column called "xbarmathn=40" and type in the sample mean Math SAT score
Calc>Random Data>Sample from Columns
Sample 40 rows from Column Math
Store samples in Mathlargesample2
Stat>Basic Statistics>Display Descriptive Statistics
Variables Mathlargesample2
Type the second sample mean as the second entry in the column "xbarmathn=40"
Repeat the process above 20 times altogether, finishing with "Mathlargesample20", for which the sample mean Math SAT score will be the 20th entry in "xbarmathn=40"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable xbarmathn=40
Graph>Stem-and-Leaf
Enter the variable xbarmathn=40
Summarize the distribution of sample mean for samples of size 40 by reporting center, spread, and shape.

Lastly, and most importantly, compare the centers, spreads, and shapes for samples of size 10 vs. 40.
- Stat>Basic Statistics>Display Descriptive Statistics
- Enter the variables xbarmathn=10 and xbarmathn=40
- Stat>Basic Statistics>2-sample t
- Activate "Samples in Different Columns"
- Enter the variables xbarmathn=10 and xbarmathn=40
- Check Graph>Boxplots of Data
Are your results consistent with the theory presented in Chapter 9? Write a paragraph to explain your answer.

Extra Credit 5 Prerequisite: Extra Credit 4. Due in lecture March 12. Worth 5 pts.

The purpose of this exercise is to explore how population shape affects the distribution of sample mean.

First verify that our population of quantitative values for the variable "Earned" has mean mu=3.776 thousand and standard deviation sigma=6.503, and that the shape is quite skewed to the right:
- Stat>Basic Statistics>Display Descriptive Statistics
- Variables Earned
- Graph>Histogram
- Variables Earned
Now take repeated small samples (size 10) from the population of quantitative values for the variable "earned". Our theory about the behavior of sampling distributions is for an infinite number of repetitions, but for practical purposes you will take 20 random samples altogether.

Calc>Random Data>Sample from Columns
Sample 10 rows from Column Earned
Store samples in Earnedsmallsample1
Stat>Basic Statistics>Display Descriptive Statistics
Variables Earnedsmallsample1
Create a column called "xbarearnedn=10" and type in the sample mean Earned SAT score
Calc>Random Data>Sample from Columns
Sample 10 rows from Column Earned
Store samples in Earnedsmallsample2
Stat>Basic Statistics>Display Descriptive Statistics
Variables Earnedsmallsample2
Type the second sample mean as the second entry in the column "xbarearnedn=10"
Repeat the process above 20 times altogether, finishing with "Earnedsmallsample20", for which the sample mean Earned SAT score will be the 20th entry in "xbarearnedn=10"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable xbarearnedn=10
Graph>Stem-and-Leaf
Enter the variable xbarearnedn=10
Summarize the distribution of sample mean for samples of size 10 by reporting center, spread, and shape.

Now take repeated large samples (size 40) from the population of values for the variable "earned" (20 samples altogether):

Calc>Random Data>Sample from Columns
Sample 40 rows from Column Earned
Store samples in Earnedlargesample1
Stat>Basic Statistics>Display Descriptive Statistics
Variables Earnedlargesample1
Create a column called "xbarearnedn=40" and type in the sample mean Earned SAT score
Calc>Random Data>Sample from Columns
Sample 40 rows from Column Earned
Store samples in Earnedlargesample2
Stat>Basic Statistics>Display Descriptive Statistics
Variables Earnedlargesample2
Type the second sample mean as the second entry in the column "xbarearnedn=40"
Repeat the process above 20 times altogether, finishing with "Earnedlargesample20", for which the sample mean Earned SAT score will be the 20th entry in "xbarearnedn=40"

Finally, obtain summaries and display:

Stat>Basic Statistics>Display Descriptive Statistics
Enter the variable xbarearnedn=40
Graph>Stem-and-Leaf
Enter the variable xbarearnedn=40
Summarize the distribution of sample mean for samples of size 40 by reporting center, spread, and shape.

Next compare the centers, spreads, and shapes for samples of size 10 vs. 40.
- Stat>Basic Statistics>Display Descriptive Statistics
- Enter the variables xbarearnedn=10 and xbarearnedn=40
- Stat>Basic Statistics>2-sample t
- Activate "Samples in Different Columns"
- Enter the variables xbarearnedn=10 and xbarearnedn=40
- Check Graph>Boxplots of Data
Are your results consistent with the theory presented in Chapter 9? Write a paragraph to explain your answer.
Lastly, and most importantly, compare shapes of the distributions of sample mean for samples of size 10 coming from "Math" vs. from "Earned", and for samples of size 40 coming from "Math" vs. from "Earned". For which population do the distributions of sample mean for a given sample size tend to be more normal, for the variable "Math" or for the variable "Earned"?

Extra Credit 6 Due in lecture March 30. Worth 5 pts.

Extra Credit Exercises 2 through 10 are based on student survey data survey9-21-03.txt, which is taken to be our population. To download it into MINITAB, type ctrl A to highlight, ctrl C to copy, start up MINITAB, type ctrl V to paste it. If it asks about delimiters, click OK. The purpose of this exercise is to understand the long-run behavior of confidence intervals.

First verify that the population proportion p of all survey respondents living off campus is almost exactly .5:
- Stat>Tables>Tally
- Variables Live, Check "Counts and Percents".
Next, take repeated samples (20 altogether) of size 40 from the population of categorical values for the variable "live", obtaining a 90% confidence interval each time for the "unknown" population proportion, based on each sample proportion.

Calc>Random Data>Sample from Columns
Sample 40 rows from Column Live
Store samples in Livesample1
Stat>Basic Statistics>1 proportion
Samples in columns Livesample1
Options>Confidence Level>90 and Check "Use test and interval based on normal distribution"
The first confidence interval is shown in the session window; you will need to examine all 20 intervals together once they've been produced.
Calc>Random Data>Sample from Columns
Sample 40 rows from Column Live
Store samples in Livesample2
Stat>Basic Statistics>1 proportion
Samples in columns Livesample2
[No need to specify 90% confidence and normal distribution, since these continue to be enabled by default.]
Repeat the process above 20 times altogether, finishing with "Livesample20".

Now examine all 20 confidence intervals. How many of them contain the actual population proportion .5? In the long run, what percentage of the 90% confidence intervals should contain p? [Note: keep all the results in your session window handy if you intend to do Extra Credit 7, which will focus on p-values rather than on confidence intervals.]

Extra Credit 7. Prerequisite Extra Credit 6. Due in lecture March 30. Worth 5 pts.

The purpose of this exercise is to understand the long-run behavior of hypothesis tests.

First recall that the population proportion p of all survey respondents living off campus is almost exactly .5.
Next examine all 20 p-values obtained in Extra Credit 6. [These were produced for the two-sided test about the null hypothesis that the population proportion p living off campus is .5] How many of them reject the null hypothesis at the 10% level? In the long run, what percentage of the tests should reject against the two-sided alternative at the 10% level, when the null hypothesis is in fact true?

Extra Credit 8 Due in lecture April 13. Worth 5 pts.