Readings

This assignment is based on the following readings:

Assignment Goals

Examples

library(yarrr) # Load yarrr for the pirates dataframe

# 1 sample t-test

# Are pirate ages different than 30 on average?
t.test(x = pirates$age, 
       mu = 30)

# 2 sample t-test

# Do females and males have different numbers of  tattoos?
t.test(formula = tattoos ~ sex,
       data = pirates, 
       subset = sex %in% c("male", "female"))

# Correlation test

# Is there a relationship between age and height?
 cor.test(formula = ~ age + height,
          data = pirates)

# Chi-Square test
 
# Is there a relationship between college and favorite pirate?
chisq.test(table(pirates$college,
                 pirates$favorite.pirate))

# Assign the results of a test to a new htest object and then acces specific outputs

age.ttest <- t.test(x = pirates$age, 
                   mu = 30)

class(age.ttest)    # sex.ttest is an htest object
names(age.ttest)      # What elements are in the object?

# Now you can access specific named elements with $

age.ttest$statistic   # The test statistic
age.ttest$p.value     # The p-value
age.ttest$conf.int    # Confidence interval for mean

Analyzing results from a survey

In this WPA, you’ll analyze data from a fictional survey of 100 students (in fact, you can even see the code I used to generate the data here: code to generate wpa6 data). In the survey, students were asked a variety of demographic questions (e.g.; sex, age, and country of origin), behavioral self-report questions (e.g.; have you smoked marijuana?), and completed a few cognitive tasks (e.g.; completing a logic problem). Your task is to conduct several hypothesis tests on these data.

Data Description

The data are located in a tab-delimited text file at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentsurvey.txt. The data has 100 rows and 12 columns. Here are descriptions of the columns:

Name Class Description
sex character Sex of the participant. “m” = male, “f” = female.
age integer Age of the participant.
major character Major of the participant.
haircolor character Color of the participant’s hair
iq integer Score on an IQ (intelligence) test
country character Country of origin
logic numeric Amount of time it took the participant to complete a logic problem (smaller is better)
siblings integer Number of siblings the participant has
multitasking integer Participant’s score on a multitasking task (higher is better)
partners integer Number of sexual partners that the participant has had.
marijuana binary Whether or not the participant has ever smoked marijuana (0 = “never”, 1 = “at least once”)
risk binary Would the person play a gamble with a 50% chance of losing 20CHF and a 50% chance of earning 20CHF?(0 means the participant would not play the gamble, 1 means they would)

Table 1: Description of columns in the student survey data

Load the data into R and saving a local copy as a .txt file

Hint: Look at the answers to WPA #4 at https://ndphillips.github.io/IntroductionR_Course/assignments/wpa/wpa_4_answers.html if you get stuck

  1. Open your class R project. Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R folder in your project working directory as wpa_6_LastFirst.R, where Last and First are your last and first names.

  2. The data are stored in a tab–delimited text file located at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentsurvey.txt. Using read.table() load this data into R as a new object called survey.

  3. Using head(), str(), and View() look at the dataset and make sure that it was loaded correctly. If the data don’t look correct (i.e; if there isn’t a header row and 100 rows and 12 columns), then something must have gone wrong when you loaded the data. Try and fix it!

  4. Save a local copy of the data as a tab-separated text file called studentsurvey.txt to the data folder of your project using write.table().

How to report hypothesis tests

For the rest of the assignment you will be conducting, and reporting, several hypothesis tests. Please write your answers to all hypothesis test questions in proper American Pirate Association (APA) style! If your p-value is less than .01, just write p < .01.

Here is the basic structure of APA style for reporting hypothesis tests

Name Skeleton Example
Chi-square X(_) = _, p = __ X(25) = 15.89, p = 0.92
T-test t(_) = _, p = __ (_-tailed), 95% CI = (_, __) t(99) = 3.47, p < .01 (2-tailed), 95% CI = (0.14, 0.53)
Correlation test r = _, t(_) = __, p = __ (_-tailed), 95% CI = (_, __) r = 0.09, t(98) = 0.89, p = 0.37 (2-tailed), 95% CI = (-0.11, 0.28)

For example, here is some output with the appropriate apa conclusion:

# Do pirates with headbands have different numbers of tattoos than those
#  who do not wear headbands?

t.test(formula = tattoos ~ headband, 
       data = pirates,
       alternative = "two.sided",
       mu = 0,
       conf.level = .95)
## 
##  Welch Two Sample t-test
## 
## data:  tattoos by headband
## t = -19.313, df = 146.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.878101 -4.786803
## sample estimates:
##  mean in group no mean in group yes 
##          4.699115         10.031567

Answer: Pirates with headbands have significantly more tattoos on average than those who do not wear headbands: t(146.73) = -19.31, p < .01 (2-tailed), 95% CI = (-5.88, -4.79)

t-test

  1. The average global IQ is known to be 100. Do the participants have an IQ different from the general population? Answer this with a one-sample t-test.

  2. A friend of yours claims that students have 2.5 siblings on average. Test this claim with a one-sample t-test.

  3. Do students that have smoked marijuana have different IQ levels than those who have never smoked marijuana? Test this claim with a two-sample t-test (you can either use the vector or the formula notation for a t-test)

correlation test

  1. Do students with higher multitasking skills tend to have more romantic partners than those with lower multitasking skills? Test this with a correlation test:

  2. Do people with higher IQs perform faster on the logic test? Answer this question with a correlation test.

chi-square test

  1. Are some majors more popular than others? Answer this question with a one-sample chi-square test.

  2. In general, were students more likely to take a risk than not? Answer this question with a one-sample chi-square test

  3. Is there a relationship between hair color and students’ academic major? Answer this with a two-sample chi-square test

You pick the test!

  1. Is there a relationship between whether a student has ever smoked marijuana and his/her decision to accept or reject the risky gamble?

  2. Do males and females have different numbers of sexual partners on average?

  3. Do males and females differ in how likely they are to have smoked marijuana?

  4. Do people who have smoked marijuana have different logic scores on average than those who never have smoked marijuana?

Checkpoint!!!

Anscombe’s Famous data quartet

In the next few questions, we’ll explore Anscombe’s famous data quartet. This famous dataset will show you the dangers of interpreting statistical tests (like a correlation test), without first plotting the data!

  1. Run the following code to create the anscombe dataframe. This dataframe contains 11 pairs of data for four different datasets: A, B, C and D
# JUST COPY, PASTE, AND RUN!

anscombe <- data.frame(x = c(c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                             c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                             c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
                             c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8)),
                       y = c(c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 4.68),
                             c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74),
                             c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
                             c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)),
                       set = rep(c("A", "B", "C", "D"), each = 11))
  1. Look at the anscombe dataframe to see how it is structured. Then, calculate the correlation between x and y separately for each dataset. That is, what is the correlation between x and y for dataset A? What about for datasets B, C and then D? You don’t need to report full APA style for these tests, just make a note of the correlation coefficients. What do you notice about the correlation coefficients for each dataset? Would you conclude that these datasets are very similar or very different?

  2. Now it’s time to actually look at the data and see if your prediction holds up! Run the following code to generate a scatterplot of each data pair, what do you find?

# JUST COPY, PASTE, AND RUN!
library(ggplot2)    # Load ggplot2

ggplot(data = anscombe,
       aes(x = x, y = y, col = set)) + 
  geom_point(size = 2) + 
  facet_wrap(~set) + 
  labs(title = "Anscombe's Quartet", 
       subtitle = "Always look at your data before conducting hypothesis tests!",
       caption = "Anscombe Wikipedia page: https://en.wikipedia.org/wiki/Anscombe%27s_quartet") +
  theme_bw() +
  guides(colour = "none")

What you have just seen is the famous Anscombe’s quartet a dataset designed to show you how important is to always plot your data before running a statistical test!!! You can see more at the wikipedia page here: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

You pick the test!

  1. Do people with higher iq scores tend to perform better on the logic test that those with lower iq scores?

  2. Are Germans more likely than not to have tried marijuana? (Hint: this is a one-sample chi-square test on a subset of the original data).

  3. Does the IQ of people with brown hair differ from blondes? (Hint: This is a two-sample t-test that requires you to use the subset() argument to tell R which two groups you want to compare)

  4. Only for men from Switzerland, is there a relationship between age and IQ?

  5. Only for people who chose the risky gamble and have never tried marijuana, is there a relationship between iq and performance on the logic test?

Predict the correlation

  1. What do you predict the correlation will be between the following two vectors x and y? Test your prediction by conducting the appropriate test (hint: you may want to put the vectors into a dataframe by running df <- data.frame(x = x, y = y) before running the correlation test),
x <- rnorm(n = 100, mean = 10, sd = 10)
y <- rnorm(n = 100, mean = 50, sd = 10)
  1. What about the correlation between the following two vectors? Test your prediction by conducting the appropriate test.
x <- rnorm(n = 100, mean = 10, sd = 10)
y <- -5 * x
  1. What about these? Test your prediction by conducting the appropriate test.
x <- rnorm(n = 100, mean = 10, sd = 10)
y <- x + rnorm(n = 100, mean = 50, sd = 10)

Conducting a simulation of p-values

Let’s run some simulations to see what p-values really mean. We’ll do this by drawing repeated samples from a world when the null hypothesis H0 is True, and other samples from a world when the null hypothesis is False, that is, when the alternative hypothesis H1 is True).

  1. First, let’s draw a random sample from the world when the null hypothesis H0 is True. We’ll save the results as sample.H0. Then, we’ll conduct a one-sample t-test to test if the true population mean is really 0:
# Generate a vector of 20 samples from a normal distribution with mu = 0
#  Here, the null hypothesis H0: Mu = 0 IS true

sample.H0 <- rnorm(n = 20, mean = 0, sd = 1)


# Now look at the p-values from one-sample t-tests of each object
t.test(x = sample.H0, mu = 0, alternative = "two.sided")$p.value
  1. Copy and paste your code above several times and see what happens! As you’ll see, the value keeps changing, this is because you’re getting new samples each time. How often do you get a ‘significant’ p-value?

  2. Now, let’s repeat the process in a world where the null hypothesis is False (that is, the true mean is 0.5). Run the following code several times to see what the p-values look like

# Generate a vector of 20 samples from a normal distribution with mu = 0.5
#  Here, the null hypothesis H0: Mu = 0 is FALSE

sample.H1 <- rnorm(n = 20, mean = 0.5, sd = 1)

# Now look at the p-values from one-sample t-tests of each object
t.test(x = sample.H1, mu = 0, alternative = "two.sided")$p.value
  1. It’s time to put it all together. In the code below, you’ll run a simulation where you repeat the process above 100 times for the H0 world, and 100 times for the H1 world

  2. Visualize the distribution of p values from the simulation separately for each world (for example, you could create a boxplot)

  3. Is the mean p-value from the H0 world significantly different from the mean p-value from the H1 world? Conduct the appropriate test!

  4. Use your simulation to answer the following question: “Given that the null hypothesis is true, what is the probability of obtaining a p-value less than 0.05?” In other words, for all of your simulations in the H1 world, what percent resulted in a p-value less than 0.05? (Hint: use aggregate() or dplyr).

  5. Now use your simulation to answer a slightly different question: “Given that a p-value is less than .05, what is the probability that the null hypothesis is really False?” In other words, for all the simulations that resulted in a p-value less than 0.05, what proportion were in the H1 world (and not in the H0 world)?

  6. Now, run the simulation again, but this time instead of performing 100 simulations from both worlds, do 1,000 simulations from the H0 world, and 100 from the H1 world. Based on this simulation, now make the same calculations as the previous two questions. That is, “Given that the null hypothesis is true, what is the probability of obtaining a p-value less than 0.05?” and “Given that a p-value is less than 0.05, what is the probability that the null hypothesis is False?” Are you answers the same or different from before?

  7. Based on what you’ve learned, what is the correct answer to the general question: “Given that the p-value from a hypothesis test is less than 0.05, what is the probability that the null hypothesis is True?”

Submit!