library(yarrr) # Load yarrr for the pirates dataframe
# 1 sample t-test
# Are pirate ages different than 30 on average?
t.test(x = pirates$age,
mu = 30)
# 2 sample t-test
# Do females and males have different numbers of tattoos?
t.test(formula = tattoos ~ sex,
data = pirates,
subset = sex %in% c("male", "female"))
# Correlation test
# Is there a relationship between age and height?
cor.test(formula = ~ age + height,
data = pirates)
# Chi-Square test
# Is there a relationship between college and favorite pirate?
chisq.test(table(pirates$college,
pirates$favorite.pirate))
# Assign the results of a test to a new htest object and then acces specific outputs
age.ttest <- t.test(x = pirates$age,
mu = 30)
class(age.ttest) # sex.ttest is an htest object
names(age.ttest) # What elements are in the object?
# Now you can access specific named elements with $
age.ttest$statistic # The test statistic
age.ttest$p.value # The p-value
age.ttest$conf.int # Confidence interval for mean
In this WPA, you’ll analyze data from a fictional survey of 100 students (in fact, you can even see the code I used to generate the data here: code to generate wpa6 data). In the survey, students were asked a variety of demographic questions (e.g.; sex, age, and country of origin), behavioral self-report questions (e.g.; have you smoked marijuana?), and completed a few cognitive tasks (e.g.; completing a logic problem). Your task is to conduct several hypothesis tests on these data.
The data are located in a tab-delimited text file at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentsurvey.txt. The data has 100 rows and 12 columns. Here are descriptions of the columns:
Name | Class | Description |
---|---|---|
sex |
character | Sex of the participant. “m” = male, “f” = female. |
age |
integer | Age of the participant. |
major |
character | Major of the participant. |
haircolor |
character | Color of the participant’s hair |
iq |
integer | Score on an IQ (intelligence) test |
country |
character | Country of origin |
logic |
numeric | Amount of time it took the participant to complete a logic problem (smaller is better) |
siblings |
integer | Number of siblings the participant has |
multitasking |
integer | Participant’s score on a multitasking task (higher is better) |
partners |
integer | Number of sexual partners that the participant has had. |
marijuana |
binary | Whether or not the participant has ever smoked marijuana (0 = “never”, 1 = “at least once”) |
risk |
binary | Would the person play a gamble with a 50% chance of losing 20CHF and a 50% chance of earning 20CHF?(0 means the participant would not play the gamble, 1 means they would) |
Table 1: Description of columns in the student survey data
Hint: Look at the answers to WPA #4 at https://ndphillips.github.io/IntroductionR_Course/assignments/wpa/wpa_4_answers.html if you get stuck
Open your class R project. Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R
folder in your project working directory as wpa_6_LastFirst.R
, where Last and First are your last and first names.
The data are stored in a tab–delimited text file located at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentsurvey.txt. Using read.table()
load this data into R as a new object called survey
.
Using head()
, str()
, and View()
look at the dataset and make sure that it was loaded correctly. If the data don’t look correct (i.e; if there isn’t a header row and 100 rows and 12 columns), then something must have gone wrong when you loaded the data. Try and fix it!
Save a local copy of the data as a tab-separated text file called studentsurvey.txt
to the data
folder of your project using write.table()
.
For the rest of the assignment you will be conducting, and reporting, several hypothesis tests. Please write your answers to all hypothesis test questions in proper American Pirate Association (APA) style! If your p-value is less than .01, just write p < .01.
Here is the basic structure of APA style for reporting hypothesis tests
Name | Skeleton | Example |
---|---|---|
Chi-square | X(_) = _, p = __ | X(25) = 15.89, p = 0.92 |
T-test | t(_) = _, p = __ (_-tailed), 95% CI = (_, __) | t(99) = 3.47, p < .01 (2-tailed), 95% CI = (0.14, 0.53) |
Correlation test | r = _, t(_) = __, p = __ (_-tailed), 95% CI = (_, __) | r = 0.09, t(98) = 0.89, p = 0.37 (2-tailed), 95% CI = (-0.11, 0.28) |
For example, here is some output with the appropriate apa conclusion:
# Do pirates with headbands have different numbers of tattoos than those
# who do not wear headbands?
t.test(formula = tattoos ~ headband,
data = pirates,
alternative = "two.sided",
mu = 0,
conf.level = .95)
##
## Welch Two Sample t-test
##
## data: tattoos by headband
## t = -19.313, df = 146.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.878101 -4.786803
## sample estimates:
## mean in group no mean in group yes
## 4.699115 10.031567
Answer: Pirates with headbands have significantly more tattoos on average than those who do not wear headbands: t(146.73) = -19.31, p < .01 (2-tailed), 95% CI = (-5.88, -4.79)
The average global IQ is known to be 100. Do the participants have an IQ different from the general population? Answer this with a one-sample t-test.
A friend of yours claims that students have 2.5 siblings on average. Test this claim with a one-sample t-test.
Do students that have smoked marijuana have different IQ levels than those who have never smoked marijuana? Test this claim with a two-sample t-test (you can either use the vector or the formula notation for a t-test)
Do students with higher multitasking skills tend to have more romantic partners than those with lower multitasking skills? Test this with a correlation test:
Do people with higher IQs perform faster on the logic test? Answer this question with a correlation test.
Are some majors more popular than others? Answer this question with a one-sample chi-square test.
In general, were students more likely to take a risk than not? Answer this question with a one-sample chi-square test
Is there a relationship between hair color and students’ academic major? Answer this with a two-sample chi-square test
Is there a relationship between whether a student has ever smoked marijuana and his/her decision to accept or reject the risky gamble?
Do males and females have different numbers of sexual partners on average?
Do males and females differ in how likely they are to have smoked marijuana?
Do people who have smoked marijuana have different logic scores on average than those who never have smoked marijuana?
In the next few questions, we’ll explore Anscombe’s famous data quartet. This famous dataset will show you the dangers of interpreting statistical tests (like a correlation test), without first plotting the data!
anscombe
dataframe. This dataframe contains 11 pairs of data for four different datasets: A, B, C and D# JUST COPY, PASTE, AND RUN!
anscombe <- data.frame(x = c(c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8)),
y = c(c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 4.68),
c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74),
c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)),
set = rep(c("A", "B", "C", "D"), each = 11))
Look at the anscombe
dataframe to see how it is structured. Then, calculate the correlation between x and y separately for each dataset. That is, what is the correlation between x and y for dataset A? What about for datasets B, C and then D? You don’t need to report full APA style for these tests, just make a note of the correlation coefficients. What do you notice about the correlation coefficients for each dataset? Would you conclude that these datasets are very similar or very different?
Now it’s time to actually look at the data and see if your prediction holds up! Run the following code to generate a scatterplot of each data pair, what do you find?
# JUST COPY, PASTE, AND RUN!
library(ggplot2) # Load ggplot2
ggplot(data = anscombe,
aes(x = x, y = y, col = set)) +
geom_point(size = 2) +
facet_wrap(~set) +
labs(title = "Anscombe's Quartet",
subtitle = "Always look at your data before conducting hypothesis tests!",
caption = "Anscombe Wikipedia page: https://en.wikipedia.org/wiki/Anscombe%27s_quartet") +
theme_bw() +
guides(colour = "none")
What you have just seen is the famous Anscombe’s quartet a dataset designed to show you how important is to always plot your data before running a statistical test!!! You can see more at the wikipedia page here: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Do people with higher iq scores tend to perform better on the logic test that those with lower iq scores?
Are Germans more likely than not to have tried marijuana? (Hint: this is a one-sample chi-square test on a subset of the original data).
Does the IQ of people with brown hair differ from blondes? (Hint: This is a two-sample t-test that requires you to use the subset()
argument to tell R which two groups you want to compare)
Only for men from Switzerland, is there a relationship between age and IQ?
Only for people who chose the risky gamble and have never tried marijuana, is there a relationship between iq and performance on the logic test?
x
and y
? Test your prediction by conducting the appropriate test (hint: you may want to put the vectors into a dataframe by running df <- data.frame(x = x, y = y)
before running the correlation test),x <- rnorm(n = 100, mean = 10, sd = 10)
y <- rnorm(n = 100, mean = 50, sd = 10)
x <- rnorm(n = 100, mean = 10, sd = 10)
y <- -5 * x
x <- rnorm(n = 100, mean = 10, sd = 10)
y <- x + rnorm(n = 100, mean = 50, sd = 10)
Let’s run some simulations to see what p-values really mean. We’ll do this by drawing repeated samples from a world when the null hypothesis H0 is True, and other samples from a world when the null hypothesis is False, that is, when the alternative hypothesis H1 is True).
sample.H0
. Then, we’ll conduct a one-sample t-test to test if the true population mean is really 0:# Generate a vector of 20 samples from a normal distribution with mu = 0
# Here, the null hypothesis H0: Mu = 0 IS true
sample.H0 <- rnorm(n = 20, mean = 0, sd = 1)
# Now look at the p-values from one-sample t-tests of each object
t.test(x = sample.H0, mu = 0, alternative = "two.sided")$p.value
Copy and paste your code above several times and see what happens! As you’ll see, the value keeps changing, this is because you’re getting new samples each time. How often do you get a ‘significant’ p-value?
Now, let’s repeat the process in a world where the null hypothesis is False (that is, the true mean is 0.5). Run the following code several times to see what the p-values look like
# Generate a vector of 20 samples from a normal distribution with mu = 0.5
# Here, the null hypothesis H0: Mu = 0 is FALSE
sample.H1 <- rnorm(n = 20, mean = 0.5, sd = 1)
# Now look at the p-values from one-sample t-tests of each object
t.test(x = sample.H1, mu = 0, alternative = "two.sided")$p.value
It’s time to put it all together. In the code below, you’ll run a simulation where you repeat the process above 100 times for the H0 world, and 100 times for the H1 world
Visualize the distribution of p values from the simulation separately for each world (for example, you could create a boxplot)
Is the mean p-value from the H0 world significantly different from the mean p-value from the H1 world? Conduct the appropriate test!
Use your simulation to answer the following question: “Given that the null hypothesis is true, what is the probability of obtaining a p-value less than 0.05?” In other words, for all of your simulations in the H1 world, what percent resulted in a p-value less than 0.05? (Hint: use aggregate()
or dplyr
).
Now use your simulation to answer a slightly different question: “Given that a p-value is less than .05, what is the probability that the null hypothesis is really False?” In other words, for all the simulations that resulted in a p-value less than 0.05, what proportion were in the H1 world (and not in the H0 world)?
Now, run the simulation again, but this time instead of performing 100 simulations from both worlds, do 1,000 simulations from the H0 world, and 100 from the H1 world. Based on this simulation, now make the same calculations as the previous two questions. That is, “Given that the null hypothesis is true, what is the probability of obtaining a p-value less than 0.05?” and “Given that a p-value is less than 0.05, what is the probability that the null hypothesis is False?” Are you answers the same or different from before?
Based on what you’ve learned, what is the correct answer to the general question: “Given that the p-value from a hypothesis test is less than 0.05, what is the probability that the null hypothesis is True?”
wpa_6_LastFirst.R
file to me at nathaniel.phillips@unibas.ch.