library(yarrr) # Load yarrr for the pirates dataframe
# 1 sample t-test
# Are pirate ages different than 30 on average?
t.test(x = pirates$age,
mu = 30)
# 2 sample t-test
# Do females and males have different numbers of tattoos?
t.test(formula = tattoos ~ sex,
data = pirates,
subset = sex %in% c("male", "female"))
# Correlation test
# Is there a relationship between age and height?
cor.test(formula = ~ age + height,
data = pirates)
# Chi-Square test
# Is there a relationship between college and favorite pirate?
chisq.test(table(pirates$college,
pirates$favorite.pirate))
# Assign the results of a test to a new htest object and then acces specific outputs
age.ttest <- t.test(x = pirates$age,
mu = 30)
class(age.ttest) # sex.ttest is an htest object
names(age.ttest) # What elements are in the object?
# Now you can access specific named elements with $
age.ttest$statistic # The test statistic
age.ttest$p.value # The p-value
age.ttest$conf.int # Confidence interval for mean
knitr::include_graphics(path = "http://assets.careerspot.com.au/files/news/mischevioussurvyepenguins.jpg")
In this WPA, you’ll analyze data from a fictional survey of 100 students (in fact, you can even see the code I used to generate the data here: code to generate wpa6 data). In the survey, students were asked a variety of demographic questions (e.g.; sex, age, and country of origin), behavioral self-report questions (e.g.; have you smoked marijuana?), and completed a few cognitive tasks (e.g.; completing a logic problem). Your task is to conduct several hypothesis tests on these data.
The data are located in a tab-delimited text file at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentsurvey.txt. The data has 100 rows and 12 columns. Here are descriptions of the columns:
Name | Class | Description |
---|---|---|
sex |
character | Sex of the participant. “m” = male, “f” = female. |
age |
integer | Age of the participant. |
major |
character | Major of the participant. |
haircolor |
character | Color of the participant’s hair |
iq |
integer | Score on an IQ (intelligence) test |
country |
character | Country of origin |
logic |
numeric | Amount of time it took the participant to complete a logic problem (smaller is better) |
siblings |
integer | Number of siblings the participant has |
multitasking |
integer | Participant’s score on a multitasking task (higher is better) |
partners |
integer | Number of sexual partners that the participant has had. |
marijuana |
binary | Whether or not the participant has ever smoked marijuana (0 = “never”, 1 = “at least once”) |
risk |
binary | Would the person play a gamble with a 50% chance of losing 20CHF and a 50% chance of earning 20CHF?(0 means the participant would not play the gamble, 1 means they would) |
Table 1: Description of columns in the student survey data
Hint: Look at the answers to WPA #4 at https://ndphillips.github.io/IntroductionR_Course/assignments/wpa/wpa_4_answers.html if you get stuck
Open your class R project. Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R
folder in your project working directory as wpa_6_LastFirst.R
, where Last and First are your last and first names.
The data are stored in a tab–delimited text file located at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentsurvey.txt. Using read.table()
load this data into R as a new object called survey
.
# Read tab-separated data from the web and store as a new object called survey
survey <- read.table(file = "https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentsurvey.txt", # Where is the file located?
header = TRUE, # Is there a header row?
sep = "\t") # How are columns separated?
Using head()
, str()
, and View()
look at the dataset and make sure that it was loaded correctly. If the data don’t look correct (i.e; if there isn’t a header row and 100 rows and 12 columns), then something must have gone wrong when you loaded the data. Try and fix it!
Save a local copy of the data as a tab-separated text file called studentsurvey.txt
to the data
folder of your project using write.table()
.
# Write survey to a tab-delimited text file in my data folder.
write.table(survey, # Object to be written
file = "data/studentsurvey.txt", # File location and name
sep = "\t") # How should columns be separated?
For the rest of the assignment you will be conducting, and reporting, several hypothesis tests. Please write your answers to all hypothesis test questions in proper American Pirate Association (APA) style! If your p-value is less than .01, just write p < .01.
Here is the basic structure of APA style for reporting hypothesis tests
Name | Skeleton | Example |
---|---|---|
Chi-square | X(_, N = _) = _, p = _ | X(25, N = 100) = 15.89, p = 0.92 |
T-test | t(_) = _, p = __ (_-tailed), 95% CI = (_, __) | t(99) = 3.47, p < .01 (2-tailed), 95% CI = (0.14, 0.53) |
Correlation test | r = _, t(_) = __, p = __ (_-tailed), 95% CI = (_, __) | r = 0.09, t(98) = 0.89, p = 0.37 (2-tailed), 95% CI = (-0.11, 0.28) |
For example, here is some output with the appropriate apa conclusion:
# Do pirates with headbands have different numbers of tattoos than those
# who do not wear headbands?
t.test(formula = tattoos ~ headband,
data = pirates,
alternative = "two.sided",
mu = 0,
conf.level = .95)
##
## Welch Two Sample t-test
##
## data: tattoos by headband
## t = -19.313, df = 146.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.878101 -4.786803
## sample estimates:
## mean in group no mean in group yes
## 4.699115 10.031567
Answer: Pirates with headbands have significantly more tattoos on average than those who do not wear headbands: t(146.73) = -19.31, p < .01 (2-tailed), 95% CI = (-5.88, -4.79)
t.test(x = survey$iq,
mu = 100)
##
## One Sample t-test
##
## data: survey$iq
## t = 20.847, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
## 109.0935 111.0065
## sample estimates:
## mean of x
## 110.05
I do find significant evidence that these participants have a mean IQ larger than the general population, t(99) = 20.85, p < .01, 95% CI = (109.09, 111.01)
t.test(x = survey$siblings,
mu = 2.5)
##
## One Sample t-test
##
## data: survey$siblings
## t = -0.9083, df = 99, p-value = 0.3659
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
## 2.181545 2.618455
## sample estimates:
## mean of x
## 2.4
I do not find significant evidence that these participants have an average number of siblings different from 2.5, t(99) = -0.91, p = 0.37, 95% CI = (2.18, 2.62)
t.test(iq ~ marijuana,
data = survey)
##
## Welch Two Sample t-test
##
## data: iq by marijuana
## t = -0.72499, df = 96.078, p-value = 0.4702
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.610103 1.213545
## sample estimates:
## mean in group 0 mean in group 1
## 109.6939 110.3922
I do not find significant evidence that the mean IQ of people who have ever smoked marijuana is different from those who have smoked marijuana, t(96.08) = -0.72, p = 0.47, 95% CI = (-2.61, 1.21)
cor.test(formula = ~ multitasking + partners,
data = survey)
##
## Pearson's product-moment correlation
##
## data: multitasking and partners
## t = -0.79444, df = 98, p-value = 0.4289
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2721353 0.1182836
## sample estimates:
## cor
## -0.07999296
I do not find significant evidence that there is a correlation between multitasking and the number of romantic partners one has had, t(98) = -0.79, p = 0.43, 95% CI = (-0.27, 0.12)
cor.test(formula = ~ iq + logic,
data = survey)
##
## Pearson's product-moment correlation
##
## data: iq and logic
## t = 1.2065, df = 98, p-value = 0.2305
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07727287 0.31003231
## sample estimates:
## cor
## 0.1209815
I do not find significant evidence that people with higher IQs perform faster on a logic test than those with lower IQs, t(98) = 1.21, p = 0.23, 95% CI = (-0.08, 0.31)
chisq.test(table(survey$major))
##
## Chi-squared test for given probabilities
##
## data: table(survey$major)
## X-squared = 130.96, df = 3, p-value < 2.2e-16
I do find significant evidence that some majors are more common than others, X(3, N = 100) = 130.96, p < 0.01
chisq.test(table(survey$risk))
##
## Chi-squared test for given probabilities
##
## data: table(survey$risk)
## X-squared = 17.64, df = 1, p-value = 2.669e-05
table(survey$risk) # To see see which value was more likely!
##
## 0 1
## 71 29
I do find significant evidence that students were LESS likely to take a risk than to not take a risk, X(1, N = 100) = 17.64, p < 0.01
chisq.test(table(survey$major, survey$haircolor))
##
## Pearson's Chi-squared test
##
## data: table(survey$major, survey$haircolor)
## X-squared = 5.9227, df = 9, p-value = 0.7476
I do not find significant evidence that there is a relationship between hair color and students’ academic major, X(9, N = 100) = 5.92, p = 0.75
chisq.test(table(survey$marijuana, survey$risk))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(survey$marijuana, survey$risk)
## X-squared = 0.56829, df = 1, p-value = 0.4509
I do NOT find significant evidence that there is a relationship between marijuana use and people’s decision to accept a risky gamble, X(1, N = 100) = 0.57, p = 0.45
t.test(formula = partners ~ sex,
data = survey)
##
## Welch Two Sample t-test
##
## data: partners by sex
## t = 0.9219, df = 93.765, p-value = 0.3589
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7615073 2.0815073
## sample estimates:
## mean in group f mean in group m
## 7.54 6.88
I do NOT find significant evidence that men and women have different numbers of sexual partners on average, t(93.77) = 0.92, p = 0.36
chisq.test(table(survey$marijuana, survey$sex))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(survey$marijuana, survey$sex)
## X-squared = 0.16006, df = 1, p-value = 0.6891
I do NOT find significant evidence that men and women differ in their likelihood to smoke marijuana, X(1, N = 100) = 0.16, p = 0.69
t.test(formula = logic ~ marijuana,
data = survey)
##
## Welch Two Sample t-test
##
## data: logic by marijuana
## t = 0.95997, df = 90.88, p-value = 0.3396
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4373672 1.2554465
## sample estimates:
## mean in group 0 mean in group 1
## 9.945510 9.536471
I do NOT find significant evidence that people who have smoked marijuana have different logic scores than those who have never smoked marijuana, t(90.88) = 0.96, p = 0.34
In the next few questions, we’ll explore Anscombe’s famous data quartet. This famous dataset will show you the dangers of interpreting statistical tests (like a correlation test), without first plotting the data!
anscombe
dataframe. This dataframe contains 11 pairs of data for four different datasets: A, B, C and D# JUST COPY, PASTE, AND RUN!
anscombe <- data.frame(x = c(c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8)),
y = c(c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 4.68),
c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74),
c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)),
set = rep(c("A", "B", "C", "D"), each = 11))
anscombe
dataframe to see how it is structured. Then, calculate the correlation between x and y separately for each dataset. That is, what is the correlation between x and y for dataset A? What about for datasets B, C and then D? You don’t need to report full APA style for these tests, just make a note of the correlation coefficients. What do you notice about the correlation coefficients for each dataset? Would you conclude that these datasets are very similar or very different?# Correlation for A
cor.test(formula = ~ x + y,
data = anscombe,
subset = set == "A")
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 4.4844, df = 9, p-value = 0.001523
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4612713 0.9549196
## sample estimates:
## cor
## 0.8311601
# Correlation for B
cor.test(formula = ~ x + y,
data = anscombe,
subset = set == "B")
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 4.2386, df = 9, p-value = 0.002179
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4239389 0.9506402
## sample estimates:
## cor
## 0.8162365
# Correlation for C
cor.test(formula = ~ x + y,
data = anscombe,
subset = set == "C")
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 4.2394, df = 9, p-value = 0.002176
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4240623 0.9506547
## sample estimates:
## cor
## 0.8162867
# Correlation for D
cor.test(formula = ~ x + y,
data = anscombe,
subset = set == "D")
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 4.243, df = 9, p-value = 0.002165
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4246394 0.9507224
## sample estimates:
## cor
## 0.8165214
All of the correlations are identical at r = 0.83!
# JUST COPY, PASTE, AND RUN!
library(ggplot2) # Load ggplot2
ggplot(data = anscombe,
aes(x = x, y = y, col = set)) +
geom_point(size = 2) +
facet_wrap(~set) +
labs(title = "Anscombe's Quartet",
subtitle = "Always look at your data before conducting hypothesis tests!",
caption = "Anscombe Wikipedia page: https://en.wikipedia.org/wiki/Anscombe%27s_quartet") +
theme_bw() +
guides(colour = "none")
What you have just seen is the famous Anscombe’s quartet a dataset designed to show you how important is to always plot your data before running a statistical test!!! You can see more at the wikipedia page here: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
cor.test(~ iq + logic,
data = survey)
##
## Pearson's product-moment correlation
##
## data: iq and logic
## t = 1.2065, df = 98, p-value = 0.2305
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07727287 0.31003231
## sample estimates:
## cor
## 0.1209815
I do not find significant evidence that people with higher IQs perform better on a logic test than those with lower IQs, t(98) = 1.21, p = 0.23, 95% CI = (-0.08, 0.31)
survey_germans <- subset(survey, country == "germany")
chisq.test(table(survey_germans$marijuana))
##
## Chi-squared test for given probabilities
##
## data: table(survey_germans$marijuana)
## X-squared = 0.034483, df = 1, p-value = 0.8527
I do NOT find significant evidence that Germans are more likely than not to have smoked marijuana, X(1, N = 29) = 0.03, p = 0.85
subset()
argument to tell R which two groups you want to compare)t.test(formula = iq ~ haircolor,
data = survey,
subset = haircolor %in% c("blonde", "brown"))
##
## Welch Two Sample t-test
##
## data: iq by haircolor
## t = 0.50574, df = 59.053, p-value = 0.6149
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.757470 2.946358
## sample estimates:
## mean in group blonde mean in group brown
## 109.7500 109.1556
I do NOT find significant evidence that the mean IQ of people with blonde hair is different from those with brown hair, t(59.1) = 0.51, p = 0.61, 95% CI = (-1.76, 2.95)
cor.test(formula = ~age + iq,
data = survey,
subset = country == "switzerland")
##
## Pearson's product-moment correlation
##
## data: age and iq
## t = 1.1545, df = 58, p-value = 0.253
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1081605 0.3890006
## sample estimates:
## cor
## 0.1498806
I do NOT find evidence for a correlation between age and IQ in students from Switzerland, t(58) = 1.15, p = 0.25, 95% CI = (-0.11, 0.39)
cor.test(formula = ~ iq + logic,
data = survey,
subset = risk == 1 & marijuana == 0)
##
## Pearson's product-moment correlation
##
## data: iq and logic
## t = 1.4249, df = 10, p-value = 0.1846
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2134024 0.7968450
## sample estimates:
## cor
## 0.4108122
I do NOT find evidence for a correlation between IQ and logic for people who chose the risky gambe and have never tried marijuana, t(10) = 1.42, p = 0.18, 95% CI = (-0.21, 0.80)
x
and y
? Test your prediction by conducting the appropriate test (hint: you may want to put the vectors into a dataframe by running df <- data.frame(x = x, y = y)
before running the correlation test),x <- rnorm(n = 100, mean = 10, sd = 10)
y <- rnorm(n = 100, mean = 50, sd = 10)
df <- data.frame(x = x, y = y)
cor.test(formula = ~ y + x,
data = df)
##
## Pearson's product-moment correlation
##
## data: y and x
## t = -0.28626, df = 98, p-value = 0.7753
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2240506 0.1684700
## sample estimates:
## cor
## -0.02890455
# Nope not significant!
x <- rnorm(n = 100, mean = 10, sd = 10)
y <- -5 * x
df <- data.frame(x = x, y = y)
cor.test(formula = ~ y + x,
data = df)
##
## Pearson's product-moment correlation
##
## data: y and x
## t = -Inf, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -1 -1
## sample estimates:
## cor
## -1
# A perfect negative correlation!
x <- rnorm(n = 100, mean = 10, sd = 10)
y <- x + rnorm(n = 100, mean = 50, sd = 10)
df <- data.frame(x = x, y = y)
cor.test(formula = ~ y + x,
data = df)
##
## Pearson's product-moment correlation
##
## data: y and x
## t = 9.0889, df = 98, p-value = 1.148e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5534003 0.7703865
## sample estimates:
## cor
## 0.6763054
# A large positive correlation!
Let’s run some simulations to see what p-values really mean. We’ll do this by drawing repeated samples from a world when the null hypothesis H0 is True, and other samples from a world when the null hypothesis is False, that is, when the alternative hypothesis H1 is True).
sample.H0
. Then, we’ll conduct a one-sample t-test to test if the true population mean is really 0:# Generate a vector of 20 samples from a normal distribution with mu = 0
# Here, the null hypothesis H0: Mu = 0 IS true
sample.H0 <- rnorm(n = 20, mean = 0, sd = 1)
# Now look at the p-values from one-sample t-tests of each object
t.test(x = sample.H0, mu = 0, alternative = "two.sided")$p.value
## [1] 0.8422217
Copy and paste your code above several times and see what happens! As you’ll see, the value keeps changing, this is because you’re getting new samples each time. How often do you get a ‘significant’ p-value?
Now, let’s repeat the process in a world where the null hypothesis is False (that is, the true mean is 0.5). Run the following code several times to see what the p-values look like
# Generate a vector of 20 samples from a normal distribution with mu = 0.5
# Here, the null hypothesis H0: Mu = 0 is FALSE
sample.H1 <- rnorm(n = 20, mean = 0.5, sd = 1)
# Now look at the p-values from one-sample t-tests of each object
t.test(x = sample.H1, mu = 0, alternative = "two.sided")$p.value
## [1] 0.007164242
It’s time to put it all together. In the code below, you’ll run a simulation where you repeat the process above 100 times for the H0 world, and 100 times for the H1 world
Visualize the distribution of p values from the simulation separately for each world (for example, you could create a boxplot)
ggplot(data = simulation,
aes(x = world, y = p)) +
geom_boxplot()
t.test(formula = p ~ world,
data = simulation)
##
## Welch Two Sample t-test
##
## data: p by world
## t = 10.353, df = 182.65, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3093399 0.4550016
## sample estimates:
## mean in group H0 mean in group H1
## 0.5098306 0.1276598
I do find evidence for a significant difference in p-values in the H0 world and the H1 world, t(182.65) = 10.35, p < .01, 95% CI = (0.31, 0.46)
aggregate()
or dplyr
).library(dplyr)
simulation %>%
filter(world == "H0") %>%
summarise(mean(p < .05))
## mean(p < 0.05)
## 1 0.03
# About 3% of the time (pretty close to 5%)
simulation %>%
filter(p < .05) %>%
summarise(mean(world == "H1"))
## mean(world == "H1")
## 1 0.9491525
# About 95% of the time
simulation %>%
filter(p < .05) %>%
summarise(mean(world == "H1"))
## mean(world == "H1")
## 1 0.5789474
# Only about half the time!
It depends on the prior probability that the null hypothesis was true in the first place!
wpa_6_LastFirst.R
file to me at nathaniel.phillips@unibas.ch.