Readings

This assignment is based on the following readings:

YaRrr: 8
Videos: E, F

Assignment Goals

Create dataframes.
Index dataframes with $, [,], and subset()
Change column names with $ and <-

Examples

# Create a dataframe called study

study <- data.frame(id = c(1:8),
                    sex = c("m", "f", "m", "m", "m", "f", "m", "x34"),
                    age = c(28, 24, 19, 23, 42, 32, 27, 24),
                    eyecolor = c("blue", "brown", "brown", "green", "blue", "brown", "blue", "green"),
                    group = c(1, 1, 1, 1, 2, 2, 2, 2),
                    score = c(78, 65, 94, 92, 84, 86, 92, 86),
                    stringsAsFactors = FALSE)

# Summary statistics from specific columns

mean(study$age)         # Mean age
table(study$sex)        # Counts of each sex
mean(study$sex == "m")  # Percent that are men
mean(study$eyecolor %in% c("blue", "brown")) # Percent of eye colors that are blue or brown

# Indexing

study[1:5,]                           # First 5 rows
study[6:10, c("id", "sex", "score")]  # Rows 6-10 and columns id, sex and score

# Subsetting

study_men <- subset(study, sex == "m")
study_g1 <- subset(study, group == 1)
study_g2 <- subset(study, group == 2)

# Different ways to do the same subsetting

# Q: What is the mean score of group 2?

study_g2 <- subset(study, group == 2)   # Method 1A: Create study_g2 dataframe
mean(study_g2$score)                    #        1B: Calculate mean of study_g2$score

mean(subset(study, group == 2)$score)        # Method 2: Same as method 1 but in one step
with(subset(study, group == 2), mean(score)) # Method 3: Using with() and subset() 
mean(study$score[study$group == 2])          # Method 4: Using []

# Q: What percent of women over the age of 20 had brown eyes?

study.women <- subset(study, sex == "f" & age > 20)           #  Method 1A: 
mean(study.women$eyecolor == "brown")                         #         1B: 

mean(subset(study, sex == "f" & age > 20)$eyecolor == "brown")         # Method 2: 
with(subset(study, sex == "f" & age > 20), mean(eyecolor == "brown"))  # Method 3: 
mean(study$eyecolor[study$sex == "f" & study$age > 20] == "brown")     # Method 4: 

# Changing values of a vector in a dataframe

# Change sex values that are NOT f or m to NA
study$sex[study$sex %in% c("f", "m") == FALSE] <- NA

# Change "f" to "female", and "m" to "male"
study$sex[study$sex == "f"] <- "female"
study$sex[study$sex == "m"] <- "male"

# Changing column names

# Change name of first column to participant.id
names(study)[1] <- "patient.id"

# Change the name of columns 2 through 4
names(study)[2:4] <- c("gender", "age_years", "eye")

# Change name of group column to condition
names(study)[names(study) == "group"] <- "condition"

A Priming study

In a provocative paper, Bargh, Chen and Burrows (1996) sought to test whether or not priming people with trait concepts would trigger trait-consistent behavior. In one study, they primed participants with either neutral words (e.g.; bat, cookie, pen), or with words related to an elderly stereotype (e.g.; wise, stubborn, old). They then, unbeknownst to the participants, used a stopwatch to record how long it took the participants to walk down a hallway at the conclusion of an experiment. They predicted that participants primed with words related to the elderly would walk slower than those primed with neutral words.

In this WPA, you will analyze fake data corresponding to this study.

Dataset description

Our fake study has data from the following measures;

Variable	Description	Possible Values
`prime`	What kind of primes was the participant given?	`neutral`, `elderly`
`prime.duration`	How long (in minutes) were primes displayed to participants?	1, 5, 10, or 30
`grandparents`	Did the participant have a close relationship with their grandparents?	`yes` means yes, `no` means no, `none` means they never met their grandparents.
`id`	The order in which participants completed the study	Integers from 1 to 500
`age`	Participants’ age	Integers larger than 18
`sex`	Participant’s sex	`"m"` = male, `"f"` = female
`attention`	Did the participant pass an attention check?	`1` = yes, `0` = no
`walk`	How long (in seconds) did participants take to walk down the hallway?	Positive numbers

Load the data

The text file containing the data is called priming.txt. It is available at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/data/priming.txt. You can load the data into R as a new dataframe called priming by running the following:

priming <- read.table(file = "https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/data/priming.txt",
                      stringsAsFactors = FALSE)

Here is how the data should look:

a	b	c	d	e	f	g	h
1	m	21	1	asdf	1	no	25.4
2	m	21	1	asdf	30	no	23.6
3	f	22	1	asdf	30	none	34.5
4	m	23	1	elderly	1	yes	40.4
5	m	23	1	asdf	10	none	25.0
6	m	22	1	asdf	10	yes	24.7

Understand and clean the data

Get to know the data using View(), summary(), head() and str().

View(priming)
summary(priming)
head(priming)
str(priming)

Look at the names of the dataframe with names(). Those aren’t very informative are they? Change the names to the correct values (make sure to use the naming scheme I describe in the dataset description).

names(priming) <- c("id", "sex", "age", "attention", "prime", "prime.duration", "grandparents", "walk")

Apply functions to columns

What was the mean participant age? Answer this in two ways. First, calculate the mean directly from the age column. Second, create a new vector object called age.v that contains the age data, then calculate the mean age from this vector. Do you get the same result?

mean(priming$age)

## [1] 21.996

age.v <- priming$age
mean(age.v)

## [1] 21.996

What was the median walking time?

median(priming$walk)

## [1] 34.7

How many females were there? How many males?

sum(priming$sex == "f")

## [1] 252

sum(priming$sex == "m")

## [1] 248

What percent of participants passed the attention check? (Hint: To calculate a percentage from a 0, 1 variable, use mean())

mean(priming$attention)

## [1] 0.886

Walking time is currently in seconds. Add a new column to the dataframe called walk_m that shows the walking time in minutes rather than seconds.

priming$walk_m <- priming$walk / 60

Index and subset

What were the sexes of the first 10 participants?

priming$sex[1:10]

##  [1] "m" "m" "f" "m" "m" "m" "f" "m" "f" "m"

What was the data for the 50th participant?

priming[50,]

##    id sex age attention   prime prime.duration grandparents walk    walk_m
## 50 50   m  21         1 elderly              1         none 34.3 0.5716667

Try answering these questions using one of the methods in the Examples above. The easiest method is Method 1. That is, first create a new dataframe object of the subsetted data, and then calculate the summary data from this new object.

What was the mean walking time for the elderly prime condition?

mean(priming$walk[priming$prime == "elderly"])

## [1] 39.17612

What was the mean walking time for the neutral prime condition?

mean(priming$walk[priming$prime == "neutral"])

## [1] 26.50167

What was the mean walking time for participants less than 23 years old?

mean(priming$walk[priming$age < 23])

## [1] 28.8359

What was the mean walking time for females with a close relationship with their grandparents?

mean(priming$walk[priming$sex == "f" & priming$grandparents == "yes"])

## [1] 34.22515

What was the mean walking time for males over 21 years old without a close relationship with their grandparents?

mean(priming$walk[priming$sex == "m" & priming$age > 21 & priming$grandparents == "none"])

## [1] 25.76126

Checkpoint!!!

Create new dataframe objects

One of your colleagues wants the study data, but only the columns id, prime, and walk. Create a new dataframe called priming_simple that only contains these columns.

priming.simple <- priming[c("id", "prime", "walk")]

Some of the data don’t make any sense. For example, some walking times are negative, some prime values aren’t correct, and some prime.duration values weren’t part of the original study plan. Create a new dataframe called priming_c (aka., priming clean) that only includes rows with valid values for each column – do this by looking for an few strange values in each column, and by looking at the original dataset description. Additionally, only include participants who passed the attention check. Here’s a skeleton of how your code should look

# Create priming_c, a subset of the original priming data
#  (replace __ with the appropriate values)
priming_c <- subset(priming,
                    subset = sex %in% c(_____) & 
                             age > ____ &
                             attention == ___ &
                             prime %in% c(___) &
                             prime.duration %in% c(___) &
                             grandparents %in% c(___) &
                             walk > ___ )

# Create priming_c, a subset of the original priming data
#  (replace __ with the appropriate values)
priming_c <- subset(priming,
                    subset = sex %in% c("m", "f") & 
                             age > 18 &
                             attention == 1 &
                             prime %in% c("elderly", "neutral") &
                             prime.duration %in% c(1, 5, 10, 30) &
                             grandparents %in% c("no", "none", "yes") &
                             walk > 0)

How many participants gave valid data and passed the attention check? (Hint: Use the result from your previous answer!)

nrow(priming_c)

## [1] 291

Of those participants who gave valid data and passed the attention check, what was the mean walking time of those given the elderly and neutral prime (calculate these separately).

with(subset(priming_c, prime == "elderly"), mean(walk))

## [1] 41.93209

with(subset(priming_c, prime == "neutral"), mean(walk))

## [1] 30.25669

Challenges

Run the following lines of code and look at the resulting objects. Are they the same or different? You can do this by printing them and visually exploring the results. Or, you can use the R function identical() (look at the help function with ?identical to see how it works)

v1 <- priming$walk
v2 <- priming["walk"]
v3 <- priming[,names(priming) == "walk"]

# v1 and v3 are vectors, while v2 is a dataframe

Run the following lines of code and look at the resulting objects. Are they the same or different? If they are different, why?

vA <- priming$walk
vB <- subset(priming, select = "walk")

# vA is a vector while vB is a dataframe

Based on what you’ve learned in the previous question, run the following code and see what happens. Can you explain why?

mean(vA)

## [1] 30.09736

mean(vB)

## [1] NA

# mean(vB) doesn't work because you can't take the mean of a dataframe.

Note: The following questions apply to your cleaned dataframe (priming_c)

Did the effect of priming condition on walking times differ between the first 50 and the last 50 participants. That is, what was the difference in the mean walking time between the two priming conditions for the first 50 participants? What about the last 50 participants? (Hint: Make sure to index the data using id!)?

mean(priming_c$walk[priming_c$id <= 50 & priming_c$prime == "elderly"]) - mean(priming_c$walk[priming_c$id <= 50 & priming_c$prime == "neutral"])

## [1] 11.04059

mean(priming_c$walk[priming_c$id >= 450 & priming_c$prime == "elderly"]) - mean(priming_c$walk[priming_c$id >= 450 & priming_c$prime == "neutral"])

## [1] 10.21579

Do you find evidence that a participant’s relationship with their grandparents affects how they responded to the primes?

# Strong relationship only
mean(priming_c$walk[priming_c$grandparents == "yes" & priming_c$prime == "elderly"]) - mean(priming_c$walk[priming_c$grandparents == "yes" & priming_c$prime == "neutral"])

## [1] 13.57851

# No relationship only
mean(priming_c$walk[priming_c$grandparents == "none" & priming_c$prime == "elderly"]) - mean(priming_c$walk[priming_c$grandparents == "none" & priming_c$prime == "neutral"])

## [1] 9.667544

Due to a computer error, the data from every participant with an even id number is invalid. Remove these data from your priming_c dataframe.

priming_c <- priming_c[priming_c$id %in% seq(1, 501, by = 2),]

Submit!

Save and email your wpa_3_LastFirst.R file to me at nathaniel.phillips@unibas.ch.
Go to https://goo.gl/forms/b9dcRH6Ud3pDagOr1 to confirm your assignment submission.

WPA #3 – Dataframes