A typical 3am scene at Paddy's in Basel.

A typical 3am scene at Paddy’s in Basel.

Readings

This assignment is based on the following readings:

Assignment Goals

Examples

# Create some vectors of data
height <- c(168, 157, 189, 147, 172, 166, 201)
initials <- c("NP", "HT", "AM", "MP", "RH", "MS", "JT")
sex <- c("m", "f", "m", "f", "m", "f", "m")
age <- c(28, 19, 23, 24, 25, 22, 26)

# What was the height of the first entry?
height[1]

# What were the sexes of the first 5 entries?
sex[1:5]

# What was the age of the last entry?
age[length(age)]

# How many were women?
sum(sex == "f")

# How many people were taller than 160cm?
sum(height > 160)

# What percent of people were younger than 25?
mean(age < 25)

# What were the initials of the females?
initials[sex == "f"]

# What were the intials of females older than 22?
initials[sex == "f" & age > 22]

# What was the height, sex, and age of the person with initials "JT"?
height[intials == "JT"]
sex[initials == "JT"]
age[initials == "JT"]

# The height of pseron 'NP' is incorrect, it should be 172. Change it
height[initials == "NP"] <- 172

# The ages of all females are one year too low, add 1 to each age
age[sex == "f"] <- age[sex == "f"] + 1

Get started

  1. Open RStudio. Open a new R script (File – New File – R Script), and save it as wpa_2_LastFirst.R (where Last and First is your last and first name). At the top of your script write the assignment number, your name and date (as comments!). For the rest of the assignment, when you answer a task, indicate which task you are answering with appropriate comments.

Analyzing Bar survey data

In this assignment you will analyse (fictional!) data from a survey of 200 people at one of two bars in Basel (Grenzwert and Paddy’s) last Friday night at 3:00am. The goal of the survey is to see if there is an effect of cologne on how long people talk with others at the bar. As each of the 300 people entered they bars, secretly given a spray of either one of two types of cologne Acqua Di Gio or CK One, or no cologne at all. For the rest of the night, two (very busy) researchers recorded how long each person spent talking to people at the bar. The data are stored in the following 5 vector objects:

A. First, get the data objects into your R session. Thankfully, you don’t need to type in the data yourself! The objects are stored in an RData file online at https://github.com/ndphillips/IntroductionR_Course/blob/master/data/wpa2.RData?raw=true. Run the following code to load the vectors into your R session.

# Load the data into my current session

load(file = url("https://github.com/ndphillips/IntroductionR_Course/blob/master/data/wpa2.RData?raw=true"))

B. Make sure the objects (id, sex, cologne, bar, time) were loaded correctly and get to know them by running the str() function on each of the 5 vectors.

Review

  1. How many people were given each type of cologne? (Hint: use table())
table(cologne)
## cologne
## ckone   gio  none 
##   100   100   100
  1. What was the mean time?
mean(time)
## [1] 143.4
  1. What was the standard deviation of times?
sd(time)
## [1] 64.82825
  1. Create time_z a z-score transformation of time. (Hint: z-score is defined as (x - mean(x)) / sd(x))
time_z <- (time - mean(time)) / sd(time)

Numerical Indexing

  1. How long did the first person spend talking to others? What cologne did he/she wear? What was his/her sex?
time[1]
## [1] 302
cologne[1]
## [1] "gio"
sex[1]
## [1] "f"
  1. What were the sexes of the first five participants?
sex[1:5]
## [1] "f" "f" "f" "f" "m"
  1. What were the colognes of the 10th through the 20th participants? (Hint: use a:b)
cologne[10:20]
##  [1] "ckone" "ckone" "ckone" "ckone" "ckone" "none"  "ckone" "gio"  
##  [9] "ckone" "gio"   "gio"
  1. Which bar did the last participant go to? (hint: don’t write the indexing number directly; instead, index the vector using the length() function with the appropriate argument)
bar[length(bar)]
## [1] "grenzwert"

Logical Indexing on one variable

  1. How many people were given gio?. How many wore ckone? How many were given no cologne?
sum(cologne == "gio")
## [1] 100
sum(cologne == "ckone")
## [1] 100
sum(cologne == "none")
## [1] 100
  1. How many people went to Grenzwert? How many went to Paddys?
sum(bar == "grenzwert")
## [1] 150
sum(bar == "paddys")
## [1] 150
  1. What percent of people went to Grenzert? (Hint: use mean() combined with a logical vector)
mean(bar == "grenzwert")
## [1] 0.5
  1. How many people talked to others for more than 30 minutes?
sum(time > 30)
## [1] 267
  1. What percent of people talked longer than 30 minutes?
mean(time > 30)
## [1] 0.89
  1. What percent of talking times were longer than 20 minutes but less than 40 minutes? (Hint: use &)
mean(time > 20 & time < 40)
## [1] 0.1366667

Logical indexing and two variables

  1. What were the ids of people who went to Grenzwert? (They should all start with g)
id[bar == "grenzwert"]
##   [1] "g.88"  "g.92"  "g.11"  "g.69"  "g.68"  "g.76"  "g.14"  "g.67" 
##   [9] "g.12"  "g.83"  "g.95"  "g.20"  "g.91"  "g.91"  "g.52"  "g.72" 
##  [17] "g.56"  "g.86"  "g.57"  "g.58"  "g.50"  "g.93"  "g.76"  "g.90" 
##  [25] "g.79"  "g.65"  "g.55"  "g.60"  "g.73"  "g.81"  "g.62"  "g.64" 
##  [33] "g.35"  "g.73"  "g.41"  "g.59"  "g.90"  "g.97"  "g.94"  "g.61" 
##  [41] "g.93"  "g.30"  "g.66"  "g.78"  "g.60"  "g.59"  "g.15"  "g.23" 
##  [49] "g.64"  "g.36"  "g.95"  "g.34"  "g.94"  "g.74"  "g.52"  "g.82" 
##  [57] "g.96"  "g.87"  "g.63"  "g.63"  "g.49"  "g.31"  "g.100" "g.47" 
##  [65] "g.38"  "g.86"  "g.28"  "g.72"  "g.26"  "g.99"  "g.96"  "g.99" 
##  [73] "g.98"  "g.69"  "g.74"  "g.89"  "g.78"  "g.77"  "g.19"  "g.79" 
##  [81] "g.75"  "g.27"  "g.54"  "g.48"  "g.77"  "g.16"  "g.82"  "g.39" 
##  [89] "g.94"  "g.29"  "g.81"  "g.92"  "g.51"  "g.80"  "g.70"  "g.53" 
##  [97] "g.83"  "g.13"  "g.100" "g.42"  "g.70"  "g.80"  "g.61"  "g.51" 
## [105] "g.44"  "g.93"  "g.45"  "g.65"  "g.55"  "g.87"  "g.84"  "g.62" 
## [113] "g.66"  "g.33"  "g.53"  "g.54"  "g.67"  "g.68"  "g.95"  "g.71" 
## [121] "g.97"  "g.40"  "g.91"  "g.100" "g.85"  "g.17"  "g.96"  "g.46" 
## [129] "g.56"  "g.85"  "g.25"  "g.24"  "g.92"  "g.98"  "g.18"  "g.75" 
## [137] "g.21"  "g.57"  "g.97"  "g.43"  "g.88"  "g.71"  "g.37"  "g.22" 
## [145] "g.58"  "g.84"  "g.89"  "g.99"  "g.32"  "g.98"
  1. What were the sexes of people who went to Paddy’s? What percentage of these were men?
sex[bar == "paddys"]
##   [1] "f" "f" "m" "m" "f" "m" "m" "m" "m" "m" "f" "f" "m" "m" "m" "f" "m"
##  [18] "f" "m" "f" "m" "m" "f" "f" "m" "m" "m" "m" "f" "m" "m" "m" "f" "f"
##  [35] "m" "m" "m" "f" "m" "f" "m" "m" "m" "m" "f" "f" "m" "f" "m" "m" "f"
##  [52] "f" "m" "m" "f" "f" "f" "f" "m" "f" "m" "m" "f" "f" "m" "f" "m" "m"
##  [69] "f" "f" "f" "f" "f" "m" "f" "m" "m" "f" "f" "m" "f" "f" "m" "m" "f"
##  [86] "f" "f" "f" "m" "f" "m" "m" "m" "m" "f" "m" "f" "m" "m" "f" "m" "f"
## [103] "f" "f" "m" "m" "m" "f" "m" "m" "f" "m" "m" "m" "m" "f" "m" "f" "f"
## [120] "f" "f" "f" "m" "m" "m" "f" "f" "m" "f" "f" "f" "f" "f" "m" "m" "m"
## [137] "f" "f" "m" "f" "f" "m" "m" "m" "m" "m" "m" "m" "f" "f"
mean(sex[bar == "paddys"] == "m")
## [1] 0.5333333
  1. What was the mean talking time of men? What about women?
mean(time[sex == "m"])
## [1] 143.0962
mean(time[sex == "f"])
## [1] 143.7292
  1. What was the mean talking time of people who went to Grenzwert? What about the people who went to Paddy’s?
mean(time[bar == "grenzwert"])
## [1] 98.24
mean(time[bar == "paddys"])
## [1] 188.56
  1. What was the mean talking time of people who were given gio? What about ckone? What about no cologne at all?
mean(time[cologne == "gio"])
## [1] 159.98
mean(time[cologne == "ckone"])
## [1] 170.13
mean(time[cologne == "none"])
## [1] 100.09
  1. Based on what you’ve learned, if someone wants to talk as long as possible, what cologne should they wear? Or should they wear no cologne at all?
# They should wear ckone!

Changing vector values with indexing and assignment a[] <- b

In the next questions, we’ll use indexing and assignment to change the values within a vector. To do this, we’ll start by creating copies of the original data so we can easily recover the data if we screw something up.

  1. Create new objects bar.r, cologne.r and time.r that are copies of the original bar, cologne and time objects (Hint: Just assign the existing vectors to new objects)
bar.r <- bar
cologne.r <- cologne
time.r <- time
  1. In the bar.r vector, change the "grenzwert" values to "g". Now change the "paddys" values to "p"
bar.r[bar.r == "grenzwert"] <- "g"
bar.r[bar.r == "paddys"] <- "p"
  1. In the cologne.r vector, change the "gio" values to "G". Now change the "ckone" values to "C". Now change "none" to "N"
cologne.r[cologne == "gio"] <- "G"
cologne.r[cologne == "ckone"] <- "C"
  1. Some of the time values are too large and should not be included in our analysis. Specifically, values of time greater than 280 should be set to just 280. In the time.r vector, change all time values greater than 280 to 280. Confirm that you did it correctly by calculating the maximum time in time.r
time.r[time > 280] <- 280
max(time.r)
## [1] 280

Checkpoint!!!

Solving a paradox

  1. Based on what you’ve learned so far, if someone wanted to talk a lot, what cologne should they wear?
# They should wear ckone!

Let’s see if your prediction holds up!

  1. What was the mean time of people who went to Grenzwert and wore gio?
mean(time[bar == "grenzwert" & cologne == "gio"])
## [1] 145.0111
  1. What was the mean time of people who went to Grenzwert and wore ckone?
mean(time[bar == "grenzwert" & cologne == "ckone"])
## [1] 38
  1. What was the mean time of people who went to Paddys who wore gio?
mean(time[bar == "paddys" & cologne == "gio"])
## [1] 294.7
  1. What was the mean time of people who went to Paddys who wore ckone?
mean(time[bar == "paddys" & cologne == "ckone"])
## [1] 184.8111
  1. Based on what you’ve learned now, if someone’s goal is to talk to people as long as possible, what cologne should they wear?
# They should wear gio!!

You can visualize the data using the following code

# Combine vectors in a dataframe
survey.df <- data.frame(bar, cologne, time)

# Create a pirateplot of the data
yarrr:::pirateplot(time ~ cologne + bar, 
                   data = survey.df)

What you’ve just seen is an example of Simpson’s Paradox. If you want to learn more, check out the wikipedia page.

Challenges

  1. What percent of women wore ckone?
mean(cologne[sex == "f"] == "ckone")
## [1] 0.3263889
  1. What was the median time of people who went to grenzwert and wore gio but who talked more than 100 minutes?
median(time[bar == "grenzwert" & cologne == "gio" & time > 100])
## [1] 144.5
  1. What percent of participants either went to grenzwert and talked for less than 220 minutes or went to paddys and talked for more than 150 minutes but no longer than 250 minutes?
mean((bar == "grenzwert" & time < 220) | (bar == "paddys" & time > 150 & time <= 250))
## [1] 0.9633333
  1. Let’s make the ckone wearers look better. For all of the ckone wearers, add a random sample from a normal distribution with mean 30 and standard deviation 5 to their original talking times. (Hint: use rnorm())
time[cologne == "ckone"] <- time[cologne == "ckone"] + rnorm(n = sum(cologne == "ckone"), mean = 30, sd = 5)

Submit!