Readings

This assignment is based on the following readings:

YaRrr: 9, 10
Videos: G

Assignment Goals

Create a project with organised folders data and R.
Read text files into R with read.table() and save text files from objects with write.table()
Save multiple objects in a single .RData file using save().
Calculate aggregated summary statistics from dataframes with aggregate() and dplyr

Examples

library(yarrr) # Load yarrr for the pirates dataframe
library(dplyr) # Load dplyr for aggregation

# Start by creating a project with File -- New Project.
# put the project in the working directory you want on your computer.
# Then, create a folder called "data" in that working directory
#   by clicking "New Folder" in the File window in RStudio

# Print my current working directory (this is where you put your project)
getwd()

# Save the pirates dataframe to a tab--delimited txt file called pirates.txt in the data folder of my working directory

write.table(x = pirates,                 # Object to save to a table    
            file = "data/pirates.txt",   # Location and name of the text file to save
            sep = "\t")                  # Separate columns with tabs

# For fun, read the pirates.txt file into R and save as a new dataframe object called pirates2

pirates2 <- read.table(file = "data/pirates.txt",   # Location of file
                       header = TRUE,               # There IS a header row
                       stringsAsFactors = FALSE)    # Do NOT convert strings to factors

# Grouped Aggregation

# Q: What is the mean age of pirates of different sexes?

# Using aggregate()
aggregate(formula = age ~ sex,   # DV is age, IV is sex
          data = pirates,        # Variables are in the pirates dataframe
          FUN = mean)            # Calculate means

# Using dplyr
pirates %>%           # Start with the pirates dataframe ..AND THEN...
  group_by(sex) %>%   # Group the data by sex ..AND THEN...
  summarise(          # Calculate summary statistics....
    N = n(),               # Number of cases in each group
    age_mean = mean(age)   # Mean age
  )

# Q: ONLY for male pirates, what is the median number of tattoos of pirates who do and do not wear headbands, and for different sexes?

# Using aggregate()
aggregate(formula = tattoos ~ headband + sex,   # DV is tattoos, IVs are headband and sex
          data = pirates,                       # Variables are in the pirates dataframe
          subset = sex == "male",               # Only male pirates
          FUN = median)                         # Calculate medians

# Using dplyr
pirates %>%                    # Start with the pirates dataframe ..AND THEN...
  filter(sex == "male") %>%       # Only include male pirates ..AND THEN..
  group_by(headband, sex) %>%  # Group the data by headband and sex ..AND THEN...
  summarise(                   # Calculate summary statistics....
    N = n(),                          # Number of cases in each group
    tattoos_median = median(tattoos)  # Median number of tattoos
  )


# Q: For each combination of sex and eyepatch, calculate the mean age, median height, mean weight, mean number of tattoos..
#      minimum sword.time, AND the percentage of pirates whose favorite pixar movie is "Monsters University".
#    Save the result as a dataframe called pirates_agg

pirates_agg <- pirates %>%
                group_by(sex, eyepatch) %>%
                summarise(
                  N = n(), # Number of cases in each group
                  age_mean = mean(age),
                  height_median = median(height),
                  weight_mean = mean(weight),
                  tattoos_mean = mean(tattoos),
                  sword_min = min(sword.time),
                  love_MU = mean(fav.pixar == "Monsters University")
                )

# Save pirates and pirates_agg objects in an .RData file called pirates.RData in the data folder of my working directory

save(pirates, pirates_agg, 
     file = "data/pirates.RData")

Why do we overestimate others’ willingness to pay?

In this WPA, we will analyze data from Matthews et al. (2016): Why do we overestimate others’ willingness to pay? The purpose of this research was to test if our beliefs about other people’s affluence (i.e.; wealth) affect how much we think they will be willing to pay for items. You can find the full paper at http://journal.sjdm.org/15/15909/jdm15909.pdf.

In study 1 of their paper, participants indicated the proportion of other people taking part in the survey who have more than themselves (havemore), and then whether other people would be willing to pay more than them for each of 10 items.

The following table shows a table of the 10 items and proportion of participants who indicated that others would be more willing to pay for the product than themselves (Table 1 in Matthews et al., 2016).

Values reported in Table 1 of Matthews et al. (2016)

Product Number	Product	Reported p(other > self)
1	A freshly-squeezed glass of apple juice	.695
2	A Parker ballpoint pen	.863
3	A pair of Bose noise-cancelling headphones	.705
4	A voucher giving dinner for two at Applebee’s	.853
5	A 16 oz jar of Planters dry-roasted peanuts	.774
6	A one-month movie pass	.800
7	An Ikea desk lamp	.863
8	A Casio digital watch	.900
9	A large, ripe pineapple	.674
10	A handmade wooden chess set	.732

Table 1: Proportion of participants who indicated that the “typical participant” would pay more than they would for each product in Study 1.

Variable Descriptions

Here are descriptions of the data variables (taken from the author’s dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt)

id: participant id code
gender: participant gender. 1 = male, 2 = female
age: participant age
income: participant annual household income on categorical scale with 8 categorical options: Less than $15,000; $15,001–$25,000; $25,001–$35,000; $35,001–$50,000; $50,001–$75,000; $75,001–$100,000; $100,001–$150,000; greater than $150,000.
p1-p10: whether the “typical” survey respondent would pay more (coded 1) or less (coded 0) than oneself, for each of the 10 products
task: whether the participant had to judge the proportion of other people who “have more money than you do” (coded 1) or the proportion who “have less money than you do” (coded 0)
havemore: participant’s response when task = 1
haveless: participant’s response when task = 0
pcmore: participant’s estimate of the proportion of people who have more than they do (calculated as 100-haveless when task=0)

Creating a new project, loading and saving data

Open RStudio. Create a new project called rcourse (or anything else you want) in a new working directory on your computer. In the directory of the folder, create two folders: R, and data – you can do this either in RStudio (by clicking the “New Folder” icon in the Files window), or outside of RStudio in your computer browser. When you are finished, your file structure should look like this:

Open a new R script and save it as wpa4.R in the R folder you just created using the main RStudio menus “File – Save As”"
At the top of your script load the dplyr package using library()

library(dplyr)

Using getwd() print the current working directory of your project. This is the directory on your computer where your project is located.

getwd()

Now it’s time to load the data. The data for this WPA are stored at http://journal.sjdm.org/15/15909/data1.csv. Load the data into R by using read.table() into a new object called matthews by running the following code. Once you have done this, kook at the first few rows of matthews using head(), and str() to make sure the data were loaded correctly into R.

# Load the comma-separated data1.csv file into R as a new object called matthews

matthews <- read.table(file = "http://journal.sjdm.org/15/15909/data1.csv", # Link to the file
                       sep = ",",                             # File is comma-separated
                       header = TRUE,                         # There IS a header column
                       stringsAsFactors = FALSE)              # Do NOT convert strings to factors

head(matthews)

##                  id gender age income p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 task
## 1 R_3PtNn51LmSFdLNM      2  26      7  1  1  1  1  1  1  1  1  1   1    0
## 2 R_2AXrrg62pgFgtMV      2  32      4  1  1  1  1  1  1  1  1  1   1    0
## 3 R_cwEOX3HgnMeVQHL      1  25      2  0  1  1  1  1  1  1  1  0   0    0
## 4 R_d59iPwL4W6BH8qx      1  33      5  1  1  1  1  1  1  1  1  1   1    0
## 5 R_1f3K2HrGzFGNelZ      1  24      1  1  1  0  1  1  1  1  1  1   1    1
## 6 R_3oN5ijzTfoMy4ca      1  22      2  1  1  0  0  1  1  1  1  0   1    0
##   havemore haveless pcmore
## 1       NA       50     50
## 2       NA       25     75
## 3       NA       10     90
## 4       NA       50     50
## 5       99       NA     99
## 6       NA       20     80

str(matthews)

## 'data.frame':    190 obs. of  18 variables:
##  $ id      : chr  "R_3PtNn51LmSFdLNM" "R_2AXrrg62pgFgtMV" "R_cwEOX3HgnMeVQHL" "R_d59iPwL4W6BH8qx" ...
##  $ gender  : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ age     : int  26 32 25 33 24 22 47 26 29 32 ...
##  $ income  : int  7 4 2 5 1 2 3 4 1 7 ...
##  $ p1      : int  1 1 0 1 1 1 1 1 1 1 ...
##  $ p2      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ p3      : int  1 1 1 1 0 0 0 0 0 1 ...
##  $ p4      : int  1 1 1 1 1 0 0 0 1 1 ...
##  $ p5      : int  1 1 1 1 1 1 1 0 0 1 ...
##  $ p6      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ p7      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ p8      : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ p9      : int  1 1 0 1 1 0 1 0 1 1 ...
##  $ p10     : int  1 1 0 1 1 1 0 0 1 1 ...
##  $ task    : int  0 0 0 0 1 0 1 0 1 1 ...
##  $ havemore: int  NA NA NA NA 99 NA 95 NA 70 25 ...
##  $ haveless: int  50 25 10 50 NA 20 NA 30 NA NA ...
##  $ pcmore  : int  50 75 90 50 99 80 95 70 70 25 ...

Now that you’ve loaded the data into R, let’s save a local copy of the data as a text file called matthews.txt into your data folder. Using write.table(), save the data as a tab–delimited text file called matthews.txt in the data folder as follows:

# Write the matthews data as a new, tab--delimited text file called matthews.txt in the data folder of
#   your working directory

write.table(x = matthews,                    # Save the object matthews
            file = "data/matthews.txt",      # Write the object to matthews.txt in the data folder
            sep = "\t")                      # Separate columns by tabs

Review

What are the names of the data columns?

names(matthews)

##  [1] "id"       "gender"   "age"      "income"   "p1"       "p2"      
##  [7] "p3"       "p4"       "p5"       "p6"       "p7"       "p8"      
## [13] "p9"       "p10"      "task"     "havemore" "haveless" "pcmore"

What was the mean age?

mean(matthews$age)

## [1] 31.71579

Currently the column gender is coded as 1 (male) and 2 (female). Let’s create a new character column called gender_a that codes the data as "male" and "female" (instead of 1 and 2). Do this by using the following template:

matthews$gender_a <- NA   # Start with a column of NA values.
matthews$gender_a[matthews$gender == __] <- "__" # Change 1 to "male"
matthews$gender_a[matthews$gender == __] <- "__" # Change 2 to "female"

# Create a new column called gender_a that codes gender as a string
matthews$gender_a <- NA  # Start with a column of NA values.
matthews$gender_a[matthews$gender == 1] <- "male"
matthews$gender_a[matthews$gender == 2] <- "female"

What percent of participants were male? (Hint: use the template: mean(__$__ == "__"))

mean(matthews$gender_a == "male")

## [1] 0.6263158

Calculate the mean age for males only using the following templates

mean(matthews$age[matthews$gender_a == "male"])

## [1] 29.76471

Calculate the mean age for females only using the following template

mean(matthews$age[matthews$gender_a == "female"])

## [1] 34.98592

Grouped Aggregation

Using aggregate() calculate the mean age of male and female participants separately using the following template. Do you get the same answers as before?

aggregate(formula = age ~ gender_a,
          FUN = mean,
          data = matthews)

##   gender_a      age
## 1   female 34.98592
## 2     male 29.76471

# Yes the answers are the same!

Now use dplyr to do the same calculations using the following template. Do you get the same answers as before?

# Calculate mean age for each sex using dplyr
matthews %>%
  group_by(__) %>%
  summarise(
    N = n(),
    age_mean = mean(__)
  )

The variable pcmore reflects the question: “What percent of people taking part in this survey do you think earn more than you do?”. Using aggregate(), calculate the median value of this variable separately for each level of income. What does the result tell you?

aggregate(formula = pcmore ~ income,
          FUN = median,
          data = matthews)

##   income pcmore
## 1      1     80
## 2      2     75
## 3      3     50
## 4      4     60
## 5      5     50
## 6      6     45
## 7      7     50
## 8      8     50

# The higher one's income, the less people think that other people make more than them.

Merging two dataframes

I created a new table containing fictional demographic information about each participant. The data are stored in a tab–delimited text file (with a header row) at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/data/matthews_demographics.txt. Using read.table(), load the data into an object called matthews_demo into R using the following template:

matthews_demo <- read.table(file = "___",          # File location
                            sep = "__",            # How are columns separted?
                            header = __,           # Is there a header row?
                            stringsAsFactors = __) # Should strings be converted to factors?

matthews_demo <- read.table("https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/data/matthews_demographics.txt", 
                      sep = "\t",
                      header = TRUE,
                      stringsAsFactors = FALSE)

Using merge() add the demographic data to the matthews data using the following template:

matthews <- merge(x = __,       # First dataframe
                  y = __,       # Second dataframe
                  by = "__")    # Column to match rows

matthews <- merge(x = matthews,
                  y = matthews_demo,
                  by = "id")

Using aggregate(), calculate the mean value of havemore for each combination of gender and race using the following template. Is there a difference between men and women, or people of different races, in how often they think other people earn more money than them?

aggregate(formula = __ ~ __ + __, 
          FUN = __, 
          data = __)

aggregate(havemore ~ gender + race, 
          FUN = mean, 
          data = matthews)

##   gender     race havemore
## 1      1    asian 62.05882
## 2      2    asian 55.00000
## 3      1    black 57.50000
## 4      2    black 65.00000
## 5      1 hispanic 51.87500
## 6      2 hispanic 31.66667
## 7      1    white 64.62963
## 8      2    white 56.00000

Now do the same calculations using dplyr using the following template. Do you get the same answer?

matthews %>% 
  group_by(__, __) %>%
  summarise(
    N = n(),
    havemore_mean = mean(__)
  )

matthews %>% 
  group_by(gender, race) %>%
  summarise(
    N = n(),
    havemore_mean = mean(havemore, na.rm = TRUE)
  )

Checkpoint!!!

Create a new dataframe called product that only contain columns p1, p2, … p10 from matthews by running the following code. After you run the code, look at it with head() to see what it looks like.

# Create product, a dataframe containing only columns p1, p2, ... p10
product <- matthews[,paste0("p", 1:10)]

The colMeans() function takes a dataframe as an argument, and returns a vector showing means across rows for each column of data. Using colMeans(), calculate the percentage of participants who indicated that the ‘typical’ participant would be willing to pay more than them for each item. Do your values match what the authors reported in Table 1?

colMeans(product)

##        p1        p2        p3        p4        p5        p6        p7 
## 0.6947368 0.8631579 0.7052632 0.8526316 0.7736842 0.8000000 0.8631579 
##        p8        p9       p10 
## 0.9000000 0.6736842 0.7315789

# Yes the numbers match!!!

The rowMeans() function is like colMeans(), but for calculating means across columns for every row of data. Using rowMeans() calculate for each participant, the percentage of the 10 items that the participant believed other people would spend more on. Save this data as a vector called pall.

pall <- rowMeans(product)

Add the pall vector as a new column called pall to the matthews dataframe using basic assignment (__$__ <- __)

matthews$pall <- pall

What was the mean value of pall across participants? This value is the answer to the question: “How often does the average participant think that someone else would pay more for an item than themselves?”

mean(matthews$pall)

## [1] 0.7857895

Calculate the mean pall value for male and female participants separately. Which gender tends to think that others would pay more for products than them?

aggregate(formula = pall ~ gender_a,
          FUN = mean,
          data = matthews)

##   gender_a      pall
## 1   female 0.8014085
## 2     male 0.7764706

# Males tend to think that others will pay more for items than them relative to females.

Calculate the mean pall value of participants for each level of income. Do you find a consistent relationship between pall and income?

aggregate(formula = pall ~ income,
          FUN = mean,
          data = matthews)

##   income      pall
## 1      1 0.9037037
## 2      2 0.8044444
## 3      3 0.7370370
## 4      4 0.7862069
## 5      5 0.7500000
## 6      6 0.6958333
## 7      7 0.8142857
## 8      8 0.8666667

# The values decrease from income = 1 to income = 6, then they go up again!

For each level of gender, calculate the summary statistics in the following table using the following template. Save the summary statistics to an object called gender_agg

variable	description
n	Number of participants
age.mean	Mean age
age.sd	Standard deviation of age
income.mean	Mean income
pcmore.mean	Mean value of pcmore
pall.mean	Mean value of pall

gender_agg <- __ %>%
  group_by(__) %>%
  summarise(
    N = n(),
    age.mean = mean(age),
    age.sd = __,
    income.mean = __,
    pcmore.mean = __,
    pall.mean = __
  )

gender_agg <- matthews %>%
  group_by(gender) %>%
  summarise(
    N = n(),
    age.mean = mean(age),
    age.sd = sd(age),
    income.mean = mean(income),
    pcmore.mean = mean(pcmore),
    pall.mean = mean(pall)
  )

gender_agg

## # A tibble: 2 x 7
##   gender     N age.mean    age.sd income.mean pcmore.mean pall.mean
##    <int> <int>    <dbl>     <dbl>       <dbl>       <dbl>     <dbl>
## 1      1   119 29.76471  7.648757    3.285714    62.25210 0.7764706
## 2      2    71 34.98592 10.430029    3.943662    58.80282 0.8014085

For each level of income, calculate the summary statistics in the following table – only for participants older than 21 – and save them to a new object called income_df.

variable	description
N	Number of participants
age_min	Minimum age
age_mean	Mean age
male_p	Percent of males
female_p	Percent of females
pcmore_mean	Mean value of pcmore
pall_mean	Mean value of pall

income_df <- matthews %>%
  filter(age > 21) %>%
  group_by(income) %>%
  summarise(
    N = n(),
    age.mean = mean(age),
    male.p = mean(gender == 1),
    female.p = mean(gender == 2),
    pcmore.mean = mean(pcmore),
    pall.mean = mean(pall)
  )

income_df

## # A tibble: 8 x 7
##   income     N age.mean    male.p  female.p pcmore.mean pall.mean
##    <int> <int>    <dbl>     <dbl>     <dbl>       <dbl>     <dbl>
## 1      1    26 29.76923 0.6923077 0.3076923    74.88462 0.9038462
## 2      2    43 33.06977 0.6279070 0.3720930    70.09302 0.8069767
## 3      3    25 31.20000 0.7600000 0.2400000    54.60000 0.7400000
## 4      4    27 31.62963 0.7037037 0.2962963    61.25926 0.7814815
## 5      5    26 33.30769 0.6153846 0.3846154    53.65385 0.7500000
## 6      6    23 32.78261 0.3913043 0.6086957    46.00000 0.6826087
## 7      7     7 39.28571 0.2857143 0.7142857    41.42857 0.8142857
## 8      8     3 33.33333 0.3333333 0.6666667    33.33333 0.8666667

Calculate the maximum and minimum age, and mean income aggregated at each level of race and gender. Save the results to an object called racegender_agg

racegender_agg <- matthews %>%
  group_by(race, gender_a) %>%
  summarise(
    N = n(),  # N
    age.max = max(age), # Oldest person
    income.mean = mean(income) # Mean income
  )

racegender_agg

## # A tibble: 8 x 5
## # Groups:   race [?]
##       race gender_a     N age.max income.mean
##      <chr>    <chr> <int>   <dbl>       <dbl>
## 1    asian   female    14      60    3.714286
## 2    asian     male    27      58    3.444444
## 3    black   female    13      49    3.615385
## 4    black     male    24      44    3.291667
## 5 hispanic   female     7      67    4.571429
## 6 hispanic     male    16      52    2.625000
## 7    white   female    37      59    4.027027
## 8    white     male    52      57    3.403846

Calculate the mean value of pcmore, the percent of participants that were black, and the mean age) aggregated at each level of task and gender. But do this only for people with an income greater than 5. Save the results to an object called taskgender_agg.

taskgender_agg <- matthews %>%
  filter(income > 5) %>%
  group_by(task, gender_a) %>%
  summarise(
    N = n(),  # N
    pcmore.mean = mean(pcmore), # mean pcmore
    p.black = mean(race == "black") # Percent black
  )

taskgender_agg

## # A tibble: 4 x 5
## # Groups:   task [?]
##    task gender_a     N pcmore.mean   p.black
##   <int>    <chr> <int>       <dbl>     <dbl>
## 1     0   female    12    50.41667 0.1666667
## 2     0     male     5    39.00000 0.2000000
## 3     1   female     9    41.11111 0.1111111
## 4     1     male     8    46.00000 0.2500000

Using save(), save matthews, gender_agg, income_df, racegender_agg, and taskgender_agg objects to a file called matthews.RData in the data folder in your working directory.

save(matthews, gender_agg, income_df, racegender_agg, taskgender_agg, file = "data/matthews.RData")

Submit!

Save and email your wpa_4_LastFirst.R file to me at nathaniel.phillips@unibas.ch.
Go to https://goo.gl/forms/b9dcRH6Ud3pDagOr1 to confirm your assignment submission.

WPA #4 – Project management and advanced dataframe manipulation