Simplify your life with R

From sharing and documenting research, to modeling simple decision strategies

Nathaniel Phillips, University of Basel
Department Presentation, University of Zurich, Department of Economics

plot of chunk unnamed-chunk-2

plot of chunk unnamed-chunk-3

Hermit crabs

plot of chunk unnamed-chunk-4

Rotjan, R. D., Chabot, J. R., & Lewis, S. M. (2010). Social context of shell acquisition in Coenobita clypeatus hermit crabs. Behavioral Ecology, 21(3), 639–646.

Hermit crabs

plot of chunk unnamed-chunk-5

Phillips et al. (2014)

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-12

The Competitive Sampling Game (CSG)

plot of chunk unnamed-chunk-13

plot of chunk unnamed-chunk-14

Pre-decisional search

plot of chunk unnamed-chunk-15

Pre-decisional search

plot of chunk unnamed-chunk-16

Pre-decisional search

plot of chunk unnamed-chunk-17

Benefits of search

plot of chunk unnamed-chunk-18

Benefits of search

plot of chunk unnamed-chunk-19

Observed Outcomes

plot of chunk unnamed-chunk-20

plot of chunk unnamed-chunk-21

Documenting and sharing data

  • How can I store, document, and share data and analyses in an open and transparent way?
Method Availability Documentation Accuracy
Don't share Low Low Low
By request only Medium Low Low
Post data online High Medium Medium
Markdown + R package High High High

plot of chunk unnamed-chunk-22

plot of chunk unnamed-chunk-23

Why R?

  • Free, open source, shareable, replicable.
  • Huge developer community.
  • R is the language of modern data science
    • Big data, natural language processing

plot of chunk unnamed-chunk-24

"To be able to choose between proprietary software packages is to be able to choose your master. Freedom means not having a master. And in the area of computing, freedom means not using proprietary software." -- Richard M. Stallman


"Closed source software is useless crap because it satisfies neither repeatability nor inspectability" -- Titus Brown

plot of chunk unnamed-chunk-25

R manuscript package

  • Include data, documentation, vignettes and tutorials written in R Markdown.

plot of chunk unnamed-chunk-26

  • Now everyone (even your future self) can always recover the data and analyses. Anytime. Anywhere.

Phillips et al. (2014)

All data analyses, and data descriptions are stored in an R manuscript package called phillips2014rivals available in a package phillips2014rivals_0.1.0.tar at https://goo.gl/q6GvBk

# Install the phillips2014rivals R package
# Package file: phillips2014rivals_0.1.0.tar, 
# Link to package file: https://goo.gl/q6GvBk

install.packages("https://goo.gl/q6GvBk", 
                 repos = NULL, 
                 type = "source")

Demo

R Documentation Pros

  • Data are fully organized, documented, and linked to the analyses.
  • Accessible to anyone (like your future self) with one line of code.
    • R becomes not just a statistical engine, but a fully documented data repository.
  • Packages make your research interactive -- calls for other researchers to get involved.

R Documentation Cons

  • It takes time!

plot of chunk unnamed-chunk-28

plot of chunk unnamed-chunk-29

Complexity vs. Simplicity


plot of chunk unnamed-chunk-30


plot of chunk unnamed-chunk-31

Decision Strategies

Compensatory Non-Compensatory
Example Weighted averaging: Expected utility, Tally, Bayes Heuristics: Take the Best, Tit-for-tat
Information Requirements High Low
Search Comprehensive Sequential
Speed Slow Fast
When do people use?
(Payne, Bettman, Johnson, 1993)
Low time pressure, high processing capacity High time-pressure, low processing capacity

Fast and Frugal Trees (FFTs)

Descriptive

  • Inference (Gigerenzer & Goldstein, 1996)
  • Judge's bailing decisions (Dhami, 2003)
  • Competition "Tit-for-Tat" (Axelrod, 1984)
  • Social "Imitate the successful" (Boyd & Richerson, 2005)

Prescriptive

  • Heart disease (Breiman et al. 1993)
  • Terrorist attacks (Garcia, 2016)
  • Bank failure (Aikman et al., 2014; Neth et al., 2014)

plot of chunk unnamed-chunk-32

Neth et al. (2014). "Homo heuristicus in the financial world".

plot of chunk unnamed-chunk-33

FFTrees

  • FFTrees An easy-to-use R package to create, visualize, and implement fast and frugal decision trees.


Phillips et al. (under review). FFTrees: An R package to create, visualize, and implement fast and frugal decision trees
# Install FFTrees to R.
install.packages("FFTrees")

plot of chunk unnamed-chunk-35

Patient Release Decisions

  • How can we explain psychiatric patient release decisions?
  • Dataset: Release decisions from 1101 patients described by 46 cues (age, sex, diagnosis, drug history, etc...)
  • Goal: Compare a fast and frugal decision tree model of release decisions to regression.

plot of chunk unnamed-chunk-37

Creating regression and FFT models

# Creating a regression decision model with glm()
patient.glm <- glm(formula = decision ~ .,
                   data = patient.data,
                   family = "binomial")

Regression

  • 10 significant predictors of release decisions.
Df F value Pr(>F)
socsit 1 6 8.604 0.000
schoolqual 2 6 5.076 0.000
migration 3 6 3.166 0.005
transfac 4 14 2.539 0.002
withdr 5 1 10.526 0.001
offense 6 10 3.154 0.001
sentence 7 1 10.880 0.001
prisonprior 8 1 4.578 0.033
raext 9 3 8.060 0.000
migration2 10 1 4.256 0.040

3 Steps to creating FFTs with FFTrees

# Step 0: Install FFTrees
install.packages("FFTrees")

# Step 1: Load the package
library("FFTrees")

# Step 2: Create an fft decision model with FFTrees
patient.fft <- FFTrees(formula = decision ~.,
                       data = patient.data)

How trees are built with FFTrees

  1. For each cue, calculate a decision threshold that maximizes accuracy.
  2. Rank order cues by their maximum accuracy
  3. Select the top N (i.e., 4) accurate cues
    • If any lower levels contain less than 10\% of the data, remove them.
  4. Select the exit structure with the highest accuracy.
# Show the cue accuracies
plot(patient.fft, what = "cues", main = "Patient cues")

plot of chunk unnamed-chunk-45

plot(patient.fft, main = "Release Decision FFT", stats = FALSE)

plot of chunk unnamed-chunk-47

plot(patient.fft)

plot of chunk unnamed-chunk-49

Comparing decision accuracy

Data fitting accuracy

  • Regression:, 79%, FFT: 68%

Prediction simulation

  • 1,000 Cross-validation prediction simulations

Cross Validation procedure

How accurate can a simple tree be?

plot of chunk unnamed-chunk-51

Generalizing FFTrees

  • The FFTrees package can be used with any dataset with a binary criterion.
  • Simulation: 10 diverse datasets taken from the UCI Machine Learning Database.
  • FFTrees vs. regression, Naive Bayes, Random Forests and more

How well can a simple fast and frugal tree predict data?

plot of chunk unnamed-chunk-52

heart.fft <- FFTrees(diagnosis ~ ., data = heartdisease)

plot of chunk unnamed-chunk-54

mushrooms.fft <- FFTrees(poisonous ~ ., data = mushrooms)

plot of chunk unnamed-chunk-56

Prediction accuracy across 10 dasets

plot of chunk unnamed-chunk-57

FFTrees conclusion

  • With the FFTrees R package, you can create descriptive or prescriptive fast and frugal trees and compare to compensatory models.
  • If data can be predicted by a very simple model, then that model should be seriously considered, even if it is terribly naive.

plot of chunk unnamed-chunk-59

If you only remember two things...

  1. Share and document your data and analyses -- R has great tools to do this.
  2. Consider simple heuristics, like fast and frugal trees, in addition to compensatory models.



plot of chunk unnamed-chunk-61

plot of chunk unnamed-chunk-62

Collaborators

  • Joerg Rieskamp (University of Basel)

  • Ralph Hertwig (MPI for Human Development)

  • Yaakov Kareev (Hebrew University of Jerusalem)

  • Judith Avrahami (Hebrew University of Jerusalem)

  • Wolfgang Gaissmaier (University of Konstanz)

  • Hansjoerg Neth (University of Konstanz)

  • Jan Woike (MPI for Human Development)

plot of chunk unnamed-chunk-64

Simply your life with R: From sharing and documenting research to modeling simple decisions.

5 Steps to creating an R package

Full tutorial: http://rpubs.com/ndphillips/rpackagescience

  1. Create necessary folders /data, /R, /vignettes, /inst
  2. Create a package description file DESCRIPTION.txt.
  3. Put data files in /data and /inst folders, all R code in /R
  4. Write documentation files for all data in /data
  5. Write vignettes in Markdown and put in /vignettes
  6. Build the package with the build() function.
    • Returns a single R package file: phillips2014rivals_0.1.0.tar

Efficiency

  • FFTs are very cheap to implement

  • Heart disease data

    • Regression: $300
    • rpart: > $100
    • Heart disease FFT: $75.91

plot of chunk unnamed-chunk-65

Speed and frugality

plot of chunk unnamed-chunk-66

Speed and frugality

plot of chunk unnamed-chunk-67

A forensic non-frugal tree

plot of chunk unnamed-chunk-68