FFTrees

An R package to create, visualize, and implement fast and frugal decision trees

Nathaniel D. Phillips, Economic Psychology, University of Basel
BaselR Meeting, March 2017, ndphillips.github.io/RBasel

A growing problem at the Cook County Hospital in 1996

plot of chunk cookmap

"As the city’s principal public hospital, Cook County was the place of last resort for the hundreds of thousands of Chicagoans without health insurance. Resources were stretched to the limit. The hospital’s cavernous wards were built for another century. There were no private rooms, and patients were separated by flimsy plywood dividers. There was no cafeteria or private telephone—just a payphone for everyone at the end of the hall. In one possibly apocryphal story, doctors once trained a homeless man to do routine lab tests because there was no one else available." Malcolm Gladwell, Blink.

Emergency Room overload

plot of chunk unnamed-chunk-2

  • 250,000 patients per year
  • Not enough space.
  • Complete chaos

Heart attacks (?)

  • 30 people a day worried about a heart attack
  • Coronary care bed ($2,000 a night + 3 day stay) or Regular bed
  • Goal: send true heart attacks to the coronary care bed, and true healthy patients to a normal bed.

Multiple, uncertain measures

  • Electrocardiogram (ECG), Blood pressure, Stethescope, How long? How much? During exercise? History? Cholesterol? Drugs? etc.

plot of chunk unnamed-chunk-3

How good were doctor's intuitive decisions?

  • Task: Estimate from 0 to 100 the probability of a heart attack of 20 separate patients.

plot of chunk unnamed-chunk-4

Answer: Not consistent at all

"In each case the answers we got pretty much ranged from 0 to 100. It was extraordinary" ~ Department of Medicine chairman

The problem

  • Too much inconsistency in doctor's decisions
  • Too many healthy people sent to the coronary care unit.

Solution

  • A fast and frugal decision tree (FFT) developed by a cardiologist named Lee Goldman.

plot of chunk unnamed-chunk-5

Why use a decision tree?

  • Easy to understand, consistent, requires little information, can be calculated 'in the head'

plot of chunk unnamed-chunk-6

The Cook hospital decision tree

  • Over two years, the performance of the tree was compared to the physician's intuitive judgments.

Results

  • Doctor's accuracy: 75-90%
  • Decision tree accuracy: 95%
  • Tree had far fewer false-positives and huge cost savings
  • To this day, the tree is still used at the hospital.

plot of chunk unnamed-chunk-7

Fast and Frugal tree

  • A fast and frugal decision tree (FFT) is a decision tree where each node has exactly two branches, where at least one branch is an exit branch (Martignon et al., 2008).

  • FFTs -> Cheap, easy to understand, and rarely overfit.

plot of chunk unnamed-chunk-8

Standard decision tree

  • "Standard"" decision trees can become very complex.

  • Complexity -> High costs, Difficult to understand, prone to overfitting.

plot of chunk unnamed-chunk-9

Problem

  • There is no off-the-shelf method to construct FFTs.
    • Previous researchers have individually constructed their FFTs.

Task

  • Create an easy-to-use R package that constructs, visualizes, and implements FFTs.

plot of chunk unnamed-chunk-10

FFTrees

# v1.1.8 available on CRAN
install.packages("FFTrees")

# v1.2.0 on github
devtools::github("ndphillips/FFTrees", include_vignette = TRUE)

Heart disease datatset

library(FFTrees)

head(heartdisease)
##   age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope ca
## 1  63   1 ta      145  233   1 hypertrophy     150     0     2.3  down  0
## 2  67   1  a      160  286   0 hypertrophy     108     1     1.5  flat  3
## 3  67   1  a      120  229   0 hypertrophy     129     1     2.6  flat  2
## 4  37   1 np      130  250   0      normal     187     0     3.5  down  0
## 5  41   0 aa      130  204   0 hypertrophy     172     0     1.4    up  0
## 6  56   1 aa      120  236   0      normal     178     0     0.8    up  0
##     thal diagnosis
## 1     fd         0
## 2 normal         1
## 3     rd         1
## 4 normal         0
## 5 normal         0
## 6 normal         0

4 Steps to using FFTrees

# Step 1: Create the trees
heart.fft <- FFTrees(formula = diagnosis ~., 
                     data = heart.train,
                     data.test = heart.test)

# Step 2: View summary statistics
print(heart.fft)

# Step 3: Visualise the tree
plot(heart.fft, data = "train")   # Training statistics
plot(herat.fft, data = "test")    # Test statistics

Printing an FFTrees object

# Step 2: Summary statistics
heart.fft
## [1] "7 FFTs using up to 4 of 13 cues"
## [1] "FFT #4 uses 3 cues {thal,cp,ca} with the following performance:"
##       train   test
## n    150.00 153.00
## pci    0.88   0.88
## mcu    1.74   1.73
## acc    0.80   0.82
## bacc   0.80   0.82
## sens   0.82   0.88
## spec   0.79   0.76

Plotting an FFTrees object

plot(heart.fft, stats = FALSE)

plot of chunk unnamed-chunk-16

3 cues

cue cost description values
thal $102 thallium scintigraphy, a nuclear imaging test that shows how well blood flows into the heart. normal (n), fixed defect (fd), reversible defect (rd)
cp $1 Chest pain type Typical angina (ta), atypical angina (aa), non-anginal pain (np), asymptomatic (a)
ca $101 Number of major vessels colored by flourosopy, a continuous x-ray imaging tool 0, 1, 2 or 3

plot(heart.fft)

plot of chunk unnamed-chunk-17

plot(heart.fft, data = "test")

plot of chunk unnamed-chunk-18

Heart Disease FFT | ROC

plot of chunk unnamed-chunk-19

plot(heart.fft, data = "test")

plot of chunk unnamed-chunk-20

plot(heart.fft, data = "test", tree = 3)

plot of chunk unnamed-chunk-21

plot(heart.fft, data = "test", tree = 6)

plot of chunk unnamed-chunk-22

Comparing FFTs to standard trees

How does the FFT created by FFTrees compare to a 'standard' decision tree created by rpart?

Heart disease: rpart

  • 8 predictors, 3 - 5 required to make decisions

plot of chunk unnamed-chunk-23

Heart disease: FFT

  • 3 predictors, only 1 - 3 required to make decisions

  • The FFT is very cheap to implement

    • Regression: $300
    • rpart: > $100
    • Heart disease FFT: $75.91

plot of chunk unnamed-chunk-24

Heart disease classification accuracy

plot of chunk unnamed-chunk-26

Heart disease classification accuracy

plot of chunk unnamed-chunk-27

Heart disease classification accuracy

plot of chunk unnamed-chunk-28

How accurate are FFTs built by FFTrees?

  • Prediction competition
    • 10 datasets taken from the UCI machine learning database
    • 50% Fitting / 50% Prediction subsample splitting, DV: balanced accuracy = (sensitivity + specificity) / 2
dataset cases cues base.rate
arrhythmia 68 280 0.29
audiology 226 70 0.10
breast 683 10 0.35
bridges 92 10 0.39
cmc 1473 10 0.35

Table: 5 of the 10 prediction datasets

Aggregate simulation prediction results

plot of chunk unnamed-chunk-29

Aggregate simulation prediction results

plot of chunk unnamed-chunk-30

Aggregate simulation prediction results

plot of chunk unnamed-chunk-31

Simulation prediction results by dataset

plot of chunk unnamed-chunk-32

Conclusions

  • FFTrees makes it easy to create simple, effective, transparent fast and frugal decision trees (FFTs).

  • FFTs can predict data "as well" as complex algorithms that use much more information.

Next steps

  • Speed up code with c++ or Julia.
  • Include cue costs into algorithm.
  • Quantify when a tree fails over time.
# Create FFTs in one line of code
FFTrees(diagnosis ~., 
        data = heartdisease)

plot of chunk unnamed-chunk-34

Please help and contribute!

Fitting vs. Prediction

plot of chunk unnamed-chunk-38

Fitting vs. Prediction

plot of chunk unnamed-chunk-39

Fitting vs. Prediction

plot of chunk unnamed-chunk-40

Fitting vs. Prediction

plot of chunk unnamed-chunk-41

FFTrees algorithm

  1. Calculate a decision threshold t for each cue that maximizes the cue’s balanced accuracy bacc in training.

  2. Rank cues in order of their maximum balanced accuracy -- select the top N cues.

  3. Creates all possible 2^{N−1} trees with these cues, using all exit structures.

plot of chunk unnamed-chunk-42

FFForest()

Visualise cue importance and co-occurence

plot of chunk unnamed-chunk-43

Heart disease cue accuracies

plot(heart.fft, what = "cues", main = "Heart Disease")

plot of chunk unnamed-chunk-44