Nathaniel D. Phillips, Economic Psychology, University of Basel
BaselR Meeting, March 2017, ndphillips.github.io/RBasel
"As the city’s principal public hospital, Cook County was the place of last resort for the hundreds of thousands of Chicagoans without health insurance. Resources were stretched to the limit. The hospital’s cavernous wards were built for another century. There were no private rooms, and patients were separated by flimsy plywood dividers. There was no cafeteria or private telephone—just a payphone for everyone at the end of the hall. In one possibly apocryphal story, doctors once trained a homeless man to do routine lab tests because there was no one else available." Malcolm Gladwell, Blink.
A fast and frugal decision tree (FFT) is a decision tree where each node has exactly two branches, where at least one branch is an exit branch (Martignon et al., 2008).
FFTs -> Cheap, easy to understand, and rarely overfit.
"Standard"" decision trees can become very complex.
Complexity -> High costs, Difficult to understand, prone to overfitting.
# v1.1.8 available on CRAN
install.packages("FFTrees")
# v1.2.0 on github
devtools::github("ndphillips/FFTrees", include_vignette = TRUE)
library(FFTrees)
head(heartdisease)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 63 1 ta 145 233 1 hypertrophy 150 0 2.3 down 0
## 2 67 1 a 160 286 0 hypertrophy 108 1 1.5 flat 3
## 3 67 1 a 120 229 0 hypertrophy 129 1 2.6 flat 2
## 4 37 1 np 130 250 0 normal 187 0 3.5 down 0
## 5 41 0 aa 130 204 0 hypertrophy 172 0 1.4 up 0
## 6 56 1 aa 120 236 0 normal 178 0 0.8 up 0
## thal diagnosis
## 1 fd 0
## 2 normal 1
## 3 rd 1
## 4 normal 0
## 5 normal 0
## 6 normal 0
# Step 1: Create the trees
heart.fft <- FFTrees(formula = diagnosis ~.,
data = heart.train,
data.test = heart.test)
# Step 2: View summary statistics
print(heart.fft)
# Step 3: Visualise the tree
plot(heart.fft, data = "train") # Training statistics
plot(herat.fft, data = "test") # Test statistics
# Step 2: Summary statistics
heart.fft
## [1] "7 FFTs using up to 4 of 13 cues"
## [1] "FFT #4 uses 3 cues {thal,cp,ca} with the following performance:"
## train test
## n 150.00 153.00
## pci 0.88 0.88
## mcu 1.74 1.73
## acc 0.80 0.82
## bacc 0.80 0.82
## sens 0.82 0.88
## spec 0.79 0.76
plot(heart.fft, stats = FALSE)
cue | cost | description | values |
---|---|---|---|
thal |
$102 | thallium scintigraphy, a nuclear imaging test that shows how well blood flows into the heart. | normal (n), fixed defect (fd), reversible defect (rd) |
cp |
$1 | Chest pain type | Typical angina (ta), atypical angina (aa), non-anginal pain (np), asymptomatic (a) |
ca |
$101 | Number of major vessels colored by flourosopy, a continuous x-ray imaging tool | 0, 1, 2 or 3 |
plot(heart.fft)
plot(heart.fft, data = "test")
plot(heart.fft, data = "test")
plot(heart.fft, data = "test", tree = 3)
plot(heart.fft, data = "test", tree = 6)
3 predictors, only 1 - 3 required to make decisions
The FFT is very cheap to implement
dataset | cases | cues | base.rate |
---|---|---|---|
arrhythmia | 68 | 280 | 0.29 |
audiology | 226 | 70 | 0.10 |
breast | 683 | 10 | 0.35 |
bridges | 92 | 10 | 0.39 |
cmc | 1473 | 10 | 0.35 |
Table: 5 of the 10 prediction datasets
FFTrees makes it easy to create simple, effective, transparent fast and frugal decision trees (FFTs).
FFTs can predict data "as well" as complex algorithms that use much more information.
# Create FFTs in one line of code
FFTrees(diagnosis ~.,
data = heartdisease)
I am very happy for contributions and bug reports at http://www.github.com/ndphillips/FFTrees,
If you have data you want to try FFTrees
on, or can think of new features, let's collaborate!
Calculate a decision threshold t
for each cue that maximizes the cue’s balanced accuracy bacc
in training.
Rank cues in order of their maximum balanced accuracy -- select the top N cues.
Creates all possible 2^{N−1}
trees with these cues, using all exit structures.
plot(heart.fft, what = "cues", main = "Heart Disease")