lm
and glm
) with the lm()
and glm()
functionsnames()
and summary()
functions to access specific elements of regression objects.library(yarrr) #Load the yarrr package for the pirates dataframe
# Predict beard.length as a function of sex, age, weight and tattoos
beard_lm <- lm(formula = beard.length ~ sex + age + weight + tattoos,
data = pirates)
summary(beard_lm) # Look at summary results
names(beard_lm) # Named elements in the object
beard_lm$coefficients # Get coefficients
# Predict tattoos as a function of ALL variables in the pirates dataframe
tattoos_lm <- lm(formula = tattoos ~.,
data = pirates)
# Calculate model fits
# Directly from lm object
tattoos_fits <- tattoos_lm$fitted.values
# or calculate manually using predict()
tattoos_fits <- predict(tattoos_lm, newdata = pirates)
# Calculate residuals
# Directly from lm object
tattoos_resid <- tattoos_lm$residuals
# or calculate manually
tattoos_resid <- pirates$tattoos - predict(tattoos_lm, newdata = pirates)
# Binary logistic regression
# Create a logical vector indicating which pirates like "hook"
pirates$like_hook <- pirates$favorite.pirate == "Hook"
# Conduct binary logistic regression predicting which pirates like hook
hook_glm <- glm(formula = like_hook ~ . -favorite.pirate, # exclude favorite.pirate
data = pirates,
family = "binomial")
summary(hook_glm) # summary of results
In this WPA, you will analyze data from a study on student performance in two classes: math and portuguese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
Here is the data description (taken directly from the original website
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The data are located in two separate tab-delimited text files at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentmath.txt (the math data), and https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentpor.txt (the portugese data).
Both datafiles have 33 columns. Here they are:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
data
and one called R
. If you do not have these folders already, create them! Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R
folder in your project working directory as wpa_8_LastFirst.R
, where Last and First are your last and first names.# Done!
read.table()
load this two data files into R as two objects, one called student_m
, and one called student_p
student_m <- read.table(file = "https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentmath.txt",
sep = "\t",
header = TRUE)
student_p <- read.table(file = "https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentpor.txt",
sep = "\t",
header = TRUE)
head()
and View()
functions to make sure they loaded correctly.head(student_m)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 3 yes no yes
## 4 home mother 1 3 0 no yes yes
## 5 home father 1 2 0 no yes yes
## 6 reputation mother 1 2 0 no yes yes
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 6 5 6 6
## 2 1 3 4 5 5 6
## 3 3 3 10 7 8 10
## 4 1 5 2 15 14 15
## 5 2 5 4 6 10 10
## 6 2 5 10 15 15 15
head(student_p)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime failures schoolsup famsup paid
## 1 course mother 2 2 0 yes no no
## 2 course father 1 2 0 no yes no
## 3 other mother 1 2 0 yes no no
## 4 home mother 1 3 0 no yes no
## 5 home father 1 2 0 no yes no
## 6 reputation mother 1 2 0 no yes no
## activities nursery higher internet romantic famrel freetime goout Dalc
## 1 no yes yes no no 4 3 4 1
## 2 no no yes yes no 5 3 3 1
## 3 no yes yes yes no 4 3 2 2
## 4 yes yes yes yes yes 3 2 2 1
## 5 no yes yes no no 4 3 2 1
## 6 yes yes yes yes no 5 4 2 1
## Walc health absences G1 G2 G3
## 1 1 3 4 0 11 11
## 2 1 3 2 9 11 11
## 3 3 3 6 12 13 12
## 4 1 5 0 14 14 14
## 5 2 5 0 11 13 13
## 6 2 5 6 12 12 13
names()
and str()
functions, look at the names and structure of the dataframes to make sure everything looks ok. If the data look strange, you did something wrong with read.table()
, diagnose the problem!str(student_m)
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
str(student_p)
## 'data.frame': 649 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
names(student_m)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
names(student_p)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
write.table()
, save a local copy of the two dataframes to text files called student_m
and student_p
in the data folder of your project. Now, you’ll always have access to the data.# Write student_m and student_p to text files in the data folder of my working directory
write.table(student_m, file = "data/student_m.txt", sep = "\t")
write.table(student_p, file = "data/student_m.txt", sep = "\t")
When reporting APA style results from a regression analysis, use the following format: STATEMENT, b = X, t(df) = X, p = X: For example:
x <- lm(weight ~ Time,
data = ChickWeight)
summary(x)
##
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.331 -14.536 0.926 13.533 160.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.4674 3.0365 9.046 <2e-16 ***
## Time 8.8030 0.2397 36.725 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared: 0.7007, Adjusted R-squared: 0.7002
## F-statistic: 1349 on 1 and 576 DF, p-value: < 2.2e-16
# There is a significant positive relationship between time and weight, b = 8.80, t(576) = 36.73, p < 0.01.
lm_6
predicting first period grade (G1) based on age.# Predict G1 as a function of age
lm_6 <- lm(G1 ~ age,
data = student_m)
names()
and summary()
on lm_6
to see additional information from your regression object. Now, return a vector of the coefficients by running lm_5$coefficients
summary(lm_6)
##
## Call:
## lm(formula = G1 ~ age, data = student_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6915 -2.7749 -0.1916 2.3085 8.3085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6919 2.1926 6.245 1.1e-09 ***
## age -0.1667 0.1309 -1.273 0.204
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.317 on 393 degrees of freedom
## Multiple R-squared: 0.004106, Adjusted R-squared: 0.001572
## F-statistic: 1.62 on 1 and 393 DF, p-value: 0.2038
names(lm_6)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
lm_6$coefficients
## (Intercept) age
## 13.6918906 -0.1666864
# There is a slight negative relationship between age and first period grade (b = -0.17), however the relationship is not significant t(393) = -1.27, p = 0.20
13.6919 + (-0.1667) * 18
## [1] 10.6913
lm_10
predicting each student’s period 3 grade (G3) based on their period 1 grade (G1). Look at the results of the regression analysis with summary()
.lm_10 <- lm(G3 ~ G1,
data = student_p)
summary(lm_10)
##
## Call:
## lm(formula = G3 ~ G1, data = student_p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5179 -0.5454 0.3996 0.6196 10.1796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.82040 0.30545 2.686 0.00742 **
## G1 0.97250 0.02605 37.329 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.821 on 647 degrees of freedom
## Multiple R-squared: 0.6829, Adjusted R-squared: 0.6824
## F-statistic: 1393 on 1 and 647 DF, p-value: < 2.2e-16
# There is a strong positive relationship between first period grade and third period grade, b = 0.97, t(647) = 37.33, p < .01.
0.8204 + 0.9725 * 10
## [1] 10.5454
cor.test(~ G1 + G3,
data = student_p)
##
## Pearson's product-moment correlation
##
## data: G1 and G3
## t = 37.329, df = 647, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8003265 0.8493311
## sample estimates:
## cor
## 0.8263871
# The test statistic is exactly the same!! t(647) = 37.329
cor.test(~ age + G1,
data = student_m)
##
## Pearson's product-moment correlation
##
## data: age and G1
## t = -1.273, df = 393, p-value = 0.2038
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.16172991 0.03480963
## sample estimates:
## cor
## -0.0640815
# Once again the test statistic is exactly the same!! t(393) = -1.273.
# Apparently correlating two variables is, statistically, exactly the same as performing a regression analysis.
lm_15
predicting third period math grade (G3) based on sex, age, internet, and failures. Then, use the summary()
function to see a summary table of the outputlm_15 <- lm(G3 ~ sex + age + internet + failures,
data = student_m)
summary(lm_15)
##
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2156 -1.9523 0.0965 3.0252 9.4370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9962 2.9808 4.695 3.69e-06 ***
## sexM 1.0451 0.4282 2.441 0.0151 *
## age -0.2407 0.1735 -1.388 0.1660
## internetyes 0.7855 0.5761 1.364 0.1735
## failures -2.1260 0.2966 -7.167 3.86e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.237 on 390 degrees of freedom
## Multiple R-squared: 0.1533, Adjusted R-squared: 0.1446
## F-statistic: 17.65 on 4 and 390 DF, p-value: 2.488e-13
# Men have significantly higher third period math grades than women (b = 1.05, t(390) = 2.441, p = 0.02), there is a sigificant negative relationship between the number of past class failures and third period math grades (b = -2.13, t(390) = -7.17, p < .01). There are no significant relationships for age or internet (ps > 0.05).
lm_17
using the same variables as question 15, however, this time predict third period scores in the portugese dataset dataset. Use the summary()
function to understand the results.lm_17 <- lm(G3 ~ sex + age + internet + failures,
data = student_p)
summary(lm_17)
##
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student_p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8941 -1.8345 0.0522 1.8807 7.8041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.61020 1.68101 6.907 1.19e-11 ***
## sexM -0.71515 0.23625 -3.027 0.002568 **
## age 0.01986 0.10031 0.198 0.843134
## internetyes 0.92639 0.27508 3.368 0.000803 ***
## failures -2.04819 0.20738 -9.877 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.936 on 644 degrees of freedom
## Multiple R-squared: 0.1794, Adjusted R-squared: 0.1743
## F-statistic: 35.19 on 4 and 644 DF, p-value: < 2.2e-16
# In the portugese datset, men do worse than women, and internet actually helps performance!
formula = y ~ .
to include all variables!). Which variables are significant? Are any of the variables that were significant before no longer significant.lm_19 <- lm(G3 ~ .,
data = student_p)
summary(lm_19)
##
## Call:
## lm(formula = G3 ~ ., data = student_p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7618 -0.5148 0.0038 0.6047 5.4973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.63823 0.96361 0.662 0.508011
## schoolMS -0.19797 0.12783 -1.549 0.121992
## sexM -0.12258 0.11778 -1.041 0.298423
## age 0.02869 0.04835 0.593 0.553208
## addressU 0.11446 0.12277 0.932 0.351565
## famsizeLE3 0.01560 0.11505 0.136 0.892197
## PstatusT -0.09746 0.16256 -0.600 0.549055
## Medu -0.09170 0.07097 -1.292 0.196799
## Fedu 0.04962 0.06461 0.768 0.442773
## Mjobhealth 0.26583 0.25225 1.054 0.292379
## Mjobother -0.09351 0.14208 -0.658 0.510720
## Mjobservices 0.17255 0.17510 0.985 0.324808
## Mjobteacher 0.22115 0.23558 0.939 0.348232
## Fjobhealth -0.44420 0.35256 -1.260 0.208189
## Fjobother -0.33805 0.21391 -1.580 0.114544
## Fjobservices -0.47121 0.22477 -2.096 0.036457 *
## Fjobteacher -0.54368 0.31611 -1.720 0.085958 .
## reasonhome -0.07885 0.13366 -0.590 0.555479
## reasonother -0.36174 0.17236 -2.099 0.036251 *
## reasonreputation -0.16934 0.13990 -1.210 0.226584
## guardianmother -0.02513 0.12461 -0.202 0.840252
## guardianother 0.21732 0.24922 0.872 0.383539
## traveltime 0.13859 0.07459 1.858 0.063667 .
## studytime 0.04965 0.06620 0.750 0.453569
## failures -0.25495 0.09900 -2.575 0.010254 *
## schoolsupyes -0.18419 0.17319 -1.064 0.287969
## famsupyes 0.09456 0.10701 0.884 0.377230
## paidyes -0.19166 0.21664 -0.885 0.376663
## activitiesyes 0.01208 0.10482 0.115 0.908275
## nurseryyes -0.09562 0.12722 -0.752 0.452553
## higheryes 0.20749 0.18261 1.136 0.256285
## internetyes 0.08517 0.12955 0.657 0.511152
## romanticyes -0.04209 0.10786 -0.390 0.696483
## famrel -0.01597 0.05471 -0.292 0.770469
## freetime -0.05002 0.05267 -0.950 0.342694
## goout -0.01889 0.05041 -0.375 0.708033
## Dalc -0.05194 0.07185 -0.723 0.469977
## Walc -0.01693 0.05553 -0.305 0.760521
## health -0.05522 0.03633 -1.520 0.129064
## absences 0.01359 0.01173 1.158 0.247198
## G1 0.12933 0.03762 3.438 0.000626 ***
## G2 0.87037 0.03495 24.906 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.249 on 607 degrees of freedom
## Multiple R-squared: 0.86, Adjusted R-squared: 0.8506
## F-statistic: 90.95 on 41 and 607 DF, p-value: < 2.2e-16
# Compared to the analysis in question 17, sex is no longer significant, internet is no longer significant, but failures is still significant.
# Note: The original version of this question asked for predicting FIRST period grades for the MATH data. I've now changed it to the third period grade for the portugese data as that's the last analysis we did in question 17.
lm_20 <- lm(G1 ~ goout * school,
data = student_m)
summary(lm_20)
##
## Call:
## lm(formula = G1 ~ goout * school, data = student_m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0027 -2.5938 -0.0027 2.1595 8.1220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.6896 0.5165 24.570 < 2e-16 ***
## goout -0.5623 0.1561 -3.601 0.000358 ***
## schoolMS -3.8376 1.5975 -2.402 0.016758 *
## goout:schoolMS 1.1525 0.4897 2.354 0.019088 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.27 on 391 degrees of freedom
## Multiple R-squared: 0.0366, Adjusted R-squared: 0.02921
## F-statistic: 4.951 on 3 and 391 DF, p-value: 0.002195
# yes there is a significant interaction! b = 1.15, t(391) = 1.15, p = 0.02
lm_21 <- lm(G1 ~ goout * school,
data = student_p)
summary(lm_21)
##
## Call:
## lm(formula = G1 ~ goout * school, data = student_p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8365 -2.0115 -0.1864 1.8136 8.5880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.53625 0.37143 33.752 < 2e-16 ***
## goout -0.17493 0.11085 -1.578 0.11504
## schoolMS -1.94744 0.62282 -3.127 0.00185 **
## goout:schoolMS 0.08652 0.18160 0.476 0.63394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.625 on 645 degrees of freedom
## Multiple R-squared: 0.08968, Adjusted R-squared: 0.08544
## F-statistic: 21.18 on 3 and 645 DF, p-value: 4.273e-13
# Nope. For the portugese grades there is no significant interaction, b = 0.09, t(645) = 0.48, p = 0.63
grade_improve
that shows this (Hint: You can easily do this by creating a logical vector comparing period 1 and period 3 grades).student_m$grade_improve <- student_m$G3 > student_m$G1
glm()
function, conduct a binary logistic regression analysis that answers the main research question above. Use the summary()
function to understand the results. What do you conclude?glm_23 <- glm(grade_improve ~ .,
data = student_m,
family = "binomial")
summary(glm_23)
##
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = student_m)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.882e-05 -2.100e-08 -2.100e-08 2.100e-08 2.846e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.800e+01 1.730e+05 0.000 1.000
## schoolMS -9.059e-01 2.628e+04 0.000 1.000
## sexM -1.358e+00 1.614e+04 0.000 1.000
## age -3.501e-01 6.824e+03 0.000 1.000
## addressU 1.392e+00 1.909e+04 0.000 1.000
## famsizeLE3 -5.622e-01 1.491e+04 0.000 1.000
## PstatusT -1.306e-01 1.756e+04 0.000 1.000
## Medu -3.296e-01 8.867e+03 0.000 1.000
## Fedu -1.303e-01 8.747e+03 0.000 1.000
## Mjobhealth 2.981e-01 2.999e+04 0.000 1.000
## Mjobother -2.086e-01 2.344e+04 0.000 1.000
## Mjobservices 5.164e-01 2.742e+04 0.000 1.000
## Mjobteacher 1.425e+00 3.011e+04 0.000 1.000
## Fjobhealth -1.663e+00 4.490e+04 0.000 1.000
## Fjobother -1.166e+00 2.882e+04 0.000 1.000
## Fjobservices -1.544e+00 2.973e+04 0.000 1.000
## Fjobteacher -1.047e+00 3.385e+04 0.000 1.000
## reasonhome -4.007e-01 1.597e+04 0.000 1.000
## reasonother 6.018e-01 2.466e+04 0.000 1.000
## reasonreputation -3.331e-01 1.721e+04 0.000 1.000
## guardianmother -7.786e-01 1.549e+04 0.000 1.000
## guardianother -1.512e+00 3.476e+04 0.000 1.000
## traveltime -1.485e-01 1.191e+04 0.000 1.000
## studytime 4.544e-02 8.594e+03 0.000 1.000
## failures -4.477e-01 1.178e+04 0.000 1.000
## schoolsupyes -1.891e+00 2.575e+04 0.000 1.000
## famsupyes -1.509e-01 1.461e+04 0.000 1.000
## paidyes -9.176e-01 1.465e+04 0.000 1.000
## activitiesyes 3.261e-01 1.336e+04 0.000 1.000
## nurseryyes 4.537e-01 1.714e+04 0.000 1.000
## higheryes 3.628e+00 9.400e+04 0.000 1.000
## internetyes -1.299e+00 2.243e+04 0.000 1.000
## romanticyes -4.369e-01 1.418e+04 0.000 1.000
## famrel 6.976e-01 8.242e+03 0.000 1.000
## freetime -9.229e-02 7.311e+03 0.000 1.000
## goout -1.477e-01 6.808e+03 0.000 1.000
## Dalc 7.961e-01 1.029e+04 0.000 1.000
## Walc -3.250e-02 6.878e+03 0.000 1.000
## health -1.765e-01 5.231e+03 0.000 1.000
## absences 2.572e-03 1.050e+03 0.000 1.000
## G1 -4.529e+01 1.057e+04 -0.004 0.997
## G2 1.522e-01 8.702e+03 0.000 1.000
## G3 4.509e+01 1.094e+04 0.004 0.997
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5.1711e+02 on 394 degrees of freedom
## Residual deviance: 3.1834e-08 on 352 degrees of freedom
## AIC: 86
##
## Number of Fisher Scoring iterations: 25
# Actually we have a problem here! The problem is that the model can perfectly predict grade improvement because it currently has access to both G1 and G3. We need to rerun the analysis but EXCLUDING G1 and G3...
glm_23 <- glm(grade_improve ~ .,
data = subset(student_m, select = c(-G1, -G3)), # Exclude columns G1 and G3!
family = "binomial")
summary(glm_23)
##
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = subset(student_m,
## select = c(-G1, -G3)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6835 -0.8774 -0.5195 1.0131 2.5495
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.170258 2.638855 -0.065 0.9486
## schoolMS -0.802432 0.513457 -1.563 0.1181
## sexM -0.612143 0.290820 -2.105 0.0353 *
## age -0.162788 0.125540 -1.297 0.1947
## addressU 0.490124 0.344860 1.421 0.1553
## famsizeLE3 -0.202798 0.278695 -0.728 0.4668
## PstatusT -0.222202 0.411606 -0.540 0.5893
## Medu -0.013632 0.188422 -0.072 0.9423
## Fedu -0.117795 0.159411 -0.739 0.4599
## Mjobhealth -0.085128 0.649117 -0.131 0.8957
## Mjobother 0.294336 0.419185 0.702 0.4826
## Mjobservices 0.240903 0.467361 0.515 0.6062
## Mjobteacher 0.424096 0.615256 0.689 0.4906
## Fjobhealth 0.261414 0.850298 0.307 0.7585
## Fjobother 0.699142 0.635129 1.101 0.2710
## Fjobservices 0.268166 0.657341 0.408 0.6833
## Fjobteacher -0.012721 0.791620 -0.016 0.9872
## reasonhome -0.193825 0.311622 -0.622 0.5340
## reasonother 0.961705 0.456002 2.109 0.0349 *
## reasonreputation -0.322455 0.327102 -0.986 0.3242
## guardianmother -0.179797 0.299377 -0.601 0.5481
## guardianother -0.801163 0.620545 -1.291 0.1967
## traveltime -0.138520 0.198514 -0.698 0.4853
## studytime -0.142796 0.161538 -0.884 0.3767
## failures 0.368359 0.202141 1.822 0.0684 .
## schoolsupyes 0.483901 0.367486 1.317 0.1879
## famsupyes 0.442378 0.275913 1.603 0.1089
## paidyes -0.264673 0.270126 -0.980 0.3272
## activitiesyes -0.052579 0.252336 -0.208 0.8349
## nurseryyes -0.127564 0.311228 -0.410 0.6819
## higheryes -0.020668 0.641993 -0.032 0.9743
## internetyes -0.150675 0.352512 -0.427 0.6691
## romanticyes -0.347262 0.271162 -1.281 0.2003
## famrel 0.279927 0.143638 1.949 0.0513 .
## freetime -0.123018 0.136262 -0.903 0.3666
## goout -0.059728 0.128480 -0.465 0.6420
## Dalc 0.254498 0.187445 1.358 0.1746
## Walc 0.107042 0.140697 0.761 0.4468
## health -0.005229 0.091439 -0.057 0.9544
## absences -0.054196 0.022106 -2.452 0.0142 *
## G2 0.189013 0.041559 4.548 5.41e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 517.11 on 394 degrees of freedom
## Residual deviance: 436.58 on 354 degrees of freedom
## AIC: 518.58
##
## Number of Fisher Scoring iterations: 4
# Now that looks better!
# Men improve their scores LESS than women (b = -0.61, t(354) = -2.11, p = 0.04), people whose reason to choose the school is "other" improve more than those with another reason to choose the school (b = 0.96, t(354) = 2.11, p = 0.03), the more absences a student has the less they improve their scores (b = -0.05, t(354) = -2.45, p = 0.01), and the higher their period 2 grades, the more they improve their scores (b = 0.19, t(354) = 4.55, p < .01).
student_p$grade_improve <- student_p$G3 > student_p$G1
glm_24 <- glm(grade_improve ~ .,
data = student_p,
family = "binomial")
summary(glm_24)
##
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = student_p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.351e-05 -6.245e-06 2.110e-08 6.383e-06 1.306e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.394e+01 1.061e+05 0.000 1.000
## schoolMS -7.397e-02 1.344e+04 0.000 1.000
## sexM 1.032e-01 1.257e+04 0.000 1.000
## age 9.049e-02 5.433e+03 0.000 1.000
## addressU -2.492e-01 1.385e+04 0.000 1.000
## famsizeLE3 1.425e-01 1.212e+04 0.000 1.000
## PstatusT -1.212e-01 1.750e+04 0.000 1.000
## Medu 3.002e-02 7.278e+03 0.000 1.000
## Fedu 6.272e-02 6.870e+03 0.000 1.000
## Mjobhealth 1.528e-01 2.748e+04 0.000 1.000
## Mjobother 2.017e-01 1.640e+04 0.000 1.000
## Mjobservices 2.734e-03 1.900e+04 0.000 1.000
## Mjobteacher 2.065e-01 2.522e+04 0.000 1.000
## Fjobhealth 2.105e-01 3.647e+04 0.000 1.000
## Fjobother -1.762e-01 2.282e+04 0.000 1.000
## Fjobservices -2.526e-01 2.392e+04 0.000 1.000
## Fjobteacher -2.295e-01 3.278e+04 0.000 1.000
## reasonhome 4.040e-01 1.473e+04 0.000 1.000
## reasonother 4.026e-02 1.927e+04 0.000 1.000
## reasonreputation 1.894e-01 1.515e+04 0.000 1.000
## guardianmother 2.657e-01 1.381e+04 0.000 1.000
## guardianother 6.944e-01 2.978e+04 0.000 1.000
## traveltime -3.476e-02 9.126e+03 0.000 1.000
## studytime -1.922e-01 7.278e+03 0.000 1.000
## failures 1.966e-01 1.167e+04 0.000 1.000
## schoolsupyes 5.889e-02 1.798e+04 0.000 1.000
## famsupyes -1.009e-01 1.169e+04 0.000 1.000
## paidyes -2.499e-01 2.064e+04 0.000 1.000
## activitiesyes 8.426e-02 1.137e+04 0.000 1.000
## nurseryyes 2.259e-02 1.388e+04 0.000 1.000
## higheryes -3.155e-01 2.117e+04 0.000 1.000
## internetyes -6.539e-02 1.432e+04 0.000 1.000
## romanticyes -1.630e-01 1.209e+04 0.000 1.000
## famrel -4.483e-02 6.312e+03 0.000 1.000
## freetime 6.285e-02 5.674e+03 0.000 1.000
## goout -1.263e-01 5.730e+03 0.000 1.000
## Dalc 3.437e-02 7.886e+03 0.000 1.000
## Walc 1.819e-03 6.109e+03 0.000 1.000
## health -2.544e-02 3.949e+03 0.000 1.000
## absences -3.494e-02 1.382e+03 0.000 1.000
## G1 -4.799e+01 1.085e+04 -0.004 0.996
## G2 1.920e-01 7.192e+03 0.000 1.000
## G3 4.776e+01 1.132e+04 0.004 0.997
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.9736e+02 on 648 degrees of freedom
## Residual deviance: 2.7744e-08 on 606 degrees of freedom
## AIC: 86
##
## Number of Fisher Scoring iterations: 25
# Same problem as before, We'll need to rerun the analysis but EXCLUDING G1 and G3...
glm_24 <- glm(grade_improve ~ .,
data = subset(student_p, select = c(-G1, -G3)), # Exclude columns G1 and G3!
family = "binomial")
summary(glm_24)
##
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = subset(student_p,
## select = c(-G1, -G3)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0499 -1.0879 0.6097 1.0513 1.8200
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.883496 1.630964 -2.994 0.002751 **
## schoolMS 0.367518 0.217474 1.690 0.091040 .
## sexM 0.216094 0.202176 1.069 0.285143
## age 0.284149 0.082682 3.437 0.000589 ***
## addressU -0.181494 0.209593 -0.866 0.386525
## famsizeLE3 -0.062713 0.195401 -0.321 0.748254
## PstatusT -0.075552 0.276326 -0.273 0.784533
## Medu -0.067137 0.120314 -0.558 0.576837
## Fedu 0.098707 0.109996 0.897 0.369524
## Mjobhealth 0.237134 0.426528 0.556 0.578236
## Mjobother 0.076978 0.241251 0.319 0.749667
## Mjobservices -0.083437 0.296503 -0.281 0.778402
## Mjobteacher 0.282297 0.398199 0.709 0.478364
## Fjobhealth -0.382061 0.607351 -0.629 0.529310
## Fjobother -0.697897 0.383796 -1.818 0.069002 .
## Fjobservices -0.892196 0.400758 -2.226 0.025996 *
## Fjobteacher -1.218474 0.554623 -2.197 0.028025 *
## reasonhome 0.201758 0.227397 0.887 0.374944
## reasonother -0.551785 0.294514 -1.874 0.060993 .
## reasonreputation -0.057891 0.237428 -0.244 0.807366
## guardianmother 0.313602 0.211172 1.485 0.137528
## guardianother 0.827878 0.447420 1.850 0.064265 .
## traveltime 0.081250 0.126782 0.641 0.521613
## studytime -0.219660 0.113142 -1.941 0.052204 .
## failures -0.027489 0.169461 -0.162 0.871138
## schoolsupyes 0.238238 0.293098 0.813 0.416316
## famsupyes 0.042976 0.182762 0.235 0.814093
## paidyes -0.123058 0.362639 -0.339 0.734353
## activitiesyes 0.014425 0.178307 0.081 0.935521
## nurseryyes 0.059858 0.216298 0.277 0.781979
## higheryes -0.373691 0.313295 -1.193 0.232957
## internetyes 0.082019 0.220764 0.372 0.710246
## romanticyes -0.097444 0.183958 -0.530 0.596316
## famrel 0.082980 0.093015 0.892 0.372328
## freetime 0.027011 0.089909 0.300 0.763853
## goout -0.205177 0.086302 -2.377 0.017434 *
## Dalc 0.086699 0.122850 0.706 0.480354
## Walc -0.009490 0.094190 -0.101 0.919744
## health -0.134846 0.062064 -2.173 0.029803 *
## absences -0.009268 0.019936 -0.465 0.641999
## G2 0.153448 0.036845 4.165 3.12e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 897.36 on 648 degrees of freedom
## Residual deviance: 823.31 on 608 degrees of freedom
## AIC: 905.31
##
## Number of Fisher Scoring iterations: 4
# There are some differences. For the portugese class, age, fathers job, how often students go out with friends and current health status each significantly predict score improvement. Sex, reason to choose the school, and absences are no longer significant. Period 2 scores are still significant.
lm_25
predicting a student’s first period math grades based on all variables in the dataset.lm_25 <- lm(G1 ~.,
data = student_m)
names()
of the elements in the lm_25
object, find the vector of fitted values from the regression object. This is a vector of the predicted first period grades of all students based on the regression analysis. Add these predictions as a new column in the student_m
dataframe called G1_predicted
.names(lm_25)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
# it's fitted.values!
lm_25$fitted.values[1:10]
## 1 2 3 4 5 6 7
## 6.601191 6.224862 6.477085 14.666764 8.708676 14.806084 12.617109
## 8 9 10
## 7.703304 15.127780 12.737022
student_m$G1_predicted <- lm_25$fitted.values
G1
and G1_predicted
vectors. You may want to use the abs()
function, which will return the absolute value of a vector of values.mean(abs(student_m$G1 - student_m$G1_predicted))
## [1] 0.9016461
library(ggplot2)
ggplot(data = student_m,
mapping = aes(x = G1, y = G1_predicted)) +
geom_point(alpha = .4) + # Slightly transparent points
theme_bw() + # Use the black and white theme
labs(title = 'Predicted versus actual Period 1 math grades',
y = "Predicted Grades",
x = "Actual Grades")
# The model does a pretty good job I think
df
that contains four predictors (A, B, C, D) and some random noise (noise). I will then create dv, a dependent variable that is a linear combination of the four predictors, plus the noise. Run the following chunk.library(tidyverse) # For dplyr
set.seed(100) # Fix the randomisation
# Create a dataframe with 4 predictors (A, B, C and D) and noise
df <- data.frame(A = rnorm(n = 100, mean = 0, sd = 1),
B = rnorm(n = 100, mean = 0, sd = 1),
C = rnorm(n = 100, mean = 0, sd = 1),
D = rnorm(n = 100, mean = 0, sd = 1),
noise = rnorm(n = 100, mean = 0, sd = 10))
# Calculate y, a linear combination of A, B, C plus noise
df <- df %>%
mutate(
dv = 20 + A + 5 * B - 4 * C + 0 * D + noise
)
dv
as a function of the 4 predictors, what coefficients would you expect to get from the regression?# I would expect to get 20, 1, 5, -4 and 0
lm_31 <- lm(dv ~ A + B + C + D,
data = df)
summary(lm_31)
##
## Call:
## lm(formula = dv ~ A + B + C + D, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.2071 -5.7369 0.9528 7.4823 20.6544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.6159 1.0804 17.231 <2e-16 ***
## A 0.7842 1.0748 0.730 0.4674
## B 4.1584 1.4213 2.926 0.0043 **
## C -2.7358 1.0886 -2.513 0.0137 *
## D -0.7390 1.0107 -0.731 0.4665
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.77 on 95 degrees of freedom
## Multiple R-squared: 0.1722, Adjusted R-squared: 0.1374
## F-statistic: 4.942 on 4 and 95 DF, p-value: 0.001157
# I wasn't perfectly correct, but the numbers aren't too far off. The reason why they aren't perfect is because there is a lot of noise in the data.
library(tidyverse) # For dplyr
set.seed(100) # Fix the randomisation
# Create a dataframe with 4 predictors (A, B, C and D) and noise
df <- data.frame(A = rnorm(n = 100, mean = 0, sd = 1),
B = rnorm(n = 100, mean = 0, sd = 1),
C = rnorm(n = 100, mean = 0, sd = 1),
D = rnorm(n = 100, mean = 0, sd = 1),
noise = rnorm(n = 100, mean = 0, sd = .01))
# Calculate y, a linear combination of A, B, C plus noise
df <- df %>%
mutate(
dv = 20 + A + 5 * B - 4 * C + 0 * D + noise
)
lm_32 <- lm(dv ~ A + B + C + D,
data = df)
summary(lm_32)
##
## Call:
## lm(formula = dv ~ A + B + C + D, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0292071 -0.0057369 0.0009528 0.0074823 0.0206544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.998616 0.001080 18511.175 <2e-16 ***
## A 0.999784 0.001075 930.180 <2e-16 ***
## B 4.999158 0.001421 3517.273 <2e-16 ***
## C -3.998736 0.001089 -3673.147 <2e-16 ***
## D -0.000739 0.001011 -0.731 0.466
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01077 on 95 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.741e+06 on 4 and 95 DF, p-value: < 2.2e-16
# Now they are almost exactly as I predicted! That's because there's almost no noise in the data now...
wpa_8_LastFirst.R
file to me at nathaniel.phillips@unibas.ch.