Readings

This assignment is based on the following readings:

Assignment Goals

Examples

library(yarrr)   #Load the yarrr package for the pirates dataframe

# Predict beard.length as a function of sex, age, weight and tattoos

beard_lm <- lm(formula = beard.length ~ sex + age + weight + tattoos,
               data = pirates)

summary(beard_lm)          # Look at summary results
names(beard_lm)            # Named elements in the object
beard_lm$coefficients      # Get coefficients

# Predict tattoos as a function of ALL variables in the pirates dataframe

tattoos_lm <- lm(formula = tattoos ~.,
                 data = pirates)

# Calculate model fits

# Directly from lm object
tattoos_fits <- tattoos_lm$fitted.values

# or calculate manually using predict()
tattoos_fits <- predict(tattoos_lm, newdata = pirates)

# Calculate residuals

# Directly from lm object
tattoos_resid <- tattoos_lm$residuals

# or calculate manually
tattoos_resid <- pirates$tattoos - predict(tattoos_lm, newdata = pirates)


# Binary logistic regression

# Create a logical vector indicating which pirates like "hook"
pirates$like_hook <- pirates$favorite.pirate == "Hook"

# Conduct binary logistic regression predicting which pirates like hook

hook_glm <- glm(formula = like_hook ~ . -favorite.pirate,   # exclude favorite.pirate
                data = pirates, 
                family = "binomial")

summary(hook_glm)  # summary of results

Student Performance

In this WPA, you will analyze data from a study on student performance in two classes: math and portuguese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#

Here is the data description (taken directly from the original website

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The data are located in two separate tab-delimited text files at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentmath.txt (the math data), and https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentpor.txt (the portugese data).

Datafile description

Both datafiles have 33 columns. Here they are:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Data loading and preparation

  1. OPEN YOUR CLASS R PROJECT!!!. This project should have (at least) two folders, one called data and one called R. If you do not have these folders already, create them! Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R folder in your project working directory as wpa_8_LastFirst.R, where Last and First are your last and first names.
# Done!
  1. The two data files you ned for this assignment are located at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentmath.txt (the math data) and https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentpor.txt (the portugese data). Using read.table() load this two data files into R as two objects, one called student_m, and one called student_p
student_m <- read.table(file = "https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentmath.txt",
                      sep = "\t",
                      header = TRUE)

student_p <- read.table(file = "https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentpor.txt",
                      sep = "\t",
                      header = TRUE)

Understand the data

  1. Look at the first few rows of the dataframes with the head() and View() functions to make sure they loaded correctly.
head(student_m)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        3       yes     no  yes
## 4       home   mother          1         3        0        no    yes  yes
## 5       home   father          1         2        0        no    yes  yes
## 6 reputation   mother          1         2        0        no    yes  yes
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        6  5  6  6
## 2    1      3        4  5  5  6
## 3    3      3       10  7  8 10
## 4    1      5        2 15 14 15
## 5    2      5        4  6 10 10
## 6    2      5       10 15 15 15
head(student_p)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason guardian traveltime studytime failures schoolsup famsup paid
## 1     course   mother          2         2        0       yes     no   no
## 2     course   father          1         2        0        no    yes   no
## 3      other   mother          1         2        0       yes     no   no
## 4       home   mother          1         3        0        no    yes   no
## 5       home   father          1         2        0        no    yes   no
## 6 reputation   mother          1         2        0        no    yes   no
##   activities nursery higher internet romantic famrel freetime goout Dalc
## 1         no     yes    yes       no       no      4        3     4    1
## 2         no      no    yes      yes       no      5        3     3    1
## 3         no     yes    yes      yes       no      4        3     2    2
## 4        yes     yes    yes      yes      yes      3        2     2    1
## 5         no     yes    yes       no       no      4        3     2    1
## 6        yes     yes    yes      yes       no      5        4     2    1
##   Walc health absences G1 G2 G3
## 1    1      3        4  0 11 11
## 2    1      3        2  9 11 11
## 3    3      3        6 12 13 12
## 4    1      5        0 14 14 14
## 5    2      5        0 11 13 13
## 6    2      5        6 12 12 13
  1. Using the names() and str() functions, look at the names and structure of the dataframes to make sure everything looks ok. If the data look strange, you did something wrong with read.table(), diagnose the problem!
str(student_m)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
str(student_p)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
names(student_m)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"
names(student_p)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"
  1. Using write.table(), save a local copy of the two dataframes to text files called student_m and student_p in the data folder of your project. Now, you’ll always have access to the data.
# Write student_m and student_p to text files in the data folder of my working directory

write.table(student_m, file = "data/student_m.txt", sep = "\t")
write.table(student_p, file = "data/student_m.txt", sep = "\t")

Standard Regression with lm()

When reporting APA style results from a regression analysis, use the following format: STATEMENT, b = X, t(df) = X, p = X: For example:

x <- lm(weight ~ Time,
        data = ChickWeight)

summary(x)
## 
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -138.331  -14.536    0.926   13.533  160.669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.4674     3.0365   9.046   <2e-16 ***
## Time          8.8030     0.2397  36.725   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared:  0.7007, Adjusted R-squared:  0.7002 
## F-statistic:  1349 on 1 and 576 DF,  p-value: < 2.2e-16
# There is a significant positive relationship between time and weight, b = 8.80, t(576) = 36.73, p < 0.01.

One IV

  1. For the math data, create a regression object called lm_6 predicting first period grade (G1) based on age.
# Predict G1 as a function of age

lm_6 <- lm(G1 ~ age, 
           data = student_m)
  1. Run names() and summary() on lm_6 to see additional information from your regression object. Now, return a vector of the coefficients by running lm_5$coefficients
summary(lm_6)
## 
## Call:
## lm(formula = G1 ~ age, data = student_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6915 -2.7749 -0.1916  2.3085  8.3085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.6919     2.1926   6.245  1.1e-09 ***
## age          -0.1667     0.1309  -1.273    0.204    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.317 on 393 degrees of freedom
## Multiple R-squared:  0.004106,   Adjusted R-squared:  0.001572 
## F-statistic:  1.62 on 1 and 393 DF,  p-value: 0.2038
names(lm_6)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
lm_6$coefficients
## (Intercept)         age 
##  13.6918906  -0.1666864
  1. How do you interpret the relationship between age and first period grade? Give an APA style conclusion.
# There is a slight negative relationship between age and first period grade (b = -0.17), however the relationship is not significant t(393) = -1.27, p = 0.20
  1. By hand (that is, typing the calculation manually), calculate the predicted first period math grade of a student who is 18 years old based on the regression equation (if the coefficient is non-significant, just use it anyway).
13.6919 + (-0.1667) * 18
## [1] 10.6913
  1. For the portugese data, create a regression object called lm_10 predicting each student’s period 3 grade (G3) based on their period 1 grade (G1). Look at the results of the regression analysis with summary().
lm_10 <- lm(G3 ~ G1, 
            data = student_p)

summary(lm_10)
## 
## Call:
## lm(formula = G3 ~ G1, data = student_p)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5179  -0.5454   0.3996   0.6196  10.1796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.82040    0.30545   2.686  0.00742 ** 
## G1           0.97250    0.02605  37.329  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.821 on 647 degrees of freedom
## Multiple R-squared:  0.6829, Adjusted R-squared:  0.6824 
## F-statistic:  1393 on 1 and 647 DF,  p-value: < 2.2e-16
  1. What is the relationship between first and third period portugese grades? Give an APA style conclusion.
# There is a strong positive relationship between first period grade and third period grade, b = 0.97, t(647) = 37.33, p < .01.
  1. By hand, calculate the calculate the predicted third period grade of a student who had a first period grade of 10.
0.8204 + 0.9725 * 10
## [1] 10.5454

Regression vs. Correlation

  1. In task 10 you calculated a regression equation predicting students’ third period portugese grades by their first period grades. Now let’s see if a simple correlation test gives you the same answer. Conduct a correlation test between first and third period portugese grades. (Hint: Refer to WPA 6 https://ndphillips.github.io/IntroductionR_Course/assignments/wpa/wpa_6_answers.html). Compare the t-value for this test to the regression analysis you did in question 10. What do you see?
cor.test(~ G1 + G3, 
         data = student_p)
## 
##  Pearson's product-moment correlation
## 
## data:  G1 and G3
## t = 37.329, df = 647, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8003265 0.8493311
## sample estimates:
##       cor 
## 0.8263871
# The test statistic is exactly the same!! t(647) = 37.329
  1. Now conduct a correlation test testing the relationship between age and first period grade for the math data. Compare the t-value from this test to the result you obtained in the regression analysis in question 7. What do you see?
cor.test(~ age + G1, 
         data = student_m)
## 
##  Pearson's product-moment correlation
## 
## data:  age and G1
## t = -1.273, df = 393, p-value = 0.2038
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.16172991  0.03480963
## sample estimates:
##        cor 
## -0.0640815
# Once again the test statistic is exactly the same!! t(393) = -1.273.
# Apparently correlating two variables is, statistically, exactly the same as performing a regression analysis.

Multiple IVs

  1. For the math data, create a regression object called lm_15 predicting third period math grade (G3) based on sex, age, internet, and failures. Then, use the summary() function to see a summary table of the output
lm_15 <- lm(G3 ~ sex + age + internet + failures, 
            data = student_m)


summary(lm_15)
## 
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student_m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2156  -1.9523   0.0965   3.0252   9.4370 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.9962     2.9808   4.695 3.69e-06 ***
## sexM          1.0451     0.4282   2.441   0.0151 *  
## age          -0.2407     0.1735  -1.388   0.1660    
## internetyes   0.7855     0.5761   1.364   0.1735    
## failures     -2.1260     0.2966  -7.167 3.86e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.237 on 390 degrees of freedom
## Multiple R-squared:  0.1533, Adjusted R-squared:  0.1446 
## F-statistic: 17.65 on 4 and 390 DF,  p-value: 2.488e-13
  1. Interpret the results!
# Men have significantly higher third period math grades than women (b = 1.05, t(390) = 2.441, p = 0.02), there is a sigificant negative relationship between the number of past class failures and third period math grades (b = -2.13, t(390) = -7.17, p < .01). There are no significant relationships for age or internet (ps > 0.05).
  1. Create a new regression object called lm_17 using the same variables as question 15, however, this time predict third period scores in the portugese dataset dataset. Use the summary() function to understand the results.
lm_17 <- lm(G3 ~ sex + age + internet + failures, 
            data = student_p)

summary(lm_17)
## 
## Call:
## lm(formula = G3 ~ sex + age + internet + failures, data = student_p)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8941  -1.8345   0.0522   1.8807   7.8041 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.61020    1.68101   6.907 1.19e-11 ***
## sexM        -0.71515    0.23625  -3.027 0.002568 ** 
## age          0.01986    0.10031   0.198 0.843134    
## internetyes  0.92639    0.27508   3.368 0.000803 ***
## failures    -2.04819    0.20738  -9.877  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.936 on 644 degrees of freedom
## Multiple R-squared:  0.1794, Adjusted R-squared:  0.1743 
## F-statistic: 35.19 on 4 and 644 DF,  p-value: < 2.2e-16
  1. What are the key differences between the math and portugese datasets in which variables predict third period scores?
# In the portugese datset, men do worse than women, and internet actually helps performance!
  1. Now, create a regression analysis predicting third period portugese grades using all variables in the dataset (Hint: use the notation formula = y ~ . to include all variables!). Which variables are significant? Are any of the variables that were significant before no longer significant.
lm_19 <- lm(G3 ~ ., 
            data = student_p)

summary(lm_19)
## 
## Call:
## lm(formula = G3 ~ ., data = student_p)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7618 -0.5148  0.0038  0.6047  5.4973 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.63823    0.96361   0.662 0.508011    
## schoolMS         -0.19797    0.12783  -1.549 0.121992    
## sexM             -0.12258    0.11778  -1.041 0.298423    
## age               0.02869    0.04835   0.593 0.553208    
## addressU          0.11446    0.12277   0.932 0.351565    
## famsizeLE3        0.01560    0.11505   0.136 0.892197    
## PstatusT         -0.09746    0.16256  -0.600 0.549055    
## Medu             -0.09170    0.07097  -1.292 0.196799    
## Fedu              0.04962    0.06461   0.768 0.442773    
## Mjobhealth        0.26583    0.25225   1.054 0.292379    
## Mjobother        -0.09351    0.14208  -0.658 0.510720    
## Mjobservices      0.17255    0.17510   0.985 0.324808    
## Mjobteacher       0.22115    0.23558   0.939 0.348232    
## Fjobhealth       -0.44420    0.35256  -1.260 0.208189    
## Fjobother        -0.33805    0.21391  -1.580 0.114544    
## Fjobservices     -0.47121    0.22477  -2.096 0.036457 *  
## Fjobteacher      -0.54368    0.31611  -1.720 0.085958 .  
## reasonhome       -0.07885    0.13366  -0.590 0.555479    
## reasonother      -0.36174    0.17236  -2.099 0.036251 *  
## reasonreputation -0.16934    0.13990  -1.210 0.226584    
## guardianmother   -0.02513    0.12461  -0.202 0.840252    
## guardianother     0.21732    0.24922   0.872 0.383539    
## traveltime        0.13859    0.07459   1.858 0.063667 .  
## studytime         0.04965    0.06620   0.750 0.453569    
## failures         -0.25495    0.09900  -2.575 0.010254 *  
## schoolsupyes     -0.18419    0.17319  -1.064 0.287969    
## famsupyes         0.09456    0.10701   0.884 0.377230    
## paidyes          -0.19166    0.21664  -0.885 0.376663    
## activitiesyes     0.01208    0.10482   0.115 0.908275    
## nurseryyes       -0.09562    0.12722  -0.752 0.452553    
## higheryes         0.20749    0.18261   1.136 0.256285    
## internetyes       0.08517    0.12955   0.657 0.511152    
## romanticyes      -0.04209    0.10786  -0.390 0.696483    
## famrel           -0.01597    0.05471  -0.292 0.770469    
## freetime         -0.05002    0.05267  -0.950 0.342694    
## goout            -0.01889    0.05041  -0.375 0.708033    
## Dalc             -0.05194    0.07185  -0.723 0.469977    
## Walc             -0.01693    0.05553  -0.305 0.760521    
## health           -0.05522    0.03633  -1.520 0.129064    
## absences          0.01359    0.01173   1.158 0.247198    
## G1                0.12933    0.03762   3.438 0.000626 ***
## G2                0.87037    0.03495  24.906  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.249 on 607 degrees of freedom
## Multiple R-squared:   0.86,  Adjusted R-squared:  0.8506 
## F-statistic: 90.95 on 41 and 607 DF,  p-value: < 2.2e-16
# Compared to the analysis in question 17, sex is no longer significant, internet is no longer significant, but failures is still significant.

# Note: The original version of this question asked for predicting FIRST period grades for the MATH data. I've now changed it to the third period grade for the portugese data as that's the last analysis we did in question 17.

Checkpoint!!!

Interactions

  1. Is the relationship between whether or not students go out with friends and period 1 math scores different between the two schools (BP or MS)? Answer this by conducting a regression analysis with the appropriate interaction term.
lm_20 <- lm(G1 ~ goout * school,
            data = student_m)

summary(lm_20)
## 
## Call:
## lm(formula = G1 ~ goout * school, data = student_m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0027 -2.5938 -0.0027  2.1595  8.1220 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.6896     0.5165  24.570  < 2e-16 ***
## goout           -0.5623     0.1561  -3.601 0.000358 ***
## schoolMS        -3.8376     1.5975  -2.402 0.016758 *  
## goout:schoolMS   1.1525     0.4897   2.354 0.019088 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.27 on 391 degrees of freedom
## Multiple R-squared:  0.0366, Adjusted R-squared:  0.02921 
## F-statistic: 4.951 on 3 and 391 DF,  p-value: 0.002195
# yes there is a significant interaction! b = 1.15, t(391) = 1.15, p = 0.02
  1. Is the relationship you found above the same for the portugese period 1 grades?
lm_21 <- lm(G1 ~ goout * school,
            data = student_p)

summary(lm_21)
## 
## Call:
## lm(formula = G1 ~ goout * school, data = student_p)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8365  -2.0115  -0.1864   1.8136   8.5880 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    12.53625    0.37143  33.752  < 2e-16 ***
## goout          -0.17493    0.11085  -1.578  0.11504    
## schoolMS       -1.94744    0.62282  -3.127  0.00185 ** 
## goout:schoolMS  0.08652    0.18160   0.476  0.63394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.625 on 645 degrees of freedom
## Multiple R-squared:  0.08968,    Adjusted R-squared:  0.08544 
## F-statistic: 21.18 on 3 and 645 DF,  p-value: 4.273e-13
# Nope. For the portugese grades there is no significant interaction, b = 0.09, t(645) = 0.48, p = 0.63

Binary Logistic regression

  1. Let’s create a logistic regression analysis that answers the question: “What predicts whether or not a student improves his/her math grade from period 1 to period 3?” To do this, we need to start by creating a new logical variable which indicates whether or not a student’s period 3 grade is larger than his/her period 1 grade. Add a new variable to the math dataframe called grade_improve that shows this (Hint: You can easily do this by creating a logical vector comparing period 1 and period 3 grades).
student_m$grade_improve <- student_m$G3 > student_m$G1
  1. Using the glm() function, conduct a binary logistic regression analysis that answers the main research question above. Use the summary() function to understand the results. What do you conclude?
glm_23 <- glm(grade_improve ~ .,
              data = student_m,
              family = "binomial")

summary(glm_23)
## 
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = student_m)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -3.882e-05  -2.100e-08  -2.100e-08   2.100e-08   2.846e-05  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -1.800e+01  1.730e+05   0.000    1.000
## schoolMS         -9.059e-01  2.628e+04   0.000    1.000
## sexM             -1.358e+00  1.614e+04   0.000    1.000
## age              -3.501e-01  6.824e+03   0.000    1.000
## addressU          1.392e+00  1.909e+04   0.000    1.000
## famsizeLE3       -5.622e-01  1.491e+04   0.000    1.000
## PstatusT         -1.306e-01  1.756e+04   0.000    1.000
## Medu             -3.296e-01  8.867e+03   0.000    1.000
## Fedu             -1.303e-01  8.747e+03   0.000    1.000
## Mjobhealth        2.981e-01  2.999e+04   0.000    1.000
## Mjobother        -2.086e-01  2.344e+04   0.000    1.000
## Mjobservices      5.164e-01  2.742e+04   0.000    1.000
## Mjobteacher       1.425e+00  3.011e+04   0.000    1.000
## Fjobhealth       -1.663e+00  4.490e+04   0.000    1.000
## Fjobother        -1.166e+00  2.882e+04   0.000    1.000
## Fjobservices     -1.544e+00  2.973e+04   0.000    1.000
## Fjobteacher      -1.047e+00  3.385e+04   0.000    1.000
## reasonhome       -4.007e-01  1.597e+04   0.000    1.000
## reasonother       6.018e-01  2.466e+04   0.000    1.000
## reasonreputation -3.331e-01  1.721e+04   0.000    1.000
## guardianmother   -7.786e-01  1.549e+04   0.000    1.000
## guardianother    -1.512e+00  3.476e+04   0.000    1.000
## traveltime       -1.485e-01  1.191e+04   0.000    1.000
## studytime         4.544e-02  8.594e+03   0.000    1.000
## failures         -4.477e-01  1.178e+04   0.000    1.000
## schoolsupyes     -1.891e+00  2.575e+04   0.000    1.000
## famsupyes        -1.509e-01  1.461e+04   0.000    1.000
## paidyes          -9.176e-01  1.465e+04   0.000    1.000
## activitiesyes     3.261e-01  1.336e+04   0.000    1.000
## nurseryyes        4.537e-01  1.714e+04   0.000    1.000
## higheryes         3.628e+00  9.400e+04   0.000    1.000
## internetyes      -1.299e+00  2.243e+04   0.000    1.000
## romanticyes      -4.369e-01  1.418e+04   0.000    1.000
## famrel            6.976e-01  8.242e+03   0.000    1.000
## freetime         -9.229e-02  7.311e+03   0.000    1.000
## goout            -1.477e-01  6.808e+03   0.000    1.000
## Dalc              7.961e-01  1.029e+04   0.000    1.000
## Walc             -3.250e-02  6.878e+03   0.000    1.000
## health           -1.765e-01  5.231e+03   0.000    1.000
## absences          2.572e-03  1.050e+03   0.000    1.000
## G1               -4.529e+01  1.057e+04  -0.004    0.997
## G2                1.522e-01  8.702e+03   0.000    1.000
## G3                4.509e+01  1.094e+04   0.004    0.997
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5.1711e+02  on 394  degrees of freedom
## Residual deviance: 3.1834e-08  on 352  degrees of freedom
## AIC: 86
## 
## Number of Fisher Scoring iterations: 25
# Actually we have a problem here! The problem is that the model can perfectly predict grade improvement because it currently has access to both G1 and G3. We need to rerun the analysis but EXCLUDING G1 and G3...

glm_23 <- glm(grade_improve ~ .,
              data = subset(student_m, select = c(-G1, -G3)),   # Exclude columns G1 and G3!
              family = "binomial")

summary(glm_23)
## 
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = subset(student_m, 
##     select = c(-G1, -G3)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6835  -0.8774  -0.5195   1.0131   2.5495  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.170258   2.638855  -0.065   0.9486    
## schoolMS         -0.802432   0.513457  -1.563   0.1181    
## sexM             -0.612143   0.290820  -2.105   0.0353 *  
## age              -0.162788   0.125540  -1.297   0.1947    
## addressU          0.490124   0.344860   1.421   0.1553    
## famsizeLE3       -0.202798   0.278695  -0.728   0.4668    
## PstatusT         -0.222202   0.411606  -0.540   0.5893    
## Medu             -0.013632   0.188422  -0.072   0.9423    
## Fedu             -0.117795   0.159411  -0.739   0.4599    
## Mjobhealth       -0.085128   0.649117  -0.131   0.8957    
## Mjobother         0.294336   0.419185   0.702   0.4826    
## Mjobservices      0.240903   0.467361   0.515   0.6062    
## Mjobteacher       0.424096   0.615256   0.689   0.4906    
## Fjobhealth        0.261414   0.850298   0.307   0.7585    
## Fjobother         0.699142   0.635129   1.101   0.2710    
## Fjobservices      0.268166   0.657341   0.408   0.6833    
## Fjobteacher      -0.012721   0.791620  -0.016   0.9872    
## reasonhome       -0.193825   0.311622  -0.622   0.5340    
## reasonother       0.961705   0.456002   2.109   0.0349 *  
## reasonreputation -0.322455   0.327102  -0.986   0.3242    
## guardianmother   -0.179797   0.299377  -0.601   0.5481    
## guardianother    -0.801163   0.620545  -1.291   0.1967    
## traveltime       -0.138520   0.198514  -0.698   0.4853    
## studytime        -0.142796   0.161538  -0.884   0.3767    
## failures          0.368359   0.202141   1.822   0.0684 .  
## schoolsupyes      0.483901   0.367486   1.317   0.1879    
## famsupyes         0.442378   0.275913   1.603   0.1089    
## paidyes          -0.264673   0.270126  -0.980   0.3272    
## activitiesyes    -0.052579   0.252336  -0.208   0.8349    
## nurseryyes       -0.127564   0.311228  -0.410   0.6819    
## higheryes        -0.020668   0.641993  -0.032   0.9743    
## internetyes      -0.150675   0.352512  -0.427   0.6691    
## romanticyes      -0.347262   0.271162  -1.281   0.2003    
## famrel            0.279927   0.143638   1.949   0.0513 .  
## freetime         -0.123018   0.136262  -0.903   0.3666    
## goout            -0.059728   0.128480  -0.465   0.6420    
## Dalc              0.254498   0.187445   1.358   0.1746    
## Walc              0.107042   0.140697   0.761   0.4468    
## health           -0.005229   0.091439  -0.057   0.9544    
## absences         -0.054196   0.022106  -2.452   0.0142 *  
## G2                0.189013   0.041559   4.548 5.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 517.11  on 394  degrees of freedom
## Residual deviance: 436.58  on 354  degrees of freedom
## AIC: 518.58
## 
## Number of Fisher Scoring iterations: 4
# Now that looks better!

# Men improve their scores LESS than women (b = -0.61, t(354) = -2.11, p = 0.04), people whose reason to choose the school is "other" improve more than those with another reason to choose the school (b = 0.96, t(354) = 2.11, p = 0.03), the more absences a student has the less they improve their scores (b = -0.05, t(354) = -2.45, p = 0.01), and the higher their period 2 grades, the more they improve their scores (b = 0.19, t(354) = 4.55, p < .01).
  1. Repeat this analyses, but now use the portugese data. Do you find differences in which variables predict grade improvement between the two datasets?
student_p$grade_improve <- student_p$G3 > student_p$G1

glm_24 <- glm(grade_improve ~ .,
              data = student_p,
              family = "binomial")

summary(glm_24)
## 
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = student_p)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.351e-05  -6.245e-06   2.110e-08   6.383e-06   1.306e-05  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -2.394e+01  1.061e+05   0.000    1.000
## schoolMS         -7.397e-02  1.344e+04   0.000    1.000
## sexM              1.032e-01  1.257e+04   0.000    1.000
## age               9.049e-02  5.433e+03   0.000    1.000
## addressU         -2.492e-01  1.385e+04   0.000    1.000
## famsizeLE3        1.425e-01  1.212e+04   0.000    1.000
## PstatusT         -1.212e-01  1.750e+04   0.000    1.000
## Medu              3.002e-02  7.278e+03   0.000    1.000
## Fedu              6.272e-02  6.870e+03   0.000    1.000
## Mjobhealth        1.528e-01  2.748e+04   0.000    1.000
## Mjobother         2.017e-01  1.640e+04   0.000    1.000
## Mjobservices      2.734e-03  1.900e+04   0.000    1.000
## Mjobteacher       2.065e-01  2.522e+04   0.000    1.000
## Fjobhealth        2.105e-01  3.647e+04   0.000    1.000
## Fjobother        -1.762e-01  2.282e+04   0.000    1.000
## Fjobservices     -2.526e-01  2.392e+04   0.000    1.000
## Fjobteacher      -2.295e-01  3.278e+04   0.000    1.000
## reasonhome        4.040e-01  1.473e+04   0.000    1.000
## reasonother       4.026e-02  1.927e+04   0.000    1.000
## reasonreputation  1.894e-01  1.515e+04   0.000    1.000
## guardianmother    2.657e-01  1.381e+04   0.000    1.000
## guardianother     6.944e-01  2.978e+04   0.000    1.000
## traveltime       -3.476e-02  9.126e+03   0.000    1.000
## studytime        -1.922e-01  7.278e+03   0.000    1.000
## failures          1.966e-01  1.167e+04   0.000    1.000
## schoolsupyes      5.889e-02  1.798e+04   0.000    1.000
## famsupyes        -1.009e-01  1.169e+04   0.000    1.000
## paidyes          -2.499e-01  2.064e+04   0.000    1.000
## activitiesyes     8.426e-02  1.137e+04   0.000    1.000
## nurseryyes        2.259e-02  1.388e+04   0.000    1.000
## higheryes        -3.155e-01  2.117e+04   0.000    1.000
## internetyes      -6.539e-02  1.432e+04   0.000    1.000
## romanticyes      -1.630e-01  1.209e+04   0.000    1.000
## famrel           -4.483e-02  6.312e+03   0.000    1.000
## freetime          6.285e-02  5.674e+03   0.000    1.000
## goout            -1.263e-01  5.730e+03   0.000    1.000
## Dalc              3.437e-02  7.886e+03   0.000    1.000
## Walc              1.819e-03  6.109e+03   0.000    1.000
## health           -2.544e-02  3.949e+03   0.000    1.000
## absences         -3.494e-02  1.382e+03   0.000    1.000
## G1               -4.799e+01  1.085e+04  -0.004    0.996
## G2                1.920e-01  7.192e+03   0.000    1.000
## G3                4.776e+01  1.132e+04   0.004    0.997
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.9736e+02  on 648  degrees of freedom
## Residual deviance: 2.7744e-08  on 606  degrees of freedom
## AIC: 86
## 
## Number of Fisher Scoring iterations: 25
# Same problem as before, We'll need to rerun the analysis but EXCLUDING G1 and G3...

glm_24 <- glm(grade_improve ~ .,
              data = subset(student_p, select = c(-G1, -G3)),   # Exclude columns G1 and G3!
              family = "binomial")

summary(glm_24)
## 
## Call:
## glm(formula = grade_improve ~ ., family = "binomial", data = subset(student_p, 
##     select = c(-G1, -G3)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0499  -1.0879   0.6097   1.0513   1.8200  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -4.883496   1.630964  -2.994 0.002751 ** 
## schoolMS          0.367518   0.217474   1.690 0.091040 .  
## sexM              0.216094   0.202176   1.069 0.285143    
## age               0.284149   0.082682   3.437 0.000589 ***
## addressU         -0.181494   0.209593  -0.866 0.386525    
## famsizeLE3       -0.062713   0.195401  -0.321 0.748254    
## PstatusT         -0.075552   0.276326  -0.273 0.784533    
## Medu             -0.067137   0.120314  -0.558 0.576837    
## Fedu              0.098707   0.109996   0.897 0.369524    
## Mjobhealth        0.237134   0.426528   0.556 0.578236    
## Mjobother         0.076978   0.241251   0.319 0.749667    
## Mjobservices     -0.083437   0.296503  -0.281 0.778402    
## Mjobteacher       0.282297   0.398199   0.709 0.478364    
## Fjobhealth       -0.382061   0.607351  -0.629 0.529310    
## Fjobother        -0.697897   0.383796  -1.818 0.069002 .  
## Fjobservices     -0.892196   0.400758  -2.226 0.025996 *  
## Fjobteacher      -1.218474   0.554623  -2.197 0.028025 *  
## reasonhome        0.201758   0.227397   0.887 0.374944    
## reasonother      -0.551785   0.294514  -1.874 0.060993 .  
## reasonreputation -0.057891   0.237428  -0.244 0.807366    
## guardianmother    0.313602   0.211172   1.485 0.137528    
## guardianother     0.827878   0.447420   1.850 0.064265 .  
## traveltime        0.081250   0.126782   0.641 0.521613    
## studytime        -0.219660   0.113142  -1.941 0.052204 .  
## failures         -0.027489   0.169461  -0.162 0.871138    
## schoolsupyes      0.238238   0.293098   0.813 0.416316    
## famsupyes         0.042976   0.182762   0.235 0.814093    
## paidyes          -0.123058   0.362639  -0.339 0.734353    
## activitiesyes     0.014425   0.178307   0.081 0.935521    
## nurseryyes        0.059858   0.216298   0.277 0.781979    
## higheryes        -0.373691   0.313295  -1.193 0.232957    
## internetyes       0.082019   0.220764   0.372 0.710246    
## romanticyes      -0.097444   0.183958  -0.530 0.596316    
## famrel            0.082980   0.093015   0.892 0.372328    
## freetime          0.027011   0.089909   0.300 0.763853    
## goout            -0.205177   0.086302  -2.377 0.017434 *  
## Dalc              0.086699   0.122850   0.706 0.480354    
## Walc             -0.009490   0.094190  -0.101 0.919744    
## health           -0.134846   0.062064  -2.173 0.029803 *  
## absences         -0.009268   0.019936  -0.465 0.641999    
## G2                0.153448   0.036845   4.165 3.12e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 897.36  on 648  degrees of freedom
## Residual deviance: 823.31  on 608  degrees of freedom
## AIC: 905.31
## 
## Number of Fisher Scoring iterations: 4
#  There are some differences. For the portugese class, age, fathers job, how often students go out with friends and current health status each significantly predict score improvement. Sex, reason to choose the school, and absences are no longer significant. Period 2 scores are still significant.

Predicting values

  1. For the math dataset, create a regression object called lm_25 predicting a student’s first period math grades based on all variables in the dataset.
lm_25 <- lm(G1 ~., 
            data = student_m)
  1. By looking at the names() of the elements in the lm_25 object, find the vector of fitted values from the regression object. This is a vector of the predicted first period grades of all students based on the regression analysis. Add these predictions as a new column in the student_m dataframe called G1_predicted.
names(lm_25)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"
# it's fitted.values!

lm_25$fitted.values[1:10]
##         1         2         3         4         5         6         7 
##  6.601191  6.224862  6.477085 14.666764  8.708676 14.806084 12.617109 
##         8         9        10 
##  7.703304 15.127780 12.737022
student_m$G1_predicted <- lm_25$fitted.values
  1. On average, how far away were the regression model predictions from the true first period math grades? To answer this, do basic arithmetic operations on the G1 and G1_predicted vectors. You may want to use the abs() function, which will return the absolute value of a vector of values.
mean(abs(student_m$G1 - student_m$G1_predicted))
## [1] 0.9016461
  1. Create a scatterplot showing the relationship between the true first period math grades and the predicted first period math grades. How well does the regression model capture the true first period math grades?
library(ggplot2)

ggplot(data = student_m, 
       mapping = aes(x = G1, y = G1_predicted)) + 
  geom_point(alpha = .4) +   # Slightly transparent points
  theme_bw() +               # Use the black and white theme
  labs(title = 'Predicted versus actual Period 1 math grades',
       y = "Predicted Grades",
       x = "Actual Grades")

# The model does a pretty good job I think

Simulating results

  1. In the following code chunk, I will create a dataframe called df that contains four predictors (A, B, C, D) and some random noise (noise). I will then create dv, a dependent variable that is a linear combination of the four predictors, plus the noise. Run the following chunk.
library(tidyverse) # For dplyr

set.seed(100)  # Fix the randomisation

# Create a dataframe with 4 predictors (A, B, C and D) and noise

df <- data.frame(A = rnorm(n = 100, mean = 0, sd = 1),
                 B = rnorm(n = 100, mean = 0, sd = 1),
                 C = rnorm(n = 100, mean = 0, sd = 1),
                 D = rnorm(n = 100, mean = 0, sd = 1),
                 noise = rnorm(n = 100, mean = 0, sd = 10))

# Calculate y, a linear combination of A, B, C plus noise

df <- df %>%
  mutate(
    dv = 20 + A + 5 * B - 4 * C + 0 * D + noise
  )
  1. If you were to conduct a regression analysis predicting dv as a function of the 4 predictors, what coefficients would you expect to get from the regression?
# I would expect to get 20, 1, 5, -4 and 0
  1. Test your prediction by conducting the appropriate analysis (don’t include the noise). Were you correct?
lm_31 <- lm(dv ~ A + B + C + D,
            data = df)

summary(lm_31)
## 
## Call:
## lm(formula = dv ~ A + B + C + D, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.2071  -5.7369   0.9528   7.4823  20.6544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.6159     1.0804  17.231   <2e-16 ***
## A             0.7842     1.0748   0.730   0.4674    
## B             4.1584     1.4213   2.926   0.0043 ** 
## C            -2.7358     1.0886  -2.513   0.0137 *  
## D            -0.7390     1.0107  -0.731   0.4665    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.77 on 95 degrees of freedom
## Multiple R-squared:  0.1722, Adjusted R-squared:  0.1374 
## F-statistic: 4.942 on 4 and 95 DF,  p-value: 0.001157
# I wasn't perfectly correct, but the numbers aren't too far off. The reason why they aren't perfect is because there is a lot of noise in the data.
  1. Now repeat the analysis, but first change the standard deviation of the noise to something really small, like 0.01. What happens to the final regression coefficients?
library(tidyverse) # For dplyr

set.seed(100)  # Fix the randomisation

# Create a dataframe with 4 predictors (A, B, C and D) and noise

df <- data.frame(A = rnorm(n = 100, mean = 0, sd = 1),
                 B = rnorm(n = 100, mean = 0, sd = 1),
                 C = rnorm(n = 100, mean = 0, sd = 1),
                 D = rnorm(n = 100, mean = 0, sd = 1),
                 noise = rnorm(n = 100, mean = 0, sd = .01))

# Calculate y, a linear combination of A, B, C plus noise

df <- df %>%
  mutate(
    dv = 20 + A + 5 * B - 4 * C + 0 * D + noise
  )

lm_32 <- lm(dv ~ A + B + C + D,
            data = df)

summary(lm_32)
## 
## Call:
## lm(formula = dv ~ A + B + C + D, data = df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0292071 -0.0057369  0.0009528  0.0074823  0.0206544 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 19.998616   0.001080 18511.175   <2e-16 ***
## A            0.999784   0.001075   930.180   <2e-16 ***
## B            4.999158   0.001421  3517.273   <2e-16 ***
## C           -3.998736   0.001089 -3673.147   <2e-16 ***
## D           -0.000739   0.001011    -0.731    0.466    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01077 on 95 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.741e+06 on 4 and 95 DF,  p-value: < 2.2e-16
# Now they are almost exactly as I predicted! That's because there's almost no noise in the data now...

Submit!