lm
and glm
) with the lm()
and glm()
functionsnames()
and summary()
functions to access specific elements of regression objects.library(yarrr) #Load the yarrr package for the pirates dataframe
# Predict beard.length as a function of sex, age, weight and tattoos
beard_lm <- lm(formula = beard.length ~ sex + age + weight + tattoos,
data = pirates)
summary(beard_lm) # Look at summary results
names(beard_lm) # Named elements in the object
beard_lm$coefficients # Get coefficients
# Predict tattoos as a function of ALL variables in the pirates dataframe
tattoos_lm <- lm(formula = tattoos ~.,
data = pirates)
# Calculate model fits
# Directly from lm object
tattoos_fits <- tattoos_lm$fitted.values
# or calculate manually using predict()
tattoos_fits <- predict(tattoos_lm, newdata = pirates)
# Calculate residuals
# Directly from lm object
tattoos_resid <- tattoos_lm$residuals
# or calculate manually
tattoos_resid <- pirates$tattoos - predict(tattoos_lm, newdata = pirates)
# Binary logistic regression
# Create a logical vector indicating which pirates like "hook"
pirates$like_hook <- pirates$favorite.pirate == "Hook"
# Conduct binary logistic regression predicting which pirates like hook
hook_glm <- glm(formula = like_hook ~ . -favorite.pirate, # exclude favorite.pirate
data = pirates,
family = "binomial")
summary(hook_glm) # summary of results
In this WPA, you will analyze data from a study on student performance in two classes: math and portuguese. These data come from the UCI Machine Learning database at http://archive.ics.uci.edu/ml/datasets/Student+Performance#
Here is the data description (taken directly from the original website
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The data are located in two separate tab-delimited text files at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentmath.txt (the math data), and https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentpor.txt (the portugese data).
Both datafiles have 33 columns. Here they are:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
OPEN YOUR CLASS R PROJECT!!!. This project should have (at least) two folders, one called data
and one called R
. If you do not have these folders already, create them! Open a new script and enter your name, date, and the wpa number at the top. Save the script in the R
folder in your project working directory as wpa_8_LastFirst.R
, where Last and First are your last and first names.
The two data files you ned for this assignment are located at https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentmath.txt (the math data) and https://raw.githubusercontent.com/ndphillips/IntroductionR_Course/master/assignments/wpa/data/studentpor.txt (the portugese data). Using read.table()
load this two data files into R as two objects, one called student_m
, and one called student_p
Look at the first few rows of the dataframes with the head()
and View()
functions to make sure they loaded correctly.
Using the names()
and str()
functions, look at the names and structure of the dataframes to make sure everything looks ok. If the data look strange, you did something wrong with read.table()
, diagnose the problem!
Using write.table()
, save a local copy of the two dataframes to text files called student_m
and student_p
in the data folder of your project. Now, you’ll always have access to the data.
When reporting APA style results from a regression analysis, use the following format: STATEMENT, b = X, t(df) = X, p = X: For example:
x <- lm(weight ~ Time,
data = ChickWeight)
summary(x)
##
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.331 -14.536 0.926 13.533 160.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.4674 3.0365 9.046 <2e-16 ***
## Time 8.8030 0.2397 36.725 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared: 0.7007, Adjusted R-squared: 0.7002
## F-statistic: 1349 on 1 and 576 DF, p-value: < 2.2e-16
# There is a significant positive relationship between time and weight, b = 8.80, t(576) = 36.73, p < 0.01.
For the math data, create a regression object called lm_6
predicting first period grade (G1) based on age.
Run names()
and summary()
on lm_6
to see additional information from your regression object. Now, return a vector of the coefficients by running lm_5$coefficients
How do you interpret the relationship between age and first period grade? Give an APA style conclusion.
By hand (that is, typing the calculation manually), calculate the predicted first period math grade of a student who is 18 years old based on the regression equation (if the coefficient is non-significant, just use it anyway).
For the portugese data, create a regression object called lm_10
predicting each student’s period 3 grade (G3) based on their period 1 grade (G1). Look at the results of the regression analysis with summary()
.
What is the relationship between first and third period portugese grades? Give an APA style conclusion.
By hand, calculate the calculate the predicted third period grade of a student who had a first period grade of 10.
In task 10 you calculated a regression equation predicting students’ third period portugese grades by their first period grades. Now let’s see if a simple correlation test gives you the same answer. Conduct a correlation test between first and third period portugese grades. (Hint: Refer to WPA 6 https://ndphillips.github.io/IntroductionR_Course/assignments/wpa/wpa_6_answers.html). Compare the t-value for this test to the regression analysis you did in question 10. What do you see?
Now conduct a correlation test testing the relationship between age and first period grade for the math data. Compare the t-value from this test to the result you obtained in the regression analysis in question 7. What do you see?
For the math data, create a regression object called lm_15
predicting third period math grade (G3) based on sex, age, internet, and failures. Then, use the summary()
function to see a summary table of the output
Interpret the results!
Create a new regression object called lm_17
using the same variables as question 15, however, this time predict third period scores in the portugese dataset dataset. Use the summary()
function to understand the results.
What are the key differences between the math and portugese datasets in which variables predict third period scores?
Now, create a regression analysis predicting first period math grades using all variables in the dataset (Hint: use the notation formula = y ~ .
to include all variables!). Which variables are significant? Are any of the variables that were significant before no longer significant.
Is the relationship bewteen whether or not students go out with friends and period 1 math scores different between the two schools (BP or MS)? Answer this by conducting a regression analysis with the appropriate interaction term.
Is the relationship you found above the same for the portugese period 1 grades?
Let’s create a logistic regression analysis that answers the question: “What predicts whether or not a student improves his/her math grade from period 1 to period 3?” To do this, we need to start by creating a new logical variable which indicates whether or not a student’s period 3 grade is larger than his/her period 1 grade. Add a new variable to the math dataframe called grade_improve
that shows this (Hint: You can easily do this by creating a logical vector comparing period 1 and period 3 grades).
Using the glm()
function, conduct a binary logistic regression analysis that answers the main research question above. Use the summary()
function to understand the results. What do you conclude?
Repeat this analyses, but now use the portugese data. Do you find differences in which variables predict grade improvement between the two datasets?
For the math dataset, create a regression object called lm_25
predicting a student’s first period math grades based on all variables in the dataset.
By looking at the names()
of the elements in the lm_25
object, find the vector of fitted values from the regression object. This is a vector of the predicted first period grades of all students based on the regression analysis. Add these predictions as a new column in the student_m
dataframe called G1_predicted
.
On average, how far away were the regression model predictions from the true first period math grades? To answer this, do basic arithmetic operations on the G1
and G1_predicted
vectors. You may want to use the abs()
function, which will return the absolute value of a vector of values.
Create a scatterplot showing the relationship between the true first period math grades and the predicted first period math grades. How well does the regression model capture the true first period math grades?
df
that contains four predictors (A, B, C, D) and some random noise (noise). I will then create dv, a dependent variable that is a linear combination of the four predictors, plus the noise. Run the following chunk.library(tidyverse) # For dplyr
set.seed(100) # Fix the randomisation
# Create a dataframe with 4 predictors (A, B, C and D) and noise
df <- data.frame(A = rnorm(n = 100, mean = 0, sd = 1),
B = rnorm(n = 100, mean = 0, sd = 1),
C = rnorm(n = 100, mean = 0, sd = 1),
D = rnorm(n = 100, mean = 0, sd = 1),
noise = rnorm(n = 100, mean = 0, sd = 10))
# Calculate y, a linear combination of A, B, C plus noise
df <- df %>%
mutate(
dv = 20 + A + 5 * B - 4 * C + 0 * C + noise
)
If you were to conduct a regression analysis predicting dv
as a function of the 4 predictors, what coefficients would you expect to get from the regression?
Test your prediction by conducting the appropriate analysis (don’t include the noise). Were you correct?
Now repeat the analysis, but first change the standard deviation of the noise to something really small, like 0.01. What happens to the final regression coefficients?
wpa_8_LastFirst.R
file to me at nathaniel.phillips@unibas.ch.