Introduction

In this set of Exercises, we will explore linear regression, variable selection, model diagnostics and prediction.

Part A

A.1 Open a new R script and save it in the directory you created in Part A.1 of Exercise 1. Then, load the Auto MPG data using the load() function and file path from the end of Exercise 1.

load(here::here("data", "auto_mpg_v2.Rda"))

A.2 Regress MPG on horsepower and look at the model results with the summary() function. Interpret the meaning of the coefficient of horsepower.

linFit <- lm(mpg ~ hp, data=auto)
summary(linFit)
## 
## Call:
## lm(formula = mpg ~ hp, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## hp          -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

We estimate that, on average, fuel economy decreases by 0.16 mpg for every one unit increase in horsepower.

A.3 Plot the model diagnostics. Do you think this model fits the data adequately? Why or why not?

par(mfrow=c(2,2))
plot(linFit)

No, this model does not fit the data well. The residuals do not appear to be normally distributed.

Part B

B.1 Add in a quadratic term for horsepower and look at the model fit results. HINT: Use the indicator function I() along with update().

quadFit <- update(linFit, ~ . + I(hp^2), data=auto)
summary(quadFit)
## 
## Call:
## lm(formula = mpg ~ hp + I(hp^2), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.9000997  1.8004268   31.60   <2e-16 ***
## hp          -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(hp^2)      0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

B.2 Plot the model diagnostics. Do you think this model fits the data adequately? Why or why not?

par(mfrow=c(2,2))
plot(quadFit)

This model appears to fit the data better than the model without the quadratic horsepower term. The residuals seem approximately normally distributed.

B.3 Compare the model from Part A to the model you just fit using an F-test. What model do you conclude fits the data better?

anova(linFit, quadFit)
## Analysis of Variance Table
## 
## Model 1: mpg ~ hp
## Model 2: mpg ~ hp + I(hp^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    390 9385.9                                  
## 2    389 7442.0  1    1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis and conclude that the full model with horsepower squared fits better than the reduced model.

B.4 Make a scatterplot of mpg versus horsepower. Add the estimated regression line from Part A using the abline() function and color it red. Add in the estimated regression line from Part B and color it blue. HINT: You will need to use the predict() and curve() functions.

plot(auto$hp, auto$mpg, pch=20, xlab="Horsepower", ylab="MPG")
abline(reg = linFit, col="red", lwd=2)
curve(predict(quadFit, data.frame(hp=x)), add=TRUE, col="blue", lwd=2)
legend("topright", legend=c("Linear Model", "Quadratic Model"), col=c("red","blue"), lty=1, lwd=2, bty="n")

Part C

C.1 Using what you’ve learned so far, fit the best possible linear model you can to predict MPG. Answers will vary. HINT: You can use automatic variable selection methods, or do so manually and compare models via adjusted \(R^2\) and F-tests.

modelC1 <- lm(mpg ~ weight + model.yr, data=auto)
summary(modelC1)
## 
## Call:
## lm(formula = mpg ~ weight + model.yr, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8777 -2.3140 -0.1211  2.0591 14.3330 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.420e+01  3.968e+00  -3.578 0.000389 ***
## weight      -6.664e-03  2.139e-04 -31.161  < 2e-16 ***
## model.yr     7.566e-01  4.898e-02  15.447  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 395 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8069 
## F-statistic: 830.4 on 2 and 395 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(modelC1)

auto[c(330, 334, 333), ]  # look at potential outliers... top 3 mpg values
##      mpg cyl disp hp weight  acc model.yr origin                 name
## 330 46.6   4   86 65   2110 17.9       80      3            mazda glc
## 334 43.4   4   90 48   2335 23.7       80      2   vw dasher (diesel)
## 333 44.3   4   90 48   2085 21.7       80      2 vw rabbit c (diesel)
##     diesel
## 330      0
## 334      1
## 333      1
auto[52, c("mpg", "weight", "model.yr")]  # look at leverage point... heaviest vehicle
##    mpg weight model.yr
## 52  13   5140       71
modelC2 <- lm(mpg ~ weight + I(weight^2) + model.yr, data=auto)
summary(modelC2)
## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + model.yr, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4771 -1.7156 -0.1581  1.5327 13.1433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.397e+00  3.823e+00   0.627    0.531    
## weight      -2.182e-02  1.426e-03 -15.296   <2e-16 ***
## I(weight^2)  2.388e-06  2.228e-07  10.718   <2e-16 ***
## model.yr     8.308e-01  4.370e-02  19.010   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.026 on 394 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8512, Adjusted R-squared:  0.8501 
## F-statistic: 751.5 on 3 and 394 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(modelC2)

auto[c(330, 334, 396), c("mpg", "weight", "model.yr")]  # look at potential outliers... top 3 mpg values
##      mpg weight model.yr
## 330 46.6   2110       80
## 334 43.4   2335       80
## 396 38.0   3015       82
anova(modelC1, modelC2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ weight + model.yr
## Model 2: mpg ~ weight + I(weight^2) + model.yr
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    395 4659.8                                  
## 2    394 3607.8  1      1052 114.88 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
finalModel <- lm(mpg ~ weight + I(weight^2) + model.yr + diesel, data=auto)
summary(finalModel)
## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + model.yr + diesel, 
##     data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4833 -1.6161 -0.0464  1.6154 13.5027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.708e+00  3.503e+00   1.915   0.0562 .  
## weight      -2.215e-02  1.296e-03 -17.098   <2e-16 ***
## I(weight^2)  2.440e-06  2.024e-07  12.058   <2e-16 ***
## model.yr     7.784e-01  4.009e-02  19.416   <2e-16 ***
## diesel1      9.776e+00  1.061e+00   9.214   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.748 on 393 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8777, Adjusted R-squared:  0.8764 
## F-statistic: 704.9 on 4 and 393 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(finalModel)

auto[c(330, 403, 119), ]  # look at potential outliers...
##      mpg cyl disp hp weight  acc model.yr origin      name diesel
## 330 46.6   4   86 65   2110 17.9       80      3 mazda glc      0
## 403 44.0   4   97 52   2130 24.6       82      2 vw pickup      0
## 119 18.0   3   70 90   2124 13.5       73      3 maxda rx3      0
par(mfrow=c(1,1)); plot(auto$weight, auto$mpg, pch=20, xlab="Weight (lbs)", ylab="MPG", col=auto$diesel)

anova(modelC2, finalModel)
## Analysis of Variance Table
## 
## Model 1: mpg ~ weight + I(weight^2) + model.yr
## Model 2: mpg ~ weight + I(weight^2) + model.yr + diesel
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    394 3607.8                                  
## 2    393 2966.9  1    640.95 84.902 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

C.2 Using the model you just fit, predict the fuel economy of the 8 vehicles with missing mpg values.

missing.mpg <- auto[is.na(auto$mpg), ]
cbind(missing.mpg, fit = predict(finalModel, newdata = missing.mpg))
##     mpg cyl disp  hp weight  acc model.yr origin
## 11   NA   4  133 115   3090 17.5       70      2
## 12   NA   8  350 165   4142 11.5       70      1
## 13   NA   8  351 153   4034 11.0       70      1
## 14   NA   8  383 175   4166 10.5       70      1
## 15   NA   8  360 175   3850 11.0       70      1
## 18   NA   8  302 140   3353  8.0       70      1
## 40   NA   4   97  48   1978 20.0       71      2
## 368  NA   4  121 110   2800 15.4       81      2
##                                 name diesel      fit
## 11              citroen ds-21 pallas      0 16.03857
## 12  chevrolet chevelle concours (sw)      0 11.29794
## 13                  ford torino (sw)      0 11.53585
## 14           plymouth satellite (sw)      0 11.25280
## 15                amc rebel sst (sw)      0 12.07229
## 18             ford mustang boss 302      0 14.34709
## 40       volkswagen super beetle 117      0 27.69954
## 368                        saab 900s      0 26.85690