Linear Regression

CSULB Intro to R

April 27, 2018

Agenda

  1. Statistical Distributions (very brief)

  2. T-Tests

  3. Linear Regression

    • Fitting Models
    • Interpretation
    • Comparing Models & Variable Selection
    • Diagnostics
    • Prediction

The Normal Distribution

Statistical Distributions in R:

Standard Normal Distribution

Calculate the value of the probability density function at \(X = 0\)

str(dnorm) # normal pdf
## function (x, mean = 0, sd = 1, log = FALSE)
dnorm(x = 0, mean = 0, sd = 1)
## [1] 0.3989423

Standard Normal Distribution

Plot the density function using dnorm()

x <- seq(from = -3, to = 3, by = 0.05)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l", ylab = "Density", lwd = 2)

Standard Normal Distribution

Generate 10 independent random numbers from a standard normal distribution

str(rnorm) # generate random number from normal dist
## function (n, mean = 0, sd = 1)
rnorm(10, mean = 0, sd = 1)
##  [1]  0.61262205  0.05963103  0.87009410  1.05050864 -0.67983398
##  [6]  0.25674646 -0.91089705 -0.09773639  0.94322955 -0.79700942

Standard Normal Distribution

Calculate the probability that \(X \leq 0\)

str(pnorm) # normal CDF
## function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
pnorm(0, mean = 0, sd = 1) # Pr[X <= 0] = ?
## [1] 0.5

Standard Normal Distribution

str(qnorm)
## function (p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(0.975)  # Pr[X <= ?] = 0.975
## [1] 1.959964

T-Tests

T-tests can be used to draw statistical conclusions about parameters of interest in the data

T-tests can be categorized into two groups:

  1. One-sample t-test

  2. Two-sample t-test

One-Sample T-Test (Create Data)

set.seed(123)
oneSampData <- rnorm(100, mean = 0, sd = 1)
mean(oneSampData)
## [1] 0.09040591
sd(oneSampData)
## [1] 0.9128159

One-Sample T-Test (\(H_0: \mu = 0\))

oneSampTest.0 <- t.test(oneSampData) 
oneSampTest.0
## 
##  One Sample t-test
## 
## data:  oneSampData
## t = 0.99041, df = 99, p-value = 0.3244
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.09071657  0.27152838
## sample estimates:
##  mean of x 
## 0.09040591
names(oneSampTest.0) 
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"

One-Sample T-Test (\(H_0: \mu = 0\))

oneSampTest.0$statistic
##         t 
## 0.9904068
oneSampTest.0$p.value
## [1] 0.3243898

One-Sample T-Test (\(H_0: \mu = 0.3\))

oneSampTest.mu <- t.test(oneSampData, mu = 0.3)
oneSampTest.mu
## 
##  One Sample t-test
## 
## data:  oneSampData
## t = -2.2961, df = 99, p-value = 0.02378
## alternative hypothesis: true mean is not equal to 0.3
## 95 percent confidence interval:
##  -0.09071657  0.27152838
## sample estimates:
##  mean of x 
## 0.09040591

Linear Regression

Linear Regression in R

Linear Regression - Prestige Data

We will fit a linear model using the prestige data. Recall that we have the following variables

Load Data

prestige <- read.csv(file = here::here("data", "prestige_v2.csv"), 
                     row.names=1)
str(prestige)
## 'data.frame':    101 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
head(prestige)
##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

Another look at the scatterplot matrix

library(car)
scatterplotMatrix(prestige[,c("prestige","education","income","women")])

Linear Regression - Fit the Model

myReg <- lm(prestige ~ education + income + women, data = prestige)
myReg
## 
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income        women  
##   -6.801684     4.182482     0.001315    -0.008304
names(myReg)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Linear Regression - Summary of Fit

summary(myReg)
## 
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7831  -5.4102  -0.1204   5.1821  17.5318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.8016842  3.2543983  -2.090   0.0392 *  
## education    4.1824817  0.3907842  10.703  < 2e-16 ***
## income       0.0013153  0.0002791   4.712  8.2e-06 ***
## women       -0.0083038  0.0306187  -0.271   0.7868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.883 on 97 degrees of freedom
## Multiple R-squared:  0.798,  Adjusted R-squared:  0.7917 
## F-statistic: 127.7 on 3 and 97 DF,  p-value: < 2.2e-16

Linear Regression - Summary Contents

myReg.summary <- summary(myReg)
names(myReg.summary) # show different contents
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"
names(myReg) # this is what we had previously
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Linear Regression - Confidence Intervals

confint(myReg, 'income', level=0.95)
##              2.5 %      97.5 %
## income 0.000761253 0.001869315
confint(myReg, level=0.95)
##                     2.5 %       97.5 %
## (Intercept) -13.260764035 -0.342604455
## education     3.406883236  4.958080173
## income        0.000761253  0.001869315
## women        -0.069073423  0.052465913

Linear Regression - Adding Variables

Recall that type of occupation has a relationship with prestige score.

So let’s add the predictor type into our regression model:

mod <- update(myReg, ~ . + type)
summary(mod)
## 
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1126  -4.9844   0.0942   5.1318  19.4630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.7378124  5.5112850   0.497   0.6205    
## education    3.1246575  0.6644750   4.702 8.70e-06 ***
## income       0.0011764  0.0002702   4.355 3.36e-05 ***
## women        0.0050863  0.0308647   0.165   0.8695    
## typeprof     8.5502774  4.0724806   2.100   0.0384 *  
## typewc      -1.1457779  2.7280276  -0.420   0.6754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.8097 
## F-statistic:  86.1 on 5 and 95 DF,  p-value: < 2.2e-16

Linear Regression - Comparing Models

Suppose we want to test which of our two models is better when one model is nested within the other (i.e., both models contain the same terms and one has at least one additional term).

formula(myReg)
## prestige ~ education + income + women
formula(mod)
## prestige ~ education + income + women + type

We say that myReg is nested within mod, or equivantley that myReg is the reduced model and mod is the full model.

Hypothesis Test:

Linear Regression - Comparing Models, contd.

anova(myReg, mod)
## Analysis of Variance Table
## 
## Model 1: prestige ~ education + income + women
## Model 2: prestige ~ education + income + women + type
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     97 6028.2                                
## 2     95 5394.6  2     633.6 5.5789 0.005119 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So we reject the null hypothesis and conclude that at least one of the levels of type of occupation contributes information about the prestige score.

Linear Regression - Relevel a Factor

Say we want the change the reference level for the covariate type so that the intercept of the model includes is for professionals. We simply relevel the variable type and re-fit the model.

levels(prestige$type)
## [1] "bc"   "prof" "wc"
prestige$type <- relevel(prestige$type, "prof")
levels(prestige$type)
## [1] "prof" "bc"   "wc"

Note: This does not change the results of the model (e.g., predictions, F-statistics) but does change the interpretation and p-values of the type coefficients.

Linear Regression - Relevel a Factor, contd.

mod <- update(myReg, ~ . + type)
summary(mod)
## 
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1126  -4.9844   0.0942   5.1318  19.4630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.2880898  9.1659054   1.232  0.22116    
## education    3.1246575  0.6644750   4.702 8.70e-06 ***
## income       0.0011764  0.0002702   4.355 3.36e-05 ***
## women        0.0050863  0.0308647   0.165  0.86946    
## typebc      -8.5502774  4.0724806  -2.100  0.03842 *  
## typewc      -9.6960553  2.9391152  -3.299  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.8097 
## F-statistic:  86.1 on 5 and 95 DF,  p-value: < 2.2e-16

Linear Regression - More on Variable Selection

Stepwise Regression

null <- lm(prestige ~ 1, data=prestige)  # most basic model, intercept only
full <- lm(prestige ~ education + income + women + type, data=prestige)  # saturated model, all predictors
step(null, scope=list(lower=null, upper=full), direction='forward')
## Start:  AIC=576.54
## prestige ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + education  1   21567.5  8274.6 448.99
## + type       2   20685.1  9157.0 461.22
## + income     1   15236.2 14605.9 506.38
## <none>                   29842.1 576.54
## + women      1     400.9 29441.1 577.18
## 
## Step:  AIC=448.99
## prestige ~ education
## 
##          Df Sum of Sq    RSS    AIC
## + income  1   2241.78 6032.8 419.07
## + type    2   1428.45 6846.1 433.85
## + women   1    866.64 7407.9 439.81
## <none>                8274.6 448.99
## 
## Step:  AIC=419.07
## prestige ~ education + income
## 
##         Df Sum of Sq    RSS    AIC
## + type   2    636.63 5396.2 411.81
## <none>               6032.8 419.07
## + women  1      4.57 6028.2 421.00
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##         Df Sum of Sq    RSS    AIC
## <none>               5396.2 411.81
## + women  1    1.5421 5394.6 413.78
## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940
step(full, scope=list(lower=null, upper=full), direction='backward')
## Start:  AIC=413.78
## prestige ~ education + income + women + type
## 
##             Df Sum of Sq    RSS    AIC
## - women      1      1.54 5396.2 411.81
## <none>                   5394.6 413.78
## - type       2    633.60 6028.2 421.00
## - income     1   1076.80 6471.4 430.16
## - education  1   1255.70 6650.3 432.92
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5396.2 411.81
## - type       2    636.63 6032.8 419.07
## - education  1   1276.30 6672.5 431.25
## - income     1   1449.96 6846.1 433.85
## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940
step(null, scope=list(lower=null, upper=full), direction='both')
## Start:  AIC=576.54
## prestige ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + education  1   21567.5  8274.6 448.99
## + type       2   20685.1  9157.0 461.22
## + income     1   15236.2 14605.9 506.38
## <none>                   29842.1 576.54
## + women      1     400.9 29441.1 577.18
## 
## Step:  AIC=448.99
## prestige ~ education
## 
##             Df Sum of Sq     RSS    AIC
## + income     1    2241.8  6032.8 419.07
## + type       2    1428.5  6846.1 433.85
## + women      1     866.6  7407.9 439.81
## <none>                    8274.6 448.99
## - education  1   21567.5 29842.1 576.54
## 
## Step:  AIC=419.07
## prestige ~ education + income
## 
##             Df Sum of Sq     RSS    AIC
## + type       2     636.6  5396.2 411.81
## <none>                    6032.8 419.07
## + women      1       4.6  6028.2 421.00
## - income     1    2241.8  8274.6 448.99
## - education  1    8573.1 14605.9 506.38
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5396.2 411.81
## + women      1      1.54 5394.6 413.78
## - type       2    636.63 6032.8 419.07
## - education  1   1276.30 6672.5 431.25
## - income     1   1449.96 6846.1 433.85
## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940

Linear Regression - Diagnostics

par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(myReg)

Linear Regression - Prediction

Predict the output for a new input

newData = data.frame(education=13.2, income=12000, women=12)
predict(myReg, newData, interval="predict")
##        fit     lwr      upr
## 1 64.09084 48.2459 79.93578

End of Session 2

Next up:

  1. Exercise 2
  2. Break

Return at 3:00 to discuss solutions to Exercise 2!