April 27, 2018
Statistical Distributions (very brief)
T-Tests
Linear Regression
Very common distribution
Sometimes called the bell curve due to its shape
Two paramaters: mean (\(\mu\)) and standard deviation (\(\sigma\))
Many real-life applications:
d<dist>()
: evaluates the probability density/mass function at a given valuer<dist>()
: generates random numbersp<dist>()
: returns the cumulative distribution function (CDF) for a given quantileq<dist>()
: returns the quantile for a given probabilityCalculate the value of the probability density function at \(X = 0\)
str(dnorm) # normal pdf
## function (x, mean = 0, sd = 1, log = FALSE)
dnorm(x = 0, mean = 0, sd = 1)
## [1] 0.3989423
Plot the density function using dnorm()
x <- seq(from = -3, to = 3, by = 0.05)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l", ylab = "Density", lwd = 2)
Generate 10 independent random numbers from a standard normal distribution
str(rnorm) # generate random number from normal dist
## function (n, mean = 0, sd = 1)
rnorm(10, mean = 0, sd = 1)
## [1] 0.61262205 0.05963103 0.87009410 1.05050864 -0.67983398
## [6] 0.25674646 -0.91089705 -0.09773639 0.94322955 -0.79700942
Calculate the probability that \(X \leq 0\)
str(pnorm) # normal CDF
## function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
pnorm(0, mean = 0, sd = 1) # Pr[X <= 0] = ?
## [1] 0.5
str(qnorm)
## function (p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(0.975) # Pr[X <= ?] = 0.975
## [1] 1.959964
T-tests can be used to draw statistical conclusions about parameters of interest in the data
T-tests can be categorized into two groups:
One-sample t-test
Two-sample t-test
set.seed(123)
oneSampData <- rnorm(100, mean = 0, sd = 1)
mean(oneSampData)
## [1] 0.09040591
sd(oneSampData)
## [1] 0.9128159
oneSampTest.0 <- t.test(oneSampData)
oneSampTest.0
##
## One Sample t-test
##
## data: oneSampData
## t = 0.99041, df = 99, p-value = 0.3244
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.09071657 0.27152838
## sample estimates:
## mean of x
## 0.09040591
names(oneSampTest.0)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
oneSampTest.0$statistic
## t
## 0.9904068
oneSampTest.0$p.value
## [1] 0.3243898
oneSampTest.mu <- t.test(oneSampData, mu = 0.3)
oneSampTest.mu
##
## One Sample t-test
##
## data: oneSampData
## t = -2.2961, df = 99, p-value = 0.02378
## alternative hypothesis: true mean is not equal to 0.3
## 95 percent confidence interval:
## -0.09071657 0.27152838
## sample estimates:
## mean of x
## 0.09040591
Approach to model the relationship between a response variable \(Y\) and one or more predictor variables \(X\).
\(Y\) is a noisy linear combination of \(X\): \[Y = X \beta + \epsilon\]
Linear regression equivalent to modeling the mean of \(Y\). Namely, \(\mu = E(Y) = X\beta\)
Multiple ways to estimate \(\beta\), but we will focus on the most basic method which is ordinary least squares estimation.
We use lm()
to fit linear regression models, which has requires us to pass in a regression formula.
Don’t need to create the matrix \(X\) (R does this for us), just tell it which predictors to include.
Some notes on formula
:
~
+
symbols (e.g., y ~ x + z
).
(e.g., y ~ .
)y ~ 0 + .
or y ~ . - 1
y
on x
, we simply use y ~ x + I(x^2)
We will fit a linear model using the prestige data. Recall that we have the following variables
education
: Average education of occupational incumbents, years, in 1971.
income
: Average income of incumbents, dollars, in 1971.
women
: Percentage of incumbents who are women.
prestige
: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census
: Canadian Census occupational code.
type
: Type of occupation, a factor with levels
bc
: Blue Collarprof
: Professional, Managerial, and Technicalwc
: White Collarprestige <- read.csv(file = here::here("data", "prestige_v2.csv"),
row.names=1)
str(prestige)
## 'data.frame': 101 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
head(prestige)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
library(car)
scatterplotMatrix(prestige[,c("prestige","education","income","women")])
myReg <- lm(prestige ~ education + income + women, data = prestige)
myReg
##
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
##
## Coefficients:
## (Intercept) education income women
## -6.801684 4.182482 0.001315 -0.008304
names(myReg)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
summary(myReg)
##
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.7831 -5.4102 -0.1204 5.1821 17.5318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.8016842 3.2543983 -2.090 0.0392 *
## education 4.1824817 0.3907842 10.703 < 2e-16 ***
## income 0.0013153 0.0002791 4.712 8.2e-06 ***
## women -0.0083038 0.0306187 -0.271 0.7868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.883 on 97 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.7917
## F-statistic: 127.7 on 3 and 97 DF, p-value: < 2.2e-16
myReg.summary <- summary(myReg)
names(myReg.summary) # show different contents
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
names(myReg) # this is what we had previously
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
income
confint(myReg, 'income', level=0.95)
## 2.5 % 97.5 %
## income 0.000761253 0.001869315
confint(myReg, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -13.260764035 -0.342604455
## education 3.406883236 4.958080173
## income 0.000761253 0.001869315
## women -0.069073423 0.052465913
Recall that type
of occupation has a relationship with prestige score.
So let’s add the predictor type
into our regression model:
mod <- update(myReg, ~ . + type)
summary(mod)
##
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1126 -4.9844 0.0942 5.1318 19.4630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7378124 5.5112850 0.497 0.6205
## education 3.1246575 0.6644750 4.702 8.70e-06 ***
## income 0.0011764 0.0002702 4.355 3.36e-05 ***
## women 0.0050863 0.0308647 0.165 0.8695
## typeprof 8.5502774 4.0724806 2.100 0.0384 *
## typewc -1.1457779 2.7280276 -0.420 0.6754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared: 0.8192, Adjusted R-squared: 0.8097
## F-statistic: 86.1 on 5 and 95 DF, p-value: < 2.2e-16
Suppose we want to test which of our two models is better when one model is nested within the other (i.e., both models contain the same terms and one has at least one additional term).
formula(myReg)
## prestige ~ education + income + women
formula(mod)
## prestige ~ education + income + women + type
We say that myReg
is nested within mod
, or equivantley that myReg
is the reduced model and mod
is the full model.
Hypothesis Test:
Null: Reduced model fits as well as the full model (formally, coefficients for the omitted terms are all zero).
Alternative: Full model fits better than the reduced model (formally, at least one omitted coefficient is non-zero).
anova(myReg, mod)
## Analysis of Variance Table
##
## Model 1: prestige ~ education + income + women
## Model 2: prestige ~ education + income + women + type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 97 6028.2
## 2 95 5394.6 2 633.6 5.5789 0.005119 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So we reject the null hypothesis and conclude that at least one of the levels of type of occupation contributes information about the prestige score.
Say we want the change the reference level for the covariate type
so that the intercept of the model includes is for professionals. We simply relevel the variable type
and re-fit the model.
levels(prestige$type)
## [1] "bc" "prof" "wc"
prestige$type <- relevel(prestige$type, "prof")
levels(prestige$type)
## [1] "prof" "bc" "wc"
Note: This does not change the results of the model (e.g., predictions, F-statistics) but does change the interpretation and p-values of the type coefficients.
mod <- update(myReg, ~ . + type)
summary(mod)
##
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1126 -4.9844 0.0942 5.1318 19.4630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2880898 9.1659054 1.232 0.22116
## education 3.1246575 0.6644750 4.702 8.70e-06 ***
## income 0.0011764 0.0002702 4.355 3.36e-05 ***
## women 0.0050863 0.0308647 0.165 0.86946
## typebc -8.5502774 4.0724806 -2.100 0.03842 *
## typewc -9.6960553 2.9391152 -3.299 0.00137 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared: 0.8192, Adjusted R-squared: 0.8097
## F-statistic: 86.1 on 5 and 95 DF, p-value: < 2.2e-16
Automated algorithms for automatic variable selection exist, but should be used with caution.
Better to start with an a priori set of variables you need to include and then compare models, especially when goal is assocation instead of prediction.
Stepwise regression: start with a model and add (or remove) predictors until some criteria is met
null <- lm(prestige ~ 1, data=prestige) # most basic model, intercept only
full <- lm(prestige ~ education + income + women + type, data=prestige) # saturated model, all predictors
step(null, scope=list(lower=null, upper=full), direction='forward')
## Start: AIC=576.54
## prestige ~ 1
##
## Df Sum of Sq RSS AIC
## + education 1 21567.5 8274.6 448.99
## + type 2 20685.1 9157.0 461.22
## + income 1 15236.2 14605.9 506.38
## <none> 29842.1 576.54
## + women 1 400.9 29441.1 577.18
##
## Step: AIC=448.99
## prestige ~ education
##
## Df Sum of Sq RSS AIC
## + income 1 2241.78 6032.8 419.07
## + type 2 1428.45 6846.1 433.85
## + women 1 866.64 7407.9 439.81
## <none> 8274.6 448.99
##
## Step: AIC=419.07
## prestige ~ education + income
##
## Df Sum of Sq RSS AIC
## + type 2 636.63 5396.2 411.81
## <none> 6032.8 419.07
## + women 1 4.57 6028.2 421.00
##
## Step: AIC=411.81
## prestige ~ education + income + type
##
## Df Sum of Sq RSS AIC
## <none> 5396.2 411.81
## + women 1 1.5421 5394.6 413.78
##
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
##
## Coefficients:
## (Intercept) education income typebc typewc
## 11.520005 3.135291 0.001153 -8.646693 -9.655940
step(full, scope=list(lower=null, upper=full), direction='backward')
## Start: AIC=413.78
## prestige ~ education + income + women + type
##
## Df Sum of Sq RSS AIC
## - women 1 1.54 5396.2 411.81
## <none> 5394.6 413.78
## - type 2 633.60 6028.2 421.00
## - income 1 1076.80 6471.4 430.16
## - education 1 1255.70 6650.3 432.92
##
## Step: AIC=411.81
## prestige ~ education + income + type
##
## Df Sum of Sq RSS AIC
## <none> 5396.2 411.81
## - type 2 636.63 6032.8 419.07
## - education 1 1276.30 6672.5 431.25
## - income 1 1449.96 6846.1 433.85
##
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
##
## Coefficients:
## (Intercept) education income typebc typewc
## 11.520005 3.135291 0.001153 -8.646693 -9.655940
step(null, scope=list(lower=null, upper=full), direction='both')
## Start: AIC=576.54
## prestige ~ 1
##
## Df Sum of Sq RSS AIC
## + education 1 21567.5 8274.6 448.99
## + type 2 20685.1 9157.0 461.22
## + income 1 15236.2 14605.9 506.38
## <none> 29842.1 576.54
## + women 1 400.9 29441.1 577.18
##
## Step: AIC=448.99
## prestige ~ education
##
## Df Sum of Sq RSS AIC
## + income 1 2241.8 6032.8 419.07
## + type 2 1428.5 6846.1 433.85
## + women 1 866.6 7407.9 439.81
## <none> 8274.6 448.99
## - education 1 21567.5 29842.1 576.54
##
## Step: AIC=419.07
## prestige ~ education + income
##
## Df Sum of Sq RSS AIC
## + type 2 636.6 5396.2 411.81
## <none> 6032.8 419.07
## + women 1 4.6 6028.2 421.00
## - income 1 2241.8 8274.6 448.99
## - education 1 8573.1 14605.9 506.38
##
## Step: AIC=411.81
## prestige ~ education + income + type
##
## Df Sum of Sq RSS AIC
## <none> 5396.2 411.81
## + women 1 1.54 5394.6 413.78
## - type 2 636.63 6032.8 419.07
## - education 1 1276.30 6672.5 431.25
## - income 1 1449.96 6846.1 433.85
##
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
##
## Coefficients:
## (Intercept) education income typebc typewc
## 11.520005 3.135291 0.001153 -8.646693 -9.655940
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(myReg)
Predict the output for a new input
newData = data.frame(education=13.2, income=12000, women=12)
predict(myReg, newData, interval="predict")
## fit lwr upr
## 1 64.09084 48.2459 79.93578