April 27, 2018
On January 28, 1986, the USA Space Shuttle Challenger exploded 73 seconds into flight, killing all 7 crew members. The explosion was traced to failure of O-ring seals in the solid rocket booster at liftoff.
It was hypothesized that there is a inverse relationship between O-ring failure and temperature.
The night before the flight, engineers had to make a decision whether or not to lauch Challenger the next day with a forecasted temperature of 31 degrees Fahrenheit.
The engineers had data available on 23 previous launches to aid in their decision with the following varaibles
More information about the O-ring data is available here.
oring <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-only.data")
names(oring) <- c("n_risk", "n_fail", "temp", "psi", "order")
head(oring)
## n_risk n_fail temp psi order
## 1 6 0 66 50 1
## 2 6 1 70 50 2
## 3 6 0 69 50 3
## 4 6 0 68 50 4
## 5 6 0 67 50 5
## 6 6 0 72 50 6
summary(oring)
## n_risk n_fail temp psi
## Min. :6 Min. :0.0000 Min. :53.00 Min. : 50.0
## 1st Qu.:6 1st Qu.:0.0000 1st Qu.:67.00 1st Qu.: 75.0
## Median :6 Median :0.0000 Median :70.00 Median :200.0
## Mean :6 Mean :0.3043 Mean :69.57 Mean :152.2
## 3rd Qu.:6 3rd Qu.:0.5000 3rd Qu.:75.00 3rd Qu.:200.0
## Max. :6 Max. :2.0000 Max. :81.00 Max. :200.0
## order
## Min. : 1.0
## 1st Qu.: 6.5
## Median :12.0
## Mean :12.0
## 3rd Qu.:17.5
## Max. :23.0
oring <- oring[order(oring$temp), ] # sort by ascending temp
# some flights have the same number of failures & launch temp, so make point size vary
pt.size <- rep(1, nrow(oring))
dups <- oring[duplicated(cbind(oring$temp, oring$n_fail)), c("n_fail", "temp")]
pt.size[as.numeric(rownames(dups))] <- 2
trips <- dups[duplicated(dups), ]
pt.size[as.numeric(rownames(trips))] <- 3
plot(oring$temp, oring$n_fail, pch = 20, cex = pt.size,
xlab = "Temperature (deg F)", ylab = "Number of Failures")
## Fit a linear model
linearReg <- lm(n_fail ~ temp, data = oring)
summary(linearReg)
##
## Call:
## lm(formula = n_fail ~ temp, data = oring)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50921 -0.30810 0.00794 0.20905 0.74381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.30159 0.83110 5.176 3.96e-05 ***
## temp -0.05746 0.01189 -4.833 8.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3935 on 21 degrees of freedom
## Multiple R-squared: 0.5266, Adjusted R-squared: 0.5041
## F-statistic: 23.36 on 1 and 21 DF, p-value: 8.895e-05
## Predict the output for the Challenger launch day
launchDay <- data.frame(temp = 31)
predict(linearReg, launchDay, interval="predict") # point estimate & 95% CI
## fit lwr upr
## 1 2.520317 1.252255 3.78838
## Plot the fit extrapolating to launch day
plot(oring$temp, oring$n_fail, pch = 20, cex = pt.size/2,
xlab = "Temperature (deg F)", ylab = "Num O-Ring Failures",
xlim = c(30, 85), ylim = c(0, 5))
abline(reg = linearReg, col = "red", lwd = 2)
points(x=31, y=2.52, pch=4, col="black", lwd=2)
x <- seq(25,90,.1)
pred <- predict(linearReg, data.frame(temp=x), interval="predict")
lines(x, pred[,2], lty=2, col="red")
lines(x, pred[,3], lty=2, col="red")
Linear regression clearly isn’t the proper technique for this type of data, but GLMs can overcome these issues.
Flexible extension of linear regression that allows response variable to have non-Gaussian error distributions.
Instead of modeling the mean μ of the response variable directly as a linear combination of the predictors, we model a link function g(μ). g[E(Y)]=Xβ⇒E(Y)=g−1(Xβ)
Y must have a distribution from the exponential family, which includes the Normal, binomial, Poisson, gamma, and many others.
Choice of link function g depends on the distributional form of Y. There can be multiple link functions for any singe distribution.
Specific GLM when dealing with binary response data Yi∈{0,1} Yi∼Bernoulli(pi) ⇒f(Y|p)=n∏i=1[pyii(1−pi)1−yi] where pi∈[0,1] is the probability that Yi=1.
Link function must map values in (−∞,∞) to the range [0,1]
Logit link function: g(p)=log(p1−p)=Xβ
Inverse of logit function is the logistic function: p=g−1(Xβ)=eXβeXβ+1=11+e−Xβ
glm()
is used to fit generalized linear models.
The formula
argument is equivalent to that of lm()
used for linear regression
New argument: family
family = binomial(link="logit")
Summary()
, update()
, predict()
, variable selection methods, etc. all work the same
## Reformat the response variable
oring$fail <- as.numeric(oring$n_fail > 0)
head(oring)
## n_risk n_fail temp psi order fail
## 14 6 2 53 200 14 1
## 9 6 1 57 200 9 1
## 23 6 1 58 200 23 1
## 10 6 1 63 200 10 1
## 1 6 0 66 50 1 0
## 5 6 0 67 50 5 0
plot(oring$temp, oring$fail, pch = 20, cex = pt.size,
xlab = "Temperature (deg F)", ylab = "O-Ring Failure")
## Fit a GLM with logistic link
logisticReg <- glm(fail ~ temp, data = oring, family=binomial(link="logit"))
summary(logisticReg)
##
## Call:
## glm(formula = fail ~ temp, family = binomial(link = "logit"),
## data = oring)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.00214 -0.57815 -0.21802 0.04401 2.01706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 23.7750 11.8204 2.011 0.0443 *
## temp -0.3667 0.1752 -2.093 0.0363 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26.402 on 22 degrees of freedom
## Residual deviance: 14.426 on 21 degrees of freedom
## AIC: 18.426
##
## Number of Fisher Scoring iterations: 6
names(logisticReg)
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
plot(oring$temp, oring$fail, pch = 20, cex = pt.size,
xlab = "Temperature (deg F)", ylab = "O-Ring Failure")
curve(predict(logisticReg, data.frame(temp=x), type="response"), add=TRUE, col="red", lwd=2)
To get probabilities, be sure to use type = "response"
.
launchDay
## temp
## 1 31
predict(logisticReg, launchDay, type="response")
## 1
## 0.9999959
predict()
simply computes the linear combination & evaulates the inverse link
XB <- as.numeric(coef(logisticReg)[1] + coef(logisticReg)[2]*31)
XB # estimated log-odds of o-ring failure on launch day
## [1] 12.40723
exp(XB)/(1 + exp(XB)) # estimated probability of o-ring failure on launch day
## [1] 0.9999959
exp(β) is the multiplicative change in the odds of O-ring failure for a one degree increase in temperature.
95% CI for exp(β) (profile likelihood method)
exp(confint(logisticReg, "temp", level=0.95))
## 2.5 % 97.5 %
## 0.4159216 0.8853986
exp(confint.default(logisticReg, "temp", level=0.95))
## 2.5 % 97.5 %
## temp 0.4916331 0.9768919
logisticReg2 = update(logisticReg, . ~ . + psi, data=oring)
summary(logisticReg2)
##
## Call:
## glm(formula = fail ~ temp + psi, family = binomial(link = "logit"),
## data = oring)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.00837 -0.54507 -0.28098 0.02512 2.21488
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 21.843631 11.936459 1.830 0.0673 .
## temp -0.350098 0.172977 -2.024 0.0430 *
## psi 0.006007 0.009749 0.616 0.5378
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26.402 on 22 degrees of freedom
## Residual deviance: 14.033 on 20 degrees of freedom
## AIC: 20.033
##
## Number of Fisher Scoring iterations: 6
anova(logisticReg, logisticReg2, test='LRT')
## Analysis of Deviance Table
##
## Model 1: fail ~ temp
## Model 2: fail ~ temp + psi
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 21 14.426
## 2 20 14.033 1 0.3929 0.5308
Make sure to submit your course evaluation survey!
Space, Right Arrow or swipe left to move to next slide, click help below for more details