The first set of exercises will deal with reading a dataset into R, exploring various structural and content-related features of the data, and manipulating the dataset so that it is in a form we can use later for analyses.
We will be using the Auto MPG Data Set, available on the UCI Machine Learning Repository; https://archive.ics.uci.edu/ml/datasets/Auto+MPG
The data consists of a collection of automobile records from 1970 to 1982 containing the following variables:
Miles per Gallon (mpg)
Number of Cylinders
Engine Displacement (in cubic inches)
Horsepower
Weight (in pounds)
Acceleration
Model Year
Origin: where the data originated from (ignore this)
Car Name
We will be focusing on the relationships between miles per gallon (mpg) and various other features of the car (such as model year, weight, number of cylinders, etc.).
A.1 Open a new R script to write and save your code for the exercises. Save this file in your local copy of the CSULB_Intro_R folder, e.g., CSULB_Intro_R/my_exercise_1.R
.
A.2 Read in the Auto MPG data to a data frame named auto
from the following url using read.table()
: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original
auto <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original")
A.3 Rename the variables (columns) using the following conventions: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”, “origin”, “name”.
names(auto) <- c("mpg", "cyl", "disp", "hp", "weight", "acc", "model.yr", "origin", "name")
A.4 Convert cyl
into a factor variable using factor()
. Convert name
into a character vector using as()
.
auto$cyl <- factor(auto$cyl)
auto$name <- as(auto$name, "character")
A.5 Use the head()
function to look at the first few rows of the data and make sure it looks like it was correctly loaded. You can compare the output here to the raw data by opening the url in A.2. Without looking at the help file, what does the argument n
do?
head(auto)
## mpg cyl disp hp weight acc model.yr origin name
## 1 18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
## 2 15 8 350 165 3693 11.5 70 1 buick skylark 320
## 3 18 8 318 150 3436 11.0 70 1 plymouth satellite
## 4 16 8 304 150 3433 12.0 70 1 amc rebel sst
## 5 17 8 302 140 3449 10.5 70 1 ford torino
## 6 15 8 429 198 4341 10.0 70 1 ford galaxie 500
head(auto, n=10)
## mpg cyl disp hp weight acc model.yr origin name
## 1 18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
## 2 15 8 350 165 3693 11.5 70 1 buick skylark 320
## 3 18 8 318 150 3436 11.0 70 1 plymouth satellite
## 4 16 8 304 150 3433 12.0 70 1 amc rebel sst
## 5 17 8 302 140 3449 10.5 70 1 ford torino
## 6 15 8 429 198 4341 10.0 70 1 ford galaxie 500
## 7 14 8 454 220 4354 9.0 70 1 chevrolet impala
## 8 14 8 440 215 4312 8.5 70 1 plymouth fury iii
## 9 14 8 455 225 4425 10.0 70 1 pontiac catalina
## 10 15 8 390 190 3850 8.5 70 1 amc ambassador dpl
B.1 Locate the observations with diesel engines using the grep()
function. The following command will search all auto names with “diesel” in the name.
diesel.index <- grep("diesel", auto$name)
diesel.index
## [1] 252 333 334 335 367 369 396
B.2 Create a new variable (column) in the auto
data frame called diesel
such that auto$diesel = 1
if the car has a diesel engine and 0
, otherwise.
auto$diesel <- 0 # creates new column of all 0s
auto$diesel[diesel.index] <- 1 # assigns 1s to all autos with diesel in the name
B.3 Coerce auto$diesel
into a factor variable using as()
.
auto$diesel <- as.factor(auto$diesel)
B.4 Look at the structure of the auto
data frame using str()
to make sure that this was done correctly.
str(auto)
## 'data.frame': 406 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cyl : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ disp : num 307 350 318 304 302 429 454 440 455 390 ...
## $ hp : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acc : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model.yr: num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## $ diesel : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
B.5 Save your data set as an R data (.Rda
) file in the data directory (i.e., "CSULB_Intro_R/data/auto_mpg_v2.Rda"
) using the save()
function.
save(auto, file=here::here("data", "auto_mpg_v2.Rda"))
?save
.csv
or .txt
file so that the data can be read easily by other programs. However, since we are only dealing with R, we will save the data as an .Rda
file.C.1 Using the summary()
function, look at descriptive statistics for the Auto MPG data. What do you notice? Jot down or comment in your code some interesting features of the data.
summary(auto)
## mpg cyl disp hp weight
## Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.00 Min. :1613
## 1st Qu.:17.50 4:207 1st Qu.:105.0 1st Qu.: 75.75 1st Qu.:2226
## Median :23.00 5: 3 Median :151.0 Median : 95.00 Median :2822
## Mean :23.51 6: 84 Mean :194.8 Mean :105.08 Mean :2979
## 3rd Qu.:29.00 8:108 3rd Qu.:302.0 3rd Qu.:130.00 3rd Qu.:3618
## Max. :46.60 Max. :455.0 Max. :230.00 Max. :5140
## NA's :8 NA's :6
## acc model.yr origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:406
## 1st Qu.:13.70 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.52 Mean :75.92 Mean :1.569
## 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
##
## diesel
## 0:399
## 1: 7
##
##
##
##
##
D.1 We will be interested in predicting/estimating a car’s MPG. Plot a relative frequency histogram (y-axis has proportions, not counts) of the response variable, MPG. Color the boxes with a color of your choosing. Make sure to name the plot and axes (main=...
, xlab=...
).
hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
D.2 Add a density curve to the histogram you plotted in B.1 using the lines()
and density()
functions. Color it red with transparency set to 0.5 using the col
argument and adjustcolor()
function. In the density()
function, you will need to add the argument na.rm=TRUE
to remove missing values.
hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col=adjustcolor("red", .5))
D.3 Add a vertical line to the plot from B.2 at the median of MPG using abline().
Within your median()
call, you will need to use na.rm=TRUE
again. Within this function, use the argument lty=...
and set this equal to a number of your choice. What happens with different values for this argument?
hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col=adjustcolor("red", .5), main="Densities of Samp1 and Samp2", xlab="")
abline(v = median(auto$mpg, na.rm=TRUE), col = "red", lwd = 2, lty=2)
E.1 How many unique cylinder values exist in this data set?
# Several options for getting this answer.
summary(auto)
## mpg cyl disp hp weight
## Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.00 Min. :1613
## 1st Qu.:17.50 4:207 1st Qu.:105.0 1st Qu.: 75.75 1st Qu.:2226
## Median :23.00 5: 3 Median :151.0 Median : 95.00 Median :2822
## Mean :23.51 6: 84 Mean :194.8 Mean :105.08 Mean :2979
## 3rd Qu.:29.00 8:108 3rd Qu.:302.0 3rd Qu.:130.00 3rd Qu.:3618
## Max. :46.60 Max. :455.0 Max. :230.00 Max. :5140
## NA's :8 NA's :6
## acc model.yr origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:406
## 1st Qu.:13.70 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.52 Mean :75.92 Mean :1.569
## 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
##
## diesel
## 0:399
## 1: 7
##
##
##
##
##
# or
summary(auto$cyl)
## 3 4 5 6 8
## 4 207 3 84 108
# or
unique(auto$cyl)
## [1] 8 4 6 3 5
## Levels: 3 4 5 6 8
E.2 Run the following code to create a vector of counts of each cylinder class
cylCounts <- table(auto$cyl)
E.3 Create a boxplot of MPG grouped by the number of cylinders with boxplot(mpg ~ cyl, data=auto,...)
. Using the ifelse()
function, color the boxes red if the number of data points in that class is below 5, and blue otherwise. Make sure to name the plot and axes.
boxplot(mpg ~ cyl, data = auto, col = ifelse(cylCounts < 5, 'red', 'blue'),
main = "Distribution of MPG by Number of Cylinders",
xlab = "Cylinders", ylab = "MPG")
Create a scatterplot matrix of the Auto MPG data. Include the following variables: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”. What relationships do you see? Which variables have a strong relationship with each other? Strong relation with MPG? Which variables would you include in trying to predict a car’s MPG rating?
library(car)
scatterplotMatrix( auto[, c("mpg", "cyl", "disp", "hp", "weight", "acc", "model.yr")] )