Introduction

The first set of exercises will deal with reading a dataset into R, exploring various structural and content-related features of the data, and manipulating the dataset so that it is in a form we can use later for analyses.

We will be using the Auto MPG Data Set, available on the UCI Machine Learning Repository; https://archive.ics.uci.edu/ml/datasets/Auto+MPG

The data consists of a collection of automobile records from 1970 to 1982 containing the following variables:

  1. Miles per Gallon (mpg)

  2. Number of Cylinders

  3. Engine Displacement (in cubic inches)

  4. Horsepower

  5. Weight (in pounds)

  6. Acceleration

  7. Model Year

  8. Origin: where the data originated from (ignore this)

  9. Car Name

We will be focusing on the relationships between miles per gallon (mpg) and various other features of the car (such as model year, weight, number of cylinders, etc.).

Part A - Data Input

A.1 Open a new R script to write and save your code for the exercises. Save this file in your local copy of the CSULB_Intro_R folder, e.g., CSULB_Intro_R/my_exercise_1.R.

A.2 Read in the Auto MPG data to a data frame named auto from the following url using read.table(): https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original

auto <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original")

A.3 Rename the variables (columns) using the following conventions: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”, “origin”, “name”.

names(auto) <- c("mpg", "cyl", "disp", "hp", "weight", "acc", "model.yr", "origin", "name")

A.4 Convert cyl into a factor variable using factor(). Convert name into a character vector using as().

auto$cyl <- factor(auto$cyl)
auto$name <- as(auto$name, "character")

A.5 Use the head() function to look at the first few rows of the data and make sure it looks like it was correctly loaded. You can compare the output here to the raw data by opening the url in A.2. Without looking at the help file, what does the argument n do?

head(auto)
##   mpg cyl disp  hp weight  acc model.yr origin                      name
## 1  18   8  307 130   3504 12.0       70      1 chevrolet chevelle malibu
## 2  15   8  350 165   3693 11.5       70      1         buick skylark 320
## 3  18   8  318 150   3436 11.0       70      1        plymouth satellite
## 4  16   8  304 150   3433 12.0       70      1             amc rebel sst
## 5  17   8  302 140   3449 10.5       70      1               ford torino
## 6  15   8  429 198   4341 10.0       70      1          ford galaxie 500
head(auto, n=10)
##    mpg cyl disp  hp weight  acc model.yr origin                      name
## 1   18   8  307 130   3504 12.0       70      1 chevrolet chevelle malibu
## 2   15   8  350 165   3693 11.5       70      1         buick skylark 320
## 3   18   8  318 150   3436 11.0       70      1        plymouth satellite
## 4   16   8  304 150   3433 12.0       70      1             amc rebel sst
## 5   17   8  302 140   3449 10.5       70      1               ford torino
## 6   15   8  429 198   4341 10.0       70      1          ford galaxie 500
## 7   14   8  454 220   4354  9.0       70      1          chevrolet impala
## 8   14   8  440 215   4312  8.5       70      1         plymouth fury iii
## 9   14   8  455 225   4425 10.0       70      1          pontiac catalina
## 10  15   8  390 190   3850  8.5       70      1        amc ambassador dpl

Part B - String Manipulation

B.1 Locate the observations with diesel engines using the grep() function. The following command will search all auto names with “diesel” in the name.

diesel.index <- grep("diesel", auto$name)
diesel.index
## [1] 252 333 334 335 367 369 396

B.2 Create a new variable (column) in the auto data frame called diesel such that auto$diesel = 1 if the car has a diesel engine and 0, otherwise.

auto$diesel <- 0 # creates new column of all 0s
auto$diesel[diesel.index] <- 1 # assigns 1s to all autos with diesel in the name

B.3 Coerce auto$diesel into a factor variable using as().

auto$diesel <- as.factor(auto$diesel)

B.4 Look at the structure of the auto data frame using str() to make sure that this was done correctly.

str(auto)
## 'data.frame':    406 obs. of  10 variables:
##  $ mpg     : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cyl     : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ disp    : num  307 350 318 304 302 429 454 440 455 390 ...
##  $ hp      : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight  : num  3504 3693 3436 3433 3449 ...
##  $ acc     : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model.yr: num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  $ diesel  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

B.5 Save your data set as an R data (.Rda) file in the data directory (i.e., "CSULB_Intro_R/data/auto_mpg_v2.Rda") using the save() function.

save(auto, file=here::here("data", "auto_mpg_v2.Rda"))
?save
  • In general, it is better practice to save as a .csv or .txt file so that the data can be read easily by other programs. However, since we are only dealing with R, we will save the data as an .Rda file.

Part C

C.1 Using the summary() function, look at descriptive statistics for the Auto MPG data. What do you notice? Jot down or comment in your code some interesting features of the data.

summary(auto)
##       mpg        cyl          disp             hp             weight    
##  Min.   : 9.00   3:  4   Min.   : 68.0   Min.   : 46.00   Min.   :1613  
##  1st Qu.:17.50   4:207   1st Qu.:105.0   1st Qu.: 75.75   1st Qu.:2226  
##  Median :23.00   5:  3   Median :151.0   Median : 95.00   Median :2822  
##  Mean   :23.51   6: 84   Mean   :194.8   Mean   :105.08   Mean   :2979  
##  3rd Qu.:29.00   8:108   3rd Qu.:302.0   3rd Qu.:130.00   3rd Qu.:3618  
##  Max.   :46.60           Max.   :455.0   Max.   :230.00   Max.   :5140  
##  NA's   :8                               NA's   :6                      
##       acc           model.yr         origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:406        
##  1st Qu.:13.70   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.52   Mean   :75.92   Mean   :1.569                     
##  3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000                     
##                                                                    
##  diesel 
##  0:399  
##  1:  7  
##         
##         
##         
##         
## 

Part D

D.1 We will be interested in predicting/estimating a car’s MPG. Plot a relative frequency histogram (y-axis has proportions, not counts) of the response variable, MPG. Color the boxes with a color of your choosing. Make sure to name the plot and axes (main=...,   xlab=...).

hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")

D.2 Add a density curve to the histogram you plotted in B.1 using the lines() and density() functions. Color it red with transparency set to 0.5 using the col argument and adjustcolor() function. In the density() function, you will need to add the argument na.rm=TRUE to remove missing values.

hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col=adjustcolor("red", .5))

D.3 Add a vertical line to the plot from B.2 at the median of MPG using abline(). Within your median() call, you will need to use na.rm=TRUE again. Within this function, use the argument lty=... and set this equal to a number of your choice. What happens with different values for this argument?

hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col=adjustcolor("red", .5), main="Densities of Samp1 and Samp2", xlab="")
abline(v = median(auto$mpg, na.rm=TRUE), col = "red", lwd = 2, lty=2)

Part E

E.1 How many unique cylinder values exist in this data set?

# Several options for getting this answer. 
summary(auto)
##       mpg        cyl          disp             hp             weight    
##  Min.   : 9.00   3:  4   Min.   : 68.0   Min.   : 46.00   Min.   :1613  
##  1st Qu.:17.50   4:207   1st Qu.:105.0   1st Qu.: 75.75   1st Qu.:2226  
##  Median :23.00   5:  3   Median :151.0   Median : 95.00   Median :2822  
##  Mean   :23.51   6: 84   Mean   :194.8   Mean   :105.08   Mean   :2979  
##  3rd Qu.:29.00   8:108   3rd Qu.:302.0   3rd Qu.:130.00   3rd Qu.:3618  
##  Max.   :46.60           Max.   :455.0   Max.   :230.00   Max.   :5140  
##  NA's   :8                               NA's   :6                      
##       acc           model.yr         origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:406        
##  1st Qu.:13.70   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.52   Mean   :75.92   Mean   :1.569                     
##  3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000                     
##                                                                    
##  diesel 
##  0:399  
##  1:  7  
##         
##         
##         
##         
## 
# or 
summary(auto$cyl)
##   3   4   5   6   8 
##   4 207   3  84 108
# or
unique(auto$cyl)
## [1] 8 4 6 3 5
## Levels: 3 4 5 6 8

E.2 Run the following code to create a vector of counts of each cylinder class

cylCounts <- table(auto$cyl)

E.3 Create a boxplot of MPG grouped by the number of cylinders with boxplot(mpg ~ cyl, data=auto,...). Using the ifelse() function, color the boxes red if the number of data points in that class is below 5, and blue otherwise. Make sure to name the plot and axes.

boxplot(mpg ~ cyl, data = auto, col = ifelse(cylCounts < 5, 'red', 'blue'),
        main = "Distribution of MPG by Number of Cylinders",
        xlab = "Cylinders", ylab = "MPG")

Part F

Create a scatterplot matrix of the Auto MPG data. Include the following variables: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”. What relationships do you see? Which variables have a strong relationship with each other? Strong relation with MPG? Which variables would you include in trying to predict a car’s MPG rating?

library(car)
scatterplotMatrix( auto[, c("mpg", "cyl", "disp", "hp", "weight", "acc", "model.yr")] )