Exploratory Data Analysis

CSULB Intro to R

April 27, 2018

Agenda

  1. Brief Review

  2. Exploratory Data Analysis

    • Numeric Summary Statistics
    • Histograms
    • Boxplots
    • Scatterplot Matrices

Vector and Matrix Review

Data Frame Review

Function Review

Reading Data

Writing Data

Some Useful Functions

str()

Compact way of understanding what an object is and what it contains

str(mean)
## function (x, ...)
str(matrix)
## function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
str(sample)
## function (x, size, replace = FALSE, prob = NULL)

str()

After loading a data frame, it is often useful to use str() in order to understand the structure of your data.

str(myDF)
## 'data.frame':    3 obs. of  3 variables:
##  $ names : Factor w/ 3 levels "Albert","Bianca",..: 1 2 3
##  $ height: num  71 57 64
##  $ female: num  0 1 1
prestige <- read.table(file = here::here("data", "prestige_v2.csv"),
                       sep=",",
                       header=TRUE,
                       row.names=1)
str(prestige)
## 'data.frame':    101 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

summary()

Another useful function for understanding your data by providing a numeric summary of each attribute (column).

summary(prestige)
##    education         income          women          prestige    
##  Min.   : 6.38   Min.   :  611   Min.   : 0.00   Min.   :14.80  
##  1st Qu.: 8.43   1st Qu.: 4075   1st Qu.: 3.59   1st Qu.:35.20  
##  Median :10.51   Median : 5902   Median :13.62   Median :43.50  
##  Mean   :10.73   Mean   : 6784   Mean   :29.19   Mean   :46.76  
##  3rd Qu.:12.71   3rd Qu.: 8131   3rd Qu.:52.27   3rd Qu.:59.60  
##  Max.   :15.97   Max.   :25879   Max.   :97.51   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :47  
##  1st Qu.:3117   prof:31  
##  Median :5137   wc  :23  
##  Mean   :5422            
##  3rd Qu.:8313            
##  Max.   :9517

Exploratory Data Analysis

Plotting in R

Useful plotting resources:

5 number summary

summary(prestige$prestige)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.80   35.20   43.50   46.76   59.60   87.20

Box plots

boxplot(prestige$prestige, horizontal=TRUE, xlab='Prestige Scores')

Histograms

hist(prestige$prestige, freq = FALSE, 
     col = "grey",
     main = "Histogram of Prestige Score", 
     xlab = "Prestige Score")

Histograms, contd.

hist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0,0.022))
lines(density(prestige$prestige, bw='nrd0'), col = "red", lwd = 2)

Violin Plots

vioplot(prestige$prestige, horizontal=TRUE, col='dodgerblue2',
  names='Prestige Scores')

Exploring more than one variable

Discrete - Discrete

Male Female
Smoke 16% 12%
Nah 41% 31%

Overlay Histograms

pp <- prestige$prestige
hist(pp[prestige$type=='bc'], breaks=10, main='Histograms of Prestige Scores by Job Type',
  xlab='Prestige Scores', col=adjustcolor('dodgerblue2',.4), xlim=c(0,100))
hist(pp[prestige$type=='wc'], breaks=10, col=adjustcolor('green', .4), add=TRUE)
hist(pp[prestige$type=='prof'], breaks=10, col=adjustcolor('purple', .4), add=TRUE)

Adding Lines

abline(v=mean(pp[prestige$type=='bc']), lwd=2, col='dodgerblue2')
abline(v=mean(pp[prestige$type=='wc']), lwd=2, col='green')
abline(v=mean(pp[prestige$type=='prof']), lwd=2, col='purple')

Adding a Legend

legend("topright", legend = c("BC", "WC", "PROF"),
       col = c("dodgerblue2", "green", "purple"), lty = 1, lwd = 2, bty = "n")

Grouped Violin plots

vioplot(pp[prestige$type=='bc'], pp[prestige$type=='wc'], pp[prestige$type=='prof'],
  horizontal=TRUE, names=c('BC', 'WC', 'PROF'))

Scatterplots

plot(x = prestige$education, y = prestige$prestige, pch = 20,
     main = "Prestige Score by Education",
     xlab = "Avg. Years of Education", ylab = "Prestige Score")

Scatterplots, contd.

plot(prestige$education, prestige$prestige, pch = 20,
     main = "Prestige Score by Education",
     xlab = "Avg. Years of Education", ylab = "Prestige Score")

abline(reg = lm(prestige ~ education, data = prestige),
  col = "green", lwd = 2)  # linear regression

lines(lowess(x = prestige$education, y = prestige$prestige),
  col = "red", lwd = 2)  # smoother

legend("topleft", legend = c("Regression Line", "Smoother"),
  col = c("green", "red"), lwd = c(2,2), lty = 1, bty = "n")

Scatterplot Matrices

library(car)
scatterplotMatrix( prestige[ ,c("prestige","education","income","women")] )

Summary

Some common commands and arguments

Next up

  1. Exercise 1
  2. Lunch

Return at XXXX to discuss solutions to Exercise 2!