Exploratory Data Analysis

Vector and Matrix Review

Vectors and Matrices are data structures used to hold data of the same type
- myVec <- c(1,2,3)
- myMatrix <- matrix(c('Ariel', 'Bill', 'Chris', 'Devon', ncol=3, nrow=2)
If the data is not of the same type, R will coerce the data to the most flexible type
We subset these structures using the [] operator
- myVec[2], myVec[-3], myVec[myVec >= 2]
- myMatrix[2,1], myMatrix[3,], myMatrix[,1]

Data Frame Review

Data Frames are special lists in R that hold different data types
- myDF <- data.frame(names=c('Albert', 'Bianca', 'Cat'), height=c(71, 57, 64), female=c(0,1,1))
Each column must be the same length and must have the same data type within that column
We subset (named) data frames using the $ operator. This can be combined with the [] operator.
- myDF$names
- myDF$height[myDF$height > 60]
- myDF$height[myDF$female==1]

Function Review

Functions, function(), are blocks of code or instructions that take some input and produce a desired output.
The input is known as the arguments or parameters
In R, many functions have preset/default argument values so that you don’t have to specify them each time. However, it’s usually good practice to specify several arguments to make your output more readable.

Reading Data

Before beginning a data analysis, we need to read in the appropriate data
This is generally handled with 1 of 2 functions: read.table(), or read.csv()
Auto MPG data
- autos <- read.table(file='https://archive.ics.uci.edu/ml/ machine-learning-databases/auto-mpg/auto-mpg.data', header=FALSE)
Abalone Whale Data
- whales <- read.csv(file='https://archive.ics.uci.edu/ml/ machine-learning-databases/abalone/abalone.data', header=FALSE)

Writing Data

Writing data is useful when you need to save a copy of your dataset to your local machine
This is usually done with write.table() and write.csv()
- write.table(object, filename, sep)
- write.csv(object, filename)

Some Useful Functions

str(): a function to explain internal structure of an object
summary(): a function that summarizes variables in a data frame
- Note: this function is also used to summarize results of model fitting functions, which we will go over in the afternoon.

`str()`

Compact way of understanding what an object is and what it contains

str(mean)

## function (x, ...)

str(matrix)

## function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

str(sample)

## function (x, size, replace = FALSE, prob = NULL)

`str()`

After loading a data frame, it is often useful to use str() in order to understand the structure of your data.

str(myDF)

## 'data.frame':    3 obs. of  3 variables:
##  $ names : Factor w/ 3 levels "Albert","Bianca",..: 1 2 3
##  $ height: num  71 57 64
##  $ female: num  0 1 1

prestige <- read.table(file = here::here("data", "prestige_v2.csv"),
                       sep=",",
                       header=TRUE,
                       row.names=1)
str(prestige)

## 'data.frame':    101 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

`summary()`

Another useful function for understanding your data by providing a numeric summary of each attribute (column).

summary(prestige)

##    education         income          women          prestige    
##  Min.   : 6.38   Min.   :  611   Min.   : 0.00   Min.   :14.80  
##  1st Qu.: 8.43   1st Qu.: 4075   1st Qu.: 3.59   1st Qu.:35.20  
##  Median :10.51   Median : 5902   Median :13.62   Median :43.50  
##  Mean   :10.73   Mean   : 6784   Mean   :29.19   Mean   :46.76  
##  3rd Qu.:12.71   3rd Qu.: 8131   3rd Qu.:52.27   3rd Qu.:59.60  
##  Max.   :15.97   Max.   :25879   Max.   :97.51   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :47  
##  1st Qu.:3117   prof:31  
##  Median :5137   wc  :23  
##  Mean   :5422            
##  3rd Qu.:8313            
##  Max.   :9517

Exploratory Data Analysis

Before performing statistical analyses on your data, it is important to do exploratory data analysis (EDA) in order to better understand the variables and the relationships between them.
This can be done in many ways
- Numeric summaries (e.g., using str() and summary())
- Plots, plots, and more plots
- More advanced methods (variograms, empirical covariance matrices, etc)
We will cover some of the basic plotting functions.

Plotting in R

To explore the distribution of one variable:
- 5-number summary
- Boxplots
- Histograms and Density Estimation
- Violin plots
To explore relationships between variables:
- Grouped Violin Plots
- Scatterplots
- Scatterplot Matrices

Useful plotting resources:

Quick-R Graphical Parameters
Colors in R
Points and Line Types
ggplot2
Search for (or post) specific questions on stackoverflow, which is a community that will answer questions & chose the best solutions via voting
NOTE: This course provides a basic introduction to R’s plotting capabilities. You can do much, much more elegant plots in R!

5 number summary

As seen in the summary() function, the distribution of a single numeric variable can be summed up wth 5 numbers. These are the minimum and max values, the $1^{st}$ and $3^{rd}$ quartiles (25% and 75%), and the median (50%).
R also throws in the mean to create a “6 number summary.” We will focus on the 5 without the mean.

summary(prestige$prestige)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.80   35.20   43.50   46.76   59.60   87.20

Box plots

A box plot, boxplot(), is a handy graphical representation of the 5 number summary.
Just as with the 5 number summary, the box plot tells us the min$^*$ and max$^*$ values, the $1^{st}$ and $3^{rd}$ quartiles, and the median.
- If outliers are present, by default defined as outside of 1.5$\times$inter-quartile range, they will be shown outside of the whiskers.

boxplot(prestige$prestige, horizontal=TRUE, xlab='Prestige Scores')

Histograms

Another useful graphic to visualize the distribution of a variable is a histogram, hist().
Reduced version of a “stem-and-leaf” plot
For each possible value of your variable (x-axis), a histogram displays the number of times that value appears in your data (y-axis). Counts can also be converted into proportions.
This is a natural plot for a discrete variable. For continuous variables, a histogram “bins” the observations into discrete units/intervals.

hist(prestige$prestige, freq = FALSE, 
     col = "grey",
     main = "Histogram of Prestige Score", 
     xlab = "Prestige Score")

Histograms, contd.

Sometimes we would rather have a smoothed version of a histogram (i.e., a density function).
- Not susceptible to number/location of bins like histograms are.
We can include this using the density() and lines() functions.
- lines(): takes coordinate pairs (in multiple input formats) and adds them to current figure connected by line segments
- density(): computes kernel density estimates (a smoothed histogram); see ?density for more details

hist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0,0.022))
lines(density(prestige$prestige, bw='nrd0'), col = "red", lwd = 2)

Violin Plots

So far we’ve seen how a distribution can be summarized by a 5 number summary, which can be visualized with a box plot.
We’ve also seen how to view the shape of a distribution using histograms and density estimates.
🤔 Can we combine these into one graphic?
- Obviously. We made a whole slide for it.
Violin plots create boxplots, but overlay them with a mirrored density estimate

vioplot(prestige$prestige, horizontal=TRUE, col='dodgerblue2',
  names='Prestige Scores')

Exploring more than one variable

We’ve looked at ways to view the distribution of single variables, but how do we examine the relationship between 2 variables? 3? 42?
We’ll focus on 2 for now.
Types of relationships:
- Discrete - Discrete
- Discrete - Continuous
- Continuous - Continuous

Discrete - Discrete

Many times, we’ll be interested in a discrete response across classes of subjects, e.g. proportion of smokers to non-smokers across males/females.
This can usually be summed up with a table

	Male	Female
Smoke	16%	12%
Nah	41%	31%

We could also overlay histograms, though this is more beneficial for ordered factors or continuous covariates.

Overlay Histograms

For Discrete - Continuous types
When examining the distribution of a single variable, we can easily create a histogram using hist(). We can extend this to creating a histogram of a single variable, but for multiple groups.
Let’s examine the prestige scores grouped by type.

pp <- prestige$prestige
hist(pp[prestige$type=='bc'], breaks=10, main='Histograms of Prestige Scores by Job Type',
  xlab='Prestige Scores', col=adjustcolor('dodgerblue2',.4), xlim=c(0,100))
hist(pp[prestige$type=='wc'], breaks=10, col=adjustcolor('green', .4), add=TRUE)
hist(pp[prestige$type=='prof'], breaks=10, col=adjustcolor('purple', .4), add=TRUE)

Adding Lines

While looking at these histograms (or any plot for that matter), we may wish to add lines indicating the location of the mean, median, quartiles, etc.
We can do this using the abline() function.
With abline(), we can plot a vertical line, abline(v=...), horizontal line, abline(h=...), generic lines, abline(a=intercept, b=slope,...), and others.

abline(v=mean(pp[prestige$type=='bc']), lwd=2, col='dodgerblue2')
abline(v=mean(pp[prestige$type=='wc']), lwd=2, col='green')
abline(v=mean(pp[prestige$type=='prof']), lwd=2, col='purple')

Adding a Legend

With multiple features on one figure, a legend can help clearly convey what is plotted.
Use legend() to do this in R.
- Very versatile function–look at its documentation
- Best to play around with this on your own, iterating through multiple plots until you get the legend to appear how you want it

legend("topright", legend = c("BC", "WC", "PROF"),
       col = c("dodgerblue2", "green", "purple"), lty = 1, lwd = 2, bty = "n")

Grouped Violin plots

Used to investigate relationships between variables:
- Continuous variable & factor (most common)
- Two continuous variables (group one of them)
Let’s look at violin plots of prestige grouped by type

vioplot(pp[prestige$type=='bc'], pp[prestige$type=='wc'], pp[prestige$type=='prof'],
  horizontal=TRUE, names=c('BC', 'WC', 'PROF'))

Scatterplots

We can also use scatterplots to visualize the relationship between two (continuous) variables.
We use the plot() function to do this
- Extremely flexible–there are plot methods for a variety of objects! (e.g., plotting a regression object returns diagnostic plots… more on this later)
- See the help documentation (?plot) and Graphical Parameters for more details
For now, focus on the most simple plot where we specify the x and y coordinates.
- x and y must be the same dimension

plot(x = prestige$education, y = prestige$prestige, pch = 20,
     main = "Prestige Score by Education",
     xlab = "Avg. Years of Education", ylab = "Prestige Score")

Scatterplots, contd.

Let’s overlay both a linear fit and smoother to our scatterplot.
We can include this using the lowess(), lines(), and abline() functions.
- Need to use lm() to fit a linear regression, more on this later
- lowess(): LOcally WEighted Scatterplot Smoother; computes a smoothed fit using locally-weighted polynomial regression.

plot(prestige$education, prestige$prestige, pch = 20,
     main = "Prestige Score by Education",
     xlab = "Avg. Years of Education", ylab = "Prestige Score")

abline(reg = lm(prestige ~ education, data = prestige),
  col = "green", lwd = 2)  # linear regression

lines(lowess(x = prestige$education, y = prestige$prestige),
  col = "red", lwd = 2)  # smoother

legend("topleft", legend = c("Regression Line", "Smoother"),
  col = c("green", "red"), lwd = c(2,2), lty = 1, bty = "n")

Scatterplot Matrices

The function scatterplotMatrix() (found in the car package we installed earlier) produces scatterplots between all variables in a data frame.
We can use direct ordering of the variables to control the order in which they are plotted.

library(car)
scatterplotMatrix( prestige[ ,c("prestige","education","income","women")] )

Summary

We’ve seen how to explore single variables with boxplots, histograms and density estimatates, and a combination of these known as a violin plot.
For 2 variables, we created grouped violin plots, overlayed histograms, and scatterplots.
Visualizations can be enhanced by adding lines, colors, labels, etc
There is always more than one way to view your data. Each perspective can give you a new understanding.

Some common commands and arguments

Plots
- boxplot()
  - horizontal=TRUE
- hist()
  - freq=TRUE
- density()
  - bw=c(‘nrd0’, ‘SJ’, 5, 0.5,…)
- vioplot::vioplot()
  - horizontal=TRUE, names=c(‘name1’, ‘name2’,…)
- plot()
Common plot arguments (for most plots)
- labeling: main, xlab, ylab
- limits: xlim, ylim
- add to existing plot: add=TRUE
- point and line types and shapes: pch, lty, lwd
- color: col
Other useful functions
- To add curves to existing plot: lines()
- To add straight line to existing plot: abline()
- To adjust color transparency: adjustcolor()

Exploratory Data Analysis

CSULB Intro to R

Agenda

Vector and Matrix Review

Data Frame Review

Function Review

Reading Data

Writing Data

Some Useful Functions

`str()`

`str()`

`summary()`

Exploratory Data Analysis

Plotting in R

Useful plotting resources:

5 number summary

Box plots

Histograms

Histograms, contd.

Violin Plots

Exploring more than one variable

Discrete - Discrete

Overlay Histograms

Adding Lines

Adding a Legend

Grouped Violin plots

Scatterplots

Scatterplots, contd.

Scatterplot Matrices

Summary

Some common commands and arguments

Next up