April 27, 2018
Brief Review
Exploratory Data Analysis
myVec <- c(1,2,3)
myMatrix <- matrix(c('Ariel', 'Bill', 'Chris', 'Devon',
ncol=3, nrow=2)
[]
operator
myVec[2]
, myVec[-3]
, myVec[myVec >= 2]
myMatrix[2,1]
, myMatrix[3,]
, myMatrix[,1]
myDF <- data.frame(names=c('Albert', 'Bianca', 'Cat'),
height=c(71, 57, 64),
female=c(0,1,1))
$
operator. This can be combined with the []
operator.
myDF$names
myDF$height[myDF$height > 60]
myDF$height[myDF$female==1]
function()
, are blocks of code or instructions that take some input and produce a desired output.read.table()
, or read.csv()
autos <-
read.table(file='https://archive.ics.uci.edu/ml/
machine-learning-databases/auto-mpg/auto-mpg.data',
header=FALSE)
whales <-
read.csv(file='https://archive.ics.uci.edu/ml/
machine-learning-databases/abalone/abalone.data',
header=FALSE)
write.table()
and write.csv()
write.table(object, filename, sep)
write.csv(object, filename)
str()
: a function to explain internal structure of an objectsummary()
: a function that summarizes variables in a data frame
str()
Compact way of understanding what an object is and what it contains
str(mean)
## function (x, ...)
str(matrix)
## function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
str(sample)
## function (x, size, replace = FALSE, prob = NULL)
str()
After loading a data frame, it is often useful to use str()
in order to understand the structure of your data.
str(myDF)
## 'data.frame': 3 obs. of 3 variables:
## $ names : Factor w/ 3 levels "Albert","Bianca",..: 1 2 3
## $ height: num 71 57 64
## $ female: num 0 1 1
prestige <- read.table(file = here::here("data", "prestige_v2.csv"),
sep=",",
header=TRUE,
row.names=1)
str(prestige)
## 'data.frame': 101 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
summary()
Another useful function for understanding your data by providing a numeric summary of each attribute (column).
summary(prestige)
## education income women prestige
## Min. : 6.38 Min. : 611 Min. : 0.00 Min. :14.80
## 1st Qu.: 8.43 1st Qu.: 4075 1st Qu.: 3.59 1st Qu.:35.20
## Median :10.51 Median : 5902 Median :13.62 Median :43.50
## Mean :10.73 Mean : 6784 Mean :29.19 Mean :46.76
## 3rd Qu.:12.71 3rd Qu.: 8131 3rd Qu.:52.27 3rd Qu.:59.60
## Max. :15.97 Max. :25879 Max. :97.51 Max. :87.20
## census type
## Min. :1113 bc :47
## 1st Qu.:3117 prof:31
## Median :5137 wc :23
## Mean :5422
## 3rd Qu.:8313
## Max. :9517
Before performing statistical analyses on your data, it is important to do exploratory data analysis (EDA) in order to better understand the variables and the relationships between them.
This can be done in many ways
str()
and summary()
)We will cover some of the basic plotting functions.
To explore the distribution of one variable:
To explore relationships between variables:
ggplot2
Search for (or post) specific questions on stackoverflow, which is a community that will answer questions & chose the best solutions via voting
NOTE: This course provides a basic introduction to R
’s plotting capabilities. You can do much, much more elegant plots in R
!
summary()
function, the distribution of a single numeric variable can be summed up wth 5 numbers. These are the minimum and max values, the \(1^{st}\) and \(3^{rd}\) quartiles (25% and 75%), and the median (50%).summary(prestige$prestige)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.80 35.20 43.50 46.76 59.60 87.20
boxplot()
, is a handy graphical representation of the 5 number summary.boxplot(prestige$prestige, horizontal=TRUE, xlab='Prestige Scores')
hist()
.hist(prestige$prestige, freq = FALSE,
col = "grey",
main = "Histogram of Prestige Score",
xlab = "Prestige Score")
density()
and lines()
functions.
lines()
: takes coordinate pairs (in multiple input formats) and adds them to current figure connected by line segmentsdensity()
: computes kernel density estimates (a smoothed histogram); see ?density
for more detailshist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0,0.022))
lines(density(prestige$prestige, bw='nrd0'), col = "red", lwd = 2)
vioplot(prestige$prestige, horizontal=TRUE, col='dodgerblue2',
names='Prestige Scores')
Male | Female | |
---|---|---|
Smoke | 16% | 12% |
Nah | 41% | 31% |
hist()
. We can extend this to creating a histogram of a single variable, but for multiple groups.prestige
scores grouped by type
.pp <- prestige$prestige
hist(pp[prestige$type=='bc'], breaks=10, main='Histograms of Prestige Scores by Job Type',
xlab='Prestige Scores', col=adjustcolor('dodgerblue2',.4), xlim=c(0,100))
hist(pp[prestige$type=='wc'], breaks=10, col=adjustcolor('green', .4), add=TRUE)
hist(pp[prestige$type=='prof'], breaks=10, col=adjustcolor('purple', .4), add=TRUE)
abline()
function.abline()
, we can plot a vertical line, abline(v=...)
, horizontal line, abline(h=...)
, generic lines, abline(a=intercept, b=slope,...)
, and others.abline(v=mean(pp[prestige$type=='bc']), lwd=2, col='dodgerblue2')
abline(v=mean(pp[prestige$type=='wc']), lwd=2, col='green')
abline(v=mean(pp[prestige$type=='prof']), lwd=2, col='purple')
legend()
to do this in R.
legend("topright", legend = c("BC", "WC", "PROF"),
col = c("dodgerblue2", "green", "purple"), lty = 1, lwd = 2, bty = "n")
prestige
grouped by type
vioplot(pp[prestige$type=='bc'], pp[prestige$type=='wc'], pp[prestige$type=='prof'],
horizontal=TRUE, names=c('BC', 'WC', 'PROF'))
plot()
function to do this
?plot
) and Graphical Parameters for more detailsx
and y
coordinates.
x
and y
must be the same dimensionplot(x = prestige$education, y = prestige$prestige, pch = 20,
main = "Prestige Score by Education",
xlab = "Avg. Years of Education", ylab = "Prestige Score")
lowess()
, lines()
, and abline()
functions.
lm()
to fit a linear regression, more on this laterlowess()
: LOcally WEighted Scatterplot Smoother; computes a smoothed fit using locally-weighted polynomial regression.plot(prestige$education, prestige$prestige, pch = 20,
main = "Prestige Score by Education",
xlab = "Avg. Years of Education", ylab = "Prestige Score")
abline(reg = lm(prestige ~ education, data = prestige),
col = "green", lwd = 2) # linear regression
lines(lowess(x = prestige$education, y = prestige$prestige),
col = "red", lwd = 2) # smoother
legend("topleft", legend = c("Regression Line", "Smoother"),
col = c("green", "red"), lwd = c(2,2), lty = 1, bty = "n")
scatterplotMatrix()
(found in the car
package we installed earlier) produces scatterplots between all variables in a data frame.library(car)
scatterplotMatrix( prestige[ ,c("prestige","education","income","women")] )
boxplot()
hist()
density()
vioplot::vioplot()
plot()
lines()
abline()
adjustcolor()