Introduction

Practice subsetting the data frame you created in Exercise 2!

Part A

Open the R script that you used for Exercise 2, and re-run the code there to ensure that data is loaded into your current environment. Check that this worked properly by typing data and str(data) at the prompt and inspecting the output.

Part B

B.1 Use a logical index vector to locate the rows with females. Name this vector femaleIndex and use it to look at the entries of data with females. Also, use this vector and a logical operator to look at the rows of data with men. HINT: you may find the exclamation point ! useful.

femaleIndex <- data$female == TRUE  # logical index vector selecting rows with females
data[femaleIndex, ]  # women
##       name female edu salary
## 1    Sally      1  HS     30
## 2 Michelle      1  BS     38
## 3     Lupe      1  BS     53
## 4    Wendy      1  MS     65
## 5  Maritza      1 PhD     89
data[!femaleIndex, ]  # men
##     name female edu salary
## 6  Jorge      0  HS     29
## 7    Sam      0  HS     33
## 8    Joe      0  BS     42
## 9  Glenn      0  MS    107
## 10  Buck      0 PhD    246

B.2 What is the mean salary for females? Is it more or less than that of males? HINT: you may want to repeat step 1 for males, subset the column salary, and use the mean() function.

mean(data$salary[femaleIndex])  # mean salary for women
## [1] 55
mean(data$salary[data$female == FALSE])  # mean salary for men
## [1] 91.4

In this example, men have a higher mean salary than women.

B.3 Is the mean salary an appropriate measure of the difference between the two groups? Compare their median salaries and comment. HINT: use the function median().

median(data$salary[femaleIndex])  # median salary for women
## [1] 53
median(data$salary[!femaleIndex])  # median salary for men
## [1] 42

In this example, women have a higher median salary than men. The mean salary for men is skewed by Glenn and Buck’s large salaries.

Part C

C.1 Look at the rows of data whose salaries are less than the mean salary.

data[data$salary < mean(data$salary), ]  # entries with lower than mean salary
##       name female edu salary
## 1    Sally      1  HS     30
## 2 Michelle      1  BS     38
## 3     Lupe      1  BS     53
## 4    Wendy      1  MS     65
## 6    Jorge      0  HS     29
## 7      Sam      0  HS     33
## 8      Joe      0  BS     42

C.2 Repeat step 1 using the median salary. Are there more or less people with salaries lower than the median versus salaries lower than the mean? Why do you think this is?

data[data$salary < median(data$salary), ]  # entries with lower than median salary
##       name female edu salary
## 1    Sally      1  HS     30
## 2 Michelle      1  BS     38
## 6    Jorge      0  HS     29
## 7      Sam      0  HS     33
## 8      Joe      0  BS     42

In this example, there are 8 people with salaries less than the mean salary of $73,200 and 5 people with salaries less than the median of $47,500. This is due to the skew in the data (the two abnoramlly large salaries).

Part D

D.1 On average, do people with a high school education make less anually than people with bachelor’s degrees? Use the data to answer this question.

mean(data$salary[data$edu == "HS"])  <  mean(data$salary[data$edu == "BS"])  # mean salary BS edu
## [1] TRUE

Yes, in this data set people with a high school education make less than those with a bachelor’s degree on average.

D.2 Does the proportion of women differ among people with a high school education and people with at least a bachelor’s degree? HINT: make a logical index vector to select people with a HS education (name it hs_index) and use the ! operator to select people with at least a BS. You can use the sum() function to get the total number of TRUEs from a logical index vector (e.g., the total number of poeple in the data frame with a high school education is sum(hs_index)).

hs_index <- data$edu == "HS"  # logical index vector of HS education or not
n_hs <- sum(hs_index)  # number of people in data wiht HS education
sum(data$female[hs_index]) / n_hs  # proportion of females among HS grads
## [1] 0.3333333
sum(data$female[!hs_index]) / (nrow(data) - n_hs)  # proportion of females among college grads
## [1] 0.5714286

In this example, there is a higher percentage of female college graduates than there is of highschool graduates.