First, input the following R-code for installing and using data:
install.package(“AER)
library(AER)
data(CPSSW9204)
N= 02
list <- sample(1:length(CPSSW9204$gender), (200+N), replace =FALSE)
data <- as.data.frame(subset(CPSSW9204[list,]))
i<- 5;is.factor(data[,i]); is.numeric(data[,i]); length(data[,i]);
lm(earnings ~year+age+gender+degree, data=data)
par(mfrow=c(2,2))
plot(data[ ,2]~data[ ,5], main="Income by Age")
abline(lm(data[ ,2]~data[ ,5]), lwd=3, col="red")
plot(data[ ,2]~data[ ,1], main="Income by Year")
plot(data[ ,2]~data[ ,3], main="Income by Degree")
plot(data[ ,2]~data[ ,4], main="Income by Gender")
summary(lm(earnings ~ gender, data=data))
Exercise #1
The output of the regression equation:
Coefficients:
(Intercept) year2004 age genderfemale degreebachelor
3.1335 5.0934 0.2701 -3.7692 5.4232
This means that:
Earnings = 3.1335+ 5.0934*year+0.2701*age -3.7692*gender + 5.4232*degree
where gender is 0, if male, and 1, if female. Degree is equal to 1, if bachelor, and equal to 0, if high school.
Exercise #2
When age is changed by 1, while all other variables are constant, earnings will be changed by 0.2701. If all other variables are constant, females’ earnings are 3.7692 less than males’ earnings. If all other variables are constant, bachelors earn 5.4232 more than those who completed only high school. . If all other variables are constant, in 2004 participants earned 5.0934 more than in 1992.
Exercise #3
The graphs should be represented like this:
Exercise #4
The first plot shows the distribution of income by age. Since both variables are numerical, R returns a scatterplot. OLS line indicates a slight increase in income as age is increasing. Generally, this means that older employees earn more than younger.
The second plot is a box plot of income by year. We can clearly see that income in 1992 was lower than in 2004. There are few outliers in both groups of year (these are extremely high values, noted on the graph with little circles).
The third plot is a box plot of income by degree. There is a difference in income between high school graduates and bachelors. Bachelors earn more than high school graduates. There are numerous outliers (extremely high values) for both groups.
The fourth plot is a box plot of income by gender. On average, males earn more than females. There are few extremely high values for males and a number of extremely high values for females.
Exercise #5
R returns the following:
Call:
lm(formula = earnings ~ gender, data = data)
Residuals:
Min 1Q Median 3Q Max
-11.380 -5.426 -2.066 3.802 43.757
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.3781 0.8424 18.25 <2e-16 ***
genderfemale -2.7108 1.2157 -2.23 0.0269 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.632 on 200 degrees of freedom
Multiple R-squared: 0.02426, Adjusted R-squared: 0.01938
F-statistic: 4.972 on 1 and 200 DF, p-value: 0.02687
This regression supports the claim that males earn more than females, because the slope is negative. The model is significant (ANOVA indicates F=4.972, p=0.02687), however, it requires more variables to be included (R-squared=0.02426, which means that only 2.43% of variance in earnings is explained by gender).