// Open DiagnosticsWithStatex77 in R
> state<-as.data.frame(state.x77); colnames(state) <- c("Population", "Income",
"Illiteracy", "LifeExp", "Murder", "HSGrad", "Frost", "Area"); attach(state)
// As the student’s number is 292302, we have to delete observations #02 and #30.
// Hence, we delete Alaska and New Jersey:
> newstate<-state[-2,]
> newstate<-newstate[-29,]
> 292302%%4+1
// it gives 3, so we select “LifeExp” variable and cross out Life Expectancy row in the table given.
//Create a table (this is what we should not do in R, but in Word document):
// Develop an unrestricted model U:
> U<-lm(LifeExp ~Population+Income+Illiteracy+Murder+HSGrad+Frost+Area, data=newstate)
> summary(U)
// Record the results in the table:
//Delete Illiteracy, as its p-value is the highest:
> newstate1<-newstate[-c(3)]
// Compute new model – R1:
> R1<-lm(LifeExp ~Population+Income+Murder+HSGrad+Frost+Area, data=newstate1)
> summary(R1)
//The table is filled with the new results:
First, the adjusted R-squared is increased, indicating that we were right when have excluded Illiteracy variable. We know that the regular R-square always higher for the higher number of variables. Thus, even if we add an insignificant variable in a model, R-square will be increased. However, adjusted R-square shows the adjusted percentage of variance explained by the model. The R1 model is better than U model – this claim is also supported by the fact that F-value has been increased. Third, we have obtained a new significant factor: Frost. Previously, in model U, Frost variable was insignificant. This is another evidence of that Illiteracy variable has been introducing a bias.
// Delete Income and create R2:
> newstate2<-newstate1[-c(2)]
> R2<-lm(LifeExp ~Population+Murder+HSGrad+Frost+Area, data=newstate2)
> summary(R2)
//The table is filled with the new results:
Adjusted R-squared increased again, F-value increased, a new factor is significant (HSGrad). Everything indicates that we are on the right path.
// Delete “Area” as its p-value is the highest and create R3:
> newstate3<-newstate2[-c(6)]
> R3<-lm(LifeExp ~Population+Murder+HSGrad+Frost, data=newstate3)
> summary(R3)
//The table is filled with the new results:
Adjusted R-squared increased again, F-value increased. Everything indicates that we are still on the right path. The only one variable is not significant now.
// Delete Population, as this is the last insignificant variable and create R4:
> newstate4<-newstate3[-c(1)]
> R4<-lm(LifeExp ~Murder+HSGrad+Frost, data=newstate4)
> summary(R4)
//The table is filled with the new results:
The value of the adjusted R-square is decreased, indicating that we have lost a part of variance explained by the Population variable. In contrast, F-value has been increased again, indicating that this model is even more significant than R3. However, Population variable may be associated with LifeExp in a different, non-linear way. So, I think that it is better to select R4 as the best linear multiple regression model to predict LifeExp.
// Create correlation matrix for the initial data set called “newstate”:
> cor(newstate)
Population Income Illiteracy LifeExp Murder HSGrad Frost Area
Population 1.00000000 0.28279501 0.1207902 -0.09187425 0.3814257 -0.07042366 -0.32581189 0.25702921
Income 0.28279501 1.00000000 -0.5369962 0.48009890 -0.3266125 0.60519602 0.18647923 0.02347333
Illiteracy 0.12079019 -0.53699616 1.0000000 -0.58548665 0.7026650 -0.69934654 -0.69034150 0.01943720
LifeExp -0.09187425 0.48009890 -0.5854866 1.00000000 -0.7776354 0.65175358 0.29108280 0.06456313
Murder 0.38142571 -0.32661252 0.7026650 -0.77763537 1.0000000 -0.55030171 -0.57038817 0.17467234
HSGrad -0.07042366 0.60519602 -0.6993465 0.65175358 -0.5503017 1.00000000 0.34848780 0.24813787
Frost -0.32581189 0.18647923 -0.6903415 0.29108280 -0.5703882 0.34848780 1.00000000 -0.09112056
Area 0.25702921 0.02347333 0.0194372 0.06456313 0.1746723 0.24813787 -0.09112056 1.00000000
This matrix indicates that there are weak correlations between LifeExp and Population, Frost, Area. There are strong correlations between LifeExp and Income, Illiteracy and HSGrad. There is a very strong correlation between LifeExp and Murder.
Some of variables indicate weak correlations with the dependent variable, hence, they should be excluded from the model. Other excluded variables correlate with those which are left in the final model. Thus, I think we removed all possible bias from the linear model.