- Pick a variable you are interested in predicting. This becomes your dependent (Y) variable. Explain why this interests you.
- Pick several candidate independent (X) variables, including some that I did not provide. It would also be good to have an interaction variable and a dumb variable. Explain why you think these may be good predictors of your Y variable.
I pick the following independent variables as my explanatory variables - median income and percentage of housing for people with income that is higher than 30% of the mean. My theory is that the commuting time of people is dependent on the capacity of people to buy homes. If a person has high income he can afford to buy a nice house in suburban areas and thus would have longer commuting times. Based on this relationship, I think that the paying capacity and the amount of money above the average is a good indicator of the capability of the person and therefore his commuting time. An error term, or dummy variable (denoted as E) is considered as part of the regression equation. The dummy variable is not identified specifically but is revealed when the regression run is made.
- Use statistical methods (correlation matrix and scatter plots) to see whether these are related to your Y variable and whether they are related to each other.
A scatter plot diagram using Microsoft Excel was done to show any visual correlation between the dependent variable (Commuting Time) and the independent variables (Average Income, Housing Percentage for People that have Income that is 30% higher than the Mean).
Figure 1 Scatterplot Diagram, Commuting Time and Median Income
The scatterplot for the first pairing indicates no particular correlation between the two variables. The second paring shows no visual correlation too.
Figure 2 Scatterplot Diagram, Commuting Time and Percentage with Greater Income
- Run several regressions using your potential X variables against your Y variable. These should be of increasing size (some with 1 X, some with 2 Xs, etc.)
Regression analysis was conducted for both pairings. The results for pairings are:
- Commuting Time and Average Income: y = 769.64x + 23533, R² = 0.1791
- Commuting Time and Percentage of People with Income Greater than 30% of Average Income: y = 0.7228x + 8.9169, R² = 0.2686
The regression results show that both independent variables have a positive effect on the dependent variable. For example, a 1 unit change in the average income increases the commuting time by 769 minutes while a 1 unit change the income of people with a 30% higher income than the average increases their commuting time by 0.722 minutes. The results have not been tested for significance so caution must be employed in their analysis. The error terms (E) for both regression runs are 23533 and 8.9169.
- Pick your favorite model (regression result) and explain why this is your favorite model (Rsquared, adjusted Rsquared, F, and significance of the Bs)
I choose to use the r-squared statistic for both analyses. The r-squared provides an idea of the goodness of fit of the regression equation. Sadly, the goodness of fit for both regression equations is low. For the first pairing, only 17.91% of the change in commuting time can be explained by income and for the second pairing, only 26.86% of the change in commuting time is caused by the change in income for those that have higher incomes than the mean. In terms of comparing the models, I think the second equation is superior to the first because it captures more of the movement of the dependent variable with respect to the changes in the independent variable, as evidenced by a higher “goodness of fit” equation.
References
Cottrell, A. (2011). An Introduction to Regression Analysis: Basic Concepts. Retrieved from http://users.wfu.edu/cottrell/ecn215/regress.pdf