About the paper
The paper is commissioned to delineate the linear and log-linear multiple regression model where the dependent variable, earnings per hour, will be regressed against two independent variables, education and experience. As part of this analysis, we will cite multiple journal articles, books and other related sources to learn more about the regression analysis and how the statisticians perform the data analysis these days.
Both the regression models, i.e. linear and log-linear multiple regression will be performed in excel and the output will be attached in the appendix section of the paper. In addition, we will also test the significance of both the independent variables by performing the hypothesis testing using t-test and f-test at a 95% level of significance. Thereafter, we will recommend some additional variables that can be tested for relationship with the the dependent variable, i.e. earnings per hour. The paper will eventually be culminated with a concise conclusion discussing the outcome of linear and log-linear regression model and hypothesis testing.
Defining Multiple Regression
Regression is a statistical measure that attempts to ascertain the relationship between the dependent variable and the independent variable(s). A regression whereby the relationship between a dependent variable and more than one independent variable is ascertained, is called multiple regression. (Draper and Smith, 1981) In this way, the analyst is able to learn about the influence of multiple variables on the dependent variables and to what extent they influence the dependent variable.
Regression Equation
Y= b0 + b1X1 + b2X2
Here:
Y= Dependent variable; i.e. earnings per hour
b0= Intercept term
b1= Slope coefficient of education(independent variable)
b2= Slope coefficient of experience(independent variable)
Expected sign of regression coefficients
As we can see that the regression equation above includes a positive coefficient for both the independent variables, i.e. education and experience shares a positive relation with the dependent variable, i.e. earnings per hour. We believe that our forecast is justified because higher is the number of years of experience or higher is the education level, better will be the earnings of an individual.
Linear Multiple Regression Model
Regressing the earnings per hour data with the data of other two independent variables, we found that both these variables indeed share a positive relation with the dependent variables. While ‘education’ has a coefficient of 3.422, ‘experience’ has a coefficient of 0.390. In addition, the model also ha a good fit and the same is evident with multiples adjusted R-square. Referring to the table attached in appendix 1, we can see that linear multiple regression model performed with these two independent variables have an adjusted R-square multiple of 80.26%. This indicates that nearly 80.26% of the total variation in the dependent variable, i.e. earnings per hour, is explained by the education and experience. Therefore, with such a high percentage of the adjusted R-square, we can be sure that the model has a good fit and can be used for further analysis. (Dufour, 2011)
F-Test
F-test access how well the set of independent variables, as a group, explains the variation in the dependent variables. Stated otherwise, by performing F-test, one can ascertain whether at least one of the independent variables explains a significant portion of variation in the dependent variable. (Montgomerry, Peck and Vining, 2012) Below we have performed the hypothesis test to perform F-test on both the independent variables:
i) Null Hypothesis(H0): b1= b2= 0
ii) Alternative Hypothesis(Ha): At lease b1 or b2 ≠0
iii) F-Statistic: MSR/MSE
MSR= Mean regression sum of squares
MSE= Mean squared error
Referring to Appendix 1, we found that MSR= 4004.166 and MSE= 24.77, therefore:
F-Statistic: 4004.166/24.77= 161.649
iv) Critical Value:
In order to determine if at least one of the coefficients are statistically significant, the calculated F-Statistic is compared with critical values, which we will determine at 5% level of significance. The degree of freedom are:
Numerator: k (number of independent variables)= 2
Denominator: n-k-1= 80-2-1= 77
Therefore, referring to the F-table, we found that at the 5% level of significance and 2 degrees of freedom in the numerator and 77 degrees of freedom in the denominator, the critical value is 3.115. Since, F-statistic is greater than the critical value of F-test, we reject the null hypothesis and conclude that at least one of the independent variables, contribute significantly in explaining the variation in the dependent variable.
T-test
T-test is used to test the significance of each independent variable. (Kasza and Wolfe, 2013)Below we have performed the hypothesis testing, using T-test, for both the independent variables at the 5% level of significance
-Testing significance of Education at 5% level of significance
i) Null Hypothesis: b1= 0
ii) Alternative Hypothesis: b1 ≠0
iii) T-Statistic: (Coefficient-Hypothesized Value)/ Standard Error
=(3.422-0)/0.3362
=10.17
iv) Degree of freedom: n-k-1= 80-2-1= 77
v) Critical Value at 77 degrees of freedom at 5% for a two tailed test: 1.99
Therefore, since the t-statistic doesn’t fall in the region of +1.99 and -1.99, we reject the null hypothesis and conclude that b1, i.e. education, has a significant relationship with the dependent variable at 5% level of significance.
-Testing significance of Experience at the 5% level of significance
i) Null Hypothesis: b2= 0
ii) Alternative Hypothesis: b2 ≠0
iii) T-Statistic: (Coefficient-Hypothesized Value)/ Standard Error
=(0.390-0)/0.053
=7.31
iv) Degree of freedom: n-k-1= 80-2-1= 77
v) Critical Value at 77 degrees of freedom at 5% for a two tailed test: 1.99
Therefore, since the t-statistic doesn’t fall in the region of +1.99 and -1.99, we reject the null hypothesis and conclude that b2, i.e. experience, has a significant relationship with the dependent variable at 5% level of significance.
Types of errors
Type 1 Error:
Type 1 error is related to the rejection of null hypothesis when it is actually true or in other words, accepting the alternative hypothesis when it is false. For instance, assuming that the our null hypothesis was that a particular independent variable had no significant relationship with the dependent variable. However, erroneously or due to negligence, the researcher rejects the null hypothesis and conclude that independent variable have a significant relationship with the dependent variable, when in reality there is no relationship. This leads to Type 1 error in hypothesis testing. (Banerjee, et all, 2009)
Type 2 Error:
Type 2 error is related to rejection of the alternative hypothesis when it is actually true. Stated otherwise, Type 2 or a false negative error occurs when some researcher accepts null hypothesis, when it is actually false. For instance, assuming that the our null hypothesis was that a particular independent variable had no significant relationship with the dependent variable. However , the researcher accepts the null hypothesis and conclude that independent variable have no significant relationship with the dependent variable, when in reality there is no relationship. This leads to Type 2 error in hypothesis testing. (Banerjee, et all, 2009)
Regression based prediction
Equation:
Y= b0 + b1X1 + b2X2
Y= -42.2019+ 3.422(15)+ 0.390(25)
= -42.2019+ 51.33+ 9.75
= 18.871
Therefore, with an experience of 25 years and 15 years of education, the predicted earnings per hour will be $18.871 per hour.
In the equation above:
Y= Dependent variable; i.e. earnings per hour
b0= Intercept term
b1= Slope coefficient of education(independent variable)
b2= Slope coefficient of experience(independent variable
X1= Estimated years of education
X2= Estimated years of experience
Other possible independent variables
In addition to years of education and years of experience, we can also use number of projects initiated at the previous job, number of workforce handled as team leader and days of leaves on the job, as the three other independent variables to test their relationship with the dependent variable. While the current independent variables, years of education and years of experience, will help us learn about nearly 80% of the total variation in the earnings per hour, the above three variables will possibly explain remaining 20% of the total variation.
As for the number of projects initiated and number of workforce handled, these two variables will help the employer learn about the leadership potential of the individual and will help in ascertaining an appropriate amount of earnings per hour for the employee. Therefore, the number of projects initiated and number of workforce handled will share a positive relation with the dependent variable. On the other hand, attendance at work, which can be measured with the number of days in a year an employee was absent from the work, will help the employer learn about the employee’s discipline at work. Hence, more is the number of days of absenteeism; lesser will be the earnings per hour for the individual. Therefore, this variable will share a negative relationship with the dependent variable.
Log-linear multiple regression model
This is another way of regressing the independent variables against the dependent variables. However, as part of the log-linear model, the current observations are transformed to log based observations and then the regression is performed. Important to note, log-linear model is particularly used for the data set that displays exponential growth or decline, such as financial data. (Montgomerry, Peck and Vining, 2012) While the outcome of log-linear model is available in Appendix 2, below we have performed the hypothesis testing using F-test and T-test to test the significance of both the independent variables in explaining the variation in the dependent variables.
F-test
i) Null Hypothesis(H0): b1= b2= 0
ii) Alternative Hypothesis(Ha): At lease b1 or b2 ≠0
iii) F-Statistic: MSR/MSE
MSR= Mean regression sum of squares
MSE= Mean squared error
Referring to Appendix 2, we found that MSR= 10.11826 and MSE= 0.0649, therefore:
F-Statistic: 10.11826/0.0649= 155.8158
iv) Critical Value:
While performing the linear regression, we accessed the critical value, at the 5% level of significance and 2 degrees of freedom in the numerator and 77 degrees of freedom in the denominator, to be 3.115. However, here also, the F-statistic(155.8158) is greater than the critical value of F-test. Hence, we reject the null hypothesis and conclude that at least one of the independent variables, contribute significantly in explaining the variation in the dependent variable. Our results aligns with the outcome of linear regression model.
T-test
-Testing significance of Education at the 5% level of significance
i) Null Hypothesis: b1= 0
ii) Alternative Hypothesis: b1 ≠0
iii) T-Statistic: (Coefficient-Hypothesized Value)/ Standard Error
=(3.0758-0)/0.2464
=12.48
iv) Degree of freedom: n-k-1= 80-2-1= 77
v) Critical Value at 77 degrees of freedom at 5% for a two tailed test: 1.99
Since the t-statistic is outside the critical region of +1.99 and -1.99, we reject the null hypothesis and conclude that b1, i.e. education, has a significant relationship with the dependent variable at 5% level of significance.
-Testing significance of Experience at 5% level of significance
i) Null Hypothesis: b2= 0
ii) Alternative Hypothesis: b2 ≠0
iii) T-Statistic: (Coefficient-Hypothesized Value)/ Standard Error
=(0.216184-0)/0.03262
=6.62
iv) Degree of freedom: n-k-1= 80-2-1= 77
v) Critical Value at 77 degrees of freedom at 5% for a two tailed test: 1.99
Here also, the t-statistic doesn’t fall in the region of +1.99 and -1.99. Hence, we reject the null hypothesis and conclude that b2, i.e. experience, has a significant relationship with the dependent variable at 5% level of significance. The outcome here aligns with the outcome of linear model also.
Linear model v/s Log-linear model
As discussed before, both the regression model, i.e. linear model and log-linear model have confirmed that the two independent variables, are significant in explaining the variation in the dependent variable. In addition, both the model has a good fit with the coefficient of determination (adjusted r-square) at 80.28% and 79.67%, respectively. As we can see from the graph below for the linear model, the residuals of both the variables are persistently above and below the trendline. However, after taking the log of these observations, the regression line loses the fit as adjusted r-square decreases from at 80.28% to 79.67%. Therefore, we recommend that linear model should be followed as it offers better fit of the regression line compared to log-linear model. Stated otherwise, compared to log-linear model, linear model explains a larger percentage variation in the dependent variable and should thus be preferred over the log-linear model.
Conclusion
At the end of this paper, we can conclude that while the linear and log-linear model both confirms that the years of education and years of experience are significant in explaining the variation in the dependent variable, i.e. earnings per hour, however, since the former model provides a better adjusted r-square and signals towards a good fit compared to log-linear model, we recommend that we should use liner model in order to perform the regression analysis for the said dependent variable against the two independent variables. In addition, we also recommend adding three other independent variables; number of projects initiated, number of workforce handled as team leader and days of leaves on the job, in order to ascertain the relationship of earnings per hour with these variables.
References
Banerjee, A., Chitnis, U., Jadhav, S., Bhawalkar, J. and Chaudhury, S. (2009). Hypothesis testing, type I and type II errors. [online] Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996198/ [Accessed 4 Jan. 2017].
Draper, N. and Smith, H. (1981). Applied regression analysis. 1st ed. New York: Wiley.
Dufour, J. (2011). Coefficients of determination. [online] McGill University, pp.1-4. Available at: https://www2.cirano.qc.ca/~dufourj/Web_Site/ResE/Dufour_1983_R2_W.pdf [Accessed 4 Jan. 2017].
Kasza, J. and Wolfe, R. (2013). Interpretation of commonly used statistical regression models. Respirology, 19(1), pp.14-21.
Montgomerry, D., Peck, E. and Vining, G. (2012). Introduction to Linear Regression Analysis. 5th ed. John Wiley and Sons.
Appendix 1: Linear Multiple Regression Model
Appendix 2: Log-Linear Multiple Regression Model