Introduction
In this paper we will operate the data collected by the National Household Education Surveys Program (NHES) which incorporates random telephone surveys in the United States. In 2005 there following surveys were performed:
- Early Childhood Program Participation Survey. This survey was related to educational programs and nonparental care arrangements of preschool children.
- After-School Programs and Activities Survey. This survey was about care of middle-school children during the afterschool hours.
- Adult Education Survey. In this survey data were collected about adult educational activities.
Thus, there were three populations of interest: children younger than 6 years, school-age children and adults (>16 years). The data contains an information regarding the questionnaire of respondents. The questions in questionnaire were related to household characteristics, mother and father characteristics, selection of care, parents’ perception about care and so on and so forth. There were 11684 observations in total.
Descriptive Statistics
As ususal, any statistical study begins with descriptive statistics. The totality of the results of observations / experiments is the raw material of any statistical study. The simplest array processing of data is to calculate basic statistics, united in a group of "descriptive statistics". Descriptive statistics allows generalizing the initial results obtained by observation or experiment. All calculations descriptive statistics are reduced to group data according to their values , their frequency distribution of construction, identify trends central distribution and finally to the estimation variance of the data with respect to the central tendency found.
So, we begin with descriptive statistics of our data sample:
The next step is to construct multiple regression model to explain grades in school.
We take SEGRADES as dependent variable and all other variables as independent variables and construct a regression model. The output is below:
According to the model summary, the coefficient of determination R-squared is very low (0.122). This means that only 12.2% of School Grades variable is explained by this model. The adjusted R-squared is 0.12. That’s why the model is almost useless to predict the dependent variable we’ve chosen.
However, ANOVA results give us p-value lesser than 0.001 which means that the model is significant (at 1% level of significance with F-value of 62.102). Hence, the multiple linear regression is a good choice.
The most informative is the last table, where we can look at significance of all independent variables. We set significance level at 5% and look at p-values for each variable. If p-value is higher than 0.05, we exclude this variable from the model. So, the following variables left:
CHILD'S SEX
PW19-TOTAL HH INCOME RANGE 2
D-EDUC ATTAINMT OF CHILD'S FATHER/GUARD
D-PARENTS IN HH/ INCLUDES SAMESEX HH
SE3-TCHRS CONTACT FAM RE BEH PRBLMS
SE4-TCHRS CONTACT FAM RE SCH WORK PRBLMS
SE5-IN-SCHOOL SUSPENSION
SE5-HAS CHILD EVER BEEN EXPELLED?
SI1-PARTICIPATING IN ANY ACTIVITIES
SK2-HOMEWORK/EDUCATIONAL/READING
SK2-COMPUTERS
SK2-ARTS
SK2-CHORES/WORK
SK2-INDOOR PLAY
D-CHILD CURRENTLY HAS A DISABILITY
Now run the same regression analysis with the variables left:
Three variables may be excluded from the model (with p-value higher than 0.05) and the procedure could be redone once more. The result is below:
All variables in this model are significant. However, the coefficient of determination is still low (0.117). The model is still bad to make predictions.
Let’s check the assumptions for this model.
We start from multicollinearity:
Those values where VIF is higher than 10 are affected by multicollinearity and should be excluded from the model. Redo the regression without variables marked with yellow.
The next step is testing for normality and heteroskedasticity. To do this, we have to plot the variables:
According to the histogram we see, that the residual plot is not normal. Also the p-p plot shows us that the probabilities are not close to the least-squares line. Now we want to test heteroscedasticity statistically using the Breusch-Pagan Test and the Koenker Test (using special macro from spsstools.net):
Run MATRIX procedure:
BP&K TESTS
==========
Bytes requested = 1092126848
>Error encountered in source line # 462
>Error # 12477
>MATRIX could not allocate memory for an object. Reduce problem size, or
>release unused matrices using RELEASE statement. Use DISPLAY statement to
>list all allocated objects. .
>This command not executed.
Bytes requested = 1092126848
>Error encountered in source line # 463
>Error # 12477
>MATRIX could not allocate memory for an object. Reduce problem size, or
>release unused matrices using RELEASE statement. Use DISPLAY statement to
>list all allocated objects. .
>This command not executed.
>Error encountered in source line # 464
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'II_1' is undefined
>Error encountered in source line # 464
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
Matrix - 'II_1' is undefined
>Error encountered in source line # 464
>Error # 12347
>Undefined operand for binary operator.
>Error encountered in source line # 465
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'M0' is undefined
>Error encountered in source line # 465
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
Matrix - 'M0' is undefined
>Error encountered in source line # 465
>Error # 12343
>Undefined operand in matrix multiply.
>Error encountered in source line # 466
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'TSS' is undefined
>Error encountered in source line # 466
>Error # 12347
>Undefined operand for binary operator.
>Error encountered in source line # 467
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'REGSS' is undefined
>Error encountered in source line # 467
>Error # 12332
>Undefined variable in PRINT.
Residual SS
10341,79
>Error encountered in source line # 469
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'TSS' is undefined
>Error encountered in source line # 469
>Error # 12332
>Undefined variable in PRINT.
>Error encountered in source line # 470
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'TSS' is undefined
>Error encountered in source line # 470
>Error # 12347
>Undefined operand for binary operator.
>Error encountered in source line # 471
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'R_SQ' is undefined
>Error encountered in source line # 471
>Error # 12332
>Undefined variable in PRINT.
Sample size (N)
****
1
>Error encountered in source line # 474
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'REGSS' is undefined
>Error encountered in source line # 474
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
Matrix - 'REGSS' is undefined
>Error encountered in source line # 474
>Error # 12347
>Undefined operand for binary operator.
>Error encountered in source line # 475
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'BP_TEST' is undefined
>Error encountered in source line # 475
>Error # 12332
>Undefined variable in PRINT.
>Error encountered in source line # 476
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'BP_TEST' is undefined
>Error encountered in source line # 476
>Error # 12428
>First argument is undefined for MOD, or CHICDF, or TCDF, or FCDF.
>Error encountered in source line # 477
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'SIG' is undefined
>Error encountered in source line # 477
>Error # 12332
>Undefined variable in PRINT.
>Error encountered in source line # 478
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'R_SQ' is undefined
>Error encountered in source line # 478
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
Matrix - 'R_SQ' is undefined
>Error encountered in source line # 478
>Error # 12347
>Undefined operand for binary operator.
>Error encountered in source line # 479
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'K_TEST' is undefined
>Error encountered in source line # 479
>Error # 12332
>Undefined variable in PRINT.
>Error encountered in source line # 480
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'K_TEST' is undefined
>Error encountered in source line # 480
>Error # 12428
>First argument is undefined for MOD, or CHICDF, or TCDF, or FCDF.
>Error encountered in source line # 481
>Error # 12492
>An attempt has been made to use previously undefined matrix (or scalar).
>This command not executed.
Matrix - 'SIG' is undefined
>Error encountered in source line # 481
>Error # 12332
>Undefined variable in PRINT.
------ END MATRIX -----
Error has appeared because we are using very big data set.
Conclusions
After all procedures we can see, that the goodness of model is still very low: R-squared is 0.092. So, only 9.2% of school grade variance is explained by the model. But the model is significant according to the ANOVA results. Also all coefficients are significant and multicollinearity does not affect the model.
The most likely reason why R-squared is so low might be that we have initially chosen wrong variables – some of important variables weren’t included in our model.
Sources
Argyrous, G. Statistics for Research: With a Guide to SPSS, SAGE, London, ISBN 1-4129-1948-7