Research Question
The questions about health cannot be answered basing on knowledge and experience only. In other words, physiology, biochemistry and clinical experience are not enough to make conclusions about the factors that cause changes in human body. However, the medical science requires answers to numerous credible questions. The answers are obtained through statistical analysis of the vast datasets with patients’ social and medical characteristics. The larger the dataset, the more reliable are the conclusions.
The dataset used for the study is based on 17 variables with 5209 observations. Therefore, the sample data can be approximated to the population data. The statistical data can be used to draw conclusion about the population.
The research question is about the relation of cholesterol level to other factors. It is commonly accepted that high cholesterol level is caused by obesity. The research question tests if the relation of cholesterol level to weight and systolic pressure is significant. Although the systolic pressure is associated with obesity, yet it is not always the case. Therefore, the research question is: “Is there a significant relation between cholesterol level, systolic pressure & weight?” If the significant relation is found, then the regression equation can be build. The cholesterol level is a dependent variable, while the weight and systolic pressure are the independent variables. The regression equation is obtained as a result of the regression analysis.
The hypothesis is tested by running the inferential statistics. The null hypothesis is: there is no relation between the cholesterol level, weight and systolic pressure. The alternate hypothesis is opposite to the null and it states that there is a significant relation between the variables. The decision is made basing on the p-value. If p-value of the test is more than 0.05, then null hypothesis is true; in case p < 0.05, then the alternate hypothesis is true.
Sample descriptive statistics
The variables of interest should be analyzed prior to analysis of the data and running the inferential statistics. The descriptive statistics of cholesterol level, weight and systolic pressure. Table 1 presents the generalized results of the descriptive statistics. Table 2 illustrates the quantiles values.
The descriptive statistics of cholesterol level, weight and systolic pressure
The minimal, maximal values, and quantiles for cholesterol level, weight and systolic pressure
The descriptive statistics indicates that the three variables have different numbers of observations (N), which indicates that the data of cholesterol level are available not for all the observations. However, the number of observations for cholesterol level exceeds 5000, and this is enough to make the reliable statistical conclusions.
The measures of central tendency are the mean, median and mode values. The mean and mode values are clos for all three variables, whereas the mode values are lower. This indicates that there are numerous people with the variables levels lower than the mean values. The lower values indicate better health. Thus, the majority of people in the sample are healthy in terms of pressure, weight, and cholesterol level. It should be noted that systolic pressure 120 is the normal healthy pressure. The mode weight 138 pounds (about 62 kg) is also a healthy weight for a woman or a man of average height.
The variance and standard deviation are indicators of data scattering around the mean value. For all cases, the scattering is about 15-20% of the mean value. There is also significant difference between the minimum and maximum values, which indicates that different people are presented in the sample.
The skewness values for all three variables is positive; the cholesterol level and weight are normally distributed (Skewness < 1), while systolic pressure has longer right tail (Skewness > 1), which indicates that there are more people with pressure higher than the mean value.
The kurtosis values for cholesterol level and weight are in the range for normal distributions (Kurtosis <3), and systolic pressure distribution is more peaked than the normal distribution. However, all the distributions can be approximated as normal and the tests applicable for normal distributions can be used.
The quantiles indicate that there are people with peaked values, for example, 5% with blood pressure more than 180, weight more than 203 pounds, and cholesterol level more than 308.
Inferential statistics
The test applied as inferential statistics is the regression ANOVA with two independent variables and an interaction effect. The suggested model is as follow:
Cholesterol Level = A ∙ Systolic + B Weight + C.
The number of values used for the test is 5051. Two tests were run: one for females, and the other for males.
ANOVA Data for Cholesterol Level Model
Goodness of Fit for Cholesterol Level Model
Regression
Sum Squares at Model Reduction (Sequental Design)
Sum Squares at Model Reduction (Unbalanced Design)
Discussion
The ANOVA results (Table 3) are used to assess the validity of the model. The regression model is valid since its p-value < 0.05 (0.0001 < 0.05).
The coefficients values are presented in Table 5. The regression model is:
Cholesterol Level = 0.37 ∙ Systolic + 0.034 Weight + 172.01.
The coefficients of the model have to be tested for significance. Here, the null hypothesis states that all the values are equal to zero, whereas the alternate hypothesis says that at least one coefficient is different from zero. The decision is made basing on the p-values. For intercept and systolic pressure, p < 0.05, and therefore the coefficients do not equal to zero. As for weight coefficient, p > 0.05, and it is equal to zero. Taking into the account the significance of the coefficients, the modified model:
Cholesterol Level = 0.37 ∙ Systolic + 172.01.
The model indicates that the minimal cholesterol level is 172.01 for all people. Each additional unit of systolic pressure adds 0.37 units to cholesterol level. Hence, if the systolic pressure is increased by 10, then the cholesterol level increases by 3.7.
The quantiles table (Table 2) indicates that 90% of people have cholesterol level higher than 174, and this is in line with the constructed regression model.
The quantitative characteristics of goodness-of-fit is presented in Table 4. The goodness of-fit analysis indicates that the expression explains only 4% of the variability in cholesterol level. This is because the R2-value is 0.04, which is low (for the good-fit model, R2 = 1 or close to 1). Despite of the low R2-value, the model is significant and can be used to predict the cholesterol level.
It has been statistically proved that weight is not a significant factor in cholesterol level prediction. From the other side, the systolic pressure has proved its importance in formation of high cholesterol level. The model is reliable since it is based on the vast number of observations.
The model reduction designs indicate the significance level of the variables. For systolic pressure, p < 0.0001 < 0.05, while for weight variable p = 0.13 > 0.05. These values are consistent for the two designs (Tables 6 and 7). However, for the studied set of data, the Type I sequential design should be used. The Type III unbalanced data are used for the sets of data where the variables values are not available for all observations. Here, the data set was reduced to include only values with the two variables.
The significance analysis of the reduction design indicates that the weight variable is not valid for prediction of cholesterol level, while systolic can be used.
Possibly, the weight variable is not valid for cholesterol prediction because it is correlated with systolic pressure, for example the high body weight value causes the high systolic pressure. Therefore, there is no reason in including both variables in the cholesterol level equation.
The research question has been tested using the data set with 5051 observations. The descriptive statistics for cholesterol level, weight, and systolic pressure has been obtained. Then, the regression analysis has been applied to the data. The analysis included ANOVA, goodness-of-fit, regression coefficients, and model reduction designs. The sequential reduction design has been applied. It has been found that the cholesterol level can be predicted with the systolic pressure values. The regression model is:
Cholesterol Level = 0.37 ∙ Systolic + 172.01.
The model is significant, yet the R2-value is 0.04, which indicates that the systolic pressure is able to explain only 4% of the variation in cholesterol level.
SAS Code and Output
SAS Code
DATA = SASHELP.HEART;
PROC CONTENTS DATA = SASHELP.HEART;
PROC UNIVARIATE DATA =SASHELP.HEART;
VAR AgeAtDeath AgeAtStart AgeCHDdiag;
VAR Cholesterol Diastolic Height MRW Systolic Weight;
PROC FREQ DATA = SASHELP.HEART;
TABLES BP_Status Chol_Status DeathCause Sex Smoking Smoking_Status Status Weight_Status;
PROC GLM DATA = SASHELP.HEART;
MODEL Cholesterol = Systolic Weight;
RUN;
SAS Output
Descriptive Statistics
The UNIVARIATE Procedure
Variable: AgeAtDeath (Age at Death)
The UNIVARIATE Procedure
Variable: AgeAtStart (Age at Start)
The UNIVARIATE Procedure
Variable: AgeCHDdiag (Age CHD Diagnosed)
The UNIVARIATE Procedure
Variable: Cholesterol
The UNIVARIATE Procedure
Variable: Diastolic
The UNIVARIATE Procedure
Variable: Height
The UNIVARIATE Procedure
Variable: MRW (Metropolitan Relative Weight)
The UNIVARIATE Procedure
Variable: Systolic
The UNIVARIATE Procedure
Variable: Weight
ANOVA test Inferential Statistics