Findings
Based on the 2010 HSE data, we have done statistical analysis of all the variables as potential attributing factor to asthma. Variables such as salary, Smoking habits, Ethnicity, Drinking habits and Sex were analyzed. It was found that salary is negatively correlated with asthma or in simple words as the income of a household increases, chances of asthma reduce. Drinking was also found as a contributing factor to asthma. Heavy drinkers are less prone to asthma than non-drinkers. Also, it was established that there is not enough evidence that smoking causes asthma. It was also found that asthma is not biased towards any specific sex However, the data for some of the ethnic groups were not sufficient to conclude that there exists a relationship but it was established that ethnicity also may be a contributing factor which determines whether a person have higher chances of asthma or not.
Introduction
The Health Survey for England (HSE) started in 1991, and is a series of cross-sectional annual surveys sponsored by the Department of Health. It collects information about the health of people living in England and combines questionnaire-based answers from a face-to-face interview with physical measurements taken by a trained nurse and the analysis of blood samples. The data for this project come from the 2010 HSE. In 2010 a total of 8,420 adults and 5,692 children were interviewed.
Participants were selected using a random probability sample. The survey design ensures that every address has an equal chance of being included in the survey and the results are representative of the English population living in private households. All persons at the selected address were eligible for interview. All adults were selected for interview.
Dependent and Independent Variables (Explanatory Variables)
Data collected on the following Parameters
- Asthma (Dependent variable): The survey found out if the person was ever diagnosed with asthma. He/she has asthma now, in the past or never.
- Income (Explanatory variable): Survey also asked question about the total annual household income.
- Income Band ( Derived Explanatory variable): The data was divided into three distinct band Lowest Tertile ( Less than (£ 15,600 ) , Middle Tertile( between £ 15,600 and £ 33,050.85)
- Ethnicity (Derived Explanatory variable): White, Black, Mixed, Asian, Other.
- Cigarette Smoking Band (Derived Explanatory variable from previous variable): categorized as never regular smoker, Past Regular smoker and current regular smoker.
- Sex (Explanatory variable): male or female
- Drinker (Explanatory variable)
Also there were data collected for detailed ethnical groups also. However for our analysis we will use only the above 8 explanatory variable and try to find out among these 8 which are able to explain the dependent variable (Asthma).
Also there were few more data collected as part of the survey
- Condr: Diagnosed with Asthma : Actually it is almost same variable as the other variable called asthma. If I put past and current asthma in the diagnosed category then it will be same as CONDR. So that’s why I have excluded condr.
- Origin: this gives a more detailed description of ethnicity. However I haven’t include it as ethnicity data is sufficient to see if any relationship exists between ethnicity and asthma.
- TopQual2 and TopQual 3: I have excluded the qualification variables from the analysis as it seems that education has nothing to do with a person’s health and it can introduce unnecessary error and clutter other explanatory variables.
Initial Data Analysis
First we will make some assumptions and will try to see if those assumptions have any data evidence.
- Assumption 1: Asthma and salary are negatively Co-related: this means when salary goes up the number of asthma patients (%) should go down.
If we plot a scatter diagram then it will look like below.
Where X Axis is Asthma (0 means no asthma, 1 is past or current) and Y axis shows the salary.
However it is very clear from the plot that as we go to the right (current asthma patients) the concentration of salaried people are on the lower side and there is much less representation from the upper salaried people. There seems to be a negative correlation between salary and asthma.
If we do a correlation test we get as below
The correlation test clearly shows that there is a negative correlation between the two. However the value is not very large so the relationship does not seem very strong. Let’s have a look the below table which shows the salary band vs. asthma analysis.
The table shows that among the data set 18% of the Low Income group had asthma or currently a patient of asthma. The ratio is 17% for mid income group and only 14.84% for high income group. The total size of each population is also almost same. So it is clear that with income chances of asthma reduces.
- There is a positive correlation between Drinking and Asthma
If we do a correlation test between the two we get
It shows a negative correlation between the two which is contrary to our assumption. We will do further investigation to see if it is at all a significant variable to explain asthma.
Generalized Linear Model
Assuming a linear relationship between asthma and the independent variables.
For our testing we will take asthma as the dependent variable and Salary, Cigarette Band ( as some of the numbers are outliers and may case problem in the model) , Sex , Drinking habit and ethnicity.
The classic linear model will be
Asthma = C1+ C2*Salary +C3*Cigarette Band+C4*Sex+C5*Drinking habit +C6*Ethnicity
However in our case the output variable is whether a person interviewed have/had asthma. It can take value only 0 or 1. 0=no asthma and 1 = asthma in the past or current asthma.
Now with a classical linear model Y can assume any value which will not be the actual or feasible case so estimating asthma with a classical model will not be proper.
Choosing a Non Classical Model:
As the value of Y (person having asthma or not) is a discrete variable. It can have only two values ( 0 = no asthma or 1= past asthma or current asthma) . It can be modeled as a N trial with S success rate (like a Binomial distribution). We have a sample of around 8000 among which we have say 1900 people had asthma history (success). It can be easily compared with tossing a biased coin 8000 times and finding 1900 heads. So we know that when next time we toss the coin what is the chance of a head.
This situation can also be modeled in the same way. Based on the data we have we want to know given the explanatory variables what is the probability that the person will have asthma. A definite rate of Asthma happening is not given ( rate of asthma based on a huge population) rather a probability can be found from the given sample from which we can predict the number of asthma patients or probability. That’s why we will use Binomial distribution and not Poisson. However we could have used Poisson as well assuming the probability found as the rate for asthma (assumption would be= N very large ). However here we will only do analysis based on asthma.
The correlations between the two are as below
In both the cases it seems there is very strong correlation ( 0.73 and 0.67 shows strong correlation between the variables as expected) so for our model we will only take one of the two.
Note: I will choose SALARY and DALCIG as they are continuous and will be a better estimate of asthma than the discrete variables SALBAND and CIGBAND. However if there is some data which are hugely large or small compared to others then that will create bias in the model. So although I have chosen the continuous variables, it will not be a bad option to check the discrete variables as well.
So our final model as decided will be
logit(p(Salary; Sex; Alcohol; Cigarette; Ethnicity; Age)) = B0+B1*Salary++B2*Sex+B3*Alcohol+B4*Cigarette+B5*Ethnicity+B6*Age
The basic statistics for the dependent variables are as below
After running the binomial regression in R we get the following result
The minus two time the log-likelihood for this full model fit is 5867.6 which the R output calls Residual deviance. Note that the Wald test for significance of the coefficients for Daily Cigarette and Sex are 0.32323 and 0.31582 respectively indicating that both of these dependent variables are redundant in this model and we cannot reject the null hypothesis for Sex and Cigarette smoking numbers. In simpler words in the above model we can say that H0: B2=B4=0
So our new reduced model becomes as below
The minus two time the log-likelihood for this reduced model fit is 5869.5 which the R output calls Residual deviance. The value of the log-Likelihood ratio statistic is X^2 =1.86. Since the full and reduced models differ by 2 parameters, we can compare this test statistic to a chi-square distribution on 2 degrees of freedom. The p value for this test is =0.3954. Thus we conclude that there is insufficient evidence that the coefficients B2 and B4 differ from zero. This allows us to settle on a logistics regression model involving only Salary, Ethnicity and Age as the regressors.
Conclusion
Based on the 2010 HSE data, we have done statistical analysis of all the variables as potential attributing factor to asthma.
Salary is found to be a factor which influences the chances of asthma. The more money a person earns the less he is likely to develop asthma.
Ethnicity seems to have an effect in chances of asthma. It was found that more the number (as defined in our data cleansing. For example 1=white, 2=mixed, 3=Asian etc.) of ethnicity (say Black and Asian population) has less chances of them to have asthma. White population(1) seems to be more prone to asthma as ethnicity and asthma are negatively related in our generalized linear equation.
Drinking habit is also a contributing factor towards chances of asthma. Surprisingly it was found in the model that a person who drinks less has more chances to have asthma than heavy drinkers.