Final Case Study Project
Introduction
In this paper, we will describe and discuss the application of statistics and probability theory to a real-world problem. Our task is to study the impacts of a variety of factors influencing house prices in Prescott, Arizona. As it was mentioned in the previous studies, there are five lakes in the Prescott metropolitan area that provide a number of benefits to those who live or spent their holidays in this area. These benefits include scenic beauty, recreational opportunity, clear water, fresh air and so on and so forth. It is assumed that the property prices should depend on how far is the property from the lakes. The easy access to the lakes should positively impact the price of the houses in the Prescott area. From the other hand, we know that the high level of sedimentation in the rivers or lakes leads to the issues with the quality of water – it creates a high level of turbidity and the sunlight cannot penetrate into the water good enough. Hence, sedimentation should be a negative factor in the property prices near the lakes.
We are given the data set of 1999 observations. Each observation represents a property in the Prescott metropolitan area in Arizona. Houses included in the data set are characterized by the nine factors (variables). The list of the factors and their definitions are given in the list below:
Price - Residential property sales in dollar (Year: 2003)
Land_fcv - Land full cash value in dollar (reflection of the market value of your property and consists of land and improvements)
Age - The age of property
Time - Traveling time to the nearest lake in minute
Pop_sqml - Number of population/mile2 in 2000 (measured on a Census tract level)
Gr_fl_ar - The area of ground floor (sq)
Patio_Floor - The ratio of patio area to total area (Total area=ground floor area+patio area)
sedi_per_lake - Tons of sediment loads/lake acre in the nearest lake from each residential property
Imp_fcv - Improved full cash value
The summary descriptive statistics of the data are given in the table below ("Descriptive Statistics In Excel"):
The average price in the considered area is $192,043 with a standard deviation of $78,046.08. The minimum price is $106,293.40 and the maximum is $924,382.60. It seems that the prices vary significantly around the mean value.
Model Specification and Development
In this part of our course work, we will define and describe the regression model used to examine our research hypothesis. The regression model is an equation where the response (dependent) variable is represented as a function of explanatory variables (also called “dependent variables). By the type of function, there are linear and nonlinear regression models distinguished. The linear regression models are most thoroughly studied and therefore most often found in econometric analysis and statistical methods for evaluation and analysis. According to the instructions, the model should be linear.
In this paper, we will initially define the regression model, which includes all the variables. The dependent variable is the price and the dependent variables are all other variables. So, the linear regression will have the following form ("Multiple Linear Regression"):
Price=constant+b1*landfcv+b2*age+b3*time+b4*popsqml+b5*grflar+b6*Patiofloor+b7*sedi_per_lake +b8*imp_fcv
Once the first model is developed, we will look at the significance of the coefficients and eliminate those which are not significant. The final model will include only significant factors.
Interpretation of Results
We constructed the initial regression model in Excel and got the following results:
The analysis of variance indicated that the coefficients of the regression equation are jointly significant (F=614.3775, p<0.0001). The model is useful and can be used in making forecasts ("Regression - Statistics Solutions"). The value of R-squared (coefficient of determination) is relatively high – 0.711804, indicating that approximately 71.18% of variance in the houses prices is explained by the independent variables included in this model. However, not all coefficients are significant separately. Tons of sediment loads/lake acre in the nearest lake from each residential property looks like an insignificant factor in the property prices (t=0.539746, p=0.589433). This factor will be excluded from the model and we examine the new model:
Price=constant+b1*landfcv+b2*age+b3*time+b4*popsqml+b5*grflar+b6*Patiofloor+b7*imp_fcv
The regression output is below:
The new regression model seems to be better. The analysis of variance indicated that the coefficients of the regression equation are jointly significant (F=702.3541, p<0.0001). We see that the F-value is even increased. The model is useful and can be used in making forecasts. The value of R-squared (coefficient of determination) almost did not change – 0.711762, indicating that approximately 71.18% of variance in the house prices is explained by the independent variables included in this model. So, tons of sediment loads/lake acre in the nearest lake from each residential property did not contribute almost anything into the linear regression model. Finally, all the independent variables are now significant (p-values are lower than 0.05). Hence, we obtained the following final regression equation:
Price=-128137+b1*1.500275+b2*-155.31+b3*-441.971+b4*-5.08722+b5*97.55091+b6*930254.8+b7*0.276739
Policy Recommendation and Conclusions
In this paper, we have studied the impacts of a variety of factors influencing house prices in Prescott, Arizona. We developed a multiple linear regression model to predict the property prices based on factors that affect residential property values using the data set provided. It was assumed that high level of sedimentation in the rivers or lakes should be a negative factor in the property prices near the lakes. However, our analysis did not support this claim. The sedimentation level is not significant in developing the property prices.
The signs of the slopes of the regression equation provide an important information for the decision making ("Regression Slope: Confidence Interval"). According to the analysis conducted, the land full cash value in dollar (reflection of the market value of your property and consists of land and improvements) is a positive factor in the house prices. Naturally, this is the part of the price, so if the land cash increases, it usually pushes up the house prices. The age of a property is a negative factor – the older houses usually cost cheaper than the new buildings. Travelling time to the nearest lake is a negative factor, because if you travel to the lake a lot of time, then the house is too far from the lake and it price should be low. The number of population in square mile in 2000 is also a negative factor. People usually like to live in areas with a low density of population. The high population density usually pushes property prices down. The ratio of patio area to total area is a very positive factor in the property prices. Properties with the big patio areas are usually luxury housings and they cost a lot of money. Finally, the improved full cash value is a positive factor in the prices.
Works Cited
"Descriptive Statistics In Excel". Excel-easy.com. N.p., 2016. Web. 23 June 2016.
"Multiple Linear Regression". Stat.yale.edu. N.p., 2016. Web. 23 June 2016.
"Regression - Statistics Solutions". Statistics Solutions. N.p., 2016. Web. 23 June 2016.
"Regression Slope: Confidence Interval". Stattrek.com. N.p., 2016. Web. 23 June 2016.