Introduction
In this paper, we will discuss and describe the application of statistics and probability theory to a real world problem. Our goal is to provide clear and meaningful statistical exploration of the National Football League data for the 2005 season. Starting from the primary issue of this case, we will provide descriptive statistics to explain the given data set. After this, the multiple linear regression will be developed to get the best prediction of the percentage of games won, based on the given variables (factors).
Data
We are given with the data set of 32 observations. Each observation reflects a particular NFL team. According to the instructions given, the following variables participate in this research:
Since the purpose of this research work is to develop a multiple linear regression in order to predict the percentage of games won, we have to determine independent and dependent variables properly. Dependent variable: WinPCT. Independent variables: all other variables.
The multiple linear regression will be developed according to the following scheme: the initial regression includes all possible variables as independent factors. Then, we will excluded all insignificant variables and repeat the regression procedure. In the end, we will obtain the significant regression consisting only of significant factors that impact the response variable (the percentage of games won).
Method
Two statistical techniques will be used in this research: descriptive statistics and multiple linear regression analysis.
Descriptive statistics is a group of statistical methods of preliminary data description. It includes measures of central tendency and measures of variability. Descriptive statistics includes data grouping, tabulation, graphical representation of different kinds of data and the quantitative description. It allows you to compile and summarize the initial results obtained through observation or experiment ("Descriptive Statistics").
Multiple regression analysis is the relationship between several independent variables (also called covariates or predictors) and the dependent variable. In the multiple linear regression, the assumptions of the regression analysis and its conduction is completely coincide with the simple linear regression. A feature of multiple regression is the eventual correlation between the independent variables ("Multiple Linear Regression").
In our analysis, we will use 10% level of significance (alpha = 0.10).
Results
The descriptive statistics was generated in Minitab 16. It is presented in the tables and graphs below:
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3
WinPct 32 0 0,5003 0,0375 0,2119 0,1250 0,3130 0,5630 0,6880
TakeInt 32 0 15,844 0,950 5,377 5,000 12,250 16,000 18,750
TakeFum 32 0 12,125 0,531 3,003 7,000 10,000 11,500 14,000
GiveInt 32 0 15,844 0,880 4,978 6,000 13,250 15,500 18,000
GiveFum 32 0 12,156 0,580 3,283 6,000 9,000 12,000 14,000
DefYds/G 32 0 315,94 4,43 25,05 277,80 296,18 316,80 327,95
RushYds/G 32 0 112,46 4,17 23,62 71,10 94,05 106,00 130,77
PassYds/G 32 0 203,45 6,64 37,56 118,60 181,17 205,30 228,55
FGPct 32 0 80,77 1,20 6,81 66,70 76,05 80,50 86,75
N for
Variable Maximum IQR Mode Mode Skewness Kurtosis
WinPct 0,8750 0,3750 0,688 6 -0,05 -1,28
TakeInt 31,000 6,500 16 5 0,45 1,02
TakeFum 19,000 4,000 11 6 0,61 0,10
GiveInt 30,000 4,750 14; 16 5 0,58 1,19
GiveFum 19,000 5,000 9; 14 5 0,19 -0,45
DefYds/G 391,20 31,78 316,8 2 0,83 1,41
RushYds/G 159,10 36,72 * 0 0,49 -0,63
PassYds/G 277,30 47,38 * 0 -0,39 0,07
FGPct 95,60 10,70 76,5; 83,3; 87,5 2 0,15 -0,37
The distribution histograms describe the frequency distribution of the variables. The bell shaped curve on each graph reflects normal (or Gaussian distribution).
The initial linear regression includes all remaining variables as independent variables. The Minitab output is given below:
The regression equation is
WinPct = 1,13 + 0,0162 TakeInt - 0,0125 TakeFum - 0,00521 GiveFum
- 0,0158 GiveInt - 0,00285 DefYds/G + 0,00210 RushYds/G
+ 0,00119 PassYds/G - 0,00000 FGPct
Predictor Coef SE Coef T P
Constant 1,1313 0,4727 2,39 0,025
TakeInt 0,016234 0,004418 3,67 0,001
TakeFum -0,012549 0,008798 -1,43 0,167
GiveFum -0,005213 0,008524 -0,61 0,547
GiveInt -0,015804 0,006369 -2,48 0,021
DefYds/G -0,0028476 0,0009524 -2,99 0,007
RushYds/G 0,002098 0,001290 1,63 0,117
PassYds/G 0,0011864 0,0006653 1,78 0,088
FGPct -0,000000 0,003408 -0,00 1,000
S = 0,120866 R-Sq = 75,9% R-Sq(adj) = 67,5%
Analysis of Variance
Source DF SS MS F P
Regression 8 1,05575 0,13197 9,03 0,000
Residual Error 23 0,33600 0,01461
Total 31 1,39175
ANOVA results indicate that the coefficients are jointly significant (F=9.03, p<0.001). The coefficient of determination R-square is 0.759, showing that approximately 75.9% of the variance in WinPct is explained by this model. However, not all factors have a significant impact on the response variable. Such variables as TakeFum (p=0.167), GiveFum (p=0.547), RushYds/G (p=0.117) and FGPct (p=1.000) are not significant in WinPct prediction. We exclude these factors from the regression equation.
Develop a new linear regression predicting WinPct with the following independent variables: TakeInt, GiveInt, DefYds/G, PassYds/G. The results are given in the tables below:
The regression equation is
WinPct = 1,28 + 0,0156 TakeInt - 0,0225 GiveInt - 0,00272 DefYds/G
+ 0,000922 PassYds/G
Predictor Coef SE Coef T P
Constant 1,2813 0,3490 3,67 0,001
TakeInt 0,015550 0,004232 3,67 0,001
GiveInt -0,022510 0,004502 -5,00 0,000
DefYds/G -0,0027168 0,0008841 -3,07 0,005
PassYds/G 0,0009217 0,0005907 1,56 0,130
S = 0,119510 R-Sq = 72,3% R-Sq(adj) = 68,2%
Analysis of Variance
Source DF SS MS F P
Regression 4 1,00612 0,25153 17,61 0,000
Residual Error 27 0,38563 0,01428
Total 31 1,39175
The analysis of variance shows that the significance of the model is improved (F=17.61, p<0.001). The R-square value is changed slightly; the model explains approximately 72.3% of the response variable. One factor still remains insignificant: PassYds/G (p=0.130). Exclude this variable and run the regression analysis once again:
The regression equation is
WinPct = 1,56 + 0,0146 TakeInt - 0,0219 GiveInt - 0,00298 DefYds/G
Predictor Coef SE Coef T P
Constant 1,5571 0,3085 5,05 0,000
TakeInt 0,014552 0,004289 3,39 0,002
GiveInt -0,021871 0,004596 -4,76 0,000
DefYds/G -0,0029780 0,0008901 -3,35 0,002
S = 0,122533 R-Sq = 69,8% R-Sq(adj) = 66,6%
Analysis of Variance
Source DF SS MS F P
Regression 3 0,97135 0,32378 21,56 0,000
Residual Error 28 0,42040 0,01501
Total 31 1,39175
The joint significance of the coefficients is increased (F=21.56, p<0.001). R-square value is decreased to 69.8%. All independent variables are considered significant for the response variable as all p-values are less than 0.10. The final multiple linear regression has the following form:
WinPct = 1,56 + 0,0146 TakeInt - 0,0219 GiveInt - 0,00298 DefYds/G
Conclusions
In this paper, we have provided the descriptive statistics of the National Football League 2005 data and developed a multiple linear regression analysis to predict the percentage of games won based on the given factors. It is appeared that the percentage of games won depends on takeaway interceptions, giveaway interceptions and average number of yards per game given up on defense. The mathematical model is given in the results section. It was also stated that there is no significant impact of takeaway fumbles, giveaway fumbles, average number of rushing yards per game, average number of passing yards per game and percentage of field goals on the percentage of games won.
Works Cited
"Descriptive Statistics". Socialresearchmethods.net. N.p., 2016. Web. 13 Apr. 2016.
"Multiple Linear Regression". Stat.yale.edu. N.p., 2016. Web. 13 Apr. 2016.