Abstract
Regression analysis estimates the value of a random variable (dependent variable) for a given value of an associated independent variable. The regression equation provides the formula for such a calculation. In the present case, which is about the analysis of the dependence of total monthly sales on the monthly customer traffic, the consultant first examines to see if there is a correlation between the two variables. To achieve this, the consultant uses the scatterplot, plotting the points on a graph. After analyzing the scatterplot, the consultant is convinced that there exists a strong positive linear correlation between the two variables. Hence, the consultant uses simple linear regression to find the values of the slope and intercept of the linear regression equation. The consultant then uses these values to make predictions about the month wise total sales for the next year, for any given month wise customer traffic. At the end of the year, the management of ABC Furniture Company compares these predictions with the actual values to calculate the variance.
Keywords: [Click here to add keywords.]
Brief Introduction and Problem Background
Scatter Plot
Regression analysis studies the relationship between variables. For a business analyst, this is a good tool as they can apply it to many situations. Most of these uses include determining how a single variable depends on other relevant variables. In real-life, this is an important issue. For example, if we can determine how advertisement affects sales, the companies can use it to better tailor their advertisement campaigns as well as frequency and funding. This is possible because the regression analysis allows the company to predict the possible sales for any given advertisement spend if there is indeed a relation. This makes regression analysis one of the most pervasive methods of statistical analysis used in the business world. One can categorize regression analysis as having two purposes. One is to understand the relationships and the other is to make predictions. A company can use regression models for what-if analysis, which is making predictions of the effect that many different patterns of one variable (such as advertisement budget) has on the other variable (which is sales).
The data used for regression analysis can be cross-sectional or time series data. Cross-sectional data is data that one gathers from a population over the same time-period. Government wage statistics and other such statistics are examples of this. Time-series data is the data that one gets when observing a particular variable over a period at several, usually equally spaced, points in time. While one can apply regression analysis to both types of data, it treats the time series data somewhat differently. The time series data has a phenomenon called autocorrelation, which means the time series variables are usually related to their own past data values. In a regression analysis, the different types of variables are
Dependent variable: Also called response or target variable, this is the variable that one is trying to predict or explain
Explanatory variable: Also called independent or predictor variables upon which the dependent variables depend. If there are more than one explanatory variable, it is a multiple regression and if there is only one explanatory variable, it is a simple regression.
When the relationship between the variables is a straight-line relationship, then the relationship is linear and when the relationship is non-linear, it generally implies a curved relationship. Linear regression can be a tool for estimating non-linear relationships also by using suitable mathematical transformations. To start any regression analysis, it is better to draw a scatterplot.
Currently, there are two sets of data and it is important to find first if there is any correlation between the two sets of data. Since the values in the second set of data are supposed to be dependent on the corresponding values in the first set of data, a linear regression may be a good fit for analyzing the data. However, to confirm that linear regression indeed is a good fit, I used a scatter diagram or scatterplot. Scatterplot helps to determine if there is any relationship between two quantitative variables. Scatterplots are also useful in identifying any outliers in the data. In a scatterplot, each point represents the intersection of x (horizontal axis) and y (vertical axis), where x consists of independent variable and y consists of the dependent variable. The intent of the scatter plot is to determine if there is a correlation between y and x so that one can predict the value of y for any given x.
After drawing the scatter diagram, as in Figure 1, we fit a line into it such that it is a reasonable approximation of the points on the graph. (Microsoft excel has an option to draw a scatterplot and a trend line option to achieve the same effect). By viewing this best-fit line, we can find if the relationship between the variables is
Positive or direct when the slope of the best-fit line (trend line) is positive with both y and x increasing together
Negative or inverse when the slope of the best-fit line is negative and with y decreasing as x increases
Curvilinear when the best-fit line is a curve, which can be either positive or negative
No relationship when the best-fit line is horizontal and knowing the value of x does not help in predicting the value of y
The scatterplot in Figure 1 indicates that there is indeed a positive relationship between the two variables, monthly customer traffic and total monthly sales. The points tend to rise from bottom left to top right. However, the relationship is not perfect. If it were perfect, the given value of monthly customer traffic would prescribe the exact total monthly sales, or the graph would show a straight-line. After calculating the correlation (0.8473), which expresses the strength of the linear relationships, one can conclude that there is a strong positive correlation between the two variables.
Figure 1: Scatter Diagram
In the current scenario, we have a positive relationship; hence, we know that the linear regression is a good fit for analyzing this relationship. Figure 2 gives the formula for linear regression.
Figure 2: Formula for simple regression line
Source:
Linear Regression
The best method for linear regression is to use the least squares linear regression method, which the Figure 3 gives.
Figure 3: Formula for Linear Regression using least-squares method
Source:
A caveat about using the linear regression is the causation. The tendency is to assume from the above results that a greater customer traffic is causing better sales. However, unless the data collection was under carefully controlled, it is not possible to be sure of causation. It is not possible to be entirely sure if x caused y or if it was the other way round. Some other variable (or variables) may have caused the variation in both x and y.
Forecasting
In the current scenario, we substitute the values of b0 and b1 obtained in the equation for the simple linear regression. Using the resultant formula, for each value of x for year 2, we predict the value y. Table 2 tabulates these values. At the end of the year, obtaining the actual values and comparing them to the forecast provides the variance. Other points that one can observer are that the slope of the regression line 0.6480 shows that total monthly sales tend to increase by 0.6480 for each one-unit increase in customer traffic. In this equation, we must consider the intercept only as an anchor value.
References
Albright, S. C., Winston, W. L., & Broadie, M. (2015). Business analytics: data analysis and decision making (5th ed.). Stamford, CT: Cengage Learning.
Kazmier, L. J. (2004). Schaum's outline of theory and problems of business statistics (4th ed.). New York, NY: The McGraw-Hill Companies, Inc.
Weiers, R. M., Gray, J. B., & Peters, L. H. (2011). Introduction to business statistics (7th ed.). Mason, OH: South-Western Cengage Learning.