Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter. Every experiment analyzed in DOE++ includes regression results for each of the responses. These results, along with the results from the analysis of variance (explained in our "Analysis of Experiments" discussion), provide information that is useful to identify significant factors in an experiment and explore the nature of the relationship between these factors and the response. Regression analysis forms the basis for all DOE++ calculations related to the sum of squares used in the analysis of variance. The reason for this is explained in the last section of Chapter 6, Use of Regression to Calculate Sum of Squares. Additionally, DOE++ also includes a regression tool to see if two or more variables are related, and to explore the nature of the relationship between them. This chapter discusses simple linear regression analysis while Chapter 5 focuses on multiple linear regression analysis.
This section also contains the following subsections:
A linear regression model attempts to explain the relationship
between two or more variables using a straight line. Consider the data
obtained from a chemical process where the yield of the process is thought
to be related to the reaction temperature (see Table 4.1). This data can
be entered in DOE++ as shown in Figure 4.1 and
a scatter plot can be obtained as shown in Figure 4.2. [Note]
In the scatter plot yield, is plotted for different temperature
values, . It is clear that no line can be found
to pass through all points of the plot. Thus no functional relation
exists between the two variables and . [Note]
However, the scatter plot does give an indication that a straight line
may exist such that all the points on the plot are scattered randomly
around this line. A statistical relation is said to exist in
this case. The statistical relation between and may be expressed as follows:
(1)
|
Table 4.1: Yield data observations of a chemical process at different values of reaction temperature. |
|
Figure 4.1: Data entry in DOE++ for the observations in Table 4.1. |
|
Figure 4.2: Scatter plot for the data in Table 4.1. |
The actual values of , (which are observed as yield from
the chemical process from time to time and are random in nature), are
assumed to be the sum of the mean value, , and a random error term, :
The regression model here is called a simple linear regression model because there is just one independent variable, , in the model. In regression models, the independent variables are also referred to as regressors or predictor variables. The dependent variable, , is also referred to as the response. The slope, , and the intercept, , of the line are called regression coefficients. The slope, , can be interpreted as the change in the mean value of for a unit change in .
The random error term, , is assumed to follow the normal distribution with a mean of 0 and variance of . Since is the sum of this random term and the mean value, , (which is a constant), the variance of at any given value of is also . Therefore, at any given value of , say , the dependent variable follows a normal distribution with a mean of and a standard deviation of . This is illustrated in the following figure.
|
Figure 4.3: The normal distribution of for two values of . Also shown is the true regression line and the values of the random error term, , corresponding to the two values. The true regression line and are usually not known. |
The true regression line corresponding to Eqn. (1) is usually never
known. However, the regression line can be estimated by estimating the
coefficients and for an observed data set. The estimates,
and , are calculated using least squares.
(For details on least square estimates refer to [19]). The estimated regression
line, obtained using the values of and , is called the fitted line.
The least square estimates, and , are obtained using the following
equations:
(2)
(3)
where is the mean of all the observed values and is the mean of all values of the predictor variable at which the observations were taken. is calculated using and is calculated using .
Once and are known, the fitted regression line
can be written as:
(4)
where is the fitted or estimated
value based on the fitted regression model. It is an estimate of the mean
value, . The fitted value, , for a given value of the predictor
variable, , may be different from the corresponding
observed value, . The difference between the two values
is called the residual, : [Note]
(5)
The least square estimates of the regression coefficients can be obtained
for the data in Table 4.1 using the Eqns. (2) and (3) as follows:
Knowing and the fitted regression line is:
This line is shown in Figure 4.4.
|
Figure 4.4: Fitted regression line for the data in Table 4.1. Also shown is the residual for the 21st observation. |
Once the fitted regression line is known, the fitted value of corresponding to any observed data
point can be calculated. For example, the fitted value corresponding to
the 21st observation in Table 4.1 is:
The observed response at this point
is . Therefore, the residual at this point
is:
In DOE++, fitted values and residuals
are available using the Diagnostic icon in the Control Panel. The values
are shown in Figure 4.5.
|
Figure 4.5: Fitted values and residuals for the data in Table 4.1. |
See Also:
Statistical Inference for Two Samples
Hypothesis Tests in Simple Linear Regression
Multiple Linear Regression Analysis
Use of Regression to Calculate Sum of Squares