As in the case of simple linear regression, analysis of a fitted multiple linear regression model is important before inferences based on the model are undertaken. This section presents some techniques that can be used to check the appropriateness of the multiple linear regression model.
This section is divided into the following subsections:
The coefficient of multiple determination is similar to the coefficient
of determination used in the case of simple linear regression. It is defined
as:
(30)
indicates the amount of total variability explained by the regression model. The positive square root of is called the multiple correlation coefficient and measures the linear association between and the predictor variables, , ....
The value of increases as more terms are added
to the model, even if the new term does not contribute significantly to
the model. An increase in the value of cannot be taken as a sign to conclude
that the new model is superior to the older model. A better statistic
to use is the adjusted statistic defined as follows:
(31)
The adjusted only increases when significant terms are added to the model. Addition of unimportant terms may lead to a decrease in the value of .
In DOE++, and values are displayed as R-sq and R-sq(adj), respectively. Other values displayed along with these values are S, PRESS and R-sq(pred). As explained in Chapter 4, the value of S is the square root of the error mean square, , and represents the "standard error of the model."
PRESS is an abbreviation for prediction error sum of squares.
It is the error sum of squares calculated using the PRESS residuals in
place of the residuals, , in Eqn. (19).
The PRESS residual, , for a particular observation, , is obtained by fitting the regression
model to the remaining observations. Then the value for a new observation,
, corresponding to the observation
in question, , is obtained based on the new regression
model. The difference between and gives . The PRESS residual, , can also be obtained using , the diagonal element of the hat matrix,
, as follows:
(32)
R-sq(pred), also referred to as prediction , is obtained using PRESS as shown
next:
(33)
The values of R-sq, R-sq(adj) and S are indicators of how well the regression model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. For example, higher values of PRESS or lower values of R-sq(pred) indicate a model that predicts poorly. Figure 5.19. shows these values for the data in Table 5.1. The values indicate that the regression model fits the data well and also predicts well.
|
Figure 5.19: Coefficient of multiple determination and related results for the data in Table 5.1. |
Plots of residuals, , similar to the ones discussed in
the previous chapter for simple linear regression, are used to check the
adequacy of a fitted multiple linear regression model. The residuals are
expected to be normally distributed with a mean of zero and a constant
variance of . In addition, they should not show
any patterns or trends when plotted against any variable or in a time
or run-order sequence. Residual plots may also be obtained using standardized
and studentized residuals. Standardized residuals, , are obtained using the following
equation: [Note]
(34)
Standardized residuals are scaled so that the standard deviation of
the residuals is approximately equal to one. This helps to identify possible
outliers or unusual observations. However, standardized residuals may
understate the true residual magnitude, hence studentized residuals, , are used in their place. Studentized
residuals are calculated as follows:
(35)
where is the th diagonal element of the hat matrix,
. External studentized (or
the studentized deleted) residuals may also be used.
These residuals are based on the PRESS residuals mentioned above in the Coefficient
of Multiple Determination, R2 section. The reason
for using the external studentized residuals is that if the th observation is an outlier, it may
influence the fitted model. In this case, the residual will be small and may not disclose
that th observation is an outlier. The external
studentized residual for the th observation, , is obtained as follows:
(36)
Residual values for the data of Table 5.1 are shown in Figure 5.20. These values are available using the Diagnostics icon in the Control Panel. Standardized residual plots for the data are shown in Figures 5.21 to 5.23. DOE++ compares the residual values to the critical values on the distribution for studentized and external studentized residuals. For other residuals the normal distribution is used. For example, for the data in Table 5.1, the critical values on the distribution at a significance of 0.1 are and (as calculated in Example 5.3, Chapter 5, Test on Individual Regression Coefficients). The studentized residual values corresponding to the 3rd and 17th observations lie outside the critical values. Therefore, the 3rd and 17th observations are outliers. This can also be seen on the residual plots in Figures 22 and 23.
|
Figure 5.20: Residual values for the data in Table 5.1. |
|
Figure 5.21: Residual probability plot for the data in Table 5.1. |
|
Figure 5.22: Residual versus fitted values plot for the data in Table 5.1. |
|
Figure 5.23: Residual versus run order plot for the data in Table 5.1. |
Residuals help to identify outlying observations. Outlying observations can be detected using leverage. Leverage values are the diagonal elements of the hat matrix, . The values always lie between 0 and 1. Values of greater than are considered to be indicators of outlying observations. [Note]
Once an outlier is identified, it is important to determine if the outlier
has a significant effect on the regression model. One measure to detect
influential observations is Cook's distance measure which is
computed as follows:
(37)To use Cook's distance measure,
the values are compared to percentile
values on the distribution with degrees of freedom. If the percentile
value is less than 10 or 20 percent, then the th case has little influence on the
fitted values. However, if the percentile value is close to 50 percent
or greater, the th case is influential, and fitted values
with and without the th case will differ substantially. [10]
Example 5.6
Cook's distance measure can be calculated as shown next. The distance measure is calculated for the first observation of the data in Table 5.1. The remaining values along with the leverage values are shown in Figure 5.24.
The standardized residual corresponding to the first observation is:
Cook's distance measure for the first
observation can now be calculated as:
The 50th percentile value for is 0.83. Since all values are less than this value there
are no influential observations.
|
Figure 5.24: Leverage and Cook's distance measure for the data in Table 5.1. |
The lack-of-fit test for simple linear regression discussed in Chapter
4 may also be applied to multiple linear regression to check the appropriateness
of the fitted response surface and see if a higher order model is required.
Data for replicates may be collected as follows
for all levels of the predictor variables:
The sum of squares due to pure error, , can be obtained as discussed in the
previous chapter as:
The number of degrees of freedom associated with are:
Knowing , sum of squares due to lack-of-fit,
, can be obtained as: [Note]

The number of degrees of freedom associated with are:
The test statistic for the lack-of-fit test is: