It is important to analyze the regression model before inferences based on the model are undertaken. The following sections present some techniques that can be used to check the appropriateness of the model for the given data. These techniques help to determine if any of the model assumptions have been violated.
Calculation of Lack-of-Fit Mean Square
The coefficient of determination is a measure of the amount of variability
in the data accounted for by the regression model. As mentioned previously,
the total variability of the data is measured by the total sum of squares,
. The amount of this variability explained
by the regression model is the regression sum of squares, . The coefficient of determination
is the ratio of the regression sum of squares to the total sum of squares.
(22)
can take on values between 0 and 1
since . For the yield data example, can be calculated as:
Therefore, 98% of the variability in the yield data is explained by the regression model, indicating a very good fit of the model. It may appear that larger values of indicate a better fitting regression model. However, should be used cautiously as this is not always the case. The value of increases as more terms are added to the model, even if the new term does not contribute significantly to the model. Therefore, an increase in the value of cannot be taken as a sign to conclude that the new model is superior to the older model. Adding a new term may make the regression model worse if the error mean square, , for the new model is larger than the of the older model, even though the new model will show an increased value of . In the results obtained from DOE++, is displayed as R-sq under the ANOVA table (as shown in Figure 4.12, which displays the complete analysis sheet for the data in Table 4.1).
The other values displayed with are S, R-sq(adj), PRESS and R-sq(pred). These values measure different aspects of the adequacy of the regression model. For example, the value of S is the square root of the error mean square, , and represents the "standard error of the model." A lower value of S indicates a better fitting model. The values of S, R-sq and R-sq(adj) indicate how well the model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. R-sq(adj), PRESS and R-sq(pred) are explained in Chapter 5, Multiple Linear Regression Analysis.
|
Figure 4.12: Complete analysis for the data in Table 4.1. |
In the simple linear regression model the true error terms, , are never known. The residuals, , may be thought of as the observed error terms that are similar to the true error terms. Since the true error terms, are assumed to be normally distributed with a mean of zero and a variance of , in a good model the observed error terms, (i.e. the residuals, ,) should also follow these assumptions. [Note] Thus the residuals in the simple linear regression should be normally distributed with a mean of zero and a constant variance of . Residuals are usually plotted against the fitted values, , against the predictor variable values, , and against time or run-order sequence, in addition to the normal probability plot. Plots of residuals are used to check for the following:
Examples of residual plots are shown in Figure 4.13. The plot of Figure 4.13 (a) is a satisfactory plot with the residuals falling in a horizontal band with no systematic pattern. Such a plot indicates an appropriate regression model. The plot of Figure 4.13 (b) shows residuals falling in a funnel shape. Such a plot indicates increase in variance of residuals and the assumption of constant variance is violated here. Transformation on may be helpful in this case (see Chapter 4, Transformations). If the residuals follow the pattern of Figure 4.13 (c) or (d) then this is an indication that the linear regression model is not adequate. Addition of higher order terms to the regression model or transformation on or may be required in such cases. A plot of residuals may also show a pattern as seen in Figure 4.13 (e) indicating that the residuals increase (or decrease) as the run order sequence or time progresses. This may be due to factors such as operator-learning or instrument-creep and should be investigated further.
|
Figure 4.13: Possible residual plots (against fitted values, time or run-order) that can be obtained from simple linear regression analysis. |
Example 4.4
Residual plots for the data of Table 4.1 are shown in Figures 4.14 to 4.16. Figure 4.14 is the normal probability plot. It can be observed that the residuals follow the normal distribution and the assumption of normality is valid here. In Figure 4.15 the residuals are plotted against the fitted values, , and in Figure 4.16 the residuals are plotted against the run order. Both of these plots show that the 21st observation seems to be an outlier. Further investigations are needed to study the cause of this outlier.
|
Figure 4.14: Normal probability plot of residuals for the data in Table 4.1. |
|
Figure: 4.15: Plot of residuals against fitted values for the data in Table 4.1. |
|
Figure 4.16: Plot of residuals against run order for the data in Table 4.1. |
As mentioned in Chapter 4, Analysis
of Variance Approach to Test the Significance of Regression, a perfect
regression model results in a fitted line that passes exactly through
all observed data points. This perfect model will give us a zero error
sum of squares (). Thus, no error exists for the perfect
model. However, if you record the response values for the same values
of for a second time, in conditions maintained
as strictly identical as possible to the first time, observations from
the second time will not all fall along the perfect model. The deviations
in observations recorded for the second time constitute the "purely"
random variation or noise. The sum of squares due to pure error
(abbreviated ) quantifies these variations. is calculated by taking repeated observations
at some or all values of and adding up the square of deviations
at each level of using the respective repeated observations
at that value. Assume that there are levels of and repeated observations are taken at
each th level. The data is collected as shown
next:
The sum of squares of the deviations from the mean of the observations
at th level of , , can be calculated as:
where is the mean of the repeated observations corresponding to (). The number of degrees of freedom for these deviations is () as there are observations at th level of but one degree of freedom is lost in calculating the mean, .
The total sum of square deviations (or ) for all levels of can be obtained by summing the deviations
for all as shown next:
(23)
The total number of degrees of freedom associated with is:
If all , (i.e. repeated observations are taken at
all levels of ), then and the degrees of freedom associated
with are: [Note]

The corresponding mean square in this case will be:
(24)
When repeated observations are used for a perfect regression model,
the sum of squares due to pure error, , is also considered as the error sum
of squares, . For the case when repeated observations
are used with imperfect regression models, there are two components of
the error sum of squares, . One portion is the pure error due
to the repeated observations. The other portion is the error that represents
variation not captured because of the imperfect model. The second portion
is termed as the sum of squares due to lack-of-fit (abbreviated
) to point to the deficiency in fit
due to departure from the perfect-fit model. Thus, for an imperfect regression
model:
(25)
Knowing and , the previous equation can be used
to obtain :
The degrees of freedom associated with can be obtained in a similar manner
using subtraction. For the case when repeated observations are taken at
all levels of , the number of degrees of freedom
associated with is:
Since there are total observations, the number of
degrees of freedom associated with is:
Therefore, the number of degrees of freedom associated with is:
The corresponding mean square, , can now be obtained as:
(26)
The magnitude of or will provide an indication of how
far the regression model is from the perfect model. An test exists to examine the lack-of-fit
at a particular significance level. [Note]
The quantity follows an distribution with degrees of freedom in the numerator
and degrees of freedom in the denominator
when all equal . The test statistic for the lack-of-fit
test is:
If the critical value is such that:
it will lead to the rejection of the hypothesis that the model adequately fits the data.
Example 4.5
Assume that a second set of observations are taken for the yield data of Table 4.1. The resulting observations are recorded in Table 4.2. To conduct a lack-of-fit test on this data, the statistic , can be calculated as shown next.
|
Table 4.2: Yield data from the first and second observation sets for the chemical process example in Chapter 4.1. |
The parameters of the fitted regression model can be obtained using
Eqns. (3) and (2) as:
Knowing and , the fitted values, , can be calculated.
Using the fitted values, the sum of squares can be obtained as follows:
The error sum of squares, , can now be split into the sum of
squares due to pure error, , and the sum of squares due to lack-of-fit,
. can be calculated as follows considering
that in this example and :
The number of degrees of freedom associated with is:
The corresponding mean square, , can now be obtained as:
can be obtained by subtraction from
as:
Similarly, the number of degrees of freedom associated with is:
The lack-of-fit mean square is:
The test statistic for the lack-of-fit test can now be calculated as:
The critical value for this test is:
Since , we fail to reject the hypothesis
that the model adequately fits the data. The value for this case is:
Therefore, at a significance level of 0.05 we conclude that the simple linear regression model, , is adequate for the observed data. Table 4.3 presents a summary of the ANOVA calculations for the lack-of-fit test.
|
Table 4.3: ANOVA table for the lack-of-fit test of the yield data example. |
Confidence Intervals in Simple Linear Regression