The following sections discuss hypothesis tests on the regression coefficients in simple linear regression. These tests can be carried out if it can be assumed that the random error term, , is normally and independently distributed with a mean of zero and variance of . [Note]
The tests are used to conduct hypothesis
tests on the regression coefficients obtained in simple linear regression.
A statistic based on the distribution is used to test the two-sided
hypothesis that the true slope, , equals some constant value, . [Note]
The statements for the hypothesis test are expressed as:
The test statistic used for this test
is:
(6) where is the least square estimate of , and is its standard error. The value of
can be calculated as follows:
(7)
The test statistic, , follows a distribution with degrees of freedom, where is the total number of observations.
The null hypothesis, , is rejected if the calculated value
of the test statistic is such that:
where and are the critical values for the two-sided hypothesis. is the percentile of the distribution corresponding to a cumulative probability of () and is the significance level. [Note]
If the value of used in Eqn. (6) is zero, then the hypothesis tests for the significance of regression. In other words, the test indicates if the fitted regression model is of value in explaining variations in the observations or if you are trying to impose a regression model when no true relationship exists between and . Failure to reject implies that no linear relationship exists between and . This result may be obtained when the scatter plots of against are as shown in 4.6 (a) and (b) of the following figure. Figure 4.6 (a) represents the case where no model exits for the observed data. In this case you would be trying to fit a regression model to noise or random variation. Figure 4.6 (b) represents the case where the true relationship between and is not linear. Figure 4.6 (c) and (d) represent the case when is rejected, implying that a model does exist between and . Figure 4.6 (c) represents the case where the linear model is sufficient. Figure 4.6, (d) represents the case where a higher order model may be needed.
|
Figure 4.6: Possible scatter plots of against . Plots (a) and (b) represent cases when is not rejected. Plots (c) and (d) represent cases when is rejected. |
A similar procedure can be used to test the hypothesis on the intercept
. The test statistic used in this case
is:
(8)
where is the least square estimate of , and is its standard error which is calculated
using:
(9)
Example 4.1
The test for the significance of regression for the data in Table 4.1
is illustrated in this example. The test is carried out using the test on the coefficient . The hypothesis to be tested is . To calculate the statistic to test
, the estimate, , and the standard error, , are needed. The value of was obtained in Chapter 4, Fitted
Regression Line. The standard error can be calculated using Eqn. (7) as follows:
Then, the test statistic can be calculated using the following equation:
The value corresponding to this statistic
based on the distribution with () degrees of freedom can be obtained
as follows:
Assuming that the desired significance level is 0.1, since value < 0.1, is rejected indicating that a relation exists between temperature and yield for the data in Table 4.1. Using this result along with the scatter plot of Figure 4.2, it can be concluded that the relationship between temperature and yield is linear.
In DOE++, information related to the test is displayed in the Regression Information table as shown in Figure 4.7. In this table the test for is displayed in the row for the term Temperature because is the coefficient that represents the variable temperature in the regression model. The columns labeled Standard Error, T Value and P Value represent the standard error, the test statistic for the test and the value for the test, respectively. These values have been calculated for in this example. The Coefficient column represents the estimate of regression coefficients. For , this value was calculated using Eqn. (2). The Effect column represents values obtained by multiplying the coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in Chapter 7, Two Level Factorial Experiments. Columns Low CI and High CI represent the limits of the confidence intervals for the regression coefficients and are explained in Chapter 4, Confidence Interval on Regression Coefficients. The Variance Inflation Factor column displays values that give a measure of multicollinearity. The concept of multicollinearity is only applicable to multiple linear regression models and is explained in Chapter 5, Multiple Linear Regression Analysis.
|
Figure 4.7: Regression results for the data in Table 4.1. |
The analysis of variance (ANOVA) is another method to test for the significance of regression. As the name implies, this approach uses the variance of the observed data to determine if a regression model can be applied to the observed data. The observed variance is partitioned into components that are then used in the test for significance of regression.
The total variance (i.e.
the variance of all of the observed data) is estimated using the observed
data. As mentioned in Chapter 3, Statistical
Background, the variance of a population can be estimated using the
sample variance, which is calculated using the following relationship:
The quantity in the numerator of the previous equation is called the
sum of squares. It is the sum of the square of deviations of
all the observations, , from their mean, . In the context of ANOVA this quantity
is called the total sum of squares (abbreviated ) because it relates to the total variance
of the observations. Thus:
(10)
The denominator in the relationship of the sample variance is the number
of degrees of freedom associated with the sample variance. Therefore,
the number of degrees of freedom associated with , , is . [Note]
The sample variance is also referred to as a mean square because
it is obtained by dividing the sum of squares by the respective degrees
of freedom. Therefore, the total mean square (abbreviated ) is:
(11)
When you attempt to fit a regression model to the observations, you
are trying to explain some of the variation of the observations using
this model. If the regression model is such that the resulting fitted
regression line passes through all of the observations, then you would
have a "perfect" model (see Figure 4.8 (a)).
In this case the model would explain all of the variability of the observations.
Therefore, the model sum of squares (also referred to as the
regression sum of squares and abbreviated ) equals the total sum of squares;
i.e. the model explains
all of the observed variance:
|
Figure 4.8: A perfect regression model will pass through all observed data points as shown in (a). Most models are imperfect and do not fit perfectly to all data points as shown in (b). |
(12)
The number of degrees of freedom associated with , , is one. [Note ]
Based on the preceding discussion of ANOVA, a perfect regression model
exists when the fitted regression line passes through all observed points.
However, this is not usually the case, as seen in Figure 4.8
(b) or Figure 4.4. In both of these plots,
a number of points do not follow the fitted regression line. This indicates
that a part of the total variability of the observed data still remains
unexplained. This portion of the total variability or the total sum of
squares, that is not explained by the model, is called the residual
sum of squares or the error sum of squares (abbreviated
). The deviation for this sum of squares
is obtained at each observation in the form of the residuals, . The error sum of squares can be obtained
as the sum of squares of these deviations:
(13)
The number of degrees of freedom associated with , , is . [Note]
The total variability of the observed data (i.e.
total sum of squares, ) can be written using the portion
of the variability explained by the model, , and the portion unexplained by the
model, , as:
(14)
The above equation is also referred to as the analysis of variance
identity and can be expanded as follows:
(15)
The deviations for the three sum of squares are shown in Figure 4.9.
|
Figure 4.9: Scatter plots showing the deviations for the sum of squares used in ANOVA. (a) shows deviations for , (b) shows deviations for , and (c) shows deviations for . |
As mentioned previously, mean squares are obtained by dividing the sum
of squares by the respective degrees of freedom. For example, the error
mean square, , can be obtained as:
(16)
The error mean square is an estimate of the variance, , of the random error term, , and can be written as: 
Similarly, the regression mean square, , can be obtained by dividing the regression
sum of squares by the respective degrees of freedom as follows:
To test the hypothesis , the statistic used is based on the
distribution. It can be shown that
if the null hypothesis is true, then the statistic:
(17)
follows the distribution with degree of freedom in the numerator
and degrees of freedom in the denominator.
is rejected if the calculated statistic,
, is such that:
where is the percentile of the distribution corresponding to a cumulative probability of () and is the significance level.
Example 4.2
The analysis of variance approach to test the significance of regression can be applied to the yield data in Table 4.1. To calculate the statistic, , for the test, the sum of squares have to be obtained. The sum of squares can be calculated as shown next.
The total sum of squares can be calculated as:
The regression sum of squares can be calculated as:
The error sum of squares can be calculated as:
Knowing the sum of squares, the statistic to test can be calculated as follows:
The critical value at a significance level of 0.1 is . Since is rejected and it is concluded that
is not zero. Alternatively, the value can also be used. The value corresponding to the test statistic,
, based on the distribution with one degree of freedom
in the numerator and 23 degrees of freedom in the denominator is:
Assuming that the desired significance is 0.1, since the value < 0.1, then is rejected, implying that a relation does exist between temperature and yield for the data in Table 4.1. Using this result along with the scatter plot of Figure 4.2, it can be concluded that the relationship that exists between temperature and yield is linear. This result is displayed in the ANOVA table as shown in Figure 4.10. Note that this is the same result that was obtained from the test in Chapter 4, Confidence Interval on Fitted Values. The ANOVA and Regression Information tables in DOE++ represent two different ways to test for the significance of the regression model. In the case of multiple linear regression models these tables are expanded to allow tests on individual variables used in the model. This is done using extra sum of squares. Multiple linear regression models and the application of extra sum of squares in the analysis of these models are discussed in Chapter 5, Multiple Linear Regression Analysis. The term Partial appearing in Figure 4.10 relates to the extra sum of squares and is also explained in Chapter 5.
|
Figure 4.10: ANOVA table for the data in Table 4.1. |
See Also:
Simple Linear Regression Analysis
Confidence Intervals in Simple Linear Regression
Multiple Linear Regression Analysis
Two Level Factorial Experiments