Hypothesis Tests in Simple Linear Regression

The following sections discuss hypothesis tests on the regression coefficients in simple linear regression. These tests can be carried out if it can be assumed that the random error term, , is normally and independently distributed with a mean of zero and variance of . [Note]

 

t Tests

The tests are used to conduct hypothesis tests on the regression coefficients obtained in simple linear regression. A statistic based on the distribution is used to test the two-sided hypothesis that the true slope, , equals some constant value, . [Note] The statements for the hypothesis test are expressed as:MATHThe test statistic used for this test is:MATH(6) where is the least square estimate of , and is its standard error. The value of can be calculated as follows:MATH(7)

The test statistic, , follows a distribution with degrees of freedom, where is the total number of observations. The null hypothesis, , is rejected if the calculated value of the test statistic is such that:MATH

where and are the critical values for the two-sided hypothesis. is the percentile of the distribution corresponding to a cumulative probability of () and is the significance level. [Note]

 

If the value of used in Eqn. (6) is zero, then the hypothesis tests for the significance of regression. In other words, the test indicates if the fitted regression model is of value in explaining variations in the observations or if you are trying to impose a regression model when no true relationship exists between and . Failure to reject implies that no linear relationship exists between and . This result may be obtained when the scatter plots of against are as shown in 4.6 (a) and (b) of the following figure. Figure 4.6 (a) represents the case where no model exits for the observed data. In this case you would be trying to fit a regression model to noise or random variation. Figure 4.6 (b) represents the case where the true relationship between and is not linear. Figure 4.6 (c) and (d) represent the case when is rejected, implying that a model does exist between and . Figure 4.6 (c) represents the case where the linear model is sufficient. Figure 4.6, (d) represents the case where a higher order model may be needed.

 

Figure

Figure 4.6: Possible scatter plots of against . Plots (a) and (b) represent cases when is not rejected. Plots (c) and (d) represent cases when is rejected.

 

A similar procedure can be used to test the hypothesis on the intercept . The test statistic used in this case is:MATH(8)

where is the least square estimate of , and is its standard error which is calculated using:MATH(9)

 

Example 4.1

 

The test for the significance of regression for the data in Table 4.1 is illustrated in this example. The test is carried out using the test on the coefficient . The hypothesis to be tested is . To calculate the statistic to test , the estimate, , and the standard error, , are needed. The value of was obtained in Chapter 4, Fitted Regression Line. The standard error can be calculated using Eqn. (7) as follows:MATH

Then, the test statistic can be calculated using the following equation:MATH

The value corresponding to this statistic based on the distribution with () degrees of freedom can be obtained as follows:MATH

Assuming that the desired significance level is 0.1, since value < 0.1, is rejected indicating that a relation exists between temperature and yield for the data in Table 4.1. Using this result along with the scatter plot of Figure 4.2, it can be concluded that the relationship between temperature and yield is linear.

 

In DOE++, information related to the test is displayed in the Regression Information table as shown in Figure 4.7. In this table the test for is displayed in the row for the term Temperature because is the coefficient that represents the variable temperature in the regression model. The columns labeled Standard Error, T Value and P Value represent the standard error, the test statistic for the test and the value for the test, respectively. These values have been calculated for in this example. The Coefficient column represents the estimate of regression coefficients. For , this value was calculated using Eqn. (2). The Effect column represents values obtained by multiplying the coefficients by a factor of 2. This value is useful in the case of two factor experiments and is explained in Chapter 7, Two Level Factorial Experiments. Columns Low CI and High CI represent the limits of the confidence intervals for the regression coefficients and are explained in Chapter 4, Confidence Interval on Regression Coefficients. The Variance Inflation Factor column displays values that give a measure of multicollinearity. The concept of multicollinearity is only applicable to multiple linear regression models and is explained in Chapter 5, Multiple Linear Regression Analysis.

 

Figure

Figure 4.7: Regression results for the data in Table 4.1.

 

Analysis of Variance Approach to Test the Significance of Regression

The analysis of variance (ANOVA) is another method to test for the significance of regression. As the name implies, this approach uses the variance of the observed data to determine if a regression model can be applied to the observed data. The observed variance is partitioned into components that are then used in the test for significance of regression.

Sum of Squares

The total variance (i.e. the variance of all of the observed data) is estimated using the observed data. As mentioned in Chapter 3, Statistical Background, the variance of a population can be estimated using the sample variance, which is calculated using the following relationship:MATH

The quantity in the numerator of the previous equation is called the sum of squares. It is the sum of the square of deviations of all the observations, , from their mean, . In the context of ANOVA this quantity is called the total sum of squares (abbreviated ) because it relates to the total variance of the observations. Thus:MATH(10)

The denominator in the relationship of the sample variance is the number of degrees of freedom associated with the sample variance. Therefore, the number of degrees of freedom associated with , , is . [Note] The sample variance is also referred to as a mean square because it is obtained by dividing the sum of squares by the respective degrees of freedom. Therefore, the total mean square (abbreviated ) is:MATH(11)

When you attempt to fit a regression model to the observations, you are trying to explain some of the variation of the observations using this model. If the regression model is such that the resulting fitted regression line passes through all of the observations, then you would have a "perfect" model (see Figure 4.8 (a)). In this case the model would explain all of the variability of the observations. Therefore, the model sum of squares (also referred to as the regression sum of squares and abbreviated ) equals the total sum of squares; i.e. the model explains all of the observed variance:MATH

 

Figure

Figure 4.8: A perfect regression model will pass through all observed data points as shown in (a). Most models are imperfect and do not fit perfectly to all data points as shown in (b).

 
For the perfect model, the regression sum of squares, , equals the total sum of squares, , because all estimated values, , will equal the corresponding observations, . can be calculated using a relationship similar to the one for obtaining by replacing by in the relationship of . Therefore:MATH(12)

 

The number of degrees of freedom associated with , , is one. [Note ]

 

Based on the preceding discussion of ANOVA, a perfect regression model exists when the fitted regression line passes through all observed points. However, this is not usually the case, as seen in Figure 4.8 (b) or Figure 4.4. In both of these plots, a number of points do not follow the fitted regression line. This indicates that a part of the total variability of the observed data still remains unexplained. This portion of the total variability or the total sum of squares, that is not explained by the model, is called the residual sum of squares or the error sum of squares (abbreviated ). The deviation for this sum of squares is obtained at each observation in the form of the residuals, . The error sum of squares can be obtained as the sum of squares of these deviations:MATH(13)

The number of degrees of freedom associated with , , is . [Note]

 

The total variability of the observed data (i.e. total sum of squares, ) can be written using the portion of the variability explained by the model, , and the portion unexplained by the model, , as:MATH(14)

The above equation is also referred to as the analysis of variance identity and can be expanded as follows:MATH(15)

 

The deviations for the three sum of squares are shown in Figure 4.9.

 

Figure

Figure 4.9: Scatter plots showing the deviations for the sum of squares used in ANOVA. (a) shows deviations for , (b) shows deviations for , and (c) shows deviations for .

 

Mean Squares

As mentioned previously, mean squares are obtained by dividing the sum of squares by the respective degrees of freedom. For example, the error mean square, , can be obtained as:MATH(16)

The error mean square is an estimate of the variance, , of the random error term, , and can be written as: MATH

Similarly, the regression mean square, , can be obtained by dividing the regression sum of squares by the respective degrees of freedom as follows:MATH

F Test

To test the hypothesis , the statistic used is based on the distribution. It can be shown that if the null hypothesis is true, then the statistic:MATH(17)

follows the distribution with degree of freedom in the numerator and degrees of freedom in the denominator. is rejected if the calculated statistic, , is such that:MATH

where is the percentile of the distribution corresponding to a cumulative probability of () and is the significance level.

 

Example 4.2

 

The analysis of variance approach to test the significance of regression can be applied to the yield data in Table 4.1. To calculate the statistic, , for the test, the sum of squares have to be obtained. The sum of squares can be calculated as shown next.

 

The total sum of squares can be calculated as:MATH

The regression sum of squares can be calculated as:MATH

The error sum of squares can be calculated as:MATH

Knowing the sum of squares, the statistic to test can be calculated as follows:MATH

 

The critical value at a significance level of 0.1 is . Since is rejected and it is concluded that is not zero. Alternatively, the value can also be used. The value corresponding to the test statistic, , based on the distribution with one degree of freedom in the numerator and 23 degrees of freedom in the denominator is:MATH

Assuming that the desired significance is 0.1, since the value < 0.1, then is rejected, implying that a relation does exist between temperature and yield for the data in Table 4.1. Using this result along with the scatter plot of Figure 4.2, it can be concluded that the relationship that exists between temperature and yield is linear. This result is displayed in the ANOVA table as shown in Figure 4.10. Note that this is the same result that was obtained from the test in Chapter 4, Confidence Interval on Fitted Values. The ANOVA and Regression Information tables in DOE++ represent two different ways to test for the significance of the regression model. In the case of multiple linear regression models these tables are expanded to allow tests on individual variables used in the model. This is done using extra sum of squares. Multiple linear regression models and the application of extra sum of squares in the analysis of these models are discussed in Chapter 5, Multiple Linear Regression Analysis. The term Partial appearing in Figure 4.10 relates to the extra sum of squares and is also explained in Chapter 5.

 

Figure

Figure 4.10: ANOVA table for the data in Table 4.1.

 

See Also:

 

Simple Linear Regression Analysis

Confidence Intervals in Simple Linear Regression

Multiple Linear Regression Analysis

Two Level Factorial Experiments