
Analysis of Variance
[Editor's Note: This article has been updated since its original publication to reflect a more recent version of the software interface.]
Analysis of variance, or ANOVA, is a powerful statistical technique that involves partitioning the observed variance into different components to conduct various significance tests. This article discusses the application of ANOVA to a data set that contains one independent variable and explains how ANOVA can be used to examine whether a linear relationship exists between a dependent variable and an independent variable.
Sum of Squares and Mean
Squares
The total variance of an observed data set can be estimated
using the following relationship:
where:
 s is the standard deviation.
 y_{i} is the ith observation.
 n is the number of observations.
 is the mean of the n observations.
The quantity in the numerator of the previous equation is called the sum of squares. It is the sum of the squares of the deviations of all the observations, y_{i}, from their mean, . In the context of ANOVA, this quantity is called the total sum of squares (abbreviated SS_{T}) because it relates to the total variance of the observations. Thus:
The denominator in the relationship of the sample variance is the number of degrees of freedom associated with the sample variance. Therefore, the number of degrees of freedom associated with SS_{T}, dof(SS_{T}), is (n1). The sample variance is also referred to as a mean square because it is obtained by dividing the sum of squares by the respective degrees of freedom. Therefore, the total mean square (abbreviated MS_{T}) is:
When you attempt to fit a model to the observations, you are trying to explain some of the variation of the observations using this model. For the case of simple linear regression, this model is a line. In other words, you would be trying to see if the relationship between the independent variable and the dependent variable is a straight line. If the model is such that the resulting line passes through all of the observations, then you would have a "perfect" model, as shown in Figure 1.
Figure 1: Perfect Model Passing
Through All Observed Data Points
The model explains all of the variability of the observations. Therefore, in this case, the model sum of squares (abbreviated SS_{R}) equals the total sum of squares:
For the perfect model, the model sum of squares, SS_{R}, equals the total sum of squares, SS_{T}, because all estimated values obtained using the model, , will equal the corresponding observations, y_{i}.
The model sum of squares, SS_{R}, can be calculated using a relationship similar to the one used to obtain SS_{T}. For SS_{R}, we simply replace the y_{i} in the relationship of SS_{T} with :
The number of degrees of freedom associated with SS_{R}, dof(SS_{R}), is 1. (For details, click here.)
Therefore, the model mean square, MS_{R}, is:
Figure 2 shows a case where the model is not a perfect model.
Figure 2: Most Models Do Not
Fit All Data Points Perfectly
You can see that a number of observed data points do not follow the fitted line. This indicates that a part of the total variability of the observed data still remains unexplained. This portion of the total variability, or the total sum of squares that is not explained by the model, is called the residual sum of squares or the error sum of squares (abbreviated SS_{E}). The deviation for this sum of squares is obtained at each observation in the form of the residuals, e_{i}:
The error sum of squares can be obtained as the sum of squares of these deviations:
The number of degrees of freedom associated with SS_{E}, dof(SS_{E}), is (n2). (For details, click here.)
Therefore the residual or error mean square, MS_{E}, is:
Analysis of Variance
Identity
The total variability of the observed data (i.e., the
total sum of squares, SS_{T}) can be written
using the portion of the variability explained by the model,
SS_{R}, and the portion unexplained by the model,
SS_{E}, as:
The above equation is referred to as the analysis of variance identity.
F Test
To test if a relationship exists between the dependent and
independent variable, a statistic based on the F
distribution is used. (For details,
click here.) The statistic is a ratio of the model mean
square and the residual mean square.
For simple linear regression, the statistic follows the F distribution with 1 degree of freedom in the numerator and (n2) degrees of freedom in the denominator.
Example
Table 1 shows the observed yield data obtained at various
temperature settings of a chemical process. We can analyze this
data set using ANOVA to determine if a linear relationship
exists between the independent variable, temperature, and the
dependent variable, yield.
Table 1: Yield Data Observations of a Chemical Process at
Different Values of Reaction Temperature
The parameters of the assumed linear model are obtained using least square estimation. (For details, click here.) These parameters are then used to obtain the estimated values, . The model sum of squares for this model can be obtained as follows:
The corresponding number of degrees of freedom for SS_{R} for the present data set is 1.
The residual sum of squares can be obtained as follows:
The corresponding number of degrees of freedom for SS_{E} for the present data set, having 25 observations, is n2 = 252 = 23.
The F statistic can be obtained as follows:
The P value corresponding to this statistic, based on the F distribution with 1 degree of freedom in the numerator and 23 degrees of freedom in the denominator, is 4.17E22. In this context, the P value is the probability that an equal amount of variation in the dependent variable would be observed in the case that the independent variable does not affect the dependent variable. (For more details about P values, click here.) Since this value is very small, we can conclude that a linear relationship exists between the dependent variable, yield, and the independent variable, temperature.
DOE++
The above analysis can be easily carried out in ReliaSoft's
DOE++ software using
the Multiple Linear Regression Tool. Figure 3 shows the data
from Table 1 entered into DOE++ and Figure 3 shows the
results obtained from DOE++. You can see that the results
shown in Figure 4 match the calculations shown previously and
indicate that a linear relationship does exist between yield and
temperature.
Figure 3: Data Entry in DOE++
for the Observations in Table 1
Figure 4: ANOVA Table for the Data in Table 1
References
[1] ReliaSoft Corporation, Experiment Design
and Analysis Reference, Tucson, AZ: ReliaSoft Publishing,
2008.
[2] Montgomery, D., Design and Analysis of
Experiments, 5th edition, 2001, New York: John Wiley & Sons,
2001.
[3] Kutner, Michael H., Nachtsheim, Christopher
J., Neter, John, and Li, William, Applied Linear Statistical
Models, New York: McGrawHill/Irwin, 2005.