In Chapters 4 and 5, methods were presented to model the relationship between a response and the associated factors (referred to as predictor variables in the context of regression) based on an observed data set. Such studies, where observed values of the response are used to establish an association between the response and the factors, are called observational studies. However, in the case of observational studies, it is difficult to establish a cause-and-effect relationship between the observed factors and the response. This is because a number of alternative justifications can be used to explain the observed change in the response values. For example, a regression model fitted to data on the population of cities and road accidents might show a positive regression relation. However, this relation does not imply that an increase in a city's population causes an increase in road accidents. It could be that a number of other factors such as road conditions, traffic control and the degree to which the residents of the city follow the traffic rules affect the number of road accidents in the city and the increase in the number of accidents seen in the study is caused by these factors. Since the observational study does not take the effect of these factors into account, the assumption that an increase a city's population will lead to an increase in road accidents is not a valid one. For example, the population of a city may increase but road accidents in the city may decrease because of better traffic control. To establish a cause-and-effect relationship, the study should be conducted in such a way that the effect of all other factors is excluded from the investigation.
The studies that enable the establishment of a cause-and-effect relationship are called experiments. In experiments the response is investigated by studying only the effect of the factor(s) of interest and excluding all other effects that may provide alternative justifications to the observed change in response. This is done in two ways. First, the levels of the factors to be investigated are carefully selected and then strictly controlled during the execution of the experiment. The aspect of selecting what factor levels should be investigated in the experiment is called the design of the experiment. The second distinguishing feature of experiments is that observations in an experiment are recorded in a random order. By doing this, it is hoped that the effect of all other factors not being investigated in the experiment will get cancelled out so that the change in the response is the result of only the investigated factors. Using these two techniques, experiments tend to ensure that alternative justifications to observed changes in the response are voided, thereby enabling the establishment of a cause-and-effect relationship between the response and the investigated factors.
This section is divided into the following subsections:
The aspect of recording observations in an experiment in a random order is referred to as randomization. Specifically randomization is the process of assigning the various levels of the investigated factors to the experimental units in a random fashion. [Note] An experiment is said to be completely randomized if the probability of an experimental unit to be subjected to any level of a factor is equal for all the experimental units. The importance of randomization can be illustrated using an example. Consider an experiment where the effect of the speed of a lathe machine on the surface finish of a product is being investigated. In order to save time, the experimenter records surface finish values by running the lathe machine continuously and recording observations in the order of increasing speeds. The analysis of the experiment data shows that an increase in lathe speeds causes a decrease in the quality of surface finish. However the results of the experiment are disputed by the lathe operator who claims that he has been able to obtain better surface finish quality in the products by operating the lathe machine at higher speeds. It is later found that the faulty results were caused because of over heating of the tool used in the machine. Since the lathe was run continuously in the order of increased speeds the observations were recorded in the order of increased tool temperatures. This problem could have been avoided if the experimenter had randomized the experiment and taken reading at the various lathe speeds in a random fashion. This would require the experimenter to stop and restart the machine at every observation, thereby keeping the temperature of the tool within a reasonable range. Randomization would have ensured that the effect of heating of the machine tool is not included in the experiment.
As explained in Chapters 4 and 5, the analysis of observational studies involves the use of regression models. The analysis of experimental studies involves the use of analysis of variance (ANOVA) models. For a comparison of the two models see Fitting ANOVA Models. In single factor experiments, ANOVA models are used to compare the mean response values at different levels of the factor. Each level of the factor is investigated to see if the response is significantly different from the response at other levels of the factor. [Note] The analysis of single factor experiments is often referred to as one-way ANOVA.
To illustrate the use of ANOVA models in the analysis of experiments, consider a single factor experiment where the analyst wants to see if the surface finish of certain parts is affected by the speed of a lathe machine. Data is collected for three speeds (or three treatments). Each treatment is replicated four times. Therefore, this experiment design is balanced. [Note] Surface finish values recorded using randomization are shown in Table 6.1.
The ANOVA model for this experiment can be stated as follows:(1)
The ANOVA model assumes that the response at each factor level, , is the sum of the mean response at the th level, , and a random error term, . The subscript denotes the factor level while the subscript denotes the replicate. If there are levels of the factor and replicates at each level then and . The random error terms, , are assumed to be normally and independently distributed with a mean of zero and variance of . Therefore, the response at each level can be thought of as a normally distributed population with a mean of and constant variance of . Eqn. (1) is referred to as the means model.
The ANOVA model of Eqn. (1) can also be written using , where represents the overall mean and represents the effect due to the th treatment.
Such an ANOVA model is called the effects model. [Note] In the effects models the treatment effects, , represent the deviations from the overall mean, . Therefore, the following constraint exists on the s: [Note] (3)
To fit ANOVA models and carry out hypothesis testing in single factor experiments, it is convenient to express the effects model of Eqn. (2) in the form (that was used for multiple linear regression models in Chapter 5). This can be done as shown next. Using Eqn. (2), the ANOVA model for the single factor experiment in Table 6.1 can be expressed as:(4)
where represents the overall mean and represents the th treatment effect. There are three treatments in Table 6.1 (500, 600 and 700). Therefore, there are three treatment effects, , and . Per Eqn. (3), the following constraint exists for these effects:(5)
For the first treatment, Eqn. (4) can be written as:
Using from Eqn. (5), the model for the first treatment is:(6)
Models for the second and third treatments can be obtained in a similar way. The models for the three treatments are:
The coefficients of the treatment effects and in Eqns. (7) to (9) can be expressed using two indicator variables, and , as follows:
Using the indicator variables and , the ANOVA model of Eqn. (4) for the data in Table 6.1 now becomes:
The equation can be rewritten by including subscripts (for the level of the factor) and (for the replicate number) as:(10)
Eqn. (10) represents the "regression version" of the ANOVA model.
It can be seen from Eqn. (10) that in an ANOVA model each factor is treated as a qualitative factor. In the present example the factor, lathe speed, is a quantitative factor with three levels. But the ANOVA model treats this factor as a qualitative factor with three levels. Therefore, two indicator variables, and , are required to represent this factor. Note that if a regression model were to be fitted to the data in Table 6.1 then a distinction is made between quantitative and qualitative factors. For regression models, the factor, lathe speed, would be used as a quantitative factor and represented with a single predictor variable. For example, if a first order regression model were to be fitted to the data in Table 6.1, then the regression model would take the form . If a second order regression model were to be fitted, the regression model would be . Notice that unlike these regression models, the ANOVA model of Eqn. (10) does not make any assumption about the nature of relationship between the response and the factor being investigated.
The choice of the two models for a particular data set depends on the objective of the experimenter. In the case of the data of Table 6.1, the objective of the experimenter is to compare the levels of the factor to see if change in the levels leads to a significant change in the response. [Note] The objective is not to make predictions on the response for a given level of the factor. Therefore, the ANOVA model is used in this case. If the objective of the experimenter were prediction or optimization, the experimenter would use a regression model and focus on aspects such as the nature of relationship between the factor, lathe speed, and the response, surface finish, so that the regression model obtained is able to make accurate predictions.
The ANOVA model of Eqn. (10) can be expanded for the three treatments and four replicates of the data in Table 6.1 as follows:The corresponding matrix notation is:
The matrices , and are used in the calculation of the sum of squares in the next section. The data in Table 6.1 can be entered into DOE++ as shown in Figure 6.1.
The hypothesis test in single factor experiments examines the ANOVA model to see if the response at any level of the investigated factor is significantly different from that at the other levels. If this is not the case and the response at all levels is not significantly different, then it can be concluded that the investigated factor does not affect the response. The test on the ANOVA model is carried out by checking to see if any of the treatment effects, , are non-zero. The test is similar to the test of significance of regression mentioned in Chapters 4 and 5 in the context of regression models. The hypotheses statements for this test are:The test for is carried out using the following statistic:(11)where represents the mean square for the ANOVA model and is the error mean square. Note that in the case of ANOVA models we use the notation (treatment mean square) for the model mean square and (treatment sum of squares) for the model sum of squares (instead of , regression mean square, and , regression sum of squares, used in Chapters 4 and 5). This is done to indicate that the model under consideration is the ANOVA model and not the regression model. The calculations to obtain and are identical to the calculations to obtain and explained in Chapter 5.
The sum of squares to obtain the statistic can be calculated as explained in Chapter 5. Using the data in Table 6.1, the model sum of squares, , can be calculated as:In the previous equation, represents the number of levels of the factor, represents the replicates at each level, represents the vector of the response values, represents the hat matrix and represents the matrix of ones. (For details on each of these terms, refer to Chapter 5.)
Since two effect terms, and , are used in the model of Eqn. (10), the degrees of freedom associated with the model sum of squares, , is two. The total sum of squares, , can be obtained as follows:In the previous equation, is the identity matrix. Since there are 12 data points in all, the number of degrees of freedom associated with is 11. [Note] Knowing and , the error sum of squares is:The number of degrees of freedom associated with is: The test statistic can now be calculated using Eqn. (11) as:The value corresponding to this statistic based on the distribution with 2 degrees of freedom in the numerator and 9 degrees of freedom in the denominator is: [Note]
Assuming that the desired significance level is 0.1, since value < 0.1, is rejected and it is concluded that change in the lathe speed has a significant effect on the surface finish. DOE++ displays these results in the ANOVA table, as shown in Figure 6.2. The values of S and R-sq are the standard error and the coefficient of determination for the model, respectively. These values are explained in Chapter 5 and indicate how well the model fits the data. The values in Figure 6.2 indicate that the fit of the ANOVA model is fair.