This section is divided into the following subsections:
Polynomial regression models are used when the response is curvilinear.
The equation shown next presents a second order polynomial regression
model with one predictor variable:
Usually, coded values are
used in these models. Values of the variables are coded by centering
or expressing the levels of the variable as deviations from the mean value
of the variable and then scaling or dividing the deviations obtained
by half of the range of the variable.
(38)
The reason for using coded predictor variables is that many times and are highly correlated and, if uncoded values are used, there may be computational difficulties while calculating the matrix to obtain the estimates, , of the regression coefficients using Eqn. (8).
The multiple linear regression model also supports the use of qualitative
factors. [Note]
For example, gender may need to be included as a factor in a regression
model. One of the ways to include qualitative factors in a regression
model is to employ indicator variables. Indicator variables take
on values of 0 or 1. For example, an indicator variable may be used with
a value of 1 to indicate female and a value of 0 to indicate male.
In general () indicator variables are required
to represent a qualitative factor with levels. As an example, a qualitative
factor representing three types of machines may be represented as follows
using two indicator variables:
An alternative coding scheme for this
example is to use a value of -1 for all indicator variables when representing
the last level of the factor:
Indicator variables are also referred
to as dummy variables or binary variables.
Example 5.7
Consider data from two types of reactors of a chemical process shown in Table 5.3 where the yield values are recorded for various levels of factor . Assuming there are no interactions between the reactor type and , a regression model can be fitted to this data as shown next.
Since the reactor type is a qualitative factor with two levels, it can
be represented by using one indicator variable. Let be the indicator variable representing
the reactor type, with 0 representing the first type of reactor and 1
representing the second type of reactor.
Data entry in DOE++ for this example is shown in Figure 5.25.
The regression model for this data is:
|
Table 5.3: Yield data from two types of reactors for a chemical process. |
|
Figure 5.25: Data from Table 5.3 as entered in DOE++. |
The estimated regression coefficients
for the model can be obtained using Eqn. (8) as:
Therefore, the fitted regression model is:
Note that since represents a qualitative predictor variable, the fitted regression model cannot be plotted simultaneously against and in a two dimensional space (because the resulting surface plot will be meaningless for the dimension in ). To illustrate this, a scatter plot of the data in Table 5.3 against is shown in Figure 5.26. It can be noted that, in the case of qualitative factors, the nature of the relationship between the response (yield) and the qualitative factor (reactor type) cannot be categorized as linear, or quadratic, or cubic, etc. The only conclusion that can be arrived at for these factors is to see if these factors contribute significantly to the regression model. This can be done by employing the partial test of Chapter 5, Test on Subsets of Regression Coefficients (using the extra sum of squares of the indicator variables representing these factors). The results of the test for the present example are shown in the ANOVA table of Figure 5.27. The results show that (reactor type) contributes significantly to the fitted regression model.
|
Figure: 5.26: Scatter plot of the observed yield values in Table 5.3 against (reactor type). |
|
Figure 5.27: DOE++ results for the data in Table 5.3. |
At times the predictor variables included in a multiple linear regression model may be found to be dependent on each other. Multicollinearity is said to exist in a multiple regression model with strong dependencies between the predictor variables.
Multicollinearity affects the regression coefficients and the extra sum of squares of the predictor variables. In a model with multicollinearity the estimate of the regression coefficient of a predictor variable depends on what other predictor variables are included the model. The dependence may even lead to change in the sign of the regression coefficient. In such models, an estimated regression coefficient may not be found to be significant individually (when using the test on the individual coefficient or looking at the value) even though a statistical relation is found to exist between the response variable and the set of the predictor variables (when using the test for the set of predictor variables). Therefore, you should be careful while looking at individual predictor variables in models that have multicollinearity. Care should also be taken while looking at the extra sum of squares for a predictor variable that is correlated with other variables. This is because in models with multicollinearity the extra sum of squares is not unique and depends on the other predictor variables included in the model. [Note]
Multicollinearity can be detected using the variance inflation factor
(abbreviated ). for a coefficient is defined as:
(39)where is the coefficient of multiple determination
resulting from regressing the th predictor variable, , on the remaining -1 predictor variables. Mean values
of considerably greater than 1 indicate
multicollinearity problems.
A few methods of dealing with multicollinearity include increasing the number of observations in a way designed to break up dependencies among predictor variables, combining the linearly dependent predictor variables into one variable, eliminating variables from the model that are unimportant or using coded variables. [Note]
Example 5.8
Variance inflation factors can be obtained for the data in Table 5.1.
To calculate the variance inflation factor for , has to be calculated. is the coefficient of determination
for the model when is regressed on the remaining variables.
In the case of this example there is just one remaining variable which
is . If a regression model is fit to the
data, taking as the response variable and as the predictor variable, then the
design matrix and the vector of observations are:
The regression sum of squares for this model can be obtained using Eqn.
(17) as:
where is the hat matrix (and is calculated
using ) and is the matrix of ones. The total sum
of squares for the model can be calculated using Eqn. (31)
as:
where is the identity matrix. Therefore:
Then the variance inflation factor
for is:
The variance inflation factor for
, , can be obtained in a similar manner.
In DOE++, the variance inflation factors are displayed in the VIF column
of the Regression Information Table as shown in Figure 5.28.
Since the values of the variance inflation factors obtained are considerably
greater than 1, multicollinearity is an issue for the data in Table 5.1.
|
Figure 5.28: Variance inflation factors for the data in Table 5.1. |