Simple Linear Regression Analysis

Regression analysis is a statistical technique that attempts to explore and model the relationship between two or more variables. For example, an analyst may want to know if there is a relationship between road accidents and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data obtained from designed experiments and is discussed briefly in this chapter. Every experiment analyzed in DOE++ includes regression results for each of the responses. These results, along with the results from the analysis of variance (explained in our "Analysis of Experiments" discussion), provide information that is useful to identify significant factors in an experiment and explore the nature of the relationship between these factors and the response. Regression analysis forms the basis for all DOE++ calculations related to the sum of squares used in the analysis of variance. The reason for this is explained in the last section of Chapter 6, Use of Regression to Calculate Sum of Squares. Additionally, DOE++ also includes a regression tool to see if two or more variables are related, and to explore the nature of the relationship between them. This chapter discusses simple linear regression analysis while Chapter 5 focuses on multiple linear regression analysis.

 

This section also contains the following subsections:

 

 

A linear regression model attempts to explain the relationship between two or more variables using a straight line. Consider the data obtained from a chemical process where the yield of the process is thought to be related to the reaction temperature (see Table 4.1). This data can be entered in DOE++ as shown in Figure 4.1 and a scatter plot can be obtained as shown in Figure 4.2. [Note] In the scatter plot yield, is plotted for different temperature values, . It is clear that no line can be found to pass through all points of the plot. Thus no functional relation exists between the two variables and . [Note] However, the scatter plot does give an indication that a straight line may exist such that all the points on the plot are scattered randomly around this line. A statistical relation is said to exist in this case. The statistical relation between and may be expressed as follows:MATH(1)

 

 

Figure

Table 4.1: Yield data observations of a chemical process at different values of reaction temperature.

 

Figure

Figure 4.1: Data entry in DOE++ for the observations in Table 4.1.

 

Figure 4.2: Scatter plot for the data in Table 4.1.

 
Eqn. (1) is the linear regression model that can be used to explain the relation between and that is seen on the scatter plot above. In this model, the mean value of (abbreviated as ) is assumed to follow the linear relation : [Note] MATH

The actual values of , (which are observed as yield from the chemical process from time to time and are random in nature), are assumed to be the sum of the mean value, , and a random error term, :MATH

 

The regression model here is called a simple linear regression model because there is just one independent variable, , in the model. In regression models, the independent variables are also referred to as regressors or predictor variables. The dependent variable, , is also referred to as the response. The slope, , and the intercept, , of the line are called regression coefficients. The slope, , can be interpreted as the change in the mean value of for a unit change in .

 

The random error term, , is assumed to follow the normal distribution with a mean of 0 and variance of . Since is the sum of this random term and the mean value, , (which is a constant), the variance of at any given value of is also . Therefore, at any given value of , say , the dependent variable follows a normal distribution with a mean of and a standard deviation of . This is illustrated in the following figure.

 

Figure

Figure 4.3: The normal distribution of for two values of . Also shown is the true regression line and the values of the random error term, , corresponding to the two values. The true regression line and are usually not known.

 

Fitted Regression Line

The true regression line corresponding to Eqn. (1) is usually never known. However, the regression line can be estimated by estimating the coefficients and for an observed data set. The estimates, and , are calculated using least squares. (For details on least square estimates refer to [19]). The estimated regression line, obtained using the values of and , is called the fitted line. The least square estimates, and , are obtained using the following equations:(2)

 

(3) 

where is the mean of all the observed values and is the mean of all values of the predictor variable at which the observations were taken. is calculated using and is calculated using .

 

Once and are known, the fitted regression line can be written as:MATH(4)

where is the fitted or estimated value based on the fitted regression model. It is an estimate of the mean value, . The fitted value, , for a given value of the predictor variable, , may be different from the corresponding observed value, . The difference between the two values is called the residual, : [Note] MATH(5)

 

Calculation of the Fitted Line Using Least Square Estimates

The least square estimates of the regression coefficients can be obtained for the data in Table 4.1 using the Eqns. (2) and (3) as follows:MATH

Knowing and the fitted regression line is:MATH

 

This line is shown in Figure 4.4.

 

Figure 4.4: Fitted regression line for the data in Table 4.1. Also shown is the residual for the 21st observation.

 

Once the fitted regression line is known, the fitted value of corresponding to any observed data point can be calculated. For example, the fitted value corresponding to the 21st observation in Table 4.1 is:MATHThe observed response at this point is . Therefore, the residual at this point is:MATHIn DOE++, fitted values and residuals are available using the Diagnostic icon in the Control Panel. The values are shown in Figure 4.5.

 

 

Figure

Figure 4.5: Fitted values and residuals for the data in Table 4.1.

 

See Also:

 

Statistical Inference for Two Samples

Hypothesis Tests in Simple Linear Regression

Analysis of Experiments

Multiple Linear Regression Analysis

Use of Regression to Calculate Sum of Squares