<<< Back to Index
This Month's Tip >>>

The Importance of Looking at the Probability Plot When Fitting a Model Using Least Squares Regression



The correlation coefficient, ρ, provides a measure of the fit of a model to a set of data points. For most real data sets, the absolute value of ρ lies between 0 and 1, with a higher value of ρ indicating a better fit of the model to the data (i.e., when ρ = -1 or ρ = 1 the model fits the data perfectly, and when ρ = 0 it does not fit at all). In fact, some practicing reliability engineers use a threshold value of ρ (e.g., 0.9) to determine that their model adequately fits the data. However, when fitting a model to data, it is important to evaluate the probability plots in addition to the value of ρ before drawing the conclusion that the proposed model is a good fit to the data.

The equation used to calculate the correlation coefficient was presented in HotWire Issue 71, along with an example of using the value of ρ to determine which of two models fit a single data set better. In this article, we will examine the correlation coefficient from a different angle. We will use two data sets with similar values of parameter estimates and similar values of ρ to illustrate the importance of examining the probability plots when assessing model fit.

Consider the two data sets given in Table 1. Both data sets contain 10 exact time-to-failure data points measured in hours. The data sets correspond to failure modes that are due to wearout (though not necessarily the same failure mode for the two data sets).

Table 1 – Times to Failure Data in Hours

Data Set 1 Data Set 2
12410 13083
14763 15450
16799 17467
20741 19196
22770 20753
22885 22649
27609 24345
28246 27256
29650 30707
31132 36059

The data sets were entered into the Weibull++ software and analyzed using a 2-parameter Weibull distribution. The results are shown in Table 2.

Table 2 – Calculated Parameters Using 2-Parameter Weibull Distribution

  Data Set 1 Data Set 2
Beta 3.61 3.63
Eta (hours) 25200 25100
Correlation Coefficient, ρ 0.986 0.986

The values of ρ for Data Set 1 and Data Set 2 are equal, which indicates that the sum of the squared distance from the data points to the model line on the probability plot is the same for each data set. In addition, the values of ρ are close to 1, indicating that there is little difference between the data points and the models. Since it is already known that the data sets are associated with wearout failure modes, there is no reason to suspect that the 2-parameter Weibull model might not be the best choice. However, inspection of the probability plots tells a different story.

One of the assumptions of the least squares regression method is that the error (i.e., the difference between the measured and predicted values of unreliability) is normally distributed with a mean of 0 and a constant variance. In other words, the data points should be scattered evenly about the model without any obvious trends. This implies that the model adequately fits the data set and the difference between the model and the data points is due to variability in the sample. The probability plot for Data Set 1, shown in Figure 1, illustrates a case in which the model describes the data well.

2-Parameter Weibull Probability Plot for Data Set 1
Figure 1 – 2-Parameter Weibull Probability Plot for Data Set 1

Alternatively, if the error follows a pattern, such as in the probability plot for Data Set 2 shown in Figure 2, then the model is not adequate to describe the data. In this case, a different model should be chosen in spite of the fact that the value of the correlation coefficient is close to 1.

2-Parameter Weibull Probability Plot for Data Set 2
Figure 2 – 2-Parameter Weibull Probability Plot for Data Set 2

Since the data sets are assumed to be due to a wearout type failure mode, another logical choice of model would be the lognormal distribution. For comparison purposes, the analysis was run again for both sets of data using a lognormal model. The results are shown in Table 3.

Table 3 - Calculated Parameters Using Lognormal Distribution

  Data Set 1 Data Set 2
Log-Mean (hours) 9.99 9.99
Log-Std 0.331 0.341
Correlation Coefficient, ρ 0.967 0.9998

Using the lognormal model, the parameters calculated from the two data sets are close to each other but the correlation coefficients are quite different. The probability plot for Data Set 1 is shown in Figure 3. Here, you can see that the data points are scattered around the model without any clear trend, so this model does not violate the assumption of normally distributed error made by using least squares regression. Nevertheless, the analyst would choose the 2-parameter Weibull model in this case, because the correlation coefficient is higher for the 2-parameter Weibull model than for the lognormal model.

Lognormal Probability Plot for Data Set 1
Figure 3 – Lognormal Probability Plot for Data Set 1

The probability plot for Data Set 2 is shown in Figure 4. Here, you can see that the data is explained almost perfectly by the model, which is reflected by the correlation coefficient that is very close to 1. In addition, there is no discernible trend in the scatter of the data points around the line. Thus, the lognormal model would be preferred for this data set. If the analyst had based the choice of a model on the fact that the 2-parameter Weibull had a high correlation coefficient without looking at the probability plot, he would have missed the fact that the lognormal model describes Data Set 2 better.

Lognormal Probability Plot for Data Set 2
Figure 4 – Lognormal Probability Plot for Data Set 2


In this article, we examined how to use the correlation coefficient, ρ, in conjunction with the probability plot to assess the fit of two models commonly used for life data analysis to two different data sets. For each data set, the 2-parameter Weibull model provided a high value of the correlation coefficient. However, upon examination of the probability plots, a trend in the scatter of the data points around the model was observed for one data set/model combination. Thus, one data set/model combination violated an underlying assumption of least squares regression, indicating that the model was inappropriate for the data. Repeating the analysis with the lognormal model showed that the lognormal distribution was a better choice for that data set.