# The Importance of Looking at the Probability Plot When Fitting a Model Using Least Squares Regression

The correlation coefficient, ρ, provides a measure of the fit of a model to a set of data points. For most real data sets, the absolute value of ρ lies between 0 and 1, with a higher value of ρ indicating a better fit of the model to the data (i.e., when ρ = -1 or ρ = 1 the model fits the data perfectly, and when ρ = 0 it does not fit at all). In fact, some practicing reliability engineers use a threshold value of ρ (e.g., 0.9) to determine that their model adequately fits the data. However, when fitting a model to data, it is important to evaluate the probability plots in addition to the value of ρ before drawing the conclusion that the proposed model is a good fit to the data.

The equation used to calculate the correlation coefficient was presented in HotWire Issue 71, along with an example of using the value of ρ to determine which of two models fit a single data set better. In this article, we will examine the correlation coefficient from a different angle. We will use two data sets with similar values of parameter estimates and similar values of ρ to illustrate the importance of examining the probability plots when assessing model fit.

Consider the two data sets given in Table 1. Both data sets contain 10 exact time-to-failure data points measured in hours. The data sets correspond to failure modes that are due to wearout (though not necessarily the same failure mode for the two data sets).

Table 1 – Times to Failure Data in Hours

 Data Set 1 Data Set 2 12410 13083 14763 15450 16799 17467 20741 19196 22770 20753 22885 22649 27609 24345 28246 27256 29650 30707 31132 36059

The data sets were entered into the Weibull++ software and analyzed using a 2-parameter Weibull distribution. The results are shown in Table 2.

Table 2 – Calculated Parameters Using 2-Parameter Weibull Distribution

 Data Set 1 Data Set 2 Beta 3.61 3.63 Eta (hours) 25200 25100 Correlation Coefficient, ρ 0.986 0.986

The values of ρ for Data Set 1 and Data Set 2 are equal, which indicates that the sum of the squared distance from the data points to the model line on the probability plot is the same for each data set. In addition, the values of ρ are close to 1, indicating that there is little difference between the data points and the models. Since it is already known that the data sets are associated with wearout failure modes, there is no reason to suspect that the 2-parameter Weibull model might not be the best choice. However, inspection of the probability plots tells a different story.

One of the assumptions of the least squares regression method is that the error (i.e., the difference between the measured and predicted values of unreliability) is normally distributed with a mean of 0 and a constant variance. In other words, the data points should be scattered evenly about the model without any obvious trends. This implies that the model adequately fits the data set and the difference between the model and the data points is due to variability in the sample. The probability plot for Data Set 1, shown in Figure 1, illustrates a case in which the model describes the data well. Figure 1 – 2-Parameter Weibull Probability Plot for Data Set 1

Alternatively, if the error follows a pattern, such as in the probability plot for Data Set 2 shown in Figure 2, then the model is not adequate to describe the data. In this case, a different model should be chosen in spite of the fact that the value of the correlation coefficient is close to 1. Figure 2 – 2-Parameter Weibull Probability Plot for Data Set 2

Since the data sets are assumed to be due to a wearout type failure mode, another logical choice of model would be the lognormal distribution. For comparison purposes, the analysis was run again for both sets of data using a lognormal model. The results are shown in Table 3.

Table 3 - Calculated Parameters Using Lognormal Distribution

 Data Set 1 Data Set 2 Log-Mean (hours) 9.99 9.99 Log-Std 0.331 0.341 Correlation Coefficient, ρ 0.967 0.9998

Using the lognormal model, the parameters calculated from the two data sets are close to each other but the correlation coefficients are quite different. The probability plot for Data Set 1 is shown in Figure 3. Here, you can see that the data points are scattered around the model without any clear trend, so this model does not violate the assumption of normally distributed error made by using least squares regression. Nevertheless, the analyst would choose the 2-parameter Weibull model in this case, because the correlation coefficient is higher for the 2-parameter Weibull model than for the lognormal model. Figure 3 – Lognormal Probability Plot for Data Set 1

The probability plot for Data Set 2 is shown in Figure 4. Here, you can see that the data is explained almost perfectly by the model, which is reflected by the correlation coefficient that is very close to 1. In addition, there is no discernible trend in the scatter of the data points around the line. Thus, the lognormal model would be preferred for this data set. If the analyst had based the choice of a model on the fact that the 2-parameter Weibull had a high correlation coefficient without looking at the probability plot, he would have missed the fact that the lognormal model describes Data Set 2 better. Figure 4 – Lognormal Probability Plot for Data Set 2

## Summary

In this article, we examined how to use the correlation coefficient, ρ, in conjunction with the probability plot to assess the fit of two models commonly used for life data analysis to two different data sets. For each data set, the 2-parameter Weibull model provided a high value of the correlation coefficient. However, upon examination of the probability plots, a trend in the scatter of the data points around the model was observed for one data set/model combination. Thus, one data set/model combination violated an underlying assumption of least squares regression, indicating that the model was inappropriate for the data. Repeating the analysis with the lognormal model showed that the lognormal distribution was a better choice for that data set.