Rank Regression Parameter Estimation
In the last two editions of Reliability Basics, we looked at the probability plotting and maximum likelihood methods of parameter estimation. In this edition, we will examine the rank regression method for parameter estimation, also known as the least squares method. This is, in essence, a more formalized method of the manual probability plotting technique, in that it provides a mathematical method for fitting a line to plotted failure data points. This eliminates some of the guesswork inherent in the probability plotting method, and allows for computer-based solution techniques that are not as complicated as those of the maximum likelihood method.
The initial process of using rank regression to analyze life data is identical to the process outlined for probability plotting. First, failure data must be obtained. (For the sake of simplicity, we will assume complete data, i.e. no suspensions.) The data can then be plotted on a special type of plotting paper that linearizes the unreliability function for a particular distribution. The x-axis coordinates represent the failure times, while the y-axis coordinates represent unreliability estimates. These unreliability estimates are usually obtained via median ranks, hence the term rank regression. (For more information on probability plotting paper, linearization of unreliability functions and unreliability estimates, see the article on probability plotting at https://www.weibull.com/hotwire/issue8/relbasics8.htm.)
It now remains to develop a model for the unreliability function based on the placement of the points on the plot. In probability plotting, this is done manually, by "eyeballing" a straight line through the points. Obviously, this method is subject to a great deal of error and does not provide repeatability. Rank regression accomplishes this modeling with more accuracy and repeatability by employing a common mathematical technique known as least squares regression analysis.
Least squares, or least sum of squares, regression requires that a straight line be fitted to a set of data points, such that the sum of the squares of the distance of the points to the fitted line is minimized. This minimization can be performed in either the vertical or horizontal direction. If the regression is on the x-axis, then the line is fitted so that the horizontal deviations from the points to the line are minimized. If the regression is on the y-axis, then this means that the distance of the vertical deviations from the points to the line is minimized. This is illustrated in the following figure.
At this point, we have data plotted on a probability plot, and we want to regress a straight line through the data points. This straight line will take the form y = ax + b, and we want to find the values of a and b that minimizes the square of the distance from the points to the line. For rank regression on y, this can be expressed mathematically as:
where and are the least squares estimates of a and b. This equation is minimized with the following values of and :
and N is the number of (xi , yi) data coordinates. Likewise, for rank regression on x, the expression for minimizing the square of the distances between the points and the line follows the form:
As with regression on y, the equation is minimized with the following values of and :
One of the advantages of the rank regression method is that it can provide a good measure for the fit of the line to the data points. This measure is known as the correlation coefficient, and is commonly represented by the Greek letter rho, ρ. In the case of life data analysis, it is a measure for the strength of the linear relation between the median ranks (y-axis values) and the failure time data (x-axis values). The population correlation coefficient has the following form:
where σxy is the covariance of x and y, σx is the standard deviation of x, and σy is the standard deviation of y. The estimate of the correlation coefficient is given by:
where is the estimator for ρ. The closer the value is to the absolute value of 1, the better the linear fit. Note that +1 indicates a perfect fit with a positive slope, while -1 indicates a perfect fit with a negative slope. A "perfect fit" means that all of the points fall exactly on a straight line. A correlation coefficient value of zero would indicate that the data points are randomly scattered and have no pattern or correlation in relation to the regression line model.
The rank regression estimation method is quite good for functions that can be linearized. As was discussed in the article on probability plotting, most of the distributions used in life data analysis are capable of being linearized. For these distributions, the calculations are relatively easy and straightforward, having closed-form solutions that can readily yield an answer without having to resort to numerical techniques or tables. Further, this technique provides a good measure of the goodness-of-fit of the chosen distribution in the correlation coefficient. Least squares is generally best used with data sets containing complete data, that is, data consisting only of single times-to-failure with no censored or interval data. For data sets containing large quantities of suspended data points, maximum likelihood estimation may be the preferable form of analysis.
Copyright © 2001 ReliaSoft Corporation, ALL RIGHTS RESERVED