Reliability HotWire

Issue 78, August 2007

Reliability Basics

Reliability Growth Analysis with Missing or Erroneous Data

Most of the reliability growth models used for estimating and tracking reliability growth based on test data assume that the data set represents all actual system failure times (complete data) consistent with a uniform definition of failure. In practice, unfortunately, things do not always work out this way. There might be cases in which training issues, oversight, biases, misreporting, human error, technical difficulties, loss of data, etc. might render a portion of the data erroneous or completely missing. Without applying "corrections" to the way the data set is handled and the way the models and their parameters are derived, "standard" analysis may result in distorted estimates of the growth rate and actual system reliability. This article discusses a practical reliability growth estimation and analysis procedure to treat data that contains anomalies over an interval of the test period. ReliaSoft's RGA software is used to perform the analysis.

Procedure
To use the Crow-AMSAA model for reliability growth analysis containing missing or abnormal data over a certain interval, we assume that the problematic interval happens independently of the underlying reliability growth process. Also, the problematic interval data is not used in the analysis, but the contribution of the interval to the total test time is retained and the failures in the interval are assumed to be consistent with the rest of the failure data. This is often referred to as "gap analysis."

Consider the case where a system is tested for time T and the actual failure times are recorded. The time T may possibly be an observed failure time. Also, the end points of the gap interval may or may not correspond to a recorded failure time. The underlying assumption is that the data used in the maximum likelihood (ML) estimation follows the Crow-AMSAA model with a Weibull intensity function λβtβ-1. It is assumed that the actual number of failures over the gap interval is unknown, and hence no information regarding these failures is used in any way to estimate λ and β.

Let S1, S2 (S1< S2) denote the end points of the gap interval. Let 0 < X1 < X2 < ...< XN1 ≤ S1 be the failure times over (0, S1) and let S2 < X1 < X2 < ...< XN2 ≤ T be the failure times over (S2, T).

The ML estimates of λ and β are obtained using the following equations.

In general, these equations cannot be solved explicitly. They are solved using numerical methods.

Example

Consider a system under development that was subjected to a reliability growth test for T = 300 hours. The next table shows the successive </> N = 35 failure times that were reported for T = 300 hours of test.

 Reported Failure Times (hr) 10 25 28.1 36.5 49 52.5 53.9 56.5 63.1 63.5 65.4 65.9 69.6 70.6 73 75.3 77.7 88.5 89.4 93.9 95.44 95.5 98.1 101.1 132 142.2 147.7 149 167.2 190.7 193 198.7 251.9 282.5 286.1

The above data set was entered into an RGA data sheet configured for the Failure Times data type.

The analyst used RGA to estimate the following Crow-AMSAA parameters (obtained without applying gap analysis concepts) and demonstrated MTBF.

 β =  0.8001 λ = 0.3648 MTBF = 10.7129

The next figure shows a plot of the cumulative number of failures versus time.

The above figure does not show a good fit of the model to the data set. RGA also indicated that the Cramr von Mises goodness-of-fit test failed. Therefore, there were concerns that the data set does not follow the Crow-AMSAA reliability growth model well.

The data set was then broken into 50 hour segments; the following table is a breakdown the number of reported failures by segment.

 Time Period Number of Reported Failures 0 - 50 5 50 - 100 18 100 - 150 5 150 - 200 4 200 - 250 0 250 - 300 3

The number of reported failures during the second 50 hour segment is quite high in comparison to the number of failures reported in the other segments. A quick investigation reveled that a number of new data collectors were assigned to the project during that period. It was also discovered that considerable design changes were made during this period involving the removal of a large number of parts. It is possible that these removals, which were not failures, were incorrectly reported as failed parts. Based on knowledge of the system and test program, it was clear that a quantity of actual system failures this large was extremely unlikely. The consensus was that this anomaly was due to reporting failures inconsistently with the failure definition used throughout the program. It was decided that the actual number of failures over this month would be assumed, for this analysis, to be unknown but consistent with the remaining data and the Crow-AMSAA reliability model.

Considering the 50 hour to 100 hour interval as a problem interval and treating it as gap interval, the analysis was repeated. In RGA, the gap is set by entering the beginning time and ending time in the Gap Interval frame on the Analysis page of the control panel, as shown next.

The new Crow-AMSAA parameters and demonstrated MTBF are.

 β =  0.8774 λ = 0.1381 MTBF = 16.6184

The next figure shows a plot of the cumulative number of failures versus time. This plot indicates a good fit of the model.

Comment:

Note that the mere fact that the model does not fit the data well is not justification to eliminate some of the data from the analysis. Engineering explanations need to be made to justify the use of gap analysis. In the above example, such investigations made the elimination of some of the data justifiable.