Reliability
Growth Analysis with Missing or Erroneous Data

[Editor's Note: This article has been updated
since its original publication to reflect a more recent version
of the software interface.]

Most of the reliability growth
models used for estimating and tracking reliability growth based
on test data assume that the data set represents all actual
system failure times (complete data) consistent with a uniform
definition of failure. In practice, unfortunately, things do not
always work out this way. There might be cases in which training
issues, oversight, biases, misreporting, human error, technical
difficulties, loss of data, etc. might render a portion of the
data erroneous or completely missing. Without applying
"corrections" to the way the data set is handled and the way the
models and their parameters are derived, "standard" analysis may
result in distorted estimates of the growth rate and actual
system reliability. This article discusses a practical
reliability growth estimation and analysis procedure to treat
data that contains anomalies over an interval of the test
period. ReliaSoft's
RGA software
is used to perform the analysis.

Procedure To use the Crow-AMSAA model for reliability growth analysis
containing missing or abnormal data over a certain interval, we
assume that the problematic interval happens independently of
the underlying reliability growth process. Also, the problematic
interval data is not used in the analysis, but the contribution
of the interval to the total test time is retained and the
failures in the interval are assumed to be consistent with the
rest of the failure data. This is often referred to as "gap
analysis."

Consider the case where a
system is tested for time T and the actual failure times
are recorded. The time T may possibly be an observed
failure time. Also, the end points of the gap interval may or
may not correspond to a recorded failure time. The underlying
assumption is that the data used in the maximum likelihood (ML)
estimation follows the Crow-AMSAA model with a Weibull intensity
function
λβt^{β-1}.
It is assumed that the actual number of failures over the gap
interval is unknown, and hence no information regarding these
failures is used in any way to estimate
λ
and β.

Let S_{1}, S_{2}(S_{1}< S_{2})_{ }denote the end
points of the gap interval. Let 0 < X_{1 }< X_{2
}< ...< X_{N1 }≤ S_{1} be the failure
times over (0, S_{1}) and let S_{2}
< X_{1 }< X_{2 }< ...< X_{N2 }≤ T_{
}be the failure times over (S_{2}, T).

The ML estimates of λ
and
β
are obtained using the following
equations.

In general, these equations cannot be solved explicitly. They
are solved using numerical methods.

Example

Consider a system under
development that was subjected to a reliability growth test for
T
= 300 hours. The next table shows the successive </>
N
=
35
failure times that were reported for T
= 300 hours of test.

The above data set was entered
into an RGA
data sheet configured for the Failure Times data type.

The
analyst used RGA
to estimate the following Crow-AMSAA parameters (obtained
without applying gap analysis concepts) and
demonstrated MTBF.

β
= 0.8001

λ
= 0.3648

MTBF = 10.7129

The next
figure shows a plot of the cumulative number of failures versus
time.

The above figure does not show
a good fit of the model to the data set. RGA also
indicated that the
Cramr von Mises goodness-of-fit test failed. Therefore,
there were concerns that the data set does not follow the
Crow-AMSAA reliability growth model well.

The data set was then broken
into 50 hour segments; the following table is a breakdown the
number of reported failures by segment.

Time
Period

Number
of Reported Failures

0 - 50

5

50 - 100

18

100 - 150

5

150 - 200

4

200 - 250

0

250 - 300

3

The number of reported failures during the second 50 hour
segment is quite high in comparison to the number of failures
reported in the other segments. A quick investigation reveled
that a number of new data collectors were assigned to the
project during that period. It was also discovered that
considerable design changes were made during this period
involving the removal of a large number of parts. It is possible
that these removals, which were not failures, were incorrectly
reported as failed parts. Based on knowledge of the system and
test program, it was clear that a quantity of actual system
failures this large was extremely unlikely. The consensus was
that this anomaly was due to reporting failures inconsistently
with the failure definition used throughout the program. It was
decided that the actual number of failures over this month would
be assumed, for this analysis, to be unknown but consistent with
the remaining data and the Crow-AMSAA reliability model.

Considering the 50 hour to 100
hour interval as a problem interval and treating it as gap
interval, the analysis was repeated. In RGA, the gap is
set by entering the beginning time and ending time in the Gap
Interval frame on the Analysis page of the control panel, as shown
next.

The new
Crow-AMSAA parameters and
demonstrated MTBF are.

β
= 0.8774

λ
= 0.1381

MTBF = 16.6184

The next
figure shows a plot of the cumulative number of failures versus
time. This plot indicates a good fit of the model.

Comment:

Note that the mere fact that the
model does not fit the data well is not justification to
eliminate some of the data from the analysis. Engineering
explanations need to be made to justify the use of gap analysis.
In the above example, such investigations made the elimination
of some of the data justifiable.