Reliability Growth
Analysis with Missing or Erroneous Data
Most of the reliability growth
models used for estimating and tracking reliability growth based
on test data assume that the data set represents all actual system
failure times (complete data) consistent with a uniform definition
of failure. In practice, unfortunately, things do not always work
out this way. There might be cases in which training issues, oversight,
biases, misreporting, human error, technical difficulties, loss
of data, etc. might render a portion of the data erroneous or completely
missing. Without applying "corrections" to the way the data set
is handled and the way the models and their parameters are derived,
"standard" analysis may result in distorted estimates of the growth
rate and actual system reliability. This article discusses a practical
reliability growth estimation and analysis procedure to treat data
that contains anomalies over an interval of the test period. ReliaSoft's
RGA 6 software
is used to perform the analysis.
Procedure To use the
Crow-AMSAA model for reliability growth analysis containing missing
or abnormal data over a certain interval, we assume that the problematic
interval happens independently of the underlying reliability growth
process. Also, the problematic interval data is not used in the
analysis, but the contribution of the interval to the total test
time is retained and the failures in the interval are assumed to
be consistent with the rest of the failure data. This is often referred
to as "gap analysis."
Consider the case where a system
is tested for time T and the actual failure times are recorded.
The time T may possibly be an observed failure time. Also,
the end points of the gap interval may or may not correspond to
a recorded failure time. The underlying assumption is that the data
used in the maximum likelihood (ML) estimation follows the Crow-AMSAA
model with a Weibull intensity function
λβtβ-1.
It is assumed that the actual number of failures over the gap interval
is unknown, and hence no information regarding these failures is
used in any way to estimate λ
and β.
Let S1, S2(S1< S2)denote the end
points of the gap interval. Let 0 < X1 < X2
< ...< XN1 ≤ S1 be the failure times
over (0, S1) and let S2 <
X1 < X2 < ...< XN2 ≤ Tbe the failure times over (S2, T).
The ML estimates of λ and
β
are obtained using the following equations.
In general, these equations cannot
be solved explicitly. They are solved using numerical methods.
Example
Consider
a system under development that was subjected to a reliability growth
test for T
= 300 hours. The next table shows the successive >
N
=
35
failure times that were reported for T
= 300 hours of test.
The above data set was entered
into an RGA 6
data sheet created by selecting the Developmental Testing > Time-to-Failure
Data > Failure Times option in the software's Data Type Expert
window.
The analyst
used RGA 6 to estimate
the following Crow-AMSAA parameters (obtained without applying gap
analysis concepts) and
demonstrated MTBF.
β
= 0.8001
λ
= 0.3648
MTBF = 10.7129
The next
figure shows a plot of the cumulative number of failures versus
time.
The above figure does not show
a good fit of the model to the data set.RGA 6 also indicated
that the
Cramér von Mises goodness-of-fit test failed. Therefore, there
were concerns that the data set does not follow the Crow-AMSAA reliability
growth model well.
The data set was then broken into
50 hour segments; the following table is a breakdown the number
of reported failures by segment.
Time Period
Number
of Reported Failures
0 - 50
5
50 - 100
18
100 - 150
5
150 - 200
4
200 - 250
0
250 - 300
3
The
number of reported failures during the second 50 hour segment is
quite high in comparison to the number of failures reported in the
other segments. A quick investigation reveled that a number of new
data collectors were assigned to the project during that period.
It was also discovered that considerable design changes were made
during this period involving the removal of a large number of parts.
It is possible that these removals, which were not failures, were
incorrectly reported as failed parts. Based on knowledge of the
system and test program, it was clear that a quantity of actual
system failures this large was extremely unlikely. The consensus
was that this anomaly was due to reporting failures inconsistently
with the failure definition used throughout the program. It was
decided that the actual number of failures over this month would
be assumed, for this analysis, to be unknown but consistent with
the remaining data and the Crow-AMSAA reliability model.
Considering the 50 hour to 100 hour
interval as a problem interval and treating it as gap interval,
the analysis was repeated. In RGA 6, the gap is
set by entering the beginning time and ending time in the Gap
Interval frame in the Set Analysis tab, as shown next.
The new Crow-AMSAA
parameters and
demonstrated MTBF are.
β
= 0.8774
λ
= 0.1381
MTBF = 16.6184
The next figure
shows a plot of the cumulative number of failures versus time. This
plot indicates a good fit of the model.
Comment:
Note that the mere fact that the model
does not fit the data well is not justification to eliminate some
of the data from the analysis. Engineering explanations need to
be made to justify the use of gap analysis. In the above example,
such investigations made the elimination of some of the data justifiable.