Reliability Growth Test Planning and Management
Larry H. Crow, Ph.D.
An effective reliability growth test planning and management strategy can contribute greatly to successful product design and development through its impact on the ability of the design/development team to meet desired reliability goals on time and within the project budget. An effective reliability growth management program both produces and utilizes important information about the reliability of the product design, such as the demonstrated MTBF through testing, the growth in MTBF that has been achieved through implementation of corrective actions, the maximum potential MTBF that can likely be achieved for the product design and estimates regarding latent failure modes that have not yet been uncovered through testing.
This article presents a brief conceptual overview of a reliability growth test planning/management strategy and data analysis methodology that provide information that can be instrumental to various management decisions for product design/development. Dr. Larry H. Crow, a leading practitioner in the field of Reliability Growth Analysis for over 30 years, developed the approach described in this article and has cooperated with design/development teams in both the military and the private sector to implement, validate and refine the relevant techniques. This article has been written with cooperation from Dr. Crow based on his lectures on the subject and published standards for reliability growth analysis.
Also, as noted in previous articles by Dr. Crow, a comprehensive reliability growth program actually begins in early design and identified potential problem failure modes are mitigated before formal testing. This potential failure mode mitigation in design is highly productive when managed with Failure Mode and Effects Analysis (FMEA), System Reliability Block Diagram (RBD) Analysis and/or Fault Tree Analysis (FTA). The objective of these analyses is to increase the reliability before testing.
Background and Assumptions
Based on the results of each reliability growth testing phase and the subsequent analysis, the project manager may wish to make changes to the design/development approach. Specifically, he/she may choose to revise the program schedule, change the number of products tested and/or the duration of the test and/or increase, decrease or reallocate the program budget and resources. In addition, the design/development team may reevaluate the criteria used to determine which failure modes will receive corrective actions and institute any necessary changes. That is, it may be appropriate to change the management strategy.
Reliability Growth Testing: Test a sample of units according to the test plan that has been established and record failure information for the units under test. In practice, the units may start the test at different times, but it is generally assumed that the test units have the same design configurations at any point in the testing. The methods also apply to discrete (oneshot) success/failure events.
Categorize Observed Failures: Categorize each observed failure according to whether corrective action will be performed to address the problem that caused the failure. In a "Test-Find-Test" scenario, one of two categories can be assigned to each failure mode: Category A or Category B.
Characterize Category B Failure Modes: Identify and characterize the failure mode for each Category B failure. The failure mode description typically provides information about the specific physical cause of the problem. For example, "leaking actuator, worn seal" and "leaking actuator, flange radius crack from fatigue" are two unique failure modes. In this case, the phrase "leaking actuator" is not sufficiently descriptive of the failure mode because there is more than one physical cause that can result in the failure of the item via a leaking actuator.
For bookkeeping purposes, it can be helpful to assign an alphanumeric code to all Category B failure modes according to the sequence in which unique modes have been identified. For example, the first Category B failure can be identified as B1, the second as B2, and so on. When/if another failure occurs due to a failure mode that has already been identified, it is given the same number as the first instance of that failure.
Effectiveness of Corrective Actions: For each unique Category B failure
mode, examine the likely effectiveness of the corrective action. The
effectiveness factor is a number between 0 and 1, which represents the
fraction decrease in the failure mode's failure rate due to the corrective
action. For example, if the corrective action is expected to reduce the
failure rate due to a given mode by 75%, then the effectiveness factor for
the corrective action is 0.75. If this mode is expected to be responsible
for 8 failures before the fix has been implemented, then after the
corrective action has been performed, we would expect to observe 2 failures
due to the given mode. Numerically, this would be
Effectiveness factors are assigned based on engineering judgment and the predictions made based on the various factors will be affected by the quality of this assessment. Based on past experience with reliability growth analysis testing, the average effectiveness factor for all modes is likely to be in the range of 0.65 to 0.75. An individual effectiveness factor may be smaller or larger than this average, but the average over a large number of effectiveness factors during a test is likely to be in this range based on data.
Statistical Model: The Crow (AMSAA) projection model uses a
nonhomogenous poisson process (N.H.P.P.) statistical model to analyze
reliability growth data and incorporate the failure classifications and
effectiveness factors. This model can be used to obtain a variety of plots
and results, including the reliability that has been demonstrated during the
test and the expected reliability of the design after the delayed fixes for
Category B failure modes have been implemented. These results are presented
graphically in Figure 1, which shows the demonstrated MTBF of the current
design as a straight line at 9.55 and the projection for the new design
(which incorporates the delayed fixes) as a point at 15.13 MTBF. The
projection of 15.13 estimates the impact of the proposed delayed corrective
actions and effectiveness factors on the system reliability.
Evaluate and Adjust Management Strategy: In addition to the demonstrated and projected MTBF results, the Crow (AMSAA) projection model supports the generation of other results and plots that can be invaluable for evaluating the current design/development management strategy and making any necessary adjustments. The growth potential metric and the analysis of unseen failure modes are important metrics for this purpose.
potential is an estimate of the maximum system MTBF that can be attained
with the product design and reliability growth management strategy. This can
be displayed with a straight line on the MTBF vs. Test Time plot, as shown
in Figure 2 where the growth potential is identified at 22.45 MTBF. This
metric can help to confirm the manager's expectation that the ultimate
reliability goal for the design is feasible, but it can also provide a clear
warning if the reliability goal cannot be achieved for the current design
under the given conditions. Management can then respond to this warning by
making changes to the management strategy, such as converting some Category
A failure modes to Category B failure modes and/or changing the criteria for
the classification of new modes that are uncovered or adding redundancy.
Analysis of the
unseen failure modes provides another important set of metrics for
evaluating the product design and the reliability growth management
strategy. Based on the failure modes that have been uncovered during the
test, the Crow (AMSAA) projection model can be used to provide estimates
about the failure modes that have not yet occurred. Such metrics include the
current rate of uncovering new Category B failure modes, the estimated
number of unseen Category B failure modes and the estimated failure rate for
unseen Category B failure modes. This analysis can provide an indicator of
how many problems are yet to be discovered in the design and how much test
time will be required to identify and correct those latent causes of
failure. The pie chart in Figure 3 represents one method to display this
information graphically. The pie chart illustrates the quantity and ratio of
seen and unseen failure modes after the completion of a particular phase of
Incorporating Category C Failures
If the test also
includes Category B failure modes, then this gradual increase will also be
accompanied by a jump in reliability when the Category B corrective actions
are implemented at the end of the test phase. The Generalized Crow
Projection model accommodates Category A , B and C failure modes and Figure
4 displays the MTBF vs. Time plot for such analyses. This plot is similar to
the ones shown in Figures 1 and 2, except that it includes a gradual
increase in the reliability observed during the test, due to the
implementation of fixes for some failure modes while the test was in
United States Department of Defense. MIL-HDBK-189:Reliability Growth Management, February 13, 1981.
International Electrotechnical Commission. IEC 61164:Reliability Growth - Statistical Test and Estimation Methods, June 1995.
NOTE: Two IEC publications on reliability growth, IEC 61164 and IEC 61014, are currently undergoing revision. For more information, search for works in progress at http://www.iec.ch.
Copyright 2006 ReliaSoft Corporation, ALL RIGHTS RESERVED