Reliability HotWire

Issue 66, July 2006

Reliability Basics

Reliability Importance Measures of Components in a Complex System - Identifying the 20% in the 80/20 Rule

When analyzing a systems reliability and availability, measuring the importance of components is often of significant value in prioritizing improvement efforts, performing trade-off analysis in system design or suggesting the most efficient way to operate and maintain a system. Focusing on the most problematic areas in the system results in the most significant gains. This article presents different ways for assessing the importance of non-repairable and repairable components within a system using BlockSim.

Introduction
In 1906, an Italian economist named Vilfredo Pareto noticed that 20% of the people owned 80% of the wealth. This is often referred to as the 80/20 rule, popularized by the quality management pioneer Dr. Joseph Juran, who, during his work in the 1930s and 40s, recognized a universal principle he called the "vital few and trivial many." Juran described the Pareto concept as distinguishing the vital few issues from the trivial many issues, stating that 20% of the defects cause 80% of the problems. Recently, Microsoft discovered that 80% of the errors and crashes in Windows and Office are caused by 20% of the entire pool of bugs detected [1]. Even if your system does not follow the 80/20 rule exactly, it is useful to prioritize the issues in your system before deciding on a plan of attack.

With modern technology and higher reliability requirements, systems are getting more complicated. Therefore, identifying the most problematic components can become difficult. Many systems are repairable systems composed of many components that fail and get repaired based on different distributions. With limitations and constraints (such as spare parts availability, repair crew response time, logistic delays etc.), exact analytical solutions become intractable. In these cases, simulation becomes the tool of choice in modeling repairable systems and identifying weak components and areas where maintainability limitations hinder the availability of the system.

Note: In this article, the cost of improving the reliability of the component is not considered. Cost of improvement is covered in the Reliability Allocation section of the System Analysis Reference.

1. Importance Measures for Non-Repairable Components
In simple systems such as a series system, it is easy to identify the weak components. However, in more complex systems, this becomes quite a difficult task. For complex systems, the analyst needs a mathematical approach that will provide the means of identifying and quantifying the importance of each component in the system.

Using Reliability Importance (IR) measures is one method of identifying the relative importance of each component in a system with respect to the overall reliability of the system. The reliability importance, IRi, of component i in a system of n components is given by Leemis [2]:

(1)

Where:

  • Rs(t) is the system reliability at a certain time, t

  • Ri(t) is the component reliability at a certain time, t

This metric measures the rate of change (at time t) of the system reliability with respect to the components reliability change. It also measures the probability of a component being responsible for system failure at time t. The value of the reliability importance given by Eqn. (1) depends on both the reliability of a component and its corresponding position in the system.

As an example, let us consider the system described in Figure 1.


Figure 1 Example of Systems Reliability Block Diagram

The failure distributions for the components in the diagram are:

Block Name

Failure Distribution (hr)

A

Weibull (β=1.5, η=200)

B

Weibull (β=4, η=1000)

C

Exponential (λ=0.0008)

D

Weibull (β=2, η=150)

E

Weibull (β=2, η=400)

F

Weibull (β=1.7, η=400)

G

Weibull (β=1.5, η=100)

H

Weibull (β=1.4, η=800)

I

Weibull (β=1.5, η=1000)

The system reliability equation for this configuration can be expressed as:

Hence, according to Eqn. (1), the reliability importance of component A, for example, is:

By varying the time value, t, and obtaining the corresponding reliabilities at t for each of the components in the above equation, we can obtain the reliability importance value for different times. For instance, if t=50 hr, IRA=0.936. The same procedure can be applied for every component.

This type of reliability importance measure can be presented graphically in various ways. The following BlockSim plot shows the reliability importance of each block in Figure 1 over time.

The next plot is a snapshot of the previous plot at a specific time value (this is called "static reliability importance").

The following plot is also static reliability importance, but is presented as a "square pie chart" that shows the breakdown of the components reliability importance.

The three plots above show the clear dominance of two (20%) of the components, A and I, in responsibility for most of the failures of the system.

2. Importance Measures for Repairable Components
Let us consider the system described in Figure 1 and now assume that it is a repairable system with the following repair distributions and preventive maintenance policies.

Table 1 Maintainability Characteristics of the Figure 1 Example System

Block Name

Repair Distribution  (hr)

Preventive Maintenance Policy

Preventive Maintenance Repair Distribution (hr)

A

Normal (μ=20, σ=0.1)

Every 300 hr of Block Age

Normal (μ=6, σ=2)

B

Normal (μ=10, σ=2)

Every 300 hr of Block Age

Normal (μ=6, σ=2)

C

Normal (μ=0.5, σ=0.1)

No PM (Constant Failure Rate)

No PM (Constant Failure Rate)

D

Normal (μ=0.5, σ=0.01)

Every 300 hr of Block Age

Normal (μ=6, σ=2)

E

Normal (μ=10, σ=2)

Every 300 hr of Block Age

Normal (μ=6, σ=2)

F

Normal (μ=10, σ=2)

Every 200 hr of Block Age

Normal (μ=6, σ=2)

G

Normal (μ=1, σ=0.1)

Every 200 hr of Block Age

Normal (μ=6, σ=2)

H

Normal (μ=10, σ=2)

Every 200 hr of Block Age

Normal (μ=6, σ=2)

I

Normal (μ=20, σ=2)

Every 200 hr of Block Age

Normal (μ=6, σ=2)


When dealing with repairable systems, the system reliability (and system availability) depends on the components failure characteristics, but also on other contributory factors such as time-to-repair distributions, maintenance practices, crews and spare availabilities, logistic delays, etc.

Through simulation, the system and components histories over time can be captured. The results of the simulation can be used to quantify two other types of reliability importance measures, ReliaSoft's Failure Criticality Index (RS FCI) and ReliaSoft's Downing Event Criticality Index (RS DECI), both available in BlockSim. A discussion of these two metrics is presented next.

2.1. ReliaSoft's Failure Criticality Index (RS FCI)
ReliaSoft's Failure Criticality Index (RS FCI) is a relative index showing the percentage of times that a failure of a component caused a system failure. RS FCI is obtained from:

This metric considers only failure events and excludes preventive maintenance and inspection events that cause an interruption is the systems operation.

RS FCI reports the percentage of times that a system failure event was caused (triggered) by a failure of a particular component over the simulation time (0,t). Intuitively, this index has the same meaning and the same application as the Reliability Importance measure, IRi, described in Eqn. (1).

For example, if we simulate the systems operation for 5000 hr in BlockSim, we obtain the following Block Summary report.

Figure 2 Blocks Simulation Summary Report for 5000 hr of System Operation
[Click to Enlarge]

For component A, RS FCI = 73.73%. This implies that 73.73% of the times that the system failed, a component A failure was responsible. Note that the RS FCI of A and I is 81.67%. In other words, A and I contributed to about 80% of the systems total downing failures.

The RS FCI results can also be seen in a graphical format.

2.2. ReliaSoft's Downing Event Criticality Index (RS DECI)
ReliaSoft's Downing Event Criticality Index (RS DECI) is a relative index showing the percentage of times that a downing event on a particular component caused the system to go down. This is obtained from:

This metric considers all downing events, i.e. failures, preventive maintenance and inspection events, that cause an interruption in the systems operation.

In Figure 2, we see that for component A, RS DECI = 51.68%. This implies that 51.68% of the times that the system was down were due to component A being down. Note that the RS DECI of A and I is 80.05%. Once again we see how the vital few issues, A and I (20% of the components), contributed to about 80% of the system downtime, whereas the trivial many (80% of the components) contributed to only 20% of the downtime.

The RS DECI results can also be seen in a graphical format.

3. FRED Report
FRED stands for Failure Reporting, Evaluation and Display. This report provides a graphical demonstration of the maintainability/availability characteristics of the components/subsystems in a system and helps to identify areas for improvement (i.e. better reliability and/or better maintainability).

For the repairable system example in Figure 1, the FRED report is shown next.


[Click to Enlarge]

The FRED report shows the average availability, the MTBF, the MTTR (mean time to repair) and the RS FCI values for each component in the system. In addition, the components are color coded (a color spectrum varying from red, for worst reliability, to dark green for best reliability) to show the reliability of each component in relation to the other components. For example, we can conclude from the above FRED report that component Gs reliability needs improvement and that component Bs maintainability needs improvement (MTTR=678.71h).

4. What-If Analysis
Another possible way to understand the importance of any element in a system is to perform sensitivity analysis using what-if analysis. This allows for even more flexible types of importance measures beyond the aforementioned standard types of measure. The analyst can vary parameters, take out elements, add resources, change preventive maintenance policies, etc. and assess the effect on a certain reliability/availability metric of interest. Practically, any element in the system can be manipulated to assess the impact on the systems reliability/availability; we will list some common examples next.

4.1. Eliminating Problems
The analyst can study the impact on reliability or availability if a problem (failure mode) is eliminated. This can be done by analyzing the system with the problem and without the problem. The next plot shows the reliability of the system in Figure 1 if A is eliminated.

In BlockSim, you can delete a block or set it so that it does not fail; this will eliminate its effect.

4.2. Changing Failure Distribution
Another way of assessing the importance of a component is by varying its failure distribution parameter. For example, you can assess the impact on the reliability of the system of improving a part or switching to a different supplier. The next figure shows the difference in B10 life of the system in Figure 1 if the original C component, C1, which follows an exponential distribution with λ1=0.0008, is replaced by a similar (but more expensive) component, C2, with λ2=0.0004, that can be purchased from a different supplier.

B10 Life for the System with C1 B10 Life for the System with C2

The above analysis can be used to weigh the gains obtained by switching to a more expensive supplier.

4.3. Changing Maintainability Characteristics
This what-if analysis is done by changing maintainability-related factors and assessing the impact on availability or total failure number. Maintainability-related factors include the repair duration, the frequency of preventive maintenance and inspections, the initial stock level of spare parts, the restocking policy, the logistic delay for obtaining parts (addressed by choosing a different distribution company or delivery companies) or number of crews.

The next example shows the impact on availability if each preventive maintenance policy applied on the components is performed every 200 hr of component age.

Mean Availability of the System with Original PM Policy Mean Availability of the System with New PM Policy

5. Conclusion
Before embarking on a lengthy and expensive development or maintenance restructuring program, it is important to identify the most problematic areas in the system. In addition to various standard metrics available in BlockSim, what-if type analysis can be used to address other types of importance measures.


References
1. "Microsoft's CEO: 80-20 Rule Applies To Bugs, Not Just Features," Paula Rooney, CRN, www.crn.com/sections/breakingnews/dailyarchives.jhtml?articleId=18821726, visited on 8/2/2006.
2. Leemis, L.M. Reliability Probabilistic Models and Statistical Methods, Prentice Hall, Inc. Englewood Clifs, New Jersey, 1995.
3. Wang, W., Loman, J., Vassiliou, P., Reliability Importance of Components in a Complex System, Proceedings of the Annual Reliability & Maintainability Symposium, 2004.

Copyright 2006 ReliaSoft Corporation, ALL RIGHTS RESERVED