Reliability HotWire |
||||||||||

Hot Topics |
||||||||||

A common metric of interest for repairable
systems is the likelihood (or number of times) that a downtime lasts beyond
a certain limit. In some circumstances, failures (or a certain number of
failures) might be tolerated as long as they do not cause significant
downtime. In this article, we will present an approach to estimate the
probability that a downtime exceeds a certain time using There are many examples of situations where the risk of excess downtime is a serious problem to companies. For example, oil-drilling tool companies may have customers that penalize them if their tools stay inoperable for more than a certain number of hours. Excessive downtime of computer servers can cause major interruptions to a company's' operations and unhappy customers who cannot access Web sites. Manufacturing companies that have processes that are very time-sensitive cannot afford long downtimes. Food processing companies deal with perishable ingredients and cannot accept downtimes longer than a certain time. Electric utility companies can handle outages that last less than a certain time with their systems, but if the outage lasts longer, it can result in a city outage. One of the questions in a "Cost of Downtime" survey conducted in 2001 via the ContingencyPlanning.com Web site asked respondents "At What Point is the Survival of Your Company At Risk" (due to the duration of the downtime). 60% of the respondents specified that downtimes ranging from within the first hour to 48 hours would pose a serious risk. The results of this survey (posted on the Web at http://www.contingencyplanningresearch.com/2001 Survey.pdf) highlight the peril of extended downtime to the survival of a company. When faced with a system downtime analysis problem, it is not enough just to evaluate the repair distributions of the parts to determine whether a system downtime would exceed a certain time. Whether a component downtime translates into a system failure depends on the system configuration and the downtimes of other parts and when they happen relative to each other. In addition, in cases where resources (crews, spare warehouse, etc.) are shared and/or when additional delays (logistic delays, transportation delays, etc.) are possible, a downtime might take longer than the repair action. A system model is necessary to combine the repair characteristics of parts, logistics and resources to get a comprehensive understanding of the system's downtime. Let us use a simple example of a system made up of two parts in series. The two parts could fail close to each other (as shown in the next figure), in which case, the system could experience downtimes longer than the part's downtime. The system downtime starts from the time the first part failed until the end of repair of the second part. Also, if parts share the same repair crew, there is a possibility that while a crew is busy repairing a part, another part fails. The second part will then incur longer delays because it has to wait for the crew to become available.
Let us use the following example for illustration. We are interested in the system's operation over 5000 hours. The system downtime should not exceed 18 hours; otherwise the company faces a significant penalty. The following table lists the failure and repair distributions for each part.
A component downtime will cause a system downtime only if an additional part is down during the same time. In this example, all repair actions are to be performed by the same crew, which can only do one repair action at a time. If we simulate the system and obtain system
level summarized results such as mean availability, total failures, total
downtime or average downtime, then we do not have enough information to
estimate the probability that the downtime caused by a system's failure
incident will last longer than a certain time. To assess the downtime caused
by each system failure, we must run one simulation and evaluate the logs of
each repair action. The This report lists the failure time and the
repair duration for each system failure event during the simulation run
time. This log is presented for only one simulation run. To have more
confidence in the results of our analysis, we can run the simulation
multiple times (each time with a different seed number) and keep a record of
all the system downtime durations. In this example, we repeated the analysis
10 times, allowing us to generate 160 system failures. The data set
including all repair durations was then entered in The generalized-gamma distribution was
found to be a good To calculate the probability that the
system stays down longer than 18 hours, we calculate P(System Downtime > 18
hours). This probability corresponds to the area under the repair duration's Thus, the probability that the system's downtime exceeds 18 hours is 7.72%. Because the system downtimes might be too complicated to be modeled with a simple distribution, as we noticed in the previous probability plot, another approach that would not need a distribution assumption might be more appropriate. That approach would be to use a non-parametric data analysis approach such as the Kaplan-Meier. The probability plot is shown next. It shows the probability that a system downtime will exceed a certain time. Using the non-parametric approach, the probability that the system's downtime will exceed 18 hours is 9% (estimated from the above plot). An even simpler method would be to count the number of times a system downtime lasted longer than 18 hours and divide that by the total sample size (160), which, in this case, yields an answer of 9.375%. More accurate results can be obtained by increasing the number of simulations.
*EAGLE ROCK ALLIANCE, 2001 Cost of Downtime Online Survey at http://www.contingencyplanningresearch.com/2001 Survey.pdf. |
||||||||||

Copyright 2007 ReliaSoft Corporation, ALL RIGHTS RESERVED |
||||||||||