Analyzing Excess System Downtime Through System Modeling
A common metric of interest for repairable systems is the likelihood (or number of times) that a downtime lasts beyond a certain limit. In some circumstances, failures (or a certain number of failures) might be tolerated as long as they do not cause significant downtime. In this article, we will present an approach to estimate the probability that a downtime exceeds a certain time using BlockSim 7 and Weibull++ 7.
There are many examples of situations where the risk of excess downtime is a serious problem to companies. For example, oil-drilling tool companies may have customers that penalize them if their tools stay inoperable for more than a certain number of hours. Excessive downtime of computer servers can cause major interruptions to a company's' operations and unhappy customers who cannot access Web sites. Manufacturing companies that have processes that are very time-sensitive cannot afford long downtimes. Food processing companies deal with perishable ingredients and cannot accept downtimes longer than a certain time. Electric utility companies can handle outages that last less than a certain time with their systems, but if the outage lasts longer, it can result in a city outage.
One of the questions in a "Cost of Downtime" survey conducted in 2001 via the ContingencyPlanning.com Web site asked respondents "At What Point is the Survival of Your Company At Risk" (due to the duration of the downtime). 60% of the respondents specified that downtimes ranging from within the first hour to 48 hours would pose a serious risk. The results of this survey (posted on the Web at http://www.contingencyplanningresearch.com/2001 Survey.pdf) highlight the peril of extended downtime to the survival of a company.
When faced with a system downtime analysis problem, it is not enough just to evaluate the repair distributions of the parts to determine whether a system downtime would exceed a certain time. Whether a component downtime translates into a system failure depends on the system configuration and the downtimes of other parts and when they happen relative to each other. In addition, in cases where resources (crews, spare warehouse, etc.) are shared and/or when additional delays (logistic delays, transportation delays, etc.) are possible, a downtime might take longer than the repair action.
A system model is necessary to combine the repair characteristics of parts, logistics and resources to get a comprehensive understanding of the system's downtime. Let us use a simple example of a system made up of two parts in series. The two parts could fail close to each other (as shown in the next figure), in which case, the system could experience downtimes longer than the part's downtime. The system downtime starts from the time the first part failed until the end of repair of the second part.
Also, if parts share the same repair crew, there is a possibility that while a crew is busy repairing a part, another part fails. The second part will then incur longer delays because it has to wait for the crew to become available.
Let us use the following example for illustration. We are interested in the system's operation over 5000 hours. The system downtime should not exceed 18 hours; otherwise the company faces a significant penalty.
The following table lists the failure and repair distributions for each part.
A component downtime will cause a system downtime only if an additional part is down during the same time. In this example, all repair actions are to be performed by the same crew, which can only do one repair action at a time.
If we simulate the system and obtain system level summarized results such as mean availability, total failures, total downtime or average downtime, then we do not have enough information to estimate the probability that the downtime caused by a system's failure incident will last longer than a certain time. To assess the downtime caused by each system failure, we must run one simulation and evaluate the logs of each repair action. The System Failure Event Log report shown next is available in BlockSim 7's Simulation Results Explorer, under the Reports folder.
This report lists the failure time and the repair duration for each system failure event during the simulation run time. This log is presented for only one simulation run. To have more confidence in the results of our analysis, we can run the simulation multiple times (each time with a different seed number) and keep a record of all the system downtime durations. In this example, we repeated the analysis 10 times, allowing us to generate 160 system failures. The data set including all repair durations was then entered in Weibull++ 7. Here, Weibull++ is used as a "general" statistical tool; the random variable is repair time, not failure time.
The generalized-gamma distribution was found to be a good pdf fit for the system downtime data. The model fits the data especially well at high downtimes (right tail of distribution), which is the focus of our probability calculation.
To calculate the probability that the system stays down longer than 18 hours, we calculate P(System Downtime > 18 hours). This probability corresponds to the area under the repair duration's pdf to the right of t = 18. In reliability analysis terms, this area corresponds to the reliability function value for t = 18. Using the Quick Calculation Pad, we calculate P(System Downtime > 18 hours) as follows.
Thus, the probability that the system's downtime exceeds 18 hours is 7.72%.
Because the system downtimes might be too complicated to be modeled with a simple distribution, as we noticed in the previous probability plot, another approach that would not need a distribution assumption might be more appropriate. That approach would be to use a non-parametric data analysis approach such as the Kaplan-Meier.
The probability plot is shown next. It shows the probability that a system downtime will exceed a certain time.
Using the non-parametric approach, the probability that the system's downtime will exceed 18 hours is 9% (estimated from the above plot).
An even simpler method would be to count the number of times a system downtime lasted longer than 18 hours and divide that by the total sample size (160), which, in this case, yields an answer of 9.375%. More accurate results can be obtained by increasing the number of simulations.
Note: The above example could be extended to include crew logistic delays, spare part logistic delays and sharing of other resources such as spare pools. In addition, other actions, such as preventive maintenance and inspections can be added to the analysis. For even more complex logistic scenarios, we suggest using a combination of BlockSim, Weibull++ and RENO.
*EAGLE ROCK ALLIANCE, 2001 Cost of Downtime Online Survey at http://www.contingencyplanningresearch.com/2001 Survey.pdf.
Copyright 2007 ReliaSoft Corporation, ALL RIGHTS RESERVED