Having introduced some of the basic theory and terminology for repairable systems in the Introduction to Repairable Systems chapter of this on-line reference, we will now examine the steps involved in the analysis of such complex systems. We will begin by examining system behavior through a sequence of discrete deterministic events and expand the analysis using discrete event simulation.
This chapter includes the following sections:
To first understand how component failures and simple repairs affect the system and to visualize the steps involved, let's begin with a very simple deterministic example with two components, A and B, in series.
Component A fails every 100 hours and component B fails every 120 hours. Both require 10 hours to get repaired. Furthermore, assume that the surviving component stops operating when the system fails (thus not aging). (Note: When a failure occurs in certain systems, some or all of the system's components may or may not continue to accumulate operating time while the system is down. For example, consider a transmitter-satellite-receiver system. This is a series system and the probability of failure for this system is the probability that any of the subsystems fail. If the receiver fails, the satellite continues to operate even though the receiver is down. In this case, the continued aging of the components during system inoperation must be taken into consideration since this will affect their failure characteristics and have an impact on the overall system downtime and availability.)
The system behavior during an operation from 0 to 300 hours would be as shown in Figure 8.1.
Figure 8.1: Overview of system and components for a simple series system with two components. Component A fails every 100 hours and component B fails every 120 hours. Both require 10 hours to get repaired and do not age (operate through failure) when the system is in a failed state.
Specifically, component A would fail at 100 hours, causing the system to fail. After 10 hours, component A would be restored and so would the system. The next event would be the failure of component B. We know that component B fails every 120 hours (or after an age of 120 hours). Since a component does not age while the system is down, component B would have reached an age of 120 when the clock reaches 130 hours. Thus, component B would fail at 130 hours and be repaired by 140 and so forth. Overall in this scenario, the system would be failed for a total of 40 hours due to four downing events (two due to A and two due to B). The overall system availability (average or mean availability) would be 260/300 = 0.86667. Point availability is the availability at a specific point time. In this deterministic case, the point availability would always be equal to 1 if the system is up at that time and equal to zero if the system is down at that time.
In the prior section, we made the assumption that components do not age when the system is down. This assumption applies to most systems. However under special circumstances, a unit may age even while the system is down. In such cases, the operating profile will be different than the one presented in the prior section. Figure 8.2 illustrates the case where the components operate continuously, regardless of the system status.
Figure 8.2: Overview of up and down states for a simple series system with two components. Component A fails every 100 hours and component B fails every 120 hours. Both require 10 hours to get repaired and age when the system is in a failed state (operate through failure).
Consider a component with an increasing failure rate, as shown in Figure 8.3. In the case that the component continues to operate through system failure, then when the system fails at t1 the surviving component's failure rate will be λ1, as illustrated in Figure 8.3. When the system is restored at t2, the component would have aged by t2 - t1 and its failure rate would now be λ2.
In the case of a component that does not operate through failure, then the surviving component would be at the same failure rate, when the system resumes operation.
Figure 8.3: Illustration of a component with a linearly increasing failure rate and the effect of operation through system failure.
Consider the following system where A fails every 100, B every 120, C every 140 and D every 160 time units. Each takes 10 time units to restore. Furthermore, assume that components do not age when the system is down.
Figure 8.4: Overview of simple redundant system with four components.
A deterministic system view is shown in Figure 8.3. The sequence of events is as follows:
At 100, A fails and is repaired by 110. The system is failed.
At 130, B fails and is repaired by 140. The system continues to operate.
At 150, C fails and is repaired by 160. The system continues to operate.
At 170, D fails and is repaired by 180. The system is failed.
At 220, A fails and is repaired by 230. The system is failed.
At 280, B fails and is repaired by 290. The system continues to operate.
End at 300.
It should be noted that we are dealing with these events deterministically in order to better illustrate the methodology. When dealing with deterministic events, it is possible to create a sequence of events that one would not expect to encounter probabilistically. One such example consists of two units in series that do not operate through failure but both fail at exactly 100, which is highly unlikely in a real world scenario. In this case, the assumption is that one of the events must occur at least an infinitesimal amount of time (dt) before the other. Probabilistically, this event is extremely rare, since both randomly generated times would have to be exactly equal to each other, to 15 decimal points. In the rare event that this happens, BlockSim would pick the unit with the lowest ID value as the first failure. BlockSim assigns a unique numerical ID when each component is created. These can be viewed by selecting the "Show Block ID" option in the Diagram Options window.
Even though the examples presented are fairly simplistic, the same approach can be repeated for larger and more complex systems. The reader can easily observe/visualize the behavior of more complex systems in BlockSim using the Up/Down plots. These are the same plots used in this chapter. It should be noted that BlockSim makes these plots available only when a single simulation run has been performed for the analysis (i.e. Number of Simulations = 1). These plots are meaningless when doing multiple simulations because each run will yield a different plot.
In a probabilistic case, the failures and repairs do not happen at a fixed time and for a fixed duration, but rather occur randomly and based on an underlying distribution, as shown in Figures 8.5 and 8.6.
Figure 8.5: A single component with a probabilistic failure time and repair duration.
Figure 8.6: A system up/down plot illustrating a probabilistic failure time and repair duration for component A.
We use discrete event simulation in order to analyze (understand) the system behavior. Discrete event simulation looks at each system/component event very similarly to the way we looked at these events in the deterministic example. However, instead of using deterministic (fixed) times for each event occurrence or duration, random times are used. These random times are obtained from the underlying distribution for each event. As an example, consider an event following a 2-parameter Weibull distribution. The cdf of the 2-parameter Weibull distribution is given by:
The Weibull reliability function is given by:
Then, to generate a random time from a Weibull distribution with a given η and β, a uniform random number from 0 to 1, UR[0, 1], is first obtained. (Note: BlockSim uses an algorithm based on L'Ecuyer's [14,15] random number generator with a post Bays-Durham shuffle. The RNG's period is approximately 1018. The RNG passes all currently known statistical tests, within the limits of the machine's precision and for a number of calls (simulation runs) less than the period. If no seed is provided, the algorithm uses the machine's clock to initialize the RNG.)
The random time from a Weibull distribution is then obtained from:
(1)
To obtain a conditional time, the Weibull conditional reliability function is given by:
(2)
Or:
The random time would be the solution for t for R(T, t) = UR[0, 1].
To illustrate the sequence of events, assume a single block with a failure and a repair distribution. The first event, , would be the failure of the component. Its first time-to-failure would be a random number drawn from its failure distribution, . Thus, the first failure event, , would be at . Once failed, the next event would be the repair of the component, . The time to repair the component would now be drawn from its repair distribution, . The component would be restored by time . The next event would now be the second failure of the component after the repair, . This event would occur after a component operating time of after the item is restored (again drawn from the failure distribution), or at . This process is repeated until the end time. It is important to note that each run will yield a different sequence of events due to the probabilistic nature of the times. To arrive at the desired result, this process is repeated many times and the results from each run (simulation) are recorded. In other words, if we were to repeat this 1,000 times, we would obtain 1,000 different values for , or . The average of these values, , would then be the average time to the first event, , or the mean time to first failure (MTTFF) for the component. Obviously, if the component were to be 100% renewed after each repair, then this value would also be the same for the second failure, etc.
Go
to weibull.com
Go
to ReliaSoft.com
©1999-2007. ReliaSoft Corporation. ALL RIGHTS RESERVED.