|
Overview of BlockSim 6
Containers in Simulation
In BlockSim 6, the container was
introduced to facilitate the analysis of standby redundancy and load
sharing blocks. But how does the container accomplish this?
This article presents how containers are used for standby and load
sharing analysis during simulation.
Standby
Containers
In
the case of a standby container, the container acts as the switch mechanism,
as illustrated in Figure 1. In addition, the container defines the standby relationships and the number
of active units that are required. The container's failure and repair
properties are really that of the switch itself. The switch can fail with a
distribution, while waiting to switch or during the switch action. Repair
properties restore the switch regardless of how the switch failed. Failure
of the switch itself does not bring the container down because the switch is
not really needed unless called upon to switch. The container will go down
if the units within the container fail or if the switch is failed when a switch
action is needed. The restoration time for this is based on the repair
distributions of the contained units and the switch. Furthermore, the
container is down during a switch process that has a delay.
Figure
1: The standby container acts as the switch, thus the failure distribution
of the container is the failure distribution of the switch. The container
can also fail when called upon to switch.
To better illustrate
this, consider the following deterministic case.
- Units A and B are
contained in a standby container.
- The standby container
is the only item in the diagram, thus failure of the container is the
same as failure of the system.
- Unit A is the active unit
and Unit B is the standby unit.
- Unit A (active) fails
every 100 time units (tu) and takes 10 tu to repair.
- Unit B fails every 3 tu
(active) and also takes 10 tu to repair.
- The units cannot fail
while in quiescent (standby) mode.
- Furthermore, assume
that the container (acting as the switch) fails every 30 tu while
waiting to switch and takes 4 tu to repair. If not failed, the container
switches with 100% probability.
- The switch action
takes 7 tu to complete.
- After repair, unit A
is always reactivated.
- The container does not
operate through system failure and thus the components do not either.
Keep in mind that we are
looking at two events on the container. The container down and container
switch down.
Figure
2: Behavior of standby system
The system event log is
shown in Figure 2 and is as follows:
- At 30, the switch
fails and gets repaired by 34. The container switch is failed and being
repaired; however, the container is up during this time.
- At 64, the switch
fails and gets repaired by 68. The container is up during this time.
- At 98, the switch
fails. It will be repaired by 102.
- At 100, unit A fails.
Unit A attempts to activate the switch to go to B; however, the switch
is failed.
- At 102, the switch is
operational.
- From 102 to 109, the
switch is in the process of switching from unit A to unit B. The
container and system are down from 100 to 109.
- By 110, unit A is
fixed and the system is switched back to A from B. The return switch
action brings the container down for 7 tu, from 110 to 117. During this
time, note that unit B has only functioned for 1 tu, 109 to 110.
- At 141, the switch
fails and gets repaired by 145. The container is up during this time.
- At 175, the switch
fails and gets repaired by 179. The container is up during this time.
- At 209, the switch
fails and gets repaired by 213. The container is up during this time.
- At 217, unit A fails.
The switch is up and the system is switched to unit B within 7 tu. The
container is down from 217 to 224.
- At 224, unit B takes
over. After 2 tu of operation at 226, unit B fails. It will be restored
by 236. The container is failed.
- At 227, unit A is
repaired and the switchback action to unit A is initiated. By 234, the
system is up.
- At 252, the switch
fails and gets repaired by 256. The container is up during this time.
- At 286, the switch
fails and gets repaired by 290. The container is up during this time.
The system results are
presented next in Figure 3.
Figure
3: Simulation results
- System CM Downtime is
24.
- CM downtime
includes all downtime due to failures as well as the delay in
switching from a failed active unit to a standby unit. It does not
include the switchback time from the standby to the restored active.
Thus, the times from 100 to 109, 217 to 224 and 226 to 234 are
included. The time to switchback, 110 to 117, is not included.
- System Total Downtime
is 31.
- It includes the CM
downtime and the switchback downtime.
- Number of System
Failures is 3.
- It includes the
failures at 100, 217 and 226.
- This is the same
as the number of CM downing events.
- The Total Downing
Events are 4.
- This includes the
switchback downing event at 110.
- The Mean Availability
(w/o PM and Inspection) does not include the downtime due to the
switchback event.
Additional
Rules and Assumptions for Standby Containers
-
A
container will only attempt to switch if there is an available
non-failed item to switch to. If there is no such item, it will then
switch if and when an item becomes available. The switch will cancel the
action if it gets restored before an item becomes available.
- As an example,
consider the case of unit A failing active while unit B failed in a
quiescent mode. If unit B gets restored before unit A, then the
switch will be initiated. If unit A is restored before unit B, the
switch action will not occur.
- In cases where not all
active units are required, a switch will only occur if the failed
combination causes the container to fail.
- For example, if A,
B and C are in a container for which one unit is required to be
operating and A and B are active with C on standby, then the failure
of either A or B will not cause a switching action. The container
will switch to C only if both A and B are failed.
- If the container
switch is failed and a switching action is required, the switching
action will occur after the switch has been restored if it is still
required (i.e. if the active unit is still failed).
- If a switch fails
during the delay time of the switching action based on the reliability
distribution (quiescent failure mode), the action is still carried out
unless a failure based on the switch probability/restarts occurs when
attempting to switch.
- During switching
events, the change from the operating to quiescent distribution (and
vice versa) occurs at the end of the delay time.
- The option of whether
components operate while the system is down is defined at the container
level. Contained items inherit this property from the container (same
also in a load sharing container). However, and regardless of the
container settings:
- If a path inside
the container is down, blocks inside the container that are in that
path do not continue to operate.
- Blocks that are up
do not continue to operate while the container is down.
- A switch can have a
repair distribution and maintenance properties without having a
reliability distribution.
- This is because
maintenance actions are utilized regardless of whether the switch
failed while waiting to switch (reliability distribution) or during
the actual switching process (fixed probability).
- A switch fails during
switching when the restarts are exhausted.
- A restart is executed
every time the switch fails to switch (based on its fixed probability of
switching).
- If a delay is
specified, restarts happen after the delay.
- If a container brings
the system down, the container is responsible for the system going down
(not the blocks inside the container).
Load
Sharing Containers
In
the case of a load sharing container, the container defines the load that is
shared. A load sharing container has no failure or repair distributions. The
container itself is considered failed if all the blocks inside the container
have failed (or k blocks in a k-out-of-n configuration). To illustrate this,
consider the following container with items A and B in a load sharing
redundancy.
Assume that A fails every
100 tu and B every 120 tu if both items are operating and they fail in half
that time if either is operating alone (i.e. the items age twice as fast
when operating alone). They both get repaired in 5 tu.
Figure
4: Behavior of load sharing system
The system event log is
shown in Figure 4 and is as follows:
- At 100, A fails. It
takes 5 tu to restore A.
- From 100 to 105, B is
operating alone and is experiencing a higher load.
- At 115, B fails. B
would normally be expected to fail at 120, however:
- From 0 to 100, it
accumulated the equivalent of 100 tu of damage.
- From 100 to 105,
it accumulated 10 tu of damage, which is twice the damage since it
was operating alone. Put another way, B aged by 10 tu over a period
of 5 tu.
- At 105, A is
restored but B has only 10 tu of life remaining at this point.
- B fails at 115.
- At 120, B is repaired.
- At 200, A fails again.
A would normally be expected to fail at 205; however, the failure of B
at 115 to 120 added additional damage to A. In other words, the age of A
at 115 was 10; by 120 it was 20. Thus it reached an age of 100 95 tu
later at 200.
- A is restored by 205.
- At 235, B fails. B
would normally be expected to fail at 240; however, the failure of A at
200 caused the reduction.
- At 200, B had an
age of 80.
- By 205, B had an
age of 90.
- B fails 30 tu
later at 235.
- The system itself
never failed.
Additional
Rules and Assumptions for Load Sharing Container
- The option of whether
components operate while the system is down is defined at the container
level. Contained items inherit this property from the container.
However, and regardless of the container settings:
- If a path inside
the container is down, blocks inside the container that are in that
path do not continue to operate.
- Blocks that are up
do not continue to operate while the container is down.
- If a container brings
the system down, the block that brought the container down is
responsible for the system going down. (This is the opposite of standby
containers.)
|