Reliability HotWire: eMagazine for the Reliability Professional
Reliability HotWire

Issue 31, September 2003

Reliability Basics
Overview of BlockSim 6 Containers in Simulation

In BlockSim 6, the container was introduced to facilitate the analysis of standby redundancy and load sharing blocks. But how does the container accomplish this? This article presents how containers are used for standby and load sharing analysis during simulation.

Standby Containers

In the case of a standby container, the container acts as the switch mechanism, as illustrated in Figure 1. In addition, the container defines the standby relationships and the number of active units that are required. The container's failure and repair properties are really that of the switch itself. The switch can fail with a distribution, while waiting to switch or during the switch action. Repair properties restore the switch regardless of how the switch failed. Failure of the switch itself does not bring the container down because the switch is not really needed unless called upon to switch. The container will go down if the units within the container fail or if the switch is failed when a switch action is needed. The restoration time for this is based on the repair distributions of the contained units and the switch. Furthermore, the container is down during a switch process that has a delay.

Standby Container

Figure 1: The standby container acts as the switch, thus the failure distribution of the container is the failure distribution of the switch. The container can also fail when called upon to switch.

To better illustrate this, consider the following deterministic case.

Container

  1. Units A and B are contained in a standby container.
  2. The standby container is the only item in the diagram, thus failure of the container is the same as failure of the system.
  3. Unit A is the active unit and Unit B is the standby unit.
  4. Unit A (active) fails every 100 time units (tu) and takes 10 tu to repair.
  5. Unit B fails every 3 tu (active) and also takes 10 tu to repair.
  6. The units cannot fail while in quiescent (standby) mode.
  7. Furthermore, assume that the container (acting as the switch) fails every 30 tu while waiting to switch and takes 4 tu to repair. If not failed, the container switches with 100% probability.
  8. The switch action takes 7 tu to complete.
  9. After repair, unit A is always reactivated.
  10. The container does not operate through system failure and thus the components do not either.

Keep in mind that we are looking at two events on the container. The container down and container switch down.

Behavior of standby system

Figure 2: Behavior of standby system

The system event log is shown in Figure 2 and is as follows:

  1. At 30, the switch fails and gets repaired by 34. The container switch is failed and being repaired; however, the container is up during this time.
  2. At 64, the switch fails and gets repaired by 68. The container is up during this time.
  3. At 98, the switch fails. It will be repaired by 102.
  4. At 100, unit A fails. Unit A attempts to activate the switch to go to B; however, the switch is failed.
  5. At 102, the switch is operational.
  6. From 102 to 109, the switch is in the process of switching from unit A to unit B. The container and system are down from 100 to 109.
  7. By 110, unit A is fixed and the system is switched back to A from B. The return switch action brings the container down for 7 tu, from 110 to 117. During this time, note that unit B has only functioned for 1 tu, 109 to 110.
  8. At 141, the switch fails and gets repaired by 145. The container is up during this time.
  9. At 175, the switch fails and gets repaired by 179. The container is up during this time.
  10. At 209, the switch fails and gets repaired by 213. The container is up during this time.
  11. At 217, unit A fails. The switch is up and the system is switched to unit B within 7 tu. The container is down from 217 to 224.
  12. At 224, unit B takes over. After 2 tu of operation at 226, unit B fails. It will be restored by 236. The container is failed.
  13. At 227, unit A is repaired and the switchback action to unit A is initiated. By 234, the system is up.
  14. At 252, the switch fails and gets repaired by 256. The container is up during this time.
  15. At 286, the switch fails and gets repaired by 290. The container is up during this time.

The system results are presented next in Figure 3.

Simulation results

Figure 3: Simulation results

  1. System CM Downtime is 24.
    1. CM downtime includes all downtime due to failures as well as the delay in switching from a failed active unit to a standby unit. It does not include the switchback time from the standby to the restored active. Thus, the times from 100 to 109, 217 to 224 and 226 to 234 are included. The time to switchback, 110 to 117, is not included.
  2. System Total Downtime is 31.
    1. It includes the CM downtime and the switchback downtime.
  3. Number of System Failures is 3.
    1. It includes the failures at 100, 217 and 226.
    2. This is the same as the number of CM downing events.
  4. The Total Downing Events are 4.
    1. This includes the switchback downing event at 110.
  5. The Mean Availability (w/o PM and Inspection) does not include the downtime due to the switchback event.

Additional Rules and Assumptions for Standby Containers

  1. A container will only attempt to switch if there is an available non-failed item to switch to. If there is no such item, it will then switch if and when an item becomes available. The switch will cancel the action if it gets restored before an item becomes available.

    1. As an example, consider the case of unit A failing active while unit B failed in a quiescent mode. If unit B gets restored before unit A, then the switch will be initiated. If unit A is restored before unit B, the switch action will not occur.
  2. In cases where not all active units are required, a switch will only occur if the failed combination causes the container to fail.
    1. For example, if A, B and C are in a container for which one unit is required to be operating and A and B are active with C on standby, then the failure of either A or B will not cause a switching action. The container will switch to C only if both A and B are failed.
  3. If the container switch is failed and a switching action is required, the switching action will occur after the switch has been restored if it is still required (i.e. if the active unit is still failed).
  4. If a switch fails during the delay time of the switching action based on the reliability distribution (quiescent failure mode), the action is still carried out unless a failure based on the switch probability/restarts occurs when attempting to switch.
  5. During switching events, the change from the operating to quiescent distribution (and vice versa) occurs at the end of the delay time.
  6. The option of whether components operate while the system is down is defined at the container level. Contained items inherit this property from the container (same also in a load sharing container). However, and regardless of the container settings:
    1. If a path inside the container is down, blocks inside the container that are in that path do not continue to operate.
    2. Blocks that are up do not continue to operate while the container is down.
  7. A switch can have a repair distribution and maintenance properties without having a reliability distribution.
    1. This is because maintenance actions are utilized regardless of whether the switch failed while waiting to switch (reliability distribution) or during the actual switching process (fixed probability).
  8. A switch fails during switching when the restarts are exhausted.
  9. A restart is executed every time the switch fails to switch (based on its fixed probability of switching).
  10. If a delay is specified, restarts happen after the delay.
  11. If a container brings the system down, the container is responsible for the system going down (not the blocks inside the container).

Load Sharing Containers

In the case of a load sharing container, the container defines the load that is shared. A load sharing container has no failure or repair distributions. The container itself is considered failed if all the blocks inside the container have failed (or k blocks in a k-out-of-n configuration). To illustrate this, consider the following container with items A and B in a load sharing redundancy.

Load Sharing Containers

Assume that A fails every 100 tu and B every 120 tu if both items are operating and they fail in half that time if either is operating alone (i.e. the items age twice as fast when operating alone). They both get repaired in 5 tu.

Behavior of load sharing system

Figure 4: Behavior of load sharing system

The system event log is shown in Figure 4 and is as follows:

  1. At 100, A fails. It takes 5 tu to restore A.
  2. From 100 to 105, B is operating alone and is experiencing a higher load.
  3. At 115, B fails. B would normally be expected to fail at 120, however:
    1. From 0 to 100, it accumulated the equivalent of 100 tu of damage.
    2. From 100 to 105, it accumulated 10 tu of damage, which is twice the damage since it was operating alone. Put another way, B aged by 10 tu over a period of 5 tu.
    3. At 105, A is restored but B has only 10 tu of life remaining at this point.
    4. B fails at 115.
  4. At 120, B is repaired.
  5. At 200, A fails again. A would normally be expected to fail at 205; however, the failure of B at 115 to 120 added additional damage to A. In other words, the age of A at 115 was 10; by 120 it was 20. Thus it reached an age of 100 95 tu later at 200.
  6. A is restored by 205.
  7. At 235, B fails. B would normally be expected to fail at 240; however, the failure of A at 200 caused the reduction.
    1. At 200, B had an age of 80.
    2. By 205, B had an age of 90.
    3. B fails 30 tu later at 235.
  8. The system itself never failed.

Additional Rules and Assumptions for Load Sharing Container

  1. The option of whether components operate while the system is down is defined at the container level. Contained items inherit this property from the container. However, and regardless of the container settings:
    1. If a path inside the container is down, blocks inside the container that are in that path do not continue to operate.
    2. Blocks that are up do not continue to operate while the container is down.
  2. If a container brings the system down, the block that brought the container down is responsible for the system going down. (This is the opposite of standby containers.)
ReliaSoft Corporation

Copyright 2003 ReliaSoft Corporation, ALL RIGHTS RESERVED