Reliability HotWire: eMagazine for the Reliability Professional
Reliability HotWire

Issue 31, September 2003

Hot Topics

Using Maintenance Policies

In BlockSim 6, there are two methods for analyzing a reliability block diagram: analytically and via simulation. The analytical method is advantageous because you can obtain a closed-form reliability equation and duplicate the results. However, the analytical solution does not take into account any maintenance policies that you may have defined for the blocks in the diagram. One of the most important benefits of simulation is the ability to define how and when the actions are performed. In particular, the actions of interest are the part repairs and replacements. This can be accomplished in BlockSim 6 by defining maintenance policies. There are three different types of maintenance policies that can be defined for actions in BlockSim 6: corrective maintenance, preventive maintenance and inspection.

Corrective Maintenance Policies

A corrective maintenance policy defines when a corrective maintenance (CM) action is performed. Figure 1 shows a corrective maintenance policy that has been assigned to a block.

Corrective maintenance policy

Figure 1: Corrective maintenance policy

Corrective actions can be performed either immediately upon failure of the item or upon finding that the item has failed (for "hidden" failures that are not detected until an inspection). BlockSim allows the selection of either category. If "Upon Failure" is selected, the CM action is initiated immediately upon failure. If a policy has not been set for a block, then this is the default option. If the "Upon Inspection" option is selected, then the CM action will only be initiated after an inspection is done on the failed component. How and when the inspections are performed is defined by the block's inspection properties and also by the inspection policy. This has the effect of defining a dependency between the corrective maintenance policy and the inspection policy, as shown in Figure 2.

Cascading dependencies present when CM "Upon Inspection" has been specified

Figure 2: Cascading dependencies present when CM "Upon Inspection" has been specified

Inspection Policies

Figure 2 shows the options available in an inspection policy within BlockSim. Inspections can be performed upon a fixed time interval. This is either based on the item's age (item clock) or the system's age (system clock). Furthermore, inspections can also be set to occur if the system goes down or if another group item goes down. Within BlockSim, items are considered to be in the same group if they have the same non-zero "Item Group #." Note that the default value for this is 0. Zero is a reserved number and it means that the item does not belong to any group. Inspections do not bring the item down by default.

Preventive Maintenance Policies

Figure 3 shows the options available in a preventive maintenance (PM) policy within BlockSim. Much like inspections, PMs can be performed upon a fixed time interval. This is either based on the item's age (item clock) or the system's age (system clock). Furthermore, PM actions can also be set to occur if the system goes down or if another group item goes down. Because PM actions always bring the item down, one can also specify whether preventive maintenance will be performed if the action brings the system down.

Preventive maintenance policy options

Figure 3: Preventive maintenance policy options

Item and System Ages

It is important to keep in mind that the system and each component of the system maintains a separate clock within the simulation. Figure 4 illustrates system and item clocks. The system clock is the simulation elapsed time while the item clock is the age of the item since last renewal. If the system clock is used, the inspection will be performed every X time units. Whereas, if the item clock is used, the inspection will be performed every time the component reaches that age. As an example, if the inspection is set to be performed at a system age of 100, then an inspection will be performed at 100, 200, 300 and so forth. If the inspection is set based on an item's age of 100, then the inspection will be performed when the item reaches an age of 100.

The system and each block maintain different clocks during each simulation

Figure 4: The system and each block maintain different clocks during each simulation

Rules for PMs and Inspections

All the options available on the Maintenance tab of the Block Properties window and the associated policies were designed to maximize the modeling flexibility within BlockSim. However, maximizing the modeling flexibility introduces issues that the user needs to be aware of and requires the user to carefully make selections that do not contradict one another. One obvious case would be to define a PM action on a component in series (which will always bring the system down) and then assign a PM policy to the block that has the "Do not perform maintenance if the action brings the system down" option set. With these settings, no PMs will ever be performed on the component during the BlockSim simulation. The following sections summarize some issues and special cases for the user to consider when defining maintenance properties and policies in BlockSim.

  1. Inspections do not consume spare parts. However, an inspection can have a renewal effect on the component if the restoration factor is set to a number other than the default of 0.
  2. On the inspection tab, if "Inspection brings system down" is selected, this also implies that the inspection brings the item down.
  3. If a PM or an inspection are scheduled based on the item's age, then they will occur exactly when the item reaches that age. However, it is important to note that failed items do not age. Thus, if an item fails before it reaches that age, the action will not be performed. This means that if the item fails before the scheduled inspection (based on item age) and the CM is set to be performed upon inspection, the CM will never take place. The reason that this option is allowed in BlockSim is for the flexibility of specifying renewing inspections.
  4. Downtime due to a failure discovered during a non-downing inspection is included when computing results "w/o PM & Inspections."
  5. If a PM upon item age is scheduled and is not performed because it brings the system down (based on the option in the PM policy) the PM will not happen unless the item reaches that age again (after restoration by CM, inspection or another type of PM).
  6. If the CM policy is upon inspection and a failed component is scheduled for PM prior to the inspection, the PM action will restore the component and the CM will not take place.
  7. In the case of simultaneous events, only one event is executed. The following precedence order is used: inspection, preventive maintenance, corrective maintenance.
  8. The PM option of "Do not perform maintenance if the action brings the system down" is only considered at the time that PM needs to be initiated. If the system is down at that time, due to another item, then the PM will be performed regardless of any future consequences to the system up state. In other words, when the other item is fixed, it is possible that the system will remain down due to this PM action. In this case, the PM time difference is added to the system PM downtime.
  9. If the CM policy is upon inspection, the CM is always initiated after the inspection time has elapsed. If a restoration factor is present for both CM and inspection then this is compounded. In this case, the current age, CA, at the end of the CM is

  1. If a failure or PM occurs during a non-downing inspection and the CM or PM have a restoration factor and the inspection action has a restoration factor, then both restoration factors are used (compounded).
  2. Downing events cannot overlap. If a component is down due to a PM and another PM is suggested based on another trigger, the second call is ignored.
  3. Non-downing events can overlap with downing events. If in a non-downing inspection and a downing event happen concurrently, the non-downing event will be dealt with in parallel with the downing event.
  4. A PM or inspection on system down is triggered only if the system was up at the time that the event brought the system down.
  5. A non-downing inspection, with a restoration factor, restores the block based on the age of the block at the end of the inspection.
  6. A non-downing inspection with restoration factor of 0 does not affect the block.

Example

To illustrate the use of maintenance policies in BlockSim, consider the following system.

Example of system utilizing maintenance policies

Figure 5: Example of system utilizing maintenance policies

The properties of each block are defined in the following table:

The properties of each crew are defined as follows:

The spare part pool are also defined.

The corrective and inspection settings for blocks A and D are shown in Figure 6.

Corrective and Inspection policies for blocks A and D

Figure 6: Corrective and Inspection policies for blocks A and D

Blocks A and D:

  1. Belong to the same group (Group 1).
  2. Corrective maintenance actions are upon inspection (not upon failure) and the inspections are performed every 30 tu based on system time. Inspections have a duration of 1 tu. Furthermore, unlimited free crews are available to perform the inspections.
  3. Whenever either item fails, the other one gets a PM.
  4. The PM has a fixed duration of 10 tu.
  5. The same crews are used for both corrective and preventive maintenance actions.

The preventive maintenance policies for blocks A and D are shown in Figure 7.

Corrective and Inspection policies for blocks A and D

Figure 7: Corrective and Inspection policies for blocks A and D

System Overview

The item and system behavior from 0 to 300 hours is shown in Figure 8.

Up/down event sequence for the system and the blocks

Figure 8: Up/down event sequence for the system and the blocks

  1. At 100, block A goes down and brings the system down.
    1. No maintenance action is performed since an "upon inspection" policy was utilized.
    2. The next scheduled inspection is at 120, thus Crew A is called to perform the maintenance by 121 (end of the inspection).
  2. Crew A arrives and initiates the repair on A at 131.
    1. The only part in the pool is utilized and an on-condition restock is triggered.
    2. Pool [on-hand = 0, pending: 150s, 181].
    3. Block A is repaired by 141.
  3. At the same time (121), a PM is initiated for block D because the PM policy called for "PM upon a maintenance action on another group item."
    1. Crew B is called for block D and arrives at 136.
    2. No part is available until 150. An on-condition restock is triggered for 181.
    3. Pool [on-hand = 0, pending: 150s, 181, 181].
    4. At 150, a part becomes available and the PM is completed by 160.
    5. Pool [on-hand = 0, pending: 181, 181].
  4. At 161, block B fails (corrective maintenance upon failure).
    1. Block B gets Crew A, which arrives at 171.
    2. No part is available until 181. An on-condition restock is triggered for 221.
    3. Pool [on-hand = 0, pending: 181, 181, 221].
    4. A part arrives at 181.
    5. The repair is completed by 201.
    6. Pool [on-hand = 0, pending: 181, 221].
  5. At 162, block C fails.
    1. Block C gets Crew C, which arrives at 177.
    2. No part is available until 181. An on-condition restock is triggered for 222.
    3. Pool [on-hand = 0, pending: 181, 221, 222].
    4. A part arrives at 181.
    5. The repair is completed by 201.
    6. Pool [on-hand = 0, pending: 221, 222].
  6. At 163, block F fails and brings the system down.
    1. Block F calls Crew A then B. Both are busy.
    2. Crew A will be the first available so F calls A again and waits.
    3. No part is available until 221. An on-condition restock is triggered for 223.
    4. Pool [on-hand = 0, pending: 221, 222, 223].
    5. Crew A arrives at 211.
    6. Repair begins at 221.
    7. Repair is completed by 241.
    8. Pool [on-hand = 0, pending: 222, 223].
  7. At 298, block A goes down and brings the system down.

The simulation results are shown next in Figure 9.

Simulation results

Figure 9: Simulation results

System Uptimes/Downtimes

  1. System Uptime: This is 200 tu.
    1. This can be obtained by observing the following system up durations, 0 to 100, 160 to 163 and 201 to 298.
  2. System CM Downtime: This is 58 tu.
    1. Observe that even though the system failed at 100, the CM action (on block A) was initiated at 121 and lasted until 141, thus only 20 tu of this downtime are attributed to the CM action.
    2. The next CM action started at 163 when block F failed and lasted until 201 when blocks B and C were restored, thus adding another 38 tu of CM downtime.
  3. System Inspection Downtime: This is 1 tu.
    1. The only time the system was under inspection was from 120 to 121, during the inspection of block A.
  4. System PM Downtime: This is 19 tu.
    1. Note that the entire PM action duration on block D was from 121 to 160.
    2. Until 141, and from the system perspective, the CM on block A was the cause for the downing. Once block A was restored (at 141), then the reason for the system being down became the PM on block D.
    3. Thus, the PM on block D was only responsible for the downtime after block A was restored, or from 141 to 160.
  5. System Total Downtime: This is 100 tu.
    1. This includes all of the above downtimes plus the 20 tu (100 to 120) and the 2 tu (298 to 300) that the system was down due to the undiscovered failure of block A.

System Metrics

  1. Mean Availability (All Events):

  1. Mean Availability (w/o PM & Inspection):
    1. This is due to the CM downtime 58, the undiscovered downtime of 22 and the inspection downtime of 1, or:

  1. It should be noted that the inspection downtime was included even though the definition was "w/o PM & Inspection." The reason for this is that the inspection did not cause the downtime in this case. Only downtimes caused by the PM or inspections are excluded.
  1. Point Availability and Reliability at 300 is zero because the system was down at 300.
  2. Expected Number of Failures is 3.
    1. The system failed at 100, 163 and 298.
  3. The MTTFF is 100 because the example is deterministic.

System Downing Events

  1. Number of Failures is 3.
    1. The first is the failure of block A, the second is the failure of block F and the third is the failure of block A.
  2. Number of CMs is 2.
    1. The first is the CM on block A and the second is the CM on block F.
  3. Number of Inspections is 1.
  4. Number of PMs is 1.
  5. Total Events are 6. These are events that the downtime can be attributed to. Specifically, the following events were observed:
    1. The failure of block A at 100.
    2. Inspection on block A at 120.
    3. The CM action on block A.
    4. The PM action on block D (after A was fixed).
    5. The failure of block F at 163.
    6. The failure of block A at 298.

Block Details

The details for blocks A, B, C, D and F are shown in Figure 10.

Block details

Figure 10: Block details

First note that there are four downing events on block A: initial failure, inspection and CM, plus the last failure at 298. All others have just one. Also, block A had a total downtime of 41 + 2, giving it a mean availability of 0.8567. The first time-to-failure for block A occurred at 100 while the second occurred after 298 - 141 = 157 hours of operation, yielding an average time between failures (MTBF) of 257/2 = 128.5 (note that this is the same as uptime/failures). Block D never failed so its MTBF cannot be determined. Furthermore, MTBDE for each item is determined by dividing the block's uptime by the number of events. The RS FCI and RS DECI metrics are obtained by looking at the SD Failures and SD Events of the item and the number of system failures and events. Specifically, the only items that caused system failure are blocks A and F; A at 100 and 298 and F at 163. It is important to note that even though one could argue that block F alone did not cause the failure (B and C were also failed), the downing was attributed to F because the system reached a failed state only when block F failed.

On the number of inspections, which were scheduled every 30 tu, nine occurred for block A [30, 60, 90, 120, 150, 180, 210, 240, 270] and eight for block D. Block D did not get inspected at 150 because block D was undergoing a PM action at that time.

Crew Details

Figure 11 shows the crew results.

Crew results

Figure 11: Crew results

Crew A received a total of six calls and accepted three. Specifically,

  1. At 121, the crew was called by block A and the call was accepted.
  2. At 121, block D also called for its PM action and was rejected. Block D then called crew B, which accepted the call.
  3. At 161, block B called crew A. Crew A accepted.
  4. At 162, block C called crew A. Crew A rejected and block C called crew B, which accepted the call.
  5. At 163, block F called crew A and then crew B and both rejected. Block F then waited until a crew became available at 201 and called that crew again. This was crew A, which accepted.

The total wait time is the time that blocks had to wait for the maintenance crew. Block F is the only component that waited, waiting 38 tu for crew A. Also, the costs for crew A were 1 per unit time and 10 per incident, thus the total costs were 30 + 100. The costs for crew B were 2 per unit time and 20 per incident, thus the total costs were 40 + 156.

Pool Details

Figure 12 shows the spare part pool results. The pool started with a stock level of 1 and ended up with 2. Specifically:

  1. At 121, the pool dispensed a part to block A and ordered another to arrive at 181.
  2. At 121, it dispensed a part to block D and ordered another to arrive at 181.
  3. At 150, a scheduled part arrived to restock the pool.
  4. At 161 the pool dispensed a part to block B and ordered another to arrive at 221.
  5. At 181, it dispensed a part to block C and ordered another to arrive at 222.
  6. At 221, it dispensed a part to block F and ordered another to arrive at 223.
  7. The 222 and 223 arrivals remained in stock until the end.

Overall, five parts were dispensed. Blocks had to wait a total of 126 tu to receive parts (B : 181 - 161 = 20, C : 181 - 162 = 19, D : 150 - 121 = 29 and F : 221 - 163 = 58).

Pool details

Figure 12: Pool details

Special Cases

To illustrate some special cases that one needs to be aware of, consider the following diagram.

In this diagram, blocks A and D have the same properties as before, with the exception that the inspection duration is now set to zero. Furthermore, recall the rule that only one event is executed in the case of simultaneous events. In this case and when block A fails, the inspection on block A at 120 will find the failure of A, which will then trigger a PM event on block D at the same instant that D also gets an inspection. This causes two simultaneous events on block D. This will result in the cancellation of the PM event on block D. The reason for the cancellation is to avoid the recursive situation where the PM on D would trigger a PM on A, which is undergoing CM, which would trigger a PM on D and so forth. Different options can be used to avoid this. One is to assign a non-zero inspection duration. In this case, the PM on block D would get triggered after the inspection on block A, as seen in the prior example.

ReliaSoft Corporation

Copyright 2003 ReliaSoft Corporation, ALL RIGHTS RESERVED