Reliability HotWire: eMagazine for the Reliability Professional
Reliability HotWire

Issue 39, May 2004

Reliability Basics

Foundations of Reliability

There are a certain number of "up-front" activities that need to be undertaken in order to be able to successfully implement a reliability program. Most of these activities are relatively simple, but they are vital to the design and implementation of a reliability program. It is like the foundation of a house: relatively inexpensive and largely forgotten once the major construction has begun, but vital to the structure nonetheless. The reliability "foundation building" concepts need to be addressed in order to put a strong reliability program in place.

This article discusses the following concepts: fostering a culture of reliability, product mission, reliability specifications and universal failure definitions.

Fostering a Culture of Reliability

The most important part of developing a reliability program is having a culture of reliability in the organization. It is vital that everyone involved in the product's production, from upper management to assembly personnel, understands that a sound reliability program is necessary for the organization's success.

Achieving this culture of reliability may actually be more difficult than it seems, as some organizations may not have the history or background that lends itself to the support of a reliability program. This can be particularly true in situations where the organization has had a niche market or little or no previous competition with the products that it produces. In the past, the organization's customers may have had to accept the reliability of the product, good or bad. As a consequence, the organization may have developed a mentality that tends to overlook the reliability of a product in favor of the "damn-the-torpedoes, full-steam-ahead" method of product development. In this type of organization, reliability engineering methods and practices tend to be viewed as superfluous or even wasteful. "We don't need all of this reliability stuff, we'll just find the problems and fix them," tends to be the attitude in these circumstances. Unfortunately, this attitude often results in poorly tested, unreliable products being shipped to customers.

The first step in developing the necessary culture of reliability is to have the support of the organization's top management. Without this, the implementation of a reliability program will be a difficult and frustrating process. Adequate education as to the benefits of a properly constructed reliability program will go a long way towards building support in upper management for reliability techniques and processes. Most important is the emphasis on the financial benefits that will accrue from a good reliability program, particularly in the form of decreased warranty costs and increased customer goodwill. This latter aspect of the benefits of reliability engineering can sometimes be an elusive concept to appreciate. An adage in the reliability field states that if customers are pleased with a product, they will tell eight other people. But if they are dissatisfied, they will tell 22 other people. While this anecdote is rather eye-opening, it must be put in a financial context to have the full impact. It is possible to construct a model that links reliability levels of a product with the probability of repeat sales. It is therefore possible to calculate a loss of sales revenue based on the unreliability of the product. This type of information is useful in educating upper management on the financial importance of reliability. Once the upper management has been adequately educated and is supportive of the implementation of reliability concepts, it will be a great deal easier to go about implementing those concepts.

However, one should not stop with upper management when it comes to educating an organization on the benefits of a proposed reliability program. It is vital to have the support and understanding of the rest of the organization as well. Since the implementation of a reliability program will affect the day-to-day activities of middle management, the engineers and the technicians, it is also necessary to convince these groups of the benefits of reliability-oriented activities. It is important to demonstrate to them how these activities, which may initially seem pointless or counterproductive to them, will ultimately benefit the organization. For example, if test technicians have a good understanding of how the test data are going to be put to use, they will be less likely to cut corners while performing the test and recording the data. Overall, a reliability program stands the greatest chance of success if everyone in the organization understands the benefits and supports the techniques involved.

Product Mission

The underlying concept in characterizing the reliability of a product involves the concept of product mission (e.g. operate for 36 months or complete 1000 cycles). A textbook definition of reliability is:

"The conditional probability, at a given confidence level, that the equipment will perform its intended functions satisfactorily or without failure, i.e., within specified performance limits, at a given age, for a specified length of time, function period, or mission time, when used in the manner and for the purpose intended while operating under the specified application and operation environments with their associated stress levels."

With all of the conditions removed, this boils down to defining reliability as the ability of a product to perform its intended mission without failing. The definition of reliability springs directly from the product mission, in that product failure is the inability of the product to perform its defined mission.

Reliability Specifications

In order to develop a good reliability program for a product, the product must have good reliability specifications. These specifications should address most, if not all, of the conditions in the reliability definition above, including mission time, usage limitations, operating environment, etc. In many instances, this will require a detailed description of how the product is expected to perform reliability-wise. Use of a single metric, such as MTBF, as the sole reliability metric is inadequate. Even worse is the specification that a product will be "no worse" than the previous model. An ambiguous reliability specification leaves a great deal of room for error and this can result in a poorly-understood and unreliable product reaching the field.

Of course, there may be situations in which an organization lacks the reliability background or history to easily define specifications for a product's reliability. In these instances, an analysis of existing data from previous products may be necessary. If enough information exists to characterize the reliability performance of a previous product, it should be a relatively simple matter to transform this historical product reliability characterization into specifications of the desired reliability performance of the new product.

Financial concerns will definitely have to be taken into account when formulating reliability specifications. Planning for warranty and production part costs is a significant part of financial planning for the release of a new product. Based on financial inputs such as these, a picture of the required reliability for a new product can be established. However, financial wishful thinking should not be the sole determinant of the reliability specifications. It can lead to problems such as unrealistic goals, specifications that change on a regular basis to fit test results or test results that get "fudged" in order to conform to unrealistic expectations. It is necessary to couple the financial goals of the product with a good understanding of product performance in order to get a realistic specification for product reliability. A proper balance of financial goals and realistic performance expectations are necessary to develop a detailed and balanced reliability specification.

Universal Failure Definitions

Another important foundation for a reliability program is the development of universally agreed-upon definitions of product failure. This may seem a bit silly, in that it should be fairly obvious whether a product has failed or not, but such a definition is quite necessary for a number of different reasons.

One of the most important reasons is that different groups within the organization may have different definitions as to what sort of behavior actually constitutes a failure. This is often the case when comparing the different practices of design and manufacturing engineering groups. Identical tests performed on the same product by these groups may produce radically different results simply because the two groups have different definitions of product failure. In order for a reliability program to be effective, there must be a commonly accepted definition of failure for the entire organization. Of course, this definition may require a little flexibility depending on the type of product, development phase, etc., but as long as everyone is familiar with the commonly accepted definition of failure, communications will be more effective and the reliability program will be easier to manage.

Another benefit of having universally agreed-upon failure definitions is that it will minimize the tendency to rationalize away failures on certain tests. This can be a problem, particularly during product development, as engineers and managers may tend to overlook or diminish the importance of failure modes that are unfamiliar or not easily replicable. This tendency is only human and a person who has spent a great deal of time developing a product may find justification for writing off oddball failures as a "glitch" or as failure due to some other external error. However, this type of mentality also results in products with poorly-defined but very real failure modes being released into the field. Having a specific failure definition that applies to all or most types of tests will help to alleviate this problem.

However, a degree of flexibility is called for in the definition of failure, particularly with complex products that may have a number of distinct failure modes. For this reason, it may be advisable to have a multi-tiered failure definition structure that can accommodate the behavioral vagaries of complex equipment. The following three-level list of failure categories is provided as an example:

  • Type I - Failure: Severe operational incidents that will definitely result in a service call, such as part failures, unrecoverable equipment hangs, DOAs, consumables that fail/deplete before their specified life, onset of noise and other critical problems. These constitute "hard-core" failure modes that will require the services of a trained repair technician to recover.
  • Type II - Intervention: Any unplanned occurrence or failure of the product mission that requires the user to manually adjust or otherwise intervene with the product or its output. These "nuisance failures" that can be recovered by the customer or with the aid of phone support. Depending on the nature of the failure mode, groups of the Type II failures can be upgraded to Type I if they exceed a predefined frequency of occurrence.
  • Type III - Event: Events include all other occurrences that do not fall into either of the categories above. This might include events that cannot directly be classified as failures, but whose frequency is of engineering interest and are appropriate for statistical analysis. Examples include failures caused by test equipment malfunction or operator error.

During testing, all of these occurrences are logged with codes to separate the three failure types. Other test-process-related issues, such as deviations from test plans, are logged in a separate test log. There should be a timely review of logged occurrences to insure proper classification prior to metric calculation and reporting.

ReliaSoft Corporation

Copyright 2004 ReliaSoft Corporation, ALL RIGHTS RESERVED