Part 1: The Probability Density Function
Generally speaking, the object of performing a life data analysis is to be able to predict the future performance of a certain product. This prediction is based on the observed behavior of a relatively small group of these objects that are considered to be representative of the entire population. Occurrences of interest are observed, and then statistics is used to model the frequency and probability of these occurrences. Most frequently, the occurrences of interest are the failure of the units on test. A number of units are placed on test and run until failures occur. This data set is then analyzed and modeled, and predictions regarding the product's failure behavior are calculated. These predictions are made via the familiar life data analysis metrics such as reliability, failure rate, etc. The basis for these metrics is a mathematical function that models how the failure occurrences are distributed over time. This function is called the probability density function or pdf. It is the basis for almost all of the reliability metrics of interest.
In order to get a better idea of how the pdf is formulated, we will start with the concept of the histogram. Suppose we have the following set of data, consisting of 100 data points of an occurrence or event we wish to model or characterize.
These values could represent failure times, or product dimension variations, or any other information. At this point, what the data set represents is immaterial; we just have a representative sample of data from a process that we want to characterize. One way we can begin to do this is through the use of a histogram. In order to construct a histogram, we separate the data into "bins" or ranges, and count how many of the data points fall into each range. This information can then be plotted in a bar chart. Following is a histogram for the data with a range size of 50.
From this initial histogram, we can begin to characterize where the data set fails with respect to its values. Note that the y-axis is in terms or relative proportion rather than the raw number of data points falling into the range. In other words, we can use the graph to get an idea of how the data set is distributed. This graph illustrates that a relatively high proportion of the data falls between the values of 0 and 50, 47%. The range from 51 to 100 contains 34% of the data, from 101 to 150 contains 14% of the data, and values of 151 or greater comprise 5% of the data. If all of these values are added up, they will sum to 100%, indicating that we have been able to account for all of our data points in this histogram.
This is a good start, but the histogram can be improved. Our first histogram contains only four ranges, which may not be sufficient to get a good idea of how the data points are distributed. We can try to refine the histogram by creating more ranges that are smaller in width. The following histogram shows the data in a histogram with a range size of 25.
This represents an improvement over the initial histogram in that it gives us a more refined picture of where the data points tend to fall. It shows that the most likely area for the data is in the range from 26 to 50. Note that the scale of the y-axis has dropped from 0.5 to 0.35. This is because as we divide the same number of data points into smaller ranges, the relative values for the bars in the histogram will decrease. However, the values of all of the bars will still add up to 1, or 100%. We can further refine our histogram by again decreasing the size of the ranges. The following chart shows a histogram for the data with a range size of 10.
This further refines the picture of where the data points tend to fall, showing a sharp increase to a maximum in the 21 to 30 range, which gradually tails off to the right. Note that once again the y-axis range has decreased, although all of the bars in the chart would still add up to 1.
If we had enough data, we could continue to create histograms with smaller and smaller ranges to get a more refined picture of the distribution of the data. The ultimate conclusion of this process would be a histogram that has, in theory, a range that is infinitesimally small. In other words, we would have a continuous function. Such functions do exist, and are the probability density functions, or pdfs, that are commonly used in life data analysis. Following is a plot of the pdf for our data set, using the lognormal distribution.
Note that the pdf for the data has the same shape that was discernable in the latter two histograms: rising rapidly from zero to a peak at a value near 30, then tailing off to the right. Also note that the y-axis only extends to 0.01. This is because the lognormal pdf is defined to positive infinity. Just as the bars in the histograms always add up to 1, the area under the curve of a pdf is always equal to 1. Since the function extends to infinity, the maximum y-axis value is going to be relatively small.
The formal mathematical definition of a pdf is given by:
In other words, the pdf defines the probability that X takes on a value in the interval [a,b] is the area under the density function from a to b. This is represented graphically in the following plot.
In reliability terms, this function gives us the probability that a failure occurs between time a and time b. This function completely describes the distribution, and is the basis for almost all of the familiar reliability and life data functions.
In next month's Reliability Basics, we'll look at how the pdf function is used to develop other commonly used functions.
Copyright © 2002 ReliaSoft Corporation, ALL RIGHTS RESERVED