Statistical Intervals

What are tolerance and prediction intervals? How are they different from confidence intervals? This month, we will define these three types of statistical intervals, describe two types of statistical studies and summarize when to use each type of interval based on the type of study to be performed.

Confidence intervals are most commonly used. A confidence interval establishes an interval based on a sample that contains the true population (or process) parameter or metric x% of the time, if a random sample is drawn repeatedly from the same population.

Prediction intervals apply to the situation in which a statement needs to be made about a future population that does not currently exist. For example, samples are provided of a prototype and we need to know if future versions of the design will exhibit the same characteristic of interest as the prototype.

Tolerance intervals describe, for a sampled population (or process), an interval that contains a certain percentage of the population x% of the time. Note that for a normal distribution, a lower one-sided x% tolerance bound to be exceeded by y% of the population is equivalent to a lower one-sided x% confidence bound for the yth percentile; an upper one-sided x% tolerance bound to exceed y% of the population is equivalent to an upper one-sided x% confidence bound for the yth percentile.

To help to understand the difference between the three types of intervals and their applications, let’s discuss some basic assumptions of sample data.

Statistical intervals represent an uncertainty that exists in the data because we work with samples that are obtained from a larger population or process. As the number of samples we have to work with increases, we notice that the length of the confidence interval decreases. Uncertainty can also be introduced through measurement error, so it is important to be able to understand the amount of uncertainty introduced by the measurement system used. Finally, uncertainty can exist in the data when the data is not "representative" of the population or process. We need to remember that any statistical intervals calculated represent only the statistical uncertainty of the data and not the uncertainly of samples representing the population of interest.

So, what does it mean to have "representative data"? First, we need to make a distinction between enumerative and analytical studies, a concept introduced by Deming (1950). Simply stated, an enumerative study is one with a well-defined, existing population where the purpose of the study is to describe or make inferences about the existing population. An analytical study is one in which the population currently does not exist. The purpose of an analytical study is to make conclusions about the future product or process, not about the materials under current investigation. Analytical studies often involve sampling prototype units, made in the lab or on a pilot line, to draw conclusions about subsequent full-scale production. For example, a study may involve sampling product from inventory to make some type of inference about the product population or process.

To summarize, if the focus of the study is to describe a characteristic about the existing inventory, then the study is enumerative. If the study focuses on the future output from the process, then the study is analytical.

Based on the focus of the study, there are important considerations in the sampling strategies.

Sampling strategy for an enumerative study:

• Explicitly and precisely define the target population.
• Clearly define the specific characteristic(s) to be evaluated.
• Clearly state the operating environment in which the desired characteristic is to be evaluated. For example, in a life test, a "failure" must be precisely defined and "normal operating conditions" must be clearly stated.
• Define the "Frame" from which the samples are to be taken. For example, develop a list or other enumeration of the population from which the samples are to be selected, thus establishing the sampled population.
• Evaluate the difference between the target population and the sampled population, the possible differences between the two, and the consequences of the differences.
• Finally, randomly select samples from the target population. (Note that there are several methods of random sampling, such as simple, stratified, cluster and systematic. Given this fact, it is important to understand the possible difference between the target population and the sampled population.)

Sampling strategy for an analytical study:

• Clearly define the process of interest.
• Determine the possible sources of data that will be useful for making the inferences about the process. You may consider including as broad of an environment as possible, sampling the process over a longer period of time or deliberately evaluating extreme conditions.
• Clearly state the assumptions required for the results of the study to apply to the process.
• Collect well-targeted data and check the statistical model and assumptions.
• Decide whether there is still value in calculating a statistical interval, or whether the statistical interval may be misleading or give a false sense of security.
• If a statistical interval is calculated, then make sure that the assumptions are fully understood. Please remember, the interval only represents the uncertainty associated with the random sampling. It does not include uncertainties due to the differences between the sampled process and the process of real interest.

Based on the discussion about enumerative and analytical studies, it becomes a little easier to determine what type of statistical interval is appropriate for your application. Hahn and Meeker (1991) provided the table below that summarizes when to use each type of statistical interval.

 General Purpose of the Statistical Interval Characteristic  of Interest Description (Enumerative) Prediction (Analytical) Location - as measured by the mean or some %tile of the specific distribution Confidence interval for a population mean or median or a specified distribution percentile. Prediction interval for a future sample mean, future sample median, or a particular ordered observation from a future sample. Spread - as measured by its standard deviation Confidence interval for a population standard deviation. Prediction interval for the standard deviation of a future sample. Enclosure Interval - Tolerance intervals to contain a specific proportion of the population and a prediction interval to contain all, or most, of the observations from a future sample. Tolerance interval to contain (or cover) at least a specified proportion of a population. Prediction interval to contain all or most of these observations from a future sample. Probability of an Event - A specific example is a confidence interval for the probability that a measurement will exceed a specified threshold value Confidence interval for the probability of being less than (or greater than) some specified value. Prediction interval to contain the proportion of observations in a future sample that exceed a specified limit.

So, there you have it! Confidence intervals and tolerance intervals are used when the population already exists and you want to be able to make inferences about that population based on a random sample taken from the population. Prediction intervals are used when you are trying to make an inference about a population that will exist in the future, based on sample data that "represents" the future population.

References

Deming, W., Some theory of sampling, 1st ed. New York: John Wiley & Sons, 1950.

Hahn, G. and Meeker, W., Statistical intervals, 1st ed. New York: Wiley, 1991.