OC Curves for Various Methods for Comparing Two Sets of Data
As discussed in chapter 5, verification testing can be of two types: test method verification testing that is done on split samples, or process verification testing that is done on independent samples. The procedures are different for each of these types of verification testing.
In chapter 5, two methods were considered for test method verification of split samples: the D2S method, which compares the contractor and agency results from a single split sample, and the paired t—test, which compares contractor and agency results from a number of split samples. OC curves, which plot the probability of detecting a difference versus the actual difference between the two populations, can be developed for either of these methods.
In the D2S method, a test is performed on a single split sample to compare agency and contractor test results. If we assume both of these samples are from normally distributed subpopulations, then we can calculate the variance of the difference and use it to calculate two standard deviation, or approximately 95 percent, limits for the sample difference quantity. Suppose the agency subpopulation has a variance
and the contractor subpopulation has a variance
. Since the variance of the difference in two random variables is the sum of the variances, the variance of the difference in an agency observation and a contractor observation is
. The D2S limits are based on the test standard deviation provided. Let us call this test standard deviation σtest. Under an assumption that
, this variance of a difference becomes
. The D2S limits are set as two times the standard deviation (i.e., approximately 95 percent limits) of the test differences. This therefore sets that the D2S limits at
, which is
, or ±2.8284 σtest. Without loss of generality, we can assume σtest = 1, along with assumption of a mean difference of 0, and use the standard normal distribution with region between -2.8284 and +2.8284 as acceptance region for the difference in an agency test result and a contractor test result. With these two limits fixed, we can calculate power of this decision-making process relative to various true differences in the underlying subpopulation means and/or various ratios of the true underlying subpopulation standard deviations.
These power values can conveniently be displayed as a three-dimensional surface. If we vary the mean difference along the first axis and the standard deviation ratio along a second axis, we can show power on the vertical axis. The agency subpopulation, the contractor subpopulation, or both, could have standard deviations smaller, about the same, or larger than the supplied σtest value. Each of these cases is considered in the technical report for this project.[17] For simplicity, herein we will consider only the case where one of the two subpopulations has standard deviation equal to the supplied σtest. Figure 50 shows the OC curves for this case. Power values are shown where the ratio of the larger of agency or contractor standard deviation to the smaller of agency or contractor standard deviation is varied over the values 0, 1, 2, 3, 4, and 5. The mean difference given along the horizontal axis (values 0, 1, 2, 3) represents the difference in agency and contractor subpopulation means expressed as multiples of σtest.
As can be seen in the figure, even when the ratio of the contractor and agency standard deviations is 5 and the difference between the contractor and agency means is 3 times the value for σtest, there is less than a 70 percent chance of detecting the difference based on the results from a single split sample.
As is the case with any method based on a sample of size one, the D2S method does not have much power to detect differences between the contractor and agency populations. The appeal of the D2S method lies in its simplicity rather than its power.
| Prob. Of Detecting a Difference, % | ![]() |
Std. Dev. Ratio |
| Mean Difference, in σtest Units | ||
Figure 50. OC Surface for the D2S Test Method Verification Method (Assuming the smaller σ= σtest) |
||
As noted in chapter 5, for the case in which it is desirable to compare more than one pair of split sample test results, the t—test for paired measurements can be used. But the question arises, how many pairs of test results should be used? This is where an OC curve is helpful. The OC curve, for a given level of α, plots on the vertical axis either the probability of not detecting, β, or detecting, 1 - β, a difference between two populations. The standardized difference between the two population means is plotted on the horizontal axis.
For a t—test for paired measurements, the standardized difference, d, is measured as:
![]()
The OC curves are developed for a given level of significance, α. It is evident from the OC curves that for any probability of not detecting a difference, β, (value on the vertical axis), the required n will increase as the difference, d, decreases (value on the horizontal axis). In some cases the desired β or d may require prohibitively large sample sizes. In that case a compromise must be made between the discriminating power desired, the cost of the amount of testing required, and the risk of claiming a difference when none exists.
OC curves for paired t—tests for α values of 0.05 and 0.01 appear in figures 51 and 52, respectively.
To use these OC curves the true standard deviation of the signed differences, σd, is assumed to be known, (or approximated based on published literature). After experience is gained with the process, σd can be more accurately defined and a better idea of the required number of tests determined.
Example 1. The number of pairs of split sample tests for verification of laboratory-compacted air voids using the Superpave Gyratory Compactor (SGC) is desired. The probability of not detecting a difference, β, is chosen as 20 percent or 0.20. (Some OC curves use 1 - β, known as the power of the test, on the vertical axis, but the only difference is the scale change, with 1 - β, in this case, being 80 percent). Assume that the absolute difference between μc and μa should not be greater than 1.25 percent, that the standard deviation using the SGC is 0.5 percent, and that α is selected as 0.01. This produces a d value of 1.25 percent/0.5 percent = 2.5. Reading this value on the horizontal axis and a β of 0.20 on the vertical axis in figure 52 shows that about 5 paired split-sample tests are necessary for the comparison.
![]() |
![]() |
| Standardized Difference, d | |
Figure 51. OC Curves for a Two—Sided t—Test (α = 0.05 ) |
|
| (Source: Experimental Statistics, by M. G. Natrella, National Bureau of Standards Handbook 91, 1963) | |
![]() |
![]() |
| Standardized Difference, d | |
Figure 52. OC Curves for a Two-Sided t—Test (α = 0.01) |
|
| (Source: Experimental Statistics, by M. G. Natrella, National Bureau of Standards Handbook 91, 1963) | |
In chapter 5, two methods were considered for process verification using independently obtained samples: the F—test and t—test method, which compares the variances and means of sets of contractor and agency test results, and the single agency test method, which compares a single agency test result with 5 to 10 contractor test results. OC curves, which plot the probability of not detecting a difference, β, or detecting a difference, the power or 1 - β, versus the actual difference between the two populations, can be developed for either of these methods.
One approach for comparing the contractor's test results with the agency's test results is to use the F—test and t—test comparisons of characteristics of the two data sets. To compare two populations that are assumed normally distributed, it is necessary to compare their means and their variabilities. An F—test is used to assess the size of the ratio of the variances, and a t—test is used to assess the degree of difference in the means. A question that needs to be answered is what power do these statistical tests have, when used with small to moderate size samples, to declare various differences in means and variances to be statistically significant differences. Some OC curves and examples of their use in power analysis follow.
F—test for Variances——Equal Sample Sizes. Suppose we have two sets of measurements assumed to come from normally distributed populations and wish to conduct a test to see if they come from populations that have the same variances, i.e.,
. Further suppose we select a level of significance of α = .05, meaning we are allowing up to 5 percent chance of incorrectly deciding the variances are different when they really are the same. If we assume these two samples are
x1, x2, …, xnx, and y1, y2 , …, yny,
calculate sample variances
and
, and construct
,
we would accept Ho :
for values of F in the interval
.
For this two-sided or two-tailed test, figure 53 shows the probability we have accepted the two samples as coming from populations with the same variabilities. This probability is usually referred to as β, and the power of the test as 1 - β. Notice the horizontal axis is the quantity λ, where
, the true standard deviation ratio. So for λ = 1, where the hypothesis of equal variance should certainly be accepted, it is accepted with probability 0.95, reduced from 1.0 only by the magnitude of our selected type I error risk, α. One major limiting factor for the use of figure 53 is the restriction that nx = ny = n.
Example 2. Suppose we have nx = 6 contractor tests and ny = 6 agency tests, conduct an α = 0.05 level test, and accept that these two sets of tests represent populations with equal variances. What power did our test have to discern if the populations from which these two sets of tests came were really rather different in variabilities? Suppose the true population standard deviation of the contractor tests (σx) was twice as large as that of the agency tests (σy), giving λ = 2. If we enter figure 53 with λ = 2 and nx = ny = 6, we find that β ≈ 0.74, or the power, 1 - β, is about 0.26. This tells us that with samples of nx = 6 and ny = 6, we only have 26 percent chance of detecting a standard deviation ratio of 2 (and correspondingly a four-fold difference in variance) as being different.
Example 3. Suppose we are not at all comfortable with the power of 0.26 in Example 1, and so subsequently we increase the number of tests used. Suppose we now have nx = 20 and ny = 20. If we again consider λ = 2, we can determine from figure 53 the power of detecting these sets of tests as coming from populations with unequal variances to be over 0.8, approximately 82 percent to 83 percent. If we proceed to conduct our F-test with these two samples, and conclude the underlying variances are equal, we certainly feel much more comfortable with our conclusions.
Figure 54 gives the appropriate OC curves to use if we choose to conduct an α = 0.01 level test. Again we see for equal variances σx2 and σy2, giving λ = 1, that β = 0.99, reduced from 1.0 only by the size of α.
F—test for Variances—Unequal Sample Sizes. Up to now the discussions and OC curves presented have been limited to the case when the two sample sizes are equal. Calculation routines were developed for this project for calculation of power for this test for any combination of sample sizes nx and ny. There are obviously an infinite number of possible combinations for nx and ny. So, it is not possible to present OC curves for every possibility. However, three sets of tables are provided herein which provide a subset of power calculations using some sample sizes that are of potential interest for comparing contractor and agency samples. These power calculations are presented in table form since there are too many variables to present in a single chart, and the data can be presented in a more compact form in tables than in a long series of charts. Table 37 gives power values for all combinations of sample sizes from 3 to 10, with the ratio of the two subpopulation standard deviations being 1, 2, 3, 4, and 5. Table 38 gives power values for the same sample sizes, but with the standard deviation ratios being 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. Table 39 gives power values for all combinations for sample sizes of 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, and 100, with the standard deviation ratio being 1, 2, or 3. An example below illustrates the use of the first of these tables to reference power for a hypothetical test.
![]() |
| Figure 53. OC Curves for the Two—Sided F—Test for Level of Significance α = 0.05 (Source: Engineering Statistics by A. H. Bowker and G. J. Lieberman.) |
![]() |
| Figure 54. OC Curves for the Two—Sided F—Test for Level of Significance α = 0.01 (Source: Engineering Statistics by A. H. Bowker and G. J. Lieberman.) |
Table 37. F—test Power Values for n = 3 to 10 and s—ratio, λ = 1 to 5
|
|
|
Table 37. F-test Power Values for n = 3 to 10 and s-ratio, λ = 1 to 5 (cont.)
|
|
|
Table 37. F-test Power Values for n = 3 to 10 and s-ratio, λ = 1 to 5 (cont.)
| λ | ny | nx | Power |
|---|---|---|---|
| 5 | 7 | 3 | 0.75893 |
| 4 | 0.84940 | ||
| 5 | 0.90024 | ||
| 6 | 0.93086 | ||
| 7 | 0.95030 | ||
| 8 | 0.96318 | ||
| 9 | 0.97201 | ||
| 10 | 0.97824 | ||
| 8 | 3 | 0.77800 | |
| 4 | 0.86909 | ||
| 5 | 0.91845 | ||
| 6 | 0.94695 | ||
| 7 | 0.96423 | ||
| 8 | 0.97513 | ||
| 9 | 0.98225 | ||
| 10 | 0.98704 | ||
| 9 | 3 | 0.79133 | |
| 4 | 0.88238 | ||
| 5 | 0.93024 | ||
| 6 | 0.95690 | ||
| 7 | 0.97244 | ||
| 8 | 0.98184 | ||
| 9 | 0.98772 | ||
| 10 | 0.99150 | ||
| 10 | 3 | 0.80115 | |
| 4 | 0.89188 | ||
| 5 | 0.93838 | ||
| 6 | 0.96351 | ||
| 7 | 0.97767 | ||
| 8 | 0.98594 | ||
| 9 | 0.99092 | ||
| 10 | 0.99400 |
Table 38. F-test Power Values for n = 3 to 10 and s-ratio, λ = 0.0 to 1.0
|
|
|
Table 38. F-test Power Values for n = 3 to 10 and s-ratio, λ = 0.0 to 1.0 (cont.)
|
|
|
Table 38. F-test Power Values for n = 3 to 10 and s-ratio, λ = 0.0 to 1.0 (cont.)
|
|
Table 39. F-test Power Values for n = 5 to 100 and s-ratio, λ = 1 to 3
|
|
|
Table 39. F-test Power Values for n = 5 to 100 and s-ratio, λ = 1 to 3 (cont.)
|
|