A Methodology for Evaluating the Classification Potential of Experimental Tests
Melody Darby
Jacobina Skinner
William E. Alley
USAF Armstrong Laboratory
Human Resources Directorate
This paper is intended for scientists interested in the study of the classification of Air Force recruits from among a population of applicants into multiple jobs. The treatment of classification in this paper assumes an underlying theory of specific abilities and differential validity in the joint predictor-criterion space as measured by Mean Predicted Performance (MPP) (Zeidner & Johnson, 1994; Alley, 1994). The quality of assignments (classification efficiency) as measured by the statistic MPP is computed as the average of the predicted performances for the job assignments of all recruits assigned (Brogden, 1959). MPP can be used as the basis for making comparisons between competing models of predicted performance, e.g., a model of predicted performance based on aptitude scores from an experimental test battery vs. a model based on aptitude scores from an operational battery.
A methodology is presented which produces unbiased estimates of MPP. Several controls for sampling error are employed, including the tendency of regression analysis to capitalize on sample-specific error variance. Techniques are presented which establish a range of possible classification benefits against which to evaluate the performance of competing models. An upper bound MPP estimate is defined by a full model using all available test information. The limits of the MPP range are established by optimal assignment as the ceiling and random assignment as the baseline or floor (Alley & Ree, 1993). Implications for interpretation of results for competing test batteries and models and for certain psychometric problems common during development of an experimental test are discussed.
Key features of the methodology are demonstrated using the operational selection and classification battery for the Armed Services, the Armed Services Vocational Aptitude Battery (ASVAB), and an experimental information processing test battery under development by the Air Force. Results for various design conditions are compared and discussed, as well as contrasted with two alternate classification study designs which may produce biased results.
Method
Participants
Two groups of Air Force enlisted accessions, recruits, were used in the evaluation of the ability of a 16 subtest experimental test battery to add to the differential prediction ability of the 10 operational subtest ASVAB. All subjects in each of the groups had tested on both the experimental and the operational batteries and had complete sets of 26 standard subtest scores. The first group of 2958 recruits, the developmental sample, attended technical training in one of either a mechanical (Mech), a general (Gen), or an electronic (Elec) course. There were 1125 Gen, 1107 Mech, and 726 Elec subjects. This group was used to develop regression weights for generating "applicants" performance scores for making assignments to jobs. The second group of 3650 recruits, the validation sample, was treated as a pseudo "applicant" group
Procedures
The analysis began with the random split, N=1479, of the developmental sample into two half samples, samples 1A and 1B (Figure 1). Next, within each of the three schools, FSG was regressed on all 26 experimental plus operational subtest standard scores (mean=50, standard deviation=10). For the regressions, FSG had been scaled to a mean of 100 and standard deviation of 10 within each school so as to place all three schools on the same metric. These regressions provided full least squares (FLS) estimates of weights for the 26 subtests. These FLS weights were then applied to the 26 subtest standard scores of the "applicants" to generate predicted performance scores for the three schools within each sample. Thus, there was a 3650x6 matrix of "applicant" predicted performance scores for the full 26 subtest model, three columns of which corresponded to each of the two developmental half samples. There were 2 restricted models used to compute predicted performance, the 10 operational subtests alone, and the 16 experimental subtests alone. FLS weights for each of these models were generated for each school within developmental half samples exactly as for the 26 subtests. The "applicants" predicted performance scores for these restricted models were computed as for the full model. Then the applicant matrix of performance scores contained 3650 rows and 18 columns, one row per applicant and one column for each sample/model/school combination. Assignments to jobs (schools) were based on performance scores for particular models from this matrix. Comparisons of assignments to jobs based on FLS estimates of performance for the full model with assignments to jobs based on FLS estimates of performance for the restricted models were then possible.

Figure 1. Diagram of Cross-Validated Study Design
The value of the assignment of a group of applicants to jobs was determined by computing the MPP statistic for the assignment. MPP is simply the sum of the predicted performance scores for all those applicants actually receiving an assignment divided by the number of applicants assigned. MPP values provided the statistical basis for the comparison of the performance estimates using the restricted models to those for the FLS performance estimates using the full model (all 26 subtests) to effect assignment of applicants to jobs.
A cross-validation design was used to control for the tendency of regression analysis to capitalize on sample specific error variance and yield shrunken estimates of MPP. Assignments were accomplished from performance scores estimated by a restricted model. MPP was always computed for those assignments using the full model performance scores or reference matrix. Moreover, assignments were sample specific, 1A or 1B. The assignments were characterized using a reference matrix from the other sample, e.g., assignments from the sample 1A, 10 operational subtest model were evaluated using the sample 1B, 26 subtest reference matrix. Since sampling was involved in this design an average value of MPP, Average MPP, was calculated across 30 iterations of assignments.
Assignments were made for all "applicants" to the three jobs constrained to reflect the actual proportional allocation for the 2958 recruits from the developmental sample in the three schools (.38 Gen, .37 Mech, and .25 Elec). Exhausting the "applicant" pool permitted evaluation of MPP benefits due solely to the classification potential of the operational and experimental batteries apart from the benefits due to selection. Assignments were also accomplished for a 60 percent selection rate.
MPP for each assignment was computed for the optimal assignment of "applicants" to jobs. MPP from optimal assignments using the FLS estimates of performance was compared to MPP for an optimal assignment using the other 2 models to estimate performance. Random assignment was used as a baseline for comparisons. Optimal assignments were accomplished using a linear programming algorithm from the Statistical Analysis System, Proc Netflow.
It is of interest to examine, for these data, how successful the cross-validated design was in controlling for the tendency of regression analysis to capitalize on sample specific error variance and produce shrunken estimates of MPP. To answer this question two additional designs were processed, a biased design and a minimally biased design. Both designs are much like the unbiased design and can be presented by reference to Figure 1. In both these alternative designs the developmental sample is split into two random half samples and regression estimates computed as for the unbiased design. "Applicant" performance scores are generated as before and assignments made. The difference in the designs occurs in the characterization of MPP. For the biased design, the same matrix of performance scores used to make assignments is used to compute MPP. For the minimally biased design, the full model matrix of performance scores is used to compute MPP for assignments for all models but there is no crossing between samples.
Results
Results for the assignment comparisons evaluating the classification effectiveness of the two restricted models are displayed in Table 1. Average MPP, MPP estimates averaged across 30 assignment iterations, is shown for optimal assignment using FLS estimates of performance for the three models: all 26 subtests (full model); the 10 operational subtests alone; and the 16 experimental subtests alone and for random assignment. In addition, results for a 60 percent selection ratio, which closely approximates the actual Air Force selection rate for the past several years, are presented. The inclusion of a selection ratio in the analysis allows for an assessment of the gains in MPP due to the added effectiveness of the batteries for selection over and above the batteries’ differential validity for classification (0% rejection rate). Sample results are reported separately.
Table 1.
MPP Across Schools for Different Models Unbiased Results, 30 Iterations|
Sample/ Selection Rt. |
Model |
Average MPP |
Random Solution |
Diff |
Percent Diff |
|
|
|
|
|
|
|
|
|
26 Subtests |
98.89 |
97.17 |
1.72 |
---- |
|
1A |
10 Subtests |
99.06 |
97.17 |
1.89 |
110 |
|
0% |
16 Subtests |
97.58 |
97.17 |
.41 |
24 |
|
|
|
|
|
|
|
|
|
26 Subtests |
98.92 |
97.17 |
1.75 |
----- |
|
1B |
10 Subtests |
99.11 |
97.17 |
1.94 |
111 |
|
0% |
16 Subtests |
97.62 |
97.17 |
.45 |
26 |
|
|
|
|
|
|
|
|
|
26 Subtests |
103.01 |
97.17 |
5.84 |
---- |
|
1A |
10 Subtests |
102.99 |
97.17 |
5.82 |
100 |
|
60% |
16 Subtests |
100.77 |
97.17 |
3.60 |
62 |
|
|
|
|
|
|
|
|
|
26 Subtests |
103.02 |
97.17 |
5.85 |
---- |
|
1B |
10 Subtests |
102.98 |
97.17 |
5.81 |
100 |
|
60% |
16 Subtests |
100.87 |
97.17 |
3.70 |
63 |
As can be seen from inspection of MPP values at the 0 % rejection rate, random assignment values at 97.17 are equivalent for the two samples. The performance range for the two samples is also very close, 1.72 and 1.75 grade points. MPP values for the three models are relatively similar for each sample in that the operational model outperforms the experimental model, 99.06 vs 97.58 at sample 1A and 99.11 vs 97.62 at sample 1B. The experimental model performance expressed as a percentage of the performance range is close, 24 % vs 26 %, for the two samples. The fact that the operational model slightly outperforms the full model indicates that the experimental test probably will not add much to the differential validity of the operational test.
Improvements in MPP can be seen from inspection of the selection results. Since the same samples were used for the selection analysis as for the pure classification analysis above, the random assignment computation remains the same, MPP = 97.17. The performance range increases substantially to 5.84 and 5.85. The performance of the operational model when a selection rate is imposed decreases relative to that of the full model by about 10 percent as compared to its pure classification performance. In contrast to the operational model results, the experimental model increases in performance relative to the full model by about 38 percentage points over its performance due solely to classification.
Comparison with Biased Designs
Table 2 presents MPP results for the biased, minimally biased, a single iteration of the unbiased, and 30 iterations of the unbiased design. Random assignment was calculated for each design and the difference between MPP due to random assignment and MPP due to the full model computed. Thus each design has its own performance range of possible classification benefits. Results have been averaged across the two samples.
Table 2. Shrinkage in MPP for Different Designs and Models
|
Design |
MPP |
Model |
||
|
26 Subtests |
10 Subtests |
16 Subtests |
||
|
Biased |
Total |
100.04 |
99.84 |
100.80 |
|
Minimally Biased |
Total |
100.04 |
99.48 |
98.82 |
|
Unbiased 1 1 Iteration |
Total |
98.74 |
99.10 |
97.34 |
|
Unbiased 30 Iterations |
Average |
98.90 |
99.08 |
97.60 |
As can be seen from examination of the table, the performance of all models decreases as the designs become increasingly restrictive. The decrease is most dramatic for the experimental model. An interpretation of the biased results might conclude that the experimental battery alone captures 63 % of the possible benefits and adds about 20 % to the differential validity of the operational battery. The results are similar for the minimally biased design. Results are quite different for the unbiased design. A single iteration of the design produces results that are difficult to interpret due to sampling variation. Thirty iterations yields a more stable estimate of MPP. As can be seen the performance of the experimental battery is about 20 % of possible benefits and does not appear to add to the differential validity of the operational battery.
Discussion
Comparison of results obtained for the three designs -- unbiased, minimally biased, biased -- underscores the importance of controlling for sample specific error variance in evaluating classification effectiveness. As shown in the study, it is necessary to use but not sufficient to rely on traditional double cross-validation techniques where regression weights are developed in split-half entrant samples and then cross-applied to split-half validation samples of applicants. That feature of the design introduces a control for the tendency of regression analysis to capitalize on sample specific error variance in the entrant samples. A second control is important for avoiding overestimation of MPP due to additional sampling error in the applicant performance payoff matrices. Estimates of MPP computed from either the same performance payoff matrix used to make assignments (biased design) or a reference matrix within the same sample (minimally biased design) may lead to overestimates of classification effectiveness and misinterpretation of the value of the set of personnel attributes of interest . In this demonstration study, the biased and minimally biased designs substantially overrepresented the value of the experimental tests.
By implementing the second control through cross-application between split-half applicant samples for purposes of final MPP computation, the unbiased design provided a more accurate assessment of classification benefits due to the experimental tests. For most classification studies, a single iteration of analyses would suffice. However, when an unusual pattern of results is obtained, as was the case here with the 10 ASVAB subtest restricted model capturing 124% of the MPP estimate of the 26 subtest (ASVAB + experimental) full model, additional iterations and a computation of the expected value of MPP is necessary. Typically, the MPP of a restricted model would be less than that of the full model. By replicating 30 times, each time on separate random split-half entrant samples, the results converged on stable MPP estimates. Furthermore, variation in the MPP estimates suggested potential psychometric problems with the experimental tests, probably unreliability of information processing ability measurement.
The unbiased design is recommended as a general-purpose methodology for scientists interested in evaluating the utility of alternate personnel attribute measures for classification decisions. The methodology can be applied to address questions that often arise about a set of personnel predictors, for example: the value of (1) cognitive and noncognitive (interests, personality, biodata, psychomotor) measures; (2) different composites developed from subsets of observed measures like subtest scores; and (3) full least square weighted composites versus simplified unit weighted composites. In addition to questions about personnel attributes, research issues concerning jobs can be explored, for example: the performance costs/benefits of classification systems based on (1) a unique composite for each job; (2) a combined composite for clusters of jobs; and (3) alternate composites for different job clusters. Further, selection rates may be varied to evaluate performance tradeoffs due to pure classification benefits (under the condition of zero percent rejection, where all recruits receive a job assignment) from those due to combined selection and classification benefits (under conditions of varying selection rates, where different percentages of recruits receive a job). This feature of the methodology allows researchers to simulate effects on classification effectiveness of changes in the size or quality of an applicant pool or raising or lowering job entry standards.
References
Alley, W.E. (1994). Recent advances in classification theory and practice. In M. G. Rumsey, C.B. Walker, & J.H. Harris (Eds.), Personnel Selection and Classification (pp. 431-442). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
Alley, W.E., & Ree, M.J. (1993). Air Force aptitude composite development using optimal allocation: Preliminary findings. Unpublished manuscript. Brooks AFB, TX: Armstrong Laboratory, Human Resources Directorate.
Brogden, H.E. (1959). Efficiency of classification as a function of the number of jobs, percent rejected, and the validity and intercorrelation of job performance estimates. Educational and Psychological Measurement. 19, 181-190.
Zeidner, J., & Johnson, C.D. (1994) Is personnel classification a concept whose time has passed? In M. G. Rumsey, C.B. Walker, & J.H. Harris (Eds.), Personnel Selection and Classification (pp. 377-410) Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.