Air Force Officer Qualifying Test Retest Performance

Thomas R. Carretta
Armstrong Laboratory Human Resources Directorate
Brooks Air Force Base, TX

The Air Force Officer Qualifying Test (AFOQT) is a multiple aptitude battery used to select applicants for U. S. Air Force (USAF) officer commissioning programs and to classify commissioned officers into aircrew training programs. Its factor structure has been studied (Carretta & Ree, 1996), it has been validated for pilot and navigator training (Arth, Steuck, Sorrentino, & Burke, 1990; Carretta & Ree, 1995a; Olea & Ree, 1994), and group differences have been examined (Carretta, in press; Carretta & Ree, 1995b; Roberts & Skinner, 1995).

Current Air Force policy allows applicants to test twice on the AFOQT (one retest). The minimum retest interval is six months, but a retest may occur after several years. Additional retests can be and are granted, but require a waiver. Only the latest scores are reported to officer and aircrew selection boards and the boards are not informed whether the score is a retest.

Although the current form (or its equivalent) of the AFOQT has been in use since 1981, little research has been done to examine its retest characteristics (i.e., score changes, reliability, validity). Arth (1986) examined score changes and retest reliability for the operational composites in a sample of 2,246 USAF officer applicants. He observed that retesters’ first-test scores were lower than those who tested only once. Arth also observed score gains for all composites and retest reliabilities between .775 and .880. However, he did not examine score changes or retest reliability for the 16 AFOQT tests, nor did he examine the predictive validity of first versus retest scores for pilot trainees.

The purpose of this study was to examine retest mean score performance and retest reliability on the AFOQT composites and tests and to evaluate alternative methods for handling retest scores. Estimating the stability of test performance over time is important because it establishes an upper limit on the amount of agreement that can be expected on a retest and may provide insight about the interpretation of retest scores relative to first-test scores. Examining the predictiveness of first and retest scores may help to inform policy in the use of retest scores for pilot selection.

Method

Participants

Participants were 276,039 USAF officer applicants tested between 1981 and 1995 on AFOQT Forms O, P1, or P2. They were mostly male (80.3%) and White (78.4%). They were between 17 and 33 years of age at time of the first test, with a mean of 22.3 years. Although most participants tested only once (n = 232,111, 84.1%), many (n = 43,928, 15.9%) tested two or more times: (twice: n = 36,243, 13.1%; three times: n = 6,647, 2.4%; four or more times: n = 1,038; 0.4%). A subsample (n = 14,403) also attended the USAF Undergraduate Pilot Training (UPT) course. The predictiveness of the first, last, and averaged AFOQT scores against pilot training final outcome was evaluated for this subsample.

Measures

AFOQT. The AFOQT is a paper-and-pencil multiple-aptitude battery used for USAF officer commissioning through the Officer Training School and the Reserve Officer Training Corps. AFOQT scores are used along with other measures of aptitude and educational achievement (e.g., college grade point average, type of degree, previous flying experience) to qualify applicants who passed medical and physical requirements for aircrew training.

The AFOQT consists of 16 tests that measure the factors of general cognitive ability (g), verbal, quantitative, spatial, aircrew aptitude/knowledge, and perceptual speed (Carretta & Ree, 1996). Operationally, the 16 tests are combined to form five composites: Verbal, Quantitative, Academic Aptitude (Verbal and Quantitative combined), Pilot, and Navigator-Technical (Berger, Gupta, Berger, & Skinner, 1990). Only composite scores are considered for operational selection. Test scores are not used individually.

The verbal tests include Verbal Analogies, Reading Comprehension, and Word Knowledge. The quantitative tests are Arithmetic Reasoning, Data Interpretation, and Math Knowledge. The spatial tests include Mechanical Comprehension, Electrical Maze, Block Counting, Rotated Blocks, and Hidden Figures. The tests of aircrew aptitude/knowledge are Instrument Comprehension, Aviation Information, and General Science. Finally, the perceptual speed tests include Scale Reading and Table Reading. More detailed descriptions are available elsewhere (Berger et al., 1990; Carretta & Ree, 1996).

Undergraduate Pilot Training. UPT is a 53 week program that consists of academic courses about aircraft systems and flying theory/principles that are taught concurrently with hands-on training in the primary jet (T-37, 21 weeks) and advanced jet (T-38, 32 weeks). Graduates complete about 190 flying hours. Final training outcome is determined by academic and flying performance and is awarded at the end of UPT. Final training outcome (UPT P/F) was a dichotomous variable where graduates received a score of 1 and those who failed for flying performance deficiency received a score of 0. Those who failed for non-flying deficiencies (e.g., self-initiated withdrawal from training, airsickness, poor academic performance, disciplinary action, medical unfitness) were not included in the validation portion of this study.

Analyses

For some analyses, the sample was divided into four non-overlapping groups. Group 1 consisted of those who tested only once. Groups 2, 3, and 4 consisted of those who tested two times, three times, and four or more times respectively.

Means. AFOQT composite and test mean comparisons were made between first and later scores for those who retested (i.e., first vs. second, first vs. third, and first vs. fourth). The size of group mean differences was expressed in standard deviation units or d (Cohen, 1988). The standard deviation for d was defined as the within-group standard deviation calculated from the weighted average of the square root of the variances for the two groups being compared (e.g., first vs. second test for the twice-tested group). Thus, d = (M1 - M2) / SD and is a measure relative to the groups considered. Cohen characterizes a d of .20 as small, .50 as medium, and .80 as large. However, even "small" d values can have a large practical impact on the proportion of applicants in the group with the lower mean that would meet or exceed some minimum cut score in a selection context. In addition to d, group mean differences were tested using one-tailed t-tests with a .01 Type I error rate.

Reliability. Retest reliability was examined for the composites and tests for first vs. second, first vs. third, and first vs. fourth administration. As with the mean comparisons, all reliability estimates were done within each retest group (i.e., within Group 2, 3, or 4).

Validity. Validity was examined for the composites and tests by correlating scores with UPT final outcome (UPT P/F). Several sets of validities were examined to evaluate the utility of first, last, and average scores. Currently, pilot trainees are selected by a board of senior officers. Among other information for each candidate presented to the board are the most recent AFOQT composite scores. The board sees a mixture of scores and has no knowledge whether these are first test or retest scores. Validity was estimated for first, last, and average scores. The use of last (most recent) scores is consistent with Air Force policy. However, these analyses might disclose whether first or averaged composite scores are advantageous.

Average scores were computed for those who tested two or more times. For example, the average of the first and second tests, or the average of the first, second, and third tests, or the average of the first, second, third, and forth tests were computed. These averaged scores were then validated against the UPT P/F criterion.

Several policy questions may be addressed by comparing initial and retest data. First, the current policy of using the "most recent" test vs. the alternative of using the first test can be evaluated. For some candidates the most recent score may be their only test, while for others it may be a second or later test. Is there a difference in validity between the first and most recent tests? A second policy could be to identify subsequent testings to selection boards if subsequent testings were less valid. A third policy could be to report the mean of all tests administered.

Differences in correlations for first vs. most recent composites were examined using a two-tailed t-test (McNemar, 1969). The null hypothesis was H0: r first - r last = 0. A second set of tests of differences in correlations was done for most recent vs. average composite scores. The null hypothesis was H0: r last - r avg = 0. A .01 Type I error rate was used for all statistical tests. This is a conservative test because for many participants, first and last, first and average, and last and average were the same observation, which reduced the between-group variance. This does not violate any of the assumptions of the test statistic (t) used.

Results and Discussion

Means

For the once-tested group, the composites were all slightly below the normative median. Generally, first-test composite and test scores were lower for those who chose to retest than for those who tested only once. For those who tested twice, the composites were 9 to 13 percentile points below the normative median on the first test. On the second test, their means on the Verbal, Pilot, and Navigator-Technical composites exceeded the median and one-time testers. On the Quantitative and Academic Aptitude composites, they remained below the median, but exceeded the scores of the one-time testers. Similar patterns were observed for those who tested three or four times.

The first vs. later scores for those who retested showed a clear trend toward mean score improvement on retests. The largest composite increases occurred for Pilot and Navigator-Technical and the smallest for Quantitative. On the test level, the largest increase occurred for Instrument Comprehension, a measure of pilot job knowledge. Arithmetic Reasoning, Math Knowledge, and General Science were among the tests that showed the least improvement on retesting. Figure 1 shows the mean scores for the Pilot composite by retest group.

The causes for the score changes cannot be known from these data. Possible reasons include reliability of the tests, g-saturation of the tests, and acquisition of specific knowledge. When the tests were ranked on the basis of their reliability (Berger et al., 1990), no relationship was found to their order of change on retest. Similarly, when the tests were ranked on the basis of their g-loading (Carretta & Ree, 1996), no relationship was found to their order of change on retest. It was speculated that the likely reason for changes in Instrument Comprehension and Aviation Information (tests of pilot job knowledge) is that candidates who performed below their expectation probably took action to learn the specialized material. This might have included attending ground school, reading aviation books or manuals, or taking flying training instruction.

Reliability

Retest reliabilities for the composites are shown in Table 1. Compared to Arth (1986), the retest reliabilities for the composites in the current study were greater. For the two-times-tested group, reliabilities ranged from .82 to .88 for the composites and from .48 (Hidden Figures) to .82 (Word Knowledge) for the tests (not shown in Table 1). For those who tested three times, the reliabilities were lower than for two-time-testers. This trend continued, and four-times-testers had lower reliability yet. Retest reliabilities declined on subsequent retests.

Berger et al. (1990) reported a mean internal consistency for the 16 tests in Forms P1 and P2 of .82 and .81. Although some difference may be due to the method used to calculate reliability (retest vs. internal consistency), most of the difference is likely due to what Gulliksen (1950) refers to as the "effect of group heterogeneity on test reliability" (pp. 108-127). In general, the variability of composites and tests for retests was lower than the variability for one-time-testers.

Table 1. Test-Retest Reliability Estimates

 

Two-Times

Three-Times

Four-Times

Score

r12

r12

r13

r12

r13

r14

Verbal

.885

.872

.837

.853

.838

.777

Quantitative

.842

.807

.792

.769

.785

.754

Academic Aptitude

.886

.866

.834

.840

.831

.770

Pilot

.825

.787

.756

.746

.716

.690

Navigator-Technical

.866

.833

.810

.785

.787

.755

Notes. Due to space limitations, reliabilities for tests are not shown. Sample sizes were 36,243 for those tested twice, 6,647 for those tested three times, and 1,038 for those tested four times. The subscripts for r indicate the administrations being correlated. For example, r12 is first and second, r13 is first and third, and r14 is first and forth.

Validity

In this sample, 88.3% of the pilot trainees successfully completed UPT. Table 2 shows the validities of the first, last, and average scores aggregated across participants. The column marked "Last" in this table represents the validity of the scores seen by the selection board. For all composites except Verbal, no statistically significant differences in validity were found for first vs. last scores. The "Last" Verbal composite was less valid than the "First." Comparison of the "Last" and "Average" scores, showed the "Average" scores to be consistently more valid. Each comparison between "Last" and "Average" composite scores was statistically significant.

Table 2. Validity of First, Last, and Average Test Scores (All Participants)

Score

First

Last

Average

Verbal

.029*

.017

.024*

Quantitative

.130*

.122*

.131*

Academic Aptitude

.085*

.078*

.084*

Pilot

.157*

.153*

.168*

Navigator-Technical

.163*

.158*

.169*

N = 14,403; *p < .01 (one-tailed test)
Notes. Due to space limitations, validities for tests are not shown. The column labeled "First" represents the validity of first-test scores. The column labeled "Last" represents the validity of most recent (i.e., last-test) scores. The column labeled "Average" represents the validity of the average of scores across test administrations.

Besides being more valid, averaging scores may discourage some needless retesting. Some applicants will retest hoping that chance fluctuation will cause their scores to increase. One of the attributes of averaged test scores is the effect of subsequent testing on extant averages. Averaging regresses scores toward the middle, thus reducing the effects of extreme scores that may be due to chance fluctuation. Consider the example where a candidate who scores a little below the selection minimum requests a retest. If a score of 50 is the minimum required for selection, and the examinee has scored an accurate 45, a retest might be requested hoping for sufficient chance score increase to enable selection. Averaging reduces spurious qualification, the circumstance where the minimum qualifying score is achieved by luck. For example, the candidate with a score of 45 who retests and by chance fluctuation receives a score of 50, could qualify on that single score of 50. A more valid 47.5 (i.e., 45 + 50] / 2) would have occurred had the two scores been averaged. This averaged score would have been below the minimum selection score and would have been a more accurate assessment of the candidate’s ability and, therefore, likelihood of training success. Averaging tends to reduce the impact of luck, both bad and good. This is beneficial to both the selecting agency and to the examinee.

Policy implications of the findings follow. Retesting does not threaten validity and therefore retesting can be allowed with no loss of predictive accuracy. In the event of retest, the average of all scores for the Pilot and Navigator-Technical composites provides a more valid index of potential training performance and would serve to discourage needless successive retesting.

References

Arth, T. O. (1986). Air Force Officer Qualifying Test (AFOQT) retesting effects (AFHRL-TP-86-8). Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory.

Arth, T. O., Steuck, K. W., Sorrentino, C. T., & Burke, E. F. (1990). Air Force Officer Qualifying Test (AFOQT): Predictors of undergraduate pilot training and undergraduate navigator training success (AFHRL-TP-89-52). Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory.

Berger, F. R., Gupta, W. B., Berger, R. M., & Skinner, J. (1990). Air Force Officer Qualifying Test (AFOQT) form P: Test manual (AFHRL-TR-89-56). Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory.

Carretta, T. R. (in press). Sex and Ethnic group differences in U. S. Air Force pilot selection tests (AL/HR-TP-1996-26). Brooks AFB, TX: Manpower and Personnel Research Division, Armstrong Laboratory Human Resources Directorate.

Carretta, T. R., & Ree, M. J. (1995a). Air Force Officer Qualifying test validity for predicting pilot training performance. Journal of Business and Psychology, 9, 379-388.

Carretta, T. R., & Ree, M. J. (1995b). Near identity of cognitive structure in sex and ethnic groups. Personality and Individual Differences, 19, 149-155.

Carretta, T. R., & Ree, M. J. (1996). Factor structure of the Air Force Officer Qualifying Test: Analysis and comparison. Military Psychology, 8, 29-42.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (second edition). Hillsdale, NJ: Erlbaum.

Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.

McNemar, Q. (1969). Psychological statistics. New York: Wiley.

Olea, M. M., & Ree, M. J. (1994). Predicting pilot and navigator criteria: Not much more than g. Journal of Applied Psychology, 79, 845-849.

Roberts, H. E., & Skinner, J. (1995, May). Equity of the Air Force Officer Qualifying Test in selection. Poster session presented at the tenth annual conference of the Society for Industrial and Organizational Psychology, Orlando, FL.

Back to Table of Contents