Equivalence of the Computer-Based Aviation Selection Test Battery (ASTB)
S. Biggerstaff, D. J. Blower, and L. A. Portman
Naval Aerospace Medical Research Laboratory, Pensacola, FL.
Introduction:
Previous research conducted in the area of computer-based testing has addressed several issues related to the transition of a test battery from a paper-and-pencil format to a computerized format. Some of the advantages of computerized testing include reductions in testing, the potential for immediate performance feedback, the ability to measure response latency, as well as to collect additional information on test-taking patterns (Wise & Plake, 1989). In contrast to paper-and-pencil tests, computer-based testing allows for individualized assessment, increased capabilities in utilizing information, enhanced economic value, manipulation of measurement databases, and improved diagnostic testing (Ward, 1984).
According to the APA Guidelines on Computer-Based Tests and Interpretations, the equivalence of scores from computerized versions should be well established and documented and comparison studies of computerized and conventional testing should be reported to establish the relative reliability of computerized assessment (Van de Vijver & Harsveld, 1989; Green, 1991). In order to ensure that performance on a test battery is not differentially affected by the medium of presentation, data should be collected to demonstrate equivalence. Equivalence has been defined in the literature as either qualitative (or structural) or quantitative. Qualitative equivalence refers to the notion that both forms of the test are assessing the same psychological construct, and it is typically determined by linear structural models or item factor analysis. Quantitative equivalence refers to the comparability of results between the two modes of presentation. Numerical score distributions obtained from equivalent tests must be identical or should be made identical through score transformations.
Overall, previous research has established the crucial importance of demonstrating both qualitative and qualitative equivalence between modes of administration in order to ensure test validity and reliability. To demonstrate equivalence, technical aspects of the computer administration must be ad dressed by the test creators. Depending upon the perceptual and cognitive dynamics of the new test item presentation, there may be decreases in test performance and lowered validity. Mead and Drasgow (1993) found that speeded tests appear to be affected by the test medium, whereas the mode of test administration has little effect on reliability and validity in power tests. Careful attention must be given to the psychometric aspects of the test design, as well as the human-machine interface, to produce a reliable and valid test. Features such as test item omission, backtracking, and item review or delaying responses, which are often not available options in a computerized test, can impact performance. Of course, computer graphics quality and the visual display format of test questions can also affect equivalence. Lastly, the input device utilized may affect the rate and ease of response. For example, pressing a key or using a mouse may actually be easier and quicker than marking an answer sheet (Federico, I 992). Each of these factors must be examined individually to assess the potential implications on test performance Despite these concerns, a number of researchers have demonstrated good equivalence between computer and paper-and-pencil tests within both the military (Moreno, Wetzel, McBride, & Weiss, 1984; Kiely, Zala & Weiss, 1986; Federico, 1992) and civilian environments (Harrell, Honaker, Hetu & Oberwager, 1987, Vansickle, Kimmel, & Kapes, 1989; Van do Vijver & Harsveld, 1994).
The purpose of this study was to do an initial evaluation of the equivalence of the Windows-based Aviation Selection Test Battery (ASTB) with the existing operational paper-and-pencil version of the test. This study is part of a larger project to investigate the underlying structure of the Navy's. ASTB, an in-house alternate test battery, and the Aviator Computer-based Performance Test (CBPT).
Participants:
Eighty-two (3 females, 79 males) U.S. Naval Aviation candidates awaiting flight school at the Naval Aviation Schools Command (NASC) volunteered to participate in this study. All participants had been previously selected for pilot training in the United States Navy. Two of these requirements are that they must have received at least a Bachelors degree from an accredited university, and have taken the Navy's ASTB. Participants were informed as to the purpose of the study and of their' freedom to withdraw at any time without prejudice to their military career.
Materials:
Aviation Selection Test Battery (ASTB): The ASTB is a multiple-choice test that was developed and validated jointly by the Naval Aerospace and Operational Medical Institute and the Educational Testing Services to predict initial ground school and flight training performance in the Navy, Marine Corps and Coast Guard pilot curriculum. The current form of the test is the 1992 revision of selection tests used since World War II in military aviation. The two equivalent forms of the ASTB take approximately one hour and forty-five minutes to administer, and include six timed subtests. The subtests include the Math/Verbal Test (MVT), the Mechanical Comprehension Test (MCT), the Spatial Apperception Test (SAT), the Aviation and Nautical Information Test (ANI), the Biographical Inventory Test (BI), and the Aviation Interest Test (AI). The MVT contains thirty-seven questions and evaluates basic math skills and paragraph comprehension. All candidates are given thirty-five minutes to complete this Section. The MCT evaluates the candidates' knowledge of basic mechanical principles and physics and consists of thirty questions, taking fifteen minutes to complete. The SAT measures a candidate's mental rotation ability. Candidates are shown an aerial view of coastal terrain, presumably a cockpit view. They are then shown a series of pictures with aircraft in different attitudes and must determine which attitude matches the pictured terrain. The subtest consists of thirty-five questions and must he completed in ten minutes. The ANI test consists of thirty questions to be answered in a fifteen-minute time limit. The ANI tests a candidate's specific knowledge about aviation- and nautically-related material. The BI is a standard biographical inventory in which candidates respond to questions regarding their academic background and performance, as well as their life experiences. Candidates are given twenty minutes to complete this portion of the test. The Aviation Interest subtest (AI) has not been validated against flight performance and therefore was not included in the analysis. The predictive validity of the ASTB subtests for flight performance and success range from .23 to .40 (Multiple R), depending on the subtest and specific criterion used for analysis (Frank & Baisden, 1992).
The computerized ASTB (C_ASTB) is a Windows-based/C++ program developed over a three-year period at the Naval Aerospace Medical Research Laboratory (NAMRL). The goal of the project was to duplicate as closely as possible the paper-and-pencil version of the existing ASTB for eventual transition to the operational community. The system runs on an IBM or PC compatible computer (486/66 MHZ or greater) and requires a VGA monitor, standard keyboard, and mouse. The entire test is menu-driven and responses can be made via keyboard or mouse. For purposes of this study, all responses were made using the mouse. Each subtest begins with an instructions and examples page. When subjects choose to move on from the instructions page the timer for that section is begun. Each test item fits on a single screen, so no scrolling is necessary. The test item images were created to mirror those of the paper-and-pencil test. The extra control features for each item/screen are located on the bottom of each screen. The subject has three options; he/she may click on a button bar either to: (1) Move to the next test item, (2) Move back to the previous item; or, (3) Mark the item for later review. At the end of each subtest, participants are informed as to the time remaining for that particular subtest and the number of unanswered questions. They are prompted to either review all of the test questions within that section, review only the marked items within the section, or to exit the current subtest. Once participants exit, they cannot return to any previous subtest. If they make no response, the subtest will be closed when the timer runs out.
ASTB - Alternate (ALT): Two forms of an alternate test battery (ALT) were developed in-house to evaluate the influence of the computer program independent of subjects' experience with the ASTB, as well as to evaluate the underlying structure of the ASTB versus another test battery. The ALT was intentionally developed to be as similar as possible to the operational ASTB, with the exception that the SAT portion of the test included a number of different item types (i.e., a folding figures, figure rotation, etc.). The number of test items in each section and the time to complete the test were identical to those of the ASTB. Both paper-and-pencil and computerized versions of the test were used in this study.
Method:
A split plot design (see Figure 1 below) was used, with three experimental treatments labeled as A (TEST), B (FORM), and C (MEDIUM) and two levels to each treatment. Treatment A was the test instrument, with levels a1 (ASTB) and a2 (ALT). Treatment B was the different forms of each test instrument; b1 is Form I and b2 is Form II. Treatment C was the different presentation medium where level ci was the paper-and-pencil and c2 was computer. B and C were between blocks treatments and A was a within-blocks treatment. Participants reported in at 0800 on the first day for briefing and informed consent procedures. By 0830 participants began testing. Instructions for the computer-based and paper-and-pencil test were identical, except that instructions in the paper-and-pencil conditions were provided verbally by the experimenter. Participants were administered one of the versions of the test battery, either the ASTB or the ALT. When participants were administered the ASTB, they took the form opposite from that which they had previously received. Subjects returned the next day at 0830 for their second test session. The numbers of correct responses for each subtest were used as dependent measures. In addition, subjects’ subtest scores from their original ASTB administration were obtained from the Naval Aerospace and Operational Medical Institute for future analysis.
|
|
a1 |
a2 |
n |
|
b1c1 |
S1 |
S1 |
20 |
|
b1c2 |
S2 |
S2 |
20 |
|
b2c1 |
S3 |
S3 |
22 |
|
b2c2 |
S4 |
S4 |
20 |
Figure l.
Results:
Eighty-one of the subjects had previously taken the ASTB prior to reporting to NASC. Average scores for each of the subtests can be seen in Table l. A significant main effect for TEST was found for all subtests, with subjects scoring higher on the ASTB versus the ALT. Significant main effects for FORM were found on the MCT (F(l,78) = 4.47, p<.05) and the ANI (F(l,78) = 4.39 p<.05), with subjects’ scores being highest on Form I of the ALT and ASTB. There were no significant main effects for MEDIUM for tbe four subtests analyzed. A significant MEDIUM by TEST interaction was found on the MVT (F(l,78) = 4.13, p = .046) and the SAT (F(l,78) = 4.10, p = .046), with scores being higher on the computerized versus paper-and-pencil ASTB, and lower on the computerized ALT compared to the paper-and-pencil ALT. No other interactions were significant.
|
|
Test |
Mean |
Standard Deviation |
N |
|
MVT |
ASTB |
27.33 |
5.55 |
82 |
|
MCT |
ASTB |
21.88 |
4.00 |
82 |
|
ANI |
ASTB |
19.55 |
3.97 |
82 |
|
SAT |
ASTB |
26.65 |
5.69 |
82 |
Table 1
Discussion:
A definitive answer on the equivalence of the computerized ASTB and paper-and-pencil test will not be available until subject training performance data is collected and the predictive validity of the system is known. However, the current study does suggest that the chosen Windows-based format is compatible with the computerization of the Navy's current ASTB without the need for score transformations. Some of the reasons for the observed equivalence include the specific design features of the system (i.e., marking, backtracking, etc.) and the procedural restriction to a single input device. In addition, the subjects were all college graduates who had had at least some exposure to computers. The ‘real world’ applicant population for the ASTB should similarly have a minimum of 2-3 years of college education prior to taking the test. The second year of the validation project will include testing additional subjects on the computerized test battery, establishing the predictive validity of the system, specifically addressing the issue of computer literacy and its effects on test battery performance, and addressing minority bias issues.
References:
Fedecco. P.A. (1992). Assessing Semantic Knowledge Using Computer-Based and Paper-Based Media. Computers in Human Behavior, 8, 169-181.
Frank, L. & Baisden, A.(1994). The 1992 Navy and Marine Corps Aviator Selection Test Battery Development. Presented at the 1994 Annual Meeting of the Military Testing Association, Williamsburg, VA.
Green, B.F. (1991). Guidelines for Computer Testing. The Computer and the Decision-Making Process. Hillsdale, NJ: Lawrence Eribaum Associates, 245-273.
Harrell, T.H., Honaker, L.M., Hetu, M., & Oberwager, J. (1987). Computerized Versus. Traditional Administration of the Multidimensional Aptitude Battery-Verbal Scale: An Examination of Reliability and Validity. Computers in Human Behavior, 3, 129-137.
Kiely, G.L., Zara, A.R. & Weiss, D.J. (1986). Equivalence of Computer and Paper-and-Pencil Armed Services Vocational Aptitude Battery Tests. Air Force Human Resources Laboratory Final Technical Paper (AFHRL-TP-86-l3), Brooks Air Force Base, TX.
Mend, A.D., & Drasgow, F. (1993). Equivalence of Computerized and Paper-and-Pencil Cognitive Ability Tests: A Meta-Analysis. Psychological Bulletin, 114, 3, 449-458.
Moreno, K.E., Wetzel, C D , McBride, JJL, & Weiss, D.J. (1984). Relationship Between Corresponding Armed Services Vocational Aptitude Battery (ASVAB) and Computerized Adaptive Testing (CAT) Subtests. Applied Psychological Measurement, 8(2), 55-163.
Van de Vijver, F.J.R. & Harsveld. M. (1994). The Incomplete Equivalence of the Paper-and-Pencil and Computerized Versions of the General Aptitude Test Battery. Journal of Applied Psychology, 79 6, 852-859.
Vansickle, T.R., Kimmel, C., & Kapes, J.T. (1989) Test-Retest Equivalency of the Computer-Based and Paper-Based Versions of the Strong-Campbell Interest Inventory. Measurement and Evaluation in Counseling and Development, 22, 88-93.
Ward, W.C. (1984). Using microcomputers to administer tests. Educational Measurement: Issues and Practices, summer, 16-20.
Wise, S.L. & Plake, B.S. (1989). Research on the Effects of Administering Tests via Computers. Educational Measurement: Issues and Practice, 8(3), pp.5-10.