Development of a Computerized Pilot Selection Test

Mary Ann Hanson
Jerry W. Hedge
Kristi K. Logan
Kenneth T. Bruskiewicz
Walter C. Borman
Personnel Decisions Research Institutes, Inc.

Frederick M. Siem
Air Force Armstrong Laboratory Human Resources Directorate

For years pilot selection has focused primarily on the identification of individuals with superior flying skills and abilities. More recently, the aviation community has become increasingly aware that successful completion of a flight or mission requires not only flying skills, but the ability to work well in a crew situation. This paper describes the development, validation and computerization of a situational judgment test, called the Situational Test of Aircrew Response Styles (STARS), which targets the interpersonal skills necessary to function effectively in a crew situation.

Recent research, especially in the crew resource management (CRM) area, highlights the importance of interpersonal skills and certain personality traits for effective pilot performance. This crew resource management research was motivated by analyses of the causes of aircraft accidents which showed that the majority of these accidents involved crew errors (Foushee, 1984). Further analysis indicated that aspects of the crews' interpersonal interactions such as breakdowns in coordination and communication most frequently played a causal role in these accidents (e.g., Cooper, White, & Lauber 1980). Helmreich (1987) hypothesized that pilot performance is determined by ability, personality, and attitudes, and suggested that since the first two variables are very difficult to change, crew resource management training should focus on changing attitudes. Thus, training can be viewed as a way to promote awareness of group dynamics, bring about attitude change, and improve interpersonal skills, but it does not change the underlying traits that have been shown to be related to crew resource management. However, selection based on personality traits and/or relevant interpersonal skills may enhance crew resource management performance, above and beyond what could be expected from crew resource management training alone.

One particularly promising approach to measuring individual differences in the interpersonal and personality areas is the situational judgment test, which presents respondents with a series of job-relevant situations and asks them to indicate which of several alternative actions would be most effective and which would be least effective in each situation (e.g., Motowidlo, Dunnette, & Carter, 1990). Because these tests do not rely on applicants' self reports, the serious problems with response distortion that plague traditional personality measures are avoided.

In order to enhance the flexibility of the STARS to meet a variety of user test administration needs, a computer-administered version of the test was developed. The Air Force has been moving toward computer administration of selection measures for years, including more recently developed computer-administered test batteries such as the BAT, the Learning Abilities Measurement Program (LAMP), and the Advanced Personnel Test (APT). Thus, computerizing the STARS administration is likely to facilitate incorporation of this new test into existing Air Force selection batteries. A computerized test is also likely to be of interest to other potential test users, as the use of computerized selection batteries has become more widespread throughout the government and private industry.

Computerized test administration provides a variety of advantages over conventional paper-and-pencil test administration. Examinee responses are collected in an electronic format, which eliminates the need to enter the data (i.e., keypunching) and allows for immediate, on-line scoring. In addition, computer administration allows for collecting data concerning response latency (i.e., time to respond to each item), which can then be used in scoring the test or for identifying inappropriate responses. Computer administered tests, especially those containing graphics, are also likely to be more visually pleasing, thus making them more appealing to respondents, and possibly even increasing their usefulness as marketing tools.

Method

STARS Development

Experienced aircrews from C-130 transport aircraft units (i.e., basic crews are composed of an aircraft commander, co-pilot, navigator, flight engineer, and loadmaster) in the Air National Guard and Air Force Reserve served as subject matter experts during the development phase of this research project. In addition, junior Air Force officers and Air Force Academy cadets, with little or no flying experience (novices), participated in the "response option generation" and "response option scaling" phases of situational judgment test development described below. In all, 398 individuals (240 experts; 158 novices) participated in four different types of STARS development workshops.

Development of this situational judgment test (referred to as the Situational Test of Aircrew Response Styles, or STARS) involved four primary steps: 1) situation generation, 2) response option generation, 3) item review, and 4) response option scaling. Across a ten month period, development workshops were conducted at 22 Air Force sites within the continental United States. The outcome of the development work was a set of difficult situations targeted toward performance-relevant interpersonal skills, and a representative sampling of the kinds of actions pilots might take in these situations. Each situation included a set of five response options ranging from very effective to relatively ineffective. In addition, effectiveness data were collected from both "expert" and "novice" raters, and the effectiveness data generated by these groups were analyzed to identify similarities and differences between the expert and novice groups. Statistical comparisons both within and between the groups allowed us to select a final set of items to be used in the validation effort, and to develop a scoring key for the test. We generated an effectiveness "score" for each response option for 60 items (selected as our final set of items) by calculating the mean effectiveness rating across all experts. Additional details concerning STARS development and validation can be found in Hedge, Hanson, Borman, Bruskiewicz, Logan (1996).

STARS Validation

We developed a set of special for-research-only performance rating scales designed to measure aircraft commander performance on the constructs related to the crew resource management aspects of the job. The critical incident approach to rating scale development was used (see Borman, 1979). These behavior-based rating scales were designed to measure performance in seven important CRM-related aspects of pilots’ jobs: Facilitating Teamwork; Responsibility/Accountability; Motivating/Disciplining Crewmembers; Training/Coaching Crewmembers; Coordinating and Directing Crewmembers; Facilitating Information Flow; and Problem Solving/Decision Making. These performance rating scales were used to collect performance information and assess the validity of the STARS for predicting pilot performance in the CRM-related aspects of their jobs.

In all, 792 aircrew members participated in the concurrent validation data collection effort, either as a rater, a ratee, or both (in most cases, an aircraft commanders served as both a rater and a ratee). STARS data were obtained from 280 aircraft commanders at 13 Air Force Reserve and Air National Guard units. Ratings of these Aircraft Commanders’ performance were collected from 731 aircrew members at 13 Air Force Reserve and Air National Guard units. In order to increase the likelihood of obtaining accurate and standardized performance ratings using our aircraft commander rating scales, we developed and videotaped a brief (10 minute) rater training program to: 1) emphasize the purpose of the project, namely, to evaluate the STARS as a predictor of aircraft commander crew resource management performance, 2) underscore the notion that the rating scales were developed based on extensive input from knowledgeable aircrews, and 3) stress the importance of providing accurate ratings as a cornerstone of overall validation success.

STARS Computerization

Goals in the development of the computer administration software for STARS included: 1) test administration with a graphics capability, and 2) utilization of advances in computer technology. In addition, the STARS software was developed to be sufficiently flexible to allow for easy computerization of all of the various experimental and operational versions of the STARS. Thus, rather than simply developing a computerized test, we developed a flexible software "shell" that would meet these objectives. The STARS shell was programmed using Turbo Pascal Version 5.5 and PCX Toolbox Version 4.0, and it runs under DOS 5.0 or above.

Results

STARS Validation

Recall that at the completion of STARS development 60 items had been selected for administration to the concurrent validation sample. Because the final test would need to be much shorter for operational use, the first step in our validation analyses was to use a combination of preliminary empirical analyses and rational considerations to identify a subset of the STARS items to use in our final validity analyses. A subset of 13 STARS items was chosen based on factor analytic results, item-level validities, and rational considerations. For comparison purposes, and as a way to distinguish the two STARS versions, the original 60-item STARS will be referred to as the R & D STARS, and the 13-item version will be referred to as the operational STARS. The internal consistency reliability estimate for the R & D STARS is .87, suggesting an acceptable level of reliability; the reliability of the operational STARS is lower (which is not surprisingly given the fewer number of items), but still respectable at .69.

Six rating sources were represented in the rating data collected; the five primary crew positions of aircraft commander, co-pilot, navigator, flight engineer, and loadmaster; as well as self-ratings. Intercorrelations between mean ratings from each of these sources ranged from .09 (between self and loadmaster ratings) to .49 (between aircraft commander and navigator ratings) with a median correlation of .35.

Table 1

STARS Validity Coefficients in the Cross-Validation Sample
Uncorrected and Corrected for Criterion Unreliability

STARS

Overall Rating

Aircraft Commander

Co-Pilot

Navigator

Flight Engineer

Load-master

Self-Ratings

 

Uncorrected

Operational

.19**

.13

-.10

.14

.07

.33**

-.06

R & D

.14*

.02

-.18

.08

.04

.23**

-.11

 

Corrected

Operational

.29**

.16

-.10

.17

.10

.42**

--

R & D

.22*

.02

-.18

.09

.05

.31**

--

*p.<.05; **p.<.01

The STARS predictor data were randomly split into a developmental sample (45%) and a cross-validation sample (55%). The developmental sample was used to select the 13 items for the operational STARS. Table 1 shows the uncorrected correlations between the operational STARS and each of the rating sources, as well as the validity for the overall rating collapsed across all sources (except self ratings). The results indicate that predictor-criterion correlations are statistically significant for two of the rating composites (overall ratings and loadmaster), and approach significance for both the aircraft commander and navigator composites

The uncorrected validities ranged from a low of -.10 for co-pilot ratings to .33 for the ratings provided by loadmasters. The validities for the co-pilot and self-rating composites were both negative, and relatively different from all other source perspectives. The correlation between the operational STARS and the overall rating composite (.19) suggests a modest, but significant relationship between how aircraft commanders scored on the STARS, and ratings of those aircraft commanders’ crew resource management performance across all rating sources.

The highest validity, and one of the most intriguing findings, is associated with the "view from the rear" of the aircraft. Loadmasters’ ratings correlated .33 with aircraft commander scores on the operational STARS, more than double the size of the validities found with any of the primary cockpit crew positions. This suggests that perhaps being removed from the cockpit allows for a more objective perspective on the performance of the aircraft commander.

The uncorrected correlations presented in Table 1 are underestimates of the true relationships between the STARS and the rating source composites to the extent that the ratings are unreliable. In order to obtain a better estimate of the true validities of the STARS, we used a standard correction formula (see Ghiselli, Campbell, & Zedeck, 1981, p. 290) to adjust the validities for criterion unreliability. The corrected validities are also presented in Table 1, and range from -.15 to .49. For comparison purposes, both the uncorrected and corrected validities of the 60-item R & D STARS are also included.

STARS Computerization

The STARS "flexible software shell" is a program which sequences "screens" and allows multiple choice responses and some nominal demographic input. The sequenced screens are "PCX files," which are encoded images that can be exported from most draw programs. When compressed a 50 item test will take up about 1.2 MB which can be fit on a standard 3.5" disk. The general scheme of the STARS shell is to read in a series of PCX images according to a script, which indicates the order in which images are to be read, the types of images, and gives some information on responses and scoring.

An examinee response is indicated by a color change of the screen. For each item, the computer records the response option chosen as most effective, the response option chosen as least effective, the amount of time taken to complete the item (in milliseconds), and the item level scores (defined as the effectiveness value of the response chosen as most effective minus the effectiveness value of the response chosen as least effective). The computer also computes a total score by averaging all of these item level scores. Data is output to a file with the respondent’s identification number as the file name (and any remaining spaces filled in with "z"s).

Discussion

In general, the results of the concurrent validation study were somewhat disappointing, as we expected to get validities in the 30’s or 40’s given the time and effort expended on predictor and criterion development. However, previous attempts to predict pilot performance with measures of personality and interpersonal skill have obtained mixed results at best, so the fact that the operational STARS has significant, positive validity is encouraging. In addition, the sample of aircraft commanders used in the validation research was highly experienced, so restriction in range is likely to have negatively impacted the obtained validities. Validation research using a predictive, rather than concurrent, design is likely to obtain substantially larger validity coefficients.

The flexible software shell developed for the STARS enhances the flexibility of the STARS to meet a variety of user test administration needs. Computer administration also allows for immediate test scoring and eliminates the need to input test data manually. The flexible nature of this program also allows for efficient computerization of alternate forms and other similar tests. In fact, this software shell is flexible enough to allow for computerization of a wide variety of different types of tests.

References

Borman, W.C. (1979). Format and training effects on rating accuracy and rater errors. Journal of Applied Psychology, 64, 410-421.

Cooper, G.E., White, M.D., & Lauber, J.K. (Eds.) (1980). Resource management on the flightdeck: Proceedings of a NASA/Industry workshop (NASA CP-2120). Moffett Field, CA: NASA-Ames Research Center.

Foushee, H.C. (1984). Dyads and triads at 35,000 feet: Factors affecting group process and aircrew performance. American Psychologist, 39, 886-893.

Ghiselli, E.E., Campbell, J.P., & Zedeck, S. (1981). Measurement theory for the behavioral sciences (rev. ed.). San Francisco: Freeman.

Hedge, J.W. , Hanson, M.A., Borman, W.C., Bruskiewicz, K.T., & Logan, K.K. (1996). Predicting the crew resource management skills of Air Force pilots (Institute Report #283). Minneapolis: Personnel Decisions Research Institutes.

Helmreich, R.L. (1987). Theory underlying crew resource management training. In H. W. Orlady and H. C. Foushee (Eds.), Cockpit Resource Management Training: Proceedings of a NASA/MAC Workshop (NASA-CP-2455) (pp. 15-22). San Francisco, CA.

Motowidlo, S.J., Dunnette, M.M., & Carter, G. W. (1990). An alternative selection procedure: the low-fidelity simulation. Journal of Applied Psychology, 76, 640-647.

Back to Table of Contents