Actual Time Spent Measurement in Support of
Critical MPT Decision Making

Charles N. Holt & Jimmy L. Mitchell, IJOA, San Antonio, TX
Winston Bennett, Jr, Air Force Research Laboratory, Mesa, AZ

ABSTRACT

This research focused on the use of actual time scales (ATS) in two applied U.S. Air Force computer-based studies. In the first study, respondents were asked to estimate how often they performed tasks that appeared in a list of routine tasks and how long it took them to perform the task once, as well as to estimate the number of hours they normally work per week. Two identical automated surveys were administered a few months apart to the same population. In the second study, respondents were asked to estimate actual time spent on routine tasks using a similar ATS, but one version of the survey included a feedback mechanism that displayed the accrued total time accounted for by rated tasks while the other version did not  provide feedback. In Study 1, the means obtained from ATS scales were remarkably similar to the more global "weekly time worked" estimates and were similar between administrations. In Study 2, the means for both treatment groups were reasonable and stable between conditions. However, as in prior research involving ATS formats, the variance observed was extreme. Suggestions for further ATS research, including methodologies for reducing the variance associated with the scale, are offered.
BACKGROUND
One of areas of ongoing research in job and occupational analysis has been the use of time estimation scales by respondents completing task-based surveys (Mitchell, et al, 1997, 1998). Relative time scales (RTS) have gained widespread acceptance and are used by many military and civilian organizations (Gael, 1983; Gould, 1978). Normally RTS formats ask respondents to simply determine whether they perform the task and then rate the time spent on that task in relation to other tasks they perform. Typically, relative time scales employ non-anchored seven- or nine-point scales that provide a rank order of time spent on various tasks, but little information about the magnitude of the differences between the amounts of time spent on tasks. While RTS formats have come under some criticism over the years (Christal, 1974; Phalen, 1991; Sanchez & Levine, 1989), they are especially useful in making training decisions or when evaluating the effectiveness of training. A rank ordering of time spent on tasks is usually sufficient to answer these types of questions. Recent work by Brenner (1999) shows that the RTS format is the most statistically reliable when compared to three other time-estimation scales and receives the highest respondent approval ratings as well.
RTS formats, however, cannot provide the anchored distribution of values that would be useful when making critical manpower decisions. Manpower questions might be more easily addressed if the actual amount of time to perform tasks were estimated using an actual time scale (ATS) so that the magnitude of the differences between tasks could be calculated (Phalen, 1991, 1995). ATS data could be quite useful, for instance, when calculating dollar values attributable to the differences in time spent performing tasks. Another attractive feature of the ATS is that it produces a large amount of variance and can be used to compare time across jobs, occupations, and organizations (Phalen, 1995). The ATS format also has considerable potential for assessing the impact of new technologies in actual hours saved or lost.
ATS formats usually ask the respondent to first estimate the frequency of task performance (how many times the task is performed per day, week, month, year) and then the time (in seconds, minutes or hours) required to perform that task once (Albert, Bennett, Pemberton, Holt, & Waldroop, 1997). The respondent thus has four pieces of information to provide (the frequency and unit of frequency and the duration and unit of duration) when using an ATS format. During data analysis, the values for the various units of measure across all tasks are converted to obtain a uniform data set, usually "hours" for duration and "years" for frequency. These converted values for frequency and duration are then multiplied together to obtain a value representing the actual time the respondent estimates they spend performing that task per year.
While RTS formats tend to produce a small amount of variance due to the self-limiting nature of the seven- or nine-point scales used, ATS formats are known to produce very large standard deviations due to the unrestricted range of values available to the respondent (Holt, Hardaway, Woehr, & Bennett, 1998). The potential for error while using the ATS format is great since the respondent must chose between several possible units of measure when making both the frequency and duration estimates for each task rated. For instance, the respondent may intend to report performing a certain task five times a year and that it takes two hours to perform that task once for a total of 10 hours per year. It is not difficult to imagine a respondent erroneously reporting they perform the task five times a day rather than five times a year, especially when dealing with very long task lists.
It is not the purpose of this study to compare or judge the relative worth of these two scales. Both scales have their place in job and occupational studies, as discussed above. Instead, this study focused on two important aspects of the actual time scale: mean stability and variability. The stability of means generated by the ATS format was examined by comparing ATS generated means to a more global time estimate, by looking at the similarity when ATS means over time and comparing means when two versions of an ATS format is used in the same study. Standard deviations, which are anticipated to be extremely large, will be calculated and possible explanations and potential remedies discussed.
METHOD - Study One
The first study was conducted recently at the Air Force Basic Military Training (BMT) Group at Lackland AFB, TX (Holt, Hardaway, Woehr, & Bennett, 1998). The Group was interested in examining the impact of technology on the daily work activities of Training Instructors (TIs) and their supervisors. The group had implemented portable computer terminals (CruisePads) into BMT to automate many of the paper and pencil processes. Data collection, using a BMT-developed task list and an actual time scale (ATS) format, was performed prior to, and shortly after, the new technology was implemented to determine if there would be timesaving using the CruisePad technology. It was hypothesized that there would be an overall reduction in time spent on tasks. For the purposes of this study, it was also anticipated that there would be a large amount of variance due to the turmoil created with the implementation of the CruisePad technology and perhaps systematic sources associated with the use of the ATS format.

Identical surveys were administered on two occasions, once just prior to the full implementation of the CruisePad in March, 1998, (N = 125) and the other three and a half months later, in early July, 1998 (N = 112). During the intervening period, there were several significant changes to the software (notably the inclusion of signature recognition software) and hardware (notably the FM antenna system). Also during this period, several TIs utilized both the existing paper and pencil methodology and the new CruisePad system for a portion of that four-month period. Additionally, several technical problems identified during the first survey were in the process of being addressed at the time of the second survey; only a portion of those problems had been resolved by then. Both survey administrations were proctored by AFRL, IJOA and BMT staff and were completed in several group sessions held in a computer laboratory in the 737th BMT Squadron Headquarters. Two cases were eliminated from the first administration and one case was removed from the second study due to response patterns that suggested that the respondents misunderstood the requirements of the survey. In addition, 12 data points were removed from the first survey and 9 data points were removed from the second survey after being identified by BMT SME personnel as providing obviously out-of-range responses.


METHOD - Study 2

The second operational study involved Air Force Security Force Law Enforcement Patrolmen (Holt, Mitchell, & Zuniga, 1998). The primary objective of this operational study was to assess tasks performed by Law Enforcement (LE) Patrolmen in selected units operating under normal (eight hour) or extended (12 hour) shifts to identify less performed tasks as candidates for removal from the LE patrolmen job (job reengineering). An automated survey using an ATS format was created to collect data concerning actual time spent on tasks, as well as a variety of background questions. In addition, one research objective was to contrast two experimental conditions; one with feedback on accruing total time accounted for by rated tasks, and one without such feedback. Security Forces staff administered the survey to selected units worldwide (71% return rate). A small number of cases were eliminated if their response patterns were outside what a qualified subject matter expert deemed possible. The removal of such "outliers" from the sample is consistent with normal occupational analysis and research practice. The final sample consisted of 271 LE Patrolmen from 16 Air Force bases. Data analysis, included testing between-group differences in mean and standard deviation, was accomplished using the Statistical Package for the Social Sciences (SPSS) employing traditional t-tests.
RESULTS - Study 1
Data were summarized using the Statistical Package for Social Sciences (SPSS), version 8.0 PC. A special DOS utility was written to calculate individual time estimate responses into a common metric – hours per task per year. SPSS was utilized instead of CODAP for technical reasons.

Table 1 – Comparison of Actual Time Estimates


 
Mean Hours per Week 
SD
N
Mean Hours per Year
SD
Survey 1
71.38
111.74
125
3711.70
5810.53
 
 
 
 
 
 
Survey 2
68.35
99.89
112
3554.18
5194.18
 
 
 
 
 
 
Difference
3.03
 
 
157.50
 
 
 

Table I shows an average timesaving of just over three hours per week per TI from the first survey to the second, or about 157 hours per year per TI. The Standard Deviations, as expected, were very large.
 
 

Table 2 – Comparison of Estimated Hours Worked per Week

 
Mean Hours per Week
SD
N
Mean Hours per Year
SD
Survey 1
74.98
11.35
125
3899.17
590.01
 
 
 
 
 
 
Survey 2
70.79
11.54
112
3681.32
599.84
 
 
 
 
 
 
Difference
4.19
 
 
217.85
 
 
 

Table 2 reflects a similar decrease in the number of hours incumbents reported working per week in response to a background item asking for an estimate of "hours worked per week,"

a more global estimate of time worked. The average TI said he/she was working about 4 hours less per week from time one to time two. Note that the estimates from Table 1 and from Table 2 are in the same direction and in about the same magnitude, reflecting a trend towards less time being spent on tasks with the use of the CruisePad. Of the 108 tasks performed by the majority of TIs, 56 showed a decrease in actual time spent, while 52 reflected an increase. The estimated savings may be an underestimate since a portion of the respondents was actually using both paper and pencil and CruisePad methodologies.
 
 

Table 3 – Comparison of ATS Data versus Global Estimate of Time Worked
 

 
Hours per Year from Actual Time Data
SD
Hours per Year from Global Estimate 
SD
Survey 1
3711.70
5810.53
3899.17
590.01
 
 
 
 
 
Survey 2
3554.18
5194.18
3681.32
599.84
 
Table 3 compares ATS data with the more global time estimate and shows both estimates to be similar. Apparently respondents’ actual time estimates were very consistent with the more concrete estimate of time worked. Again, note that with the ATS format, the Standard Deviations related to the ATS format are very much larger (nearly ten times larger) than the Standard Deviations related to the global estimates.


RESULTS - Study 2

The major operational results are reported in an earlier study (Holt, Mitchell, & Zuniga, 1998). Basically, the operational study identified "non-essential" tasks that were not performed very frequently while the experimental study showed that the means and Standard Deviations were both reduced when feedback was provided to survey respondents.

Table 4 – Means and Standard Deviations for Feedback versus No Feedback Conditions


 
Mean hours per week
N
Standard Deviation
Feedback 
59.826
114
137.746
 
 
 
 
No Feedback
63.638
114
151.379
 
Table 4 reflects means that are consistent with what might reasonably be expected, that is, the "feedback" group reported less time spent on tasks. Likewise, Standard Deviations were very large, but, again, as expected, were smaller in the feedback treatment. Apparently, the feedback mechanism had an impact on the estimates the individuals were making.

DISCUSSION

The data from Study 1 confirms that ATS estimations were consistent with a more global estimate of time spent working on tasks and reasonably reflected the anticipated decrease in time spent performing tasks. The ATS data from Study 2 also produced means that were in a reasonable range for the career field being studied and behaved as expected as the apparent result of the treatment condition. While stability of the means in both studies could well be an artifact, the fact that both studies reflected reasonable time data is worth noting.

As evidenced by the high standard deviations for both studies, there may be some problem(s) with how the ATS data collection was operationalized. Future development of ATS and its related software should address these overestimation issues.

On the other hand, one major reason for collecting actual time estimates is to increase the variance in ratings in order to overcome the restriction in range problem with other types of rating scales (Phalen, 1995). A number of researchers have maintained that "higher variability in item responses is indicative of higher data quality (Stanton, 1998, page 713)." Clearly the present studies were successful in developing considerable more variability that would have been possible using the RTS format employed in most occupational analysis studies. Part of the variance in ratings in both studies could well be a function of the unusual work schedules of both career fields studied

It is apparent, however, that there is some excess variability in the ratings and perhaps some overestimate of the amount of work time for some respondents. Examination of individual responses revealed that some respondents were not using a consistent frame of reference when rating individual tasks and appeared to be estimating actual time to perform the task inappropriately. While the more extreme cases could be identified and eliminated as outliers, eliminating too large a portion of the sample this way might border on selecting the data to fit expected results.

Another possible problem is whether the tasks to be rated are well written, reasonably discrete, and time ratable as recommended by most experienced occupational analysts (Archer & Fruchter, 1963; Christal, 1974; Driskill & Gentner, 1978). If the tasks in a job inventory are not mutually exclusive or tend to be ambiguous, the ratings will tend to be more diverse but possibly spurious, and the result will be an overestimate of the time spent on a given task or function; likewise total hours worked would be exaggerated. Review of the task lists for both studies indicates there may have been some lack of discreteness for a few tasks, particularly when some are somewhat global statements (i.e., patrol the base, etc.).

Another factor which may have introduced extra variance in responses was the lack of some of the proctoring of responses which was part of the original laboratory study (Albert, et al., 1994; Phalen, 1995). In this hard disk software, certain screening criteria were built in so that if a response was extreme (i.e., exceeded the maximum expected level) then the software put up an alert flag which asked the respondent to reconsider his or her rating (Ibid). When the software was simplified for the field feasibility study (Mitchell, Weissmuller, Bennett, Agee, & Albert, 1995) so that it could be exported on low density diskettes to Air Force worldwide locations (and run from disk without installing on the PC hard disk), the extra monitoring of responses and prompting raters to reconsider had to be eliminated. Clearly, such software proctoring would be worthwhile in helping to keep down overestimation and making rater responses more realistic. Such software proctoring would be easier to implement in a Windows environment than is possible with the current DOS-based system (OASurv).

Changing the format of the ATS scale might well produce less systematic error variance. For instance, if respondents do not have to select the metric associated with the task being presented, either in frequency or duration, a significant source or error could be eliminated. Of course this would require a great deal of Subject Matter Experts (SME) participation in task list preparation to identify the proper metric (seconds, minutes, hours, etc.) for each task. Another possibility could be to offer a pop down window for the respondent to select the appropriate metric before estimating the frequency or duration associated with the task.

Further research and development to operationalize ATS data collection would certainly be worthwhile.

References:

Albert, W.G., Bennett, W., Jr., Pemberton, K., Holt, C.N., & Waldroop, P. (1997). Evaluation of automated technology in basic military training using computer-based surveys. Presentation in the symposium, J.S. Tartell and H.W. Ruck, co-chairs, Advanced Technology Research and Applications in Occupational & Training Analysis & Organizational Assessment, at the 38th annual conference of the International Military Testing Association (IMTA), Sydney, Australia.

Albert, W.G., Phalen, W.J., Selander, D.M., Dittmar, M.J., Tucker, D.L., & Weissmuller, J.J. (1994). Large-scale laboratory test of occupational survey software and scaling procedures. Proceedings of the 36th Annual Conference of the International Military Testing Association. Rotterdam, The Netherlands: European Members of the IMTA.

         Archer, W.B. & Fruchter, D.A. (1963). The construction, review, and administration of Air Force job inventories
         (PRL-TDR-63-21). Lackland AFB, TX: 6570th Personnel Research Laboratory. Christal, R.E. (1974). Collecting, analyzing, and reporting information describing jobs and occupations. (AFHRL-TR-74-19, AD-774 575). Lackland AFB, TX: Occupational Research Division, Air Force Human Resources Laboratory.

Driskill, W.E., & Gentner, F.C. (1978). Four fundamental criteria for describing the tasks of an occupational specialty (in Technical Note 78-04). U.S. Air Force Occupational Measurement Center, Randolph AFB, TX.

Hasher, L. & Zacks, R.T. (1984). Automatic processing of fundamental information: The case of frequency of occurrence. American Psychologist, 39 (12), 1372-1388.

Mitchell, J.L., Tucker, D., Fast, J., Bennett, W., Jr., & Albert, W.G. (1997, October). Research and development of new occupational analysis and training evaluation technologies. Presentation in the symposium, J.S. Tartell and H.W. Ruck, co-chairs, Advanced Technology Research and Applications in Occupational & Training Analysis & Organizational Assessment, at the 38th annual conference of the International Military Testing Association (IMTA), Sydney, Australia (Proceedings in press).

Mitchell, J.L., Weissmuller, J.J., Bennett, W., Jr., Agee, R.C., & Albert, W.G. (1995). Final results of a field study of the feasibility of computer-assisted occupational surveys: Stability of task and job information. Proceedings of the 37th annual conference of the International Military Testing Association (IMTA), pp. 231-236. Toronto, Ontario, Canada: Canadian Forces Applied Research Unit.

Mitchell, J.L., Weissmuller, J.J., Tucker, D.L., Waldroop, P., & Bennett, W., Jr. (1996, November). Development and application of a computer-assisted survey authoring tool for training needs assessment. In the symposium, H. W. Ruck, Chair, Recent Research and Applications in Training Needs Assessment and Evaluation. Proceedings of the 38th Annual Conference of the International Military Testing Association, pp.486-491. San Antonio, TX: Air Force Personnel Center, Armstrong Laboratory Human Resources Directorate, & the Air Force Occupational Measurement Squadron.

Phalen, W.J. (1995). A critical evaluation of various procedures for estimating time spent. Proceedings of the 37th Annual Conference of the International Military Testing Association, pp. 418 - 423. Toronto, Ontario, Canada: Canadian Forces Applied Research Unit.

Stanton, J.M., (1998). An empirical assessment of data collection using the internet. Personnel Psychology, 51:709-725.
 

Next paper