BlogAnhVu: A Study of the Effect of Direct Test Preparation on the TOEIC Scores of Japanese University Students

Thứ Bảy, 8 tháng 11, 2008

A Study of the Effect of Direct Test Preparation on the TOEIC Scores of Japanese University Students

A Study of the Effect of Direct Test Preparation on the TOEIC Scores of Japanese University Students

Source: http://tesl-ej.org/ej12/a2.html

Thomas N. Robb
Kyoto Sangyo University

Jay Ercanbrack
Kyoto Sangyo University

Abstract
In order to study the effect of direct test preparation on TOEIC gain scores, two samples of students (i.e., English majors and non-majors) at a Japanese university were divided into three treatment groups: 1) TOEIC Preparation, 2) Business English and 3) General (four-skills) English. The results indicate that usage of TOEIC preparatory materials led to a statistically significant gain on post-test scores for the non-majors' reading component only. The authors conclude that TOEIC preparatory materials are of little benefit to students enrolled in a comprehensive program of English language study, but might boost the score of the reading component of students enrolled in a university-level general English course in Japan.

Introduction

There is a clear tendency for students, not only in Japan, but around the world, to study for a test by reviewing past tests and concentrating their efforts on the types of language and test items that are known to appear on such tests. It is equally clear that if a test can be prepared for, then the test no longer can be said to measure general proficiency. Rather, it measures how well people have studied for the test.

Thus there is the inherent danger of the test becoming the tail that wags the curriculum dog. Henning expressed this concern thus: "If there is no concerted effort to subordinate testing to explicit curricular goals, there is an ever-present potential danger that tests themselves with all their inherent limitations will become the purpose of the educational encounter by default" (1990:380).

This study was designed to determine whether "teaching for the test" in a Japanese university setting does, in fact, result in higher test scores. Specifically, we set out to determine if students who use material designed for TOEIC test preparation or for "Business [-1-] English" achieve higher gain scores than students who study an equal amount of time with standard language study materials.

Past Studies
The authors have found little previous research related to gain scores on the TOEIC test and only two studies, Alderson & Wall (1993) and Alderson & Hamp-lyons (1996) concerning the TOEFL examination. These, however, were more concerned with the 'washback effect' on such test preparation on the actual content of classes and contained no objective data concerning the effects of coaching on subsequent test scores.

Fortunately, several studies have been conducted in relation to yet another standardized test, Educational Testing Service's SAT (Scholastic Aptitude Test), an exam which is administered to native-speaking high school students and is required as part of the application process for most American universities. Powers' (1993) "Coaching for the SAT: A Summary of the Summaries and an Update" is a thorough survey of such studies, particularly those which employ meta-analytical techniques to synthesize previous research. Much of what follows below has been drawn from this survey.

Preparation Programs for the SAT
Preparation programs for academic aptitude and language proficiency tests are currently abundant. Taken together, they constitute a vast industry within the private educational sector. Students are propelled toward such programs by a desire to succeed on tests where the perceived stakes are high. As noted by several researchers (e.g., Mehrens and Kaminsky 1989), the higher the stakes of a test, the greater the desire for guided test preparation and practice.

Yet, despite the great popularity of test preparation courses and programs, relatively little research has been done to document whether special preparation can have a markedly positive effect on test scores. A resolution of this issue is obviously crucial for the creators of standardized tests, as well as for the test-takers themselves. As mentioned, if preparation via coaching in test-taking techniques and strategies is found to be effective, it would indicate that test scores are not reliable indicators of academic ability or language proficiency, but rather reflect, at least to some degree, an ability to take tests. If such a situation exists, the validity of the tests is called into question.

Commercial coaching companies often report considerable gains in test scores by their clientele as proof of the effectiveness of their coaching programs. However, as Powers (1993) points out, variations in an individual's test scores from one test administration to another can be expected to occur, and for a variety of reasons. First, test score gains may be the result of [-2-] a "practice effect", wherein test takers have a greater sense of comfort, familiarity, and confidence when retaking a test (referred to by Bachman,1990, as "test-wiseness") than they possessed in their initial experience with the same exam. Such changes may also reflect growth in an individuals ability over time, rather than the direct influence of test coaching programs. Furthermore, variations in scores - either increases or decreases - may be due simply to measurement error. Upon retesting, it is quite usual for some examinees to show large increases in scores, and for others to show large decreases. This phenomenon of regression to the mean was demonstrated in a study by Johnson et al. (1985) . Examining an SAT coaching program, it was noted that the gains or losses recorded by coached students varied greatly depending on their initial test scores. Students who scored lowest on their first encounter with the SAT tended to make the greatest gains upon retesting, while those who at first scored most highly were likely to make the smallest gains or largest drops in their second round scores.

In contrast to the relative dearth of similar research concerning other large-scale, commercially available tests, several other studies have been conducted in the past two decades dealing with the effect of coaching on SAT scores (e.g., Becker 1990; Der Simonian and Laird 1983; Kulik, Bangert-Drowns and Kulik 1984; and Messick and Jungeblut 1981). As summarized by Powers (1993), and taking into account that simply repeating the test may lead to gains of around 15 points on the verbal section and 12 points on math (College Board 1991), these studies reveal that even the most well- known commercial coaching programs (e.g., Stanley Kaplan, Inc., and the Princeton Review) produce only modest score gains, typically 15 to 25 points each on the 200-800 point verbal and mathematical sections of the test. It has also been found that the scores of students who have undergone coaching receive a slightly higher boost on the math section than on the verbal section of the test (Becker 1990, Messick and Jungeblut 1981). In any case when, as suggested by Messick (1982), improvements in percentile ranking for coached students (usually consisting of only a few points) are considered, the gains made by coached SAT-takers appear to be meager. It seems clear then, at least in the case of the SAT, that the effects of coaching may fall far short of students expectations.

Beyond the direct influence of coaching, differences in scores between coached and uncoached students may reflect other factors and tendencies associated with each group. Powers (1981) offers evidence to show that students who utilize formal or commercialized coaching services also make greater use of other test preparation resources than their uncoached peers. For example, they are more likely to conduct their own review of relevant subject matter, read supplementary test preparation books, and attend review sessions provided by their own schools. Such tendencies make it difficult, if not impossible, to objectively assess the impact that coaching programs alone may have on those who enroll in them. [-3-]

One related question worth considering with regards to the test coaching issue is this: if coaching does have an effect (and it appears, as we have seen, to have at least some small impact on SAT scores), then what precisely is the reason for this effect? That is, what aspect of coaching is helping to raise test scores? One of the few studies to address this issue, again in the context of the SAT, was that of Johnson et al. (1985). It found that as a result of coaching, many test-takers were able to complete more items on both sections of the SAT. Since providing a correct answer on even a portion of the previously unmarked items would lead to higher overall scores, this newly developed ability was seen by the researchers as being the instrumental element in test score improvement for this set of coached students.

Though, as mentioned, few studies have analyzed the effect of coaching on language proficiency tests, one significant review (Kulik, Bangert-Drowns, and Kulik 1984) compares the SAT with a variety of other standardized aptitude tests, both academic and psychometric (e.g., GRE-Q, Stanford-Binet, WISC, etc.). It found that the affect of coaching for the other tests was much greater (approximately three times) than for the SAT. This may possibly be attributed to the preponderance of relatively simple test item formats found on the SAT. Item format is considered pivotal with regards to coaching since complex formats have been found to be more coachable than those of a simple nature (Powers 1986).

"Washback" Effects on Language Preparation Programs
While the literature on the effectiveness of preparation courses and programs for language proficiency tests is bleakly sparse, there are several studies which have concerned themselves with the washback effects of such tests on EFL/ESL classrooms (Wesdorp 1982; Hughes 1988; Khaniya 1990; Wall and Alderson 1993; and Alderson and Hamp-Lyons 1996, among others). Washback, a term popular in British applied linguistics and commonly referred to as backwash in the field of general education, may be understood as the influence that a test has on teaching and learning. The "Washback Hypothesis", as explained by Alderson and Wall, assumes that "teachers and learners do things they would not necessarily otherwise do because of the test" (1993:117).

The concept of washback presupposes a belief in the notion that tests are prominent determiners of classroom practices and events. Accordingly, the term itself is neutral in that the influence of a test may be either positive or negative in nature. That is, a poor test yields negative washback while a good test will have effects perceived as positive. As summarized by Alderson and Wall (1993), some of the negative effects tests have been suspected of producing include narrowing or distortion of the curriculum (Vernon 1956; Madaus 1988; Cooley 1991), loss of instructional time (Smith et al. 1989), reduced emphasis on skills that require complex [-4-] thinking or problem-solving (Frederickson 1984; Darling-Hammond and Wise 1985) and test score pollution, meaning gains in test scores without a paralleled improvement in actual ability in the construct under examination (Haladnya,Nolan, and Haas 1991).

In contrast, some researchers (i.e., Swain 1985 and Alderson 1986) emphasize the potential positive aspects of test influence and urge the creation of tests which, through constructive washback, will have enlightening effects on language curricula. Certain researchers (i.e., Morrow 1986 and Frederickson and Collins 1989) have suggested that a tests validity should be determined by the degree to which it has a positive influence on teaching. Morrow (1986) refers to this as washback validity, while Frederickson and Collins (1989) have introduced the term systemic validity to refer to a similar process.

Remarkably, while claims of washback and its effects, both positive and negative, are numerous in educational literature, Alderson and Wall (1993) point out that little empirical evidence has been provided to support the argument that tests do indeed influence teaching; that is, that washback actually exists. Assertions concerning washback in past studies have been based primarily on anecdotal evidence, primarily opinions and impressions gathered from teachers and administrators.

To amend this lack of objective data, Alderson and Hamp-Lyons (1996) set out to investigate the existence and extent of washback in one educational setting. Using a combination of classroom observations and interviews with teachers and students, they targeted preparation classes for the English proficiency test TOEFL (Test of English as a Foreign Language), an examination of particular importance to non-native speakers of English interested in entering degree programs at American universities. Two teachers received extensive observation in both their TOEFL preparation and regular English classes, and an attempt was made to separate the effects of individual teacher style from TOEFL washback.

The study did not investigate the question of whether or not TOEFL preparation courses were effective in raising scores on the test and, to this date, no previous research of this kind appears to have been conducted. Only the processes of teaching and learning were observed and examined in order to determine the extent of TOEFL washback in this setting. The authors concluded that the TOEFL did indeed affect both what and how teachers taught, but that the effect differed in degree and kind from teacher to teacher. More importantly, they suggested that it is not a test alone that causes washback, but the way that test is approached by administrators (who may determine the necessity of large class sizes), materials writers (who may fail to give proper guidance to teachers on possible ways to teach with a certain set of materials), and teachers themselves (who may devote little energy to finding alternative or innovative [-5-] ways to teach test preparation classes) which actually creates the phenomenon of washback for a given language proficiency test.

It is, then, against this broad backdrop of hypothesis and information concerning washback, generally, and the effects of coaching on test scores, specifically, that the present study was conducted.

Experimental Design
The study was carried out with two distinct samples of freshmen students at Kyoto Sangyo University: English majors (henceforth 'Majors') in the Faculty of Foreign Languages and Non-Majors from other faculties of the university taking freshmen English courses offered by the school's English Language Education and Research Center. These two samples will be treated separately since there are important differences between them that make integration of the data unwise:

1. Contact hours/week. The majors were taking seven 90-minute classes per week in English. These students were pseudo-randomly assigned by the school to one of 8 sections ("kumi" in Japanese). Students in a particular section at Kyoto Sangyo University take classes together in all but two of of their seven 'practical' courses. It was not feasible to vary the content of all of the courses, so only two courses, "Extensive Reading" and "Listening/Pronunciation" were used in the experimental design. The other five courses taken by the majors included "Intensive Reading", "Grammar", "Composition", "Conversation" and "General Cultural Studies". An assumption was made that the content of the other courses would be roughly similar and would therefore not jeopardize the validity of the study. Two of the sections, 7 & 8, were English majors with a specialization in International Relations. These students had a slightly different program with a content course instead of grammar. The results with these classes both included and excluded were essentially the same, so this minor difference will henceforth be ignored.

Most, but not all, non-majors had two classes per week, one of which, "Applied English" was part of this study. The other class, "Reading" was a traditional reading class which concentrated on the careful reading and understanding of short passages of text. While it would have been better if this class, too, had been included in the study, this was unfeasible. Since all students were receiving a like amount of this reading practice however, this additional class should have made little difference in the overall outcome. The Applied English class met for a maximum of 27 class sessions during the school year for a maximum of 40.5 hours of contact time.

2. Level of English. The initial level of the majors was considerably higher than that of the non-majors, as would be [-6-] expected. One implication of this was that the same materials could not be used for both sets of students in most cases.

3. Motivation. The majors, having chosen English as their primary area of study for the next 4 years, could be assumed to be more interested in English and more highly motivated than the non-majors.

4. Homework. The English majors were much more likely to do home assignments. This was not so much a matter of intrinsic motivation as a consequence of the fact that their English courses were required. If they had failed to meet the instructor's expectations, they would have had to repeat the course. This, in turn (depending on the number of other failures), might have set back their year of graduation. For the non-majors, the course was not required. If they failed, they could take courses in a variety of other subjects to garner sufficient credits for graduation

Hypothesis
The gain scores of all students, regardless of method of study, would be equal.
Initial Setup
Class Configurations
The researchers had no control over the composition of the classes. The majors in groups 1 through 6 (English--Language & Culture) were assigned pseudo-randomly by the University administration. Groups 7 and 8 (English--International Relations) were assigned alphabetically.

For the non-majors, the students are grouped according to their desired second foreign language. For example, all Business Majors who desired to study French were formed into one or more classes. Since there are normally too many students for a single class, the students are further divided into multiple classes according to their total score on the university's entrance examination. For this experiment, the instructors were assigned to six of the classes which contained the highest ranking students for certain combinations of major + second foreign language.

As can be seen from Table 1, the Non-major design has a neat 3 x 2 arrangement (treatments x instructors) while there are only two instances of "major" instructors teaching two sections for the entire period. In both of these cases, the instructor has classes of the same treatment. [-7-]

--------------------------------------------------------------------------------
Table 1 -- Treatment Groups
Majors (Treatment & Instructor) Non-majors (Treatment,Instructor & Major) Reading Listening (Listening & Reading Class Class in one Class) 1 TOEIC A TOEIC H 101 General X Business Majors 2 TOEIC B TOEIC H/I 133 TOEIC X Economics Majors 3 TOEIC C TOEIC I 141 Business X Law Majors 4 General D General J 218 General Y Business Majors 5 Business E Business K 292 TOEIC Y Engineering Majors 6 General F General L 228 Business Y Economics Majors 7 General F General M 8 Business G Business K

--------------------------------------------------------------------------------
Prior Information Provided to the Students
The students were informed both in the course catalog and in their first class that a basic purpose of the course (regardless of treatment) was to achieve a high score on the TOEIC examination. Students were told that 20% of their final mark for the course would be based on the improvement in their test scores between the initial TOEIC examination (in May) and the final examination in January.

Preparation for Pre-test
A shortened, demonstration version of the TOEIC (Form MT-93) was administered to all sections as a class activity one to two weeks prior to the actual pre-test. A list of 'hints' and test-taking strategies was also provided to all students in Japanese.

This preparation was considered important as a way to partially offset the "practice effect", mentioned previously, whereby students generally score higher on second and successive administrations of a test merely due to greater familiarity with the test itself.

In this study we were more interested in assessing the effect of the variation in the teaching of language content rather than differences arising from the acquisition of test-taking strategies. While it was inevitable that the students in the 'TOEIC' treatment [-8-] would have more exposure over the course of the year to such test- taking strategies, we felt that this could be offset to some degree by familiarizing all students with the examination beforehand.

Pre- and Post-tests
The pre-test was administered on May 11, 1996, approximately one month after the start of the school year. For administrative reasons, it was impossible to schedule the test any earlier. The instructors in the TOEIC treatment, in particular, were instructed to avoid any classwork which could be termed "TOEIC test preparation" until the test was over. The 'general' and 'business' treatments had no other exposure to TOEIC-type questions during the course of the experiment, save for the one sample test and the actual pre- and post-tests.

The post-test was administered on January 18, 1997, which was the day after the conclusion of the final term. The pre-test had revealed some slightly significant differences in certain test groups (at the 0.05 level). It was decided to compensate for these differences in the final analysis by using an analysis of covariance (ANCOVA). Possible intervening variables such as age, club activities, etc. were also taken into account. The results of the pre- and post-test are presented below in "Results and Analysis."

Conduct Of The Courses
English Majors
Two of the once-weekly classes of the English majors were used for the experiment, their extensive reading class (I-B) and their listening/pronunciation class (I-F). For the extensive reading course, students in all groups were required to read over 1000 pages a year and to write summaries of what they read in a notebook kept for that purpose. Thus the three treatments were only different in the materials used for the in-class component.

Texts for each course were selected according to the following factors:

The material needed to be relevant to the specific treatment ('General', 'TOEIC' or 'Business' English).
The material had to be targeted at the appropriate ability level for the students in that particular course. In general, less challenging material was required for the non-major groups.
There had to be a sufficient volume of material for the number of class hours and expected hours of homework assignments.
For the listening component, we required a text that was accompanied by tapes that the students could listen to at home. (A special license was arranged with each publisher to duplicate their tapes for a modest fee.) [-9-]
For most of the treatments, the cost of the required materials was greater than students would normally be willing to pay for a university course bearing only 2 credits. Students were thus required to pay a maximum of 2500yen per course, the rest of the expense being subsidized by grant funds from TOEIC.

General English Treatment (Control)
Reading: For classwork, the students used the SRA "Reading Laboratory 2c" exclusively. There was no 'up front' instruction from the teacher. This is the material normally used in this course for the in-class component of the I-B extensive reading course.

Listening: The students used the following texts.
Practice with English Reduced Forms, Sanshusha Eigo Onseigaku no Kiso (Basic English Phonetics), Kenkyusha
TOEIC Treatment
The main text, On Target for the TOEIC (Longman) was shared between the reading and listening sections, each of which did the sections relevant to their particular course. This text was chosen over other available texts because it contained the most 'pedagogical material' in addition to the ubiquitous practice items and vocabulary lists. Copies of the audio tape for this text were distributed to all students. There was also a Japanese language companion text with notes, translations of vocabulary, etc. In addition, the following texts were used in order to provide a sufficient volume of material for the year course:

Reading:
Improving Your Pronunciation (Meirindo+ tape) Building Skills for TOEIC (Pifer)
TOEIC Kisokara Gambare! - Vocabulary
Listening:
TOEIC Kisokara Gambare! - Listening (with tape)
Practice with English Reduced Forms

Business Treatment
Business Objectives (Oxford) was shared between the reading and listening sections, each doing the relevant parts. Copies of the audio tape for this text were distributed to all students. Other supplementary texts used were:

Reading:
English by Newspaper (Heinle & Heinle)
Listening:
Business Venture [-10-]
English by Newspaper,
While the above is not a business text, it was adopted after two considerations, 1) there were no 'business reading' texts available for students at the low-intermediate level, and 2) the material in "English by Newspaper" contained numerous articles in business and related fields. .

Non-majors
General English Treatment (Control)

Main Text:

High Impact (Longman) + Workbook & Tapes
High Impact is a four skills text written primarily for Japanese 'false beginners.' It was targeted at a lower level than the corresponding text used for the English majors, since it was assumed (correctly) that the non-majors general level of English proficiency would be considerably lower.

TOEIC Treatment
Main Text:

On Target for the TOEIC (Addison-Wesley)
Japanese companion text + Tapes
TOEIC kara Gambare! (Vocabulary)
Business Treatment Main Text:
Business Basics (Oxford) + Tapes
This text was selected for similar reasons to High Impact -- it was deemed to be targeted at the correct level for non-major students.

The content of each course was thus dictated by the assigned texts. No direct control was exerted over the instructors to conform to a specific lesson plan. The instructors, however, were encouraged to coordinate their class with the other instructors teaching the same materials, either through face-to-face meetings or by keeping a log into which each could report their progress. (The teachers for the TOEIC treatment reading classes, in particular, taught on different days and thus rarely had a chance to meet each other in person.)

Quizzes
Quizzes were prepared for each of the taped listening sections of each text and for the "TOEIC Kiso Kara Gambare" vocabulary text. The teachers were to use them to check how well the students had prepared the assigned homework. (Japanese students are apt to ignore homework when their is no specific way to assess whether they have done it or not.) These quiz scores were centrally recorded, although differences in frequency and manner of administration [-11-] rendered them unusable for purposes of this study. Nevertheless, they were important as a motivational tool and as an element of the final grade given to each student.

Questionnaire
A questionnaire was administered to all students at the end of the year to 1) gather statistical data on other activities which might have influenced their progress in English and 2) measure their attitudes towards different aspects of the course. In particular, data concerning additional English classes taken and prior overseas experience proved meaningful for proper interpretation of the data. The questionnaire with analyses are presented in Appendix B.

--------------------------------------------------------------------------------

Table 2 -- Net Gain in Scores by Treatment Non-Majors TOTGAIN LISTGAIN READGAIN Business N OF CASES 53 53 53 MEAN GAIN 6.415 -6.981 13.396 STANDARD DEV 71.403 49.907 40.641 General N OF CASES 46 46 46 MEAN GAIN 12.609 0.326 12.283 STANDARD DEV 77.386 48.217 51.020 TOEIC N OF CASES 50 50 50 MEAN GAIN 53.300 5.400 47.900 STANDARD DEV 80.930 44.845 51.844 Majors TOTGAIN LISTGAIN READGAIN Business N OF CASES 60 60 60 MEAN GAIN 60.917 31.333 29.583 STANDARD DEV 58.226 47.299 33.119 General N OF CASES 83 83 83 MEAN GAIN 52.349 28.795 23.554 STANDARD DEV 59.614 42.667 41.647 TOEIC N OF CASES 73 73 73 MEAN GAIN 80.000 40.479 39.863 STANDARD DEV 64.253 46.091 43.922

--------------------------------------------------------------------------------
[-12-]
Results & Analysis
Below are the resulting gain scores (Post-test scores minus Pre-test scores). The actual pre-test and post-test scores are presented in Appendix A

Analysis Of Variance (Non-majors)
Systat version 5.0 was used to perform an analysis of variance on the data. In order to save space only the most useful data are reported below. For each population (Majors and Non-majors) tests were performed on the scores on the January 18 administration (coded at '118'), using the baseline score as a covariate (May 11 administration, coded as '511'). Data from the questionnaire were used as potential intervening variables.

Results for Non-majors
Total Score
A slightly signifcant effect was found (p=0.039) for the treatment when only the pre-test was used as a covariate in the analysis, but once the students' response to the question concerning outside English study (OTHERCL) was taken into consideration, this significant difference disappeared (p=0.092).

--------------------------------------------------------------------------------

Table 3 -- Analysis of Total Scores (TOTAL118) for Non-majors DEP VAR:TOTAL118 N: 149 MULTIPLE R: 0.547 SQUARED MULTIPLE R: 0.300 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P TREAT$ 34003.100 2 17001.550 3.331 0.039 TOTAL511 309338.355 1 309338.355 60.601 0.000 ERROR 740151.649 145 5104.494 ---------------------------------------------------------------- [-13-] Analysis of TOTAL118 (Non-majors) with the intervening variable, 'OTHERCL' (Other English classes) added to the equation. DEP VAR:TOTAL118 N: 127 MULTIPLE R: 0.618 SQUARED MULTIPLE R: 0.382 (22 cases deleted due to missing data -- No questionnaire) ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P TREAT$ 23321.456 2 11660.728 2.429 0.092 TOTAL511 245203.451 1 245203.451 51.069 0.000 OTHERCL 74208.663 1 74208.663 15.456 0.000 ERROR 585769.217 122 4801.387

--------------------------------------------------------------------------------
Listening Sub-test Only
No signficant difference appeared between the groups (0.787).

--------------------------------------------------------------------------------

Table 4 -- Analysis of Listening Scores (LIST118) for Non-majors DEP VAR: LIST118 N: 149 MULTIPLE R: 0.392 SQUARED MULTIPLE R: 0.153 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P TREAT$ 986.725 2 493.362 0.240 0.787 LIST511 53887.223 1 53887.223 26.257 0.000 ERROR 297586.783 145 2052.323

--------------------------------------------------------------------------------
Reading Sub-test Only
A highly significant difference (p=0.002) appeared between the groups when only the pre-test was taken into consideration. Once the intervening variables 'CLUB' (Particpation in a club for studying English), 'OSEAS' (Overseas experience) and "OUTSIDE" (Language classes outside the university) were added to the equation, the significance of the difference decreased 10-fold in [-14-] magnitude, nevertheless remaining signifcant at the p=0.02 level. A post hoc Scheff was then performed to determine where the signifcant difference lay. As indicated in Table 4, the results of the TOEIC treatment group turned out to be significantly different from only the General treatment group.

--------------------------------------------------------------------------------

Table 5 -- Analysis of Reading Scores (READ118) for Non-majors DEP VAR: READ118 N: 149 MULTIPLE R: 0.586 SQUARED MULTIPLE R: 0.343 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P TREAT$ 21184.068 2 10592.034 6.435 0.002 READ511 116316.437 1 116316.437 70.661 0.000 ERROR 238685.870 145 1646.109 DEP VAR: READ118 N: 127 MULTIPLE R: 0.653 SQUARED MULTIPLE R: 0.426 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P TREAT$ 14030.878 2 7015.439 4.060 0.020 READ511 99584.655 1 99584.655 57.630 0.000 CLUB 805.702 1 805.702 0.466 0.496 OUTSIDE 753.678 1 753.678 0.436 0.510 OSEAS 7172.884 1 7172.884 4.151 0.044 ERROR 207360.867 120 1728.007 Post Hoc Test of READ118 USING MODEL MSE OF 1604.609 WITH 120. DF. MATRIX OF PAIRWISE MEAN DIFFERENCES: BUSIN GEN'L TOEIC BUSIN 0.000 GEN'L -4.198 0.000 TOEIC 21.141 25.339 0.000 [-15-] SCHEFFE TEST. MATRIX OF PAIRWISE COMPARISON PROBABILITIES: BUSIN GEN'L TOEIC BUSIN 1.000 GEN'L 0.898 1.000 TOEIC 0.079 0.032 1.000

--------------------------------------------------------------------------------

Results For Majors
Table 6 presents the analysis of variance for the Total, Listening, and Reading Scores for the Majors. None resulted in a signficant difference at the criterion level of 0.05. We can therefore state that for the majors, there was no signficant difference in the test scores depending on the treatment.

--------------------------------------------------------------------------------

Table 6 -- Analysis of Scores for Majors DEP VAR: TOTGAIN N: 216 MULTIPLE R: 0.421 SQUARED MULTIPLE R: 0.177 ANALYSIS OF VARIANCE TREAT$ 17884.770 2 8942.385 2.811 0.062 TOTAL511 519685.583 1 519685.583 163.389 0.000 ERROR 674302.491 212 3180.672 DEP VAR: LIST118 N: 216 MULTIPLE R: 0.539 SQUARED MULTIPLE R: 0.291 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P TREAT$ 2951.144 2 1475.572 0.879 0.417 LIST511 140482.164 1 140482.164 83.670 0.000 ERROR 355946.662 212 1678.994 [-16-] DEP VAR: READ118 N: 216 MULTIPLE R: 0.581 SQUARED MULTIPLE R: 0.338 ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE F-RATIO P TREAT$ 5029.187 2 2514.593 2.071 0.129 READ511 130148.453 1 130148.453 107.167 0.000 ERROR 257463.134 212 1214.449

--------------------------------------------------------------------------------

Discussion
Our hypothesis was that the gain scores of all students, regardless of method of study, would be equal. This was confirmed in all but one instance: Non-Major students given the TOEIC treatment showed a significant gain on the Reading Section compared to those who studied using regular materials or business materials. The students of one instructor for the Non-Majors actually demonstrated a gain of 77 points overall, with 56 of them in the reading section (Table 7).

--------------------------------------------------------------------------------

Table 7 -- Mean Gains for Non-Major Sections, By Instructor TOTAL LISTENING READING Instructor X Business MEAN GAIN -11.4 -21.7 10.3 General MEAN GAIN -8.2 -8.2 0.0 TOEIC MEAN GAIN 24.7 -13.3 38.0 Instructor Y Business MEAN GAIN 27.9 10.8 17.1 General MEAN GAIN 27.2 4.6 30.9 TOEIC MEAN GAIN 77.6 21.3 56.3

--------------------------------------------------------------------------------

Although the gains for one of the instructors for Non-Majors are considerably greater than those of the other instructor for this [-17-] group, the pattern is similar in that the Reading Section always shows a greater gain than the listening section, and the TOEIC treatment shows a greater gain than the other two treatments which are similar in their total gain scores. No differences emerged on the follow-up questionnaire (Appendix B) which would account for this difference in the results.

It is also clear that whatever gains there might have been with the majors were 'washed out' by the many other courses which they were taking concurrently. Some might claim that it would have been wiser to alter the content of all classes during the week so that clearer results could have been obtained. This, however, would have resulted in an artificial curriculum, one which would not normally exist at a university. Since English majors would take TOEIC preparation as only one element of their course of study, our model closely approximates a possible implementation.

One surprising result is that the Non-Major students, with the exception of Instructor Y's TOEIC section, improved very little over the course of the year and in some cases, even showed 'negative gain'. This can be taken as a testament to the poor attitude of Japanese university students towards their 'general education' subjects. The instructors reported that they could assign little homework since there was little expectation that the students would actually do it. Thus most of the students' exposure was limited to the 26 class meetings. It appears that the activities carried out in class did not, for many students, result in any real 'learning' that could be translated into improved TOEIC scores.

Even with the one section that did show great improvement, we cannot ascertain how much of this gain can be attributed to greater 'test wiseness' as opposed to greater knowledge of English. It would appear, however, that a greater knowledge of the schema of the written genre appearing on the TOEIC examination might have been a significant factor. This and other possible causes are discussed in the following section.

Improvement in Reading vs Listening
Although all groups generally showed improvement over the course of the year, one salient difference between the Non-Majors and Majors lies in where the improvement took place. With the majors, the improvement in the Listening and Reading scores was almost equal, whereas with the Non-Majors, there was little gain in the listening component (-6.9, 0.3 and 5.4 for the three treatments) and a greater rate of improvement in the reading section (13.3, 12.2 and 47.9). There was little improvement in listening even though both of the instructors used English as the medium of instruction. We can tentatively postulate the following reasons for this: [-18-]

The students did not spend much time at home listening to the tapes. This is supported by their responses to the question "I listened to the tapes at home regularly" where the average of their self-reports was under 3.0, with '5' meaning 'agree' and '1' meaning 'disagree'. (Figure 1).

Figure 1

The 'teacher talk' may have been significantly different in its essential nature from the language used on the TOEIC and therefore of little help in improving their scores on the listening test items.
The main text for the 'TOEIC' group On Target for the TOEIC was used in the order that the material was presented in the book, where the listening material precedes the reading material. At the time of the post-test, then, the students had just completed the reading section, while the listening section had been finished months earlier.
The Listening section is administered before the Reading section in the actual TOEIC examination. Since the students at the beginning of the year were not familiar with the test, it could well be that they tired towards the end. This would have resulted in poorer performance and a measurement which under-estimated their actual ability in reading (=one result of the 'practice effect').
Concerning the TOEIC treatment which demonstrated the largest gain on the reading section, one of the instructors noted a positive reaction among the students to one particular section of the text which dealt with finding details in written texts such as letters and advertisements. It appears that this kind of material was completely new to them and, indeed, it is not part of the general high school English curriculum in Japan. Thus not only did they actually learn some important skills here, these are skills which were not covered to the same extent in the Business or General [-19-] treatments' texts, although both of them did also contain some letters as part of their instructional content.
One instructor reported that he had heavily stressed the fact that improvement in the TOEIC score would be an important factor in their final grade. Actually, this policy applied to students in all treatments, and they were informed of this in the course prospectus at the beginning of the year. It seems however, that this point may have received more emphasis in at least one of the TOEIC treatments.
Concerning the Majors, they achieved similar gains on both sections due to the more balanced curriulum, as explained in the presentation of the experimental design in the section entitled "Contact Hours/Week" above.
Conclusion
While this study seems to suggest that TOEIC materials can be effective for improving the reading component scores of non-major students at a Japanese university, our results are by no means conclusive. The non-major students, for example, had initial scores far below those of the English majors. It could be that students in this low score range can benefit more from such instruction than can those at a higher level of ability.

Further, the TOEIC course was a substitute for the standard general English course which might have placed greater emphasis on English for communicative purposes. Forcing students to study TOEIC preparatory material might, therefore, be doing them a disservice if communicative ability is the goal of the program.

Care needs to be taken when applying these findings to other teaching situations. Further studies are required to confirm whether these results apply to students of other nationalitities, differing levels of ability or motivation or in other educational setttings such as in-company training programs and language schools.

References
Alderson, J. Charles, (1986). Innovations in Language Testing, in Portal, M. (ed.), 93-105.

Alderson, J.C. & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.

Alderson, J.C. & Hamp-Lyons, L. (1996). "TOEFL preparation courses: a study of washback." Language Testing 13, 3, 280-297.

Amer, Aly Anwer (1993). "Teaching EFL students to use a test-taking strategy." Language Testing 10, 1, 71-78. [-20-]

Bachman, Lyle F., (1990). Fundamental Considerations in Language Testing, Oxford University Press, Oxford.

Becker, B. J. (1990). Coaching for the Scholastic Aptitude Test: Further synthesis and appraisal. Review of Educational Research, 60, 373-417.

Cooley, W.W. (1991). Statewide student assessment. Educational Measurement: Issues and Practice 10, 3-6.

Darling-Hammond, L. and Wise, A.E., (1985). Beyond standardization: state standards, and school improvement. The Elementary School Journal 85, 315-36.

DerSimonian, R. and Laird, N. M. (1983). "Evaluating the effect of coaching on SAT scores: A meta-analysis." Harvard Educational Review, 53, 1-15.

Frederickson, J.R. (1984). The real test bias: influences of testing on teaching and learning. American Psychologist 39, 193-202. Frederickson, J.R. and Collins, A. (1989). A systems approach to educational testing. Educational Researcher 18, 27-32.

Haladnya, T.M., Nolan S.B. and Haas, N.S. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher 20, 2-20.

Henning, Grant (1990). "Priority Issues in the Assessment of Communicative Language Abilities", Foreign Language Annals, 23:5 October 1990, 379-384.

Hughes, A. (1988). Introducing a needs-based test of English language proficiency into an English-medium university in Turkey. In Hughes, A., ed., 134-153.

Hughes, A., ed. (1988). Testing English for university study. ELT Document 127, London: Modern English Publications.

Johnson S. T., Asbury, C. A., Wallace M. B., Robinson S. & Vaughn J. (1985). "The effectiveness of a program to increase Scholastic Aptitude Test scores of Black students in three cities." Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, April 1985.

Khaniya, T.R. (1990). Examinations as instruments for educational change: investigating the washback effect of the Nepalese English exams. Unpublished PhD dissertation, University of Edinburgh.

Kulik, J.A. Bangert-Drowns, R.L. & Kulik, C.C. (1984). Effectiveness of coaching for aptitude tests. Psychological Bulletin, 95, 179-188.

Madaus, G.F. (1988). The influence of testing on the curriculum. In Travers, L., editor, Critical issues in curriculum (87th yearbook of the Society for the Study of Education), Part 1, Chicago, IL: Chicago University Press, 83-121.

Messick, S and Jungeblut, A. (1981). Time and method in coaching for the SAT. Psychological Bulletin, 89, 191-216.

Morrow, K. (1986). The evaluation of tests of communicative performance, in Portal(ed).

Portal, M. (ed.). Innovations in Language Testing. London: NFER/Nelson. [-21-]

Powers, Donald E.(1993). Educational Measurement: Issues and Practice, Summer 1993, 24-31.

Smith. M.L., Edelsky, C., Draper, K., Rottenberg, C. and Cherland, M. (1989). The role of testing in elementary schools. Los Angles, CA: Center for Research on Educational Standards and Student Tests, Graduate School of Education, UCLA.

Swain, M. (1985). Large-scale communicative testing in Lee, Yp Fok, C.Y.Y., Lord, R. and Low, G. (eds) New Directions in Language Testing. Hong Kong: Pergamon Press.

Vernon, P.E. (1956). The Measurement of Abilities (2nd edn.) London: University of London Press.

Wall, D. and Alderson, J.C. (1993). Examining Washback: the Sri Lankan Impact Study, Language Testing 10, 41-70.

Wesdorp, H. (1982). Backwash effects of language testing in primary and secondary education. Stichting Centrum voor onderwijsonderzoek van de Universiteit van Amsterdam.

Acknowledgements
We would like to acknowlege the generous assitance of Steve Ross, who offered advice at every stage of this project, from its inception to the final report.

We would like to thank the following organizations for their assistance with this research:

The Chauncey Group International Ltd. and IIEC(Japan) for the funding that made this research possible;

Oxford University Press and Addison-Wesley/Longman for generously allowing us to duplicate the tapes of their texts locally at a reduced rate for research purposes.

About the Authors
Thomas N. Robb received his PhD in Linguistics from the University of Hawai'i. He is a professor in the Department of English, Faculty of Foreign Languages, Kyoto Sangyo University. He is a past president of JALT and former member of the TESOL Executive Board.

Jay Ercanbrack received his M.A. in ESL from the University of Hawai'i. He is an Associate Professor in the English Language Education and Research Center at Kyoto Sangyo University. He regularly serves as an examiner for the British Council, Kyoto.