Testing for Language Teachers, A. Hughes, Cambridge Handbooks for Language Teachers, 1989

test-986935_1280

Image source: Pixabay

Summary notes

backwash – The effect of testing on teaching and learning. Backwash might be either positive or negative. For example, a test that has a large part involving describing a chart/graph will probably result in an over-focus of teaching on describing charts/graphs.

Unreliability has two origins: features of the test itself (unclear instructions, ambiguous questions, items that require too much guesswork), and the way it is scored.

A test which is good for one purpose may be useless for another purpose.

TYPES OF TESTS:

proficiency tests – Designed to measure language ability regardless of course content or previous training. However, what counts as ‘proficient’?

achievement tests (final achievement tests, progress tests) – Achievement tests are directly related to language courses.

diagnostic tests – Used to identify students’ strengths and weaknesses.

placement tests – Used to assign students to a class level.

…………………………………………………………………….

DIRECT TESTING VS. INDIRECT TESTING

direct testing – Testing is said to be direct when it requires the candidate to perform precisely the skill which we wish to measure. If we want to know how well a student can write an essay, we get them to write an essay. Direct testing is often limited to a rather small sample of tasks.

indirect testing – Attempts to measure the abilities which underlie the skills in which we are interested. For example, Lado (1961) proposed a test of pronunciation ability through a pen and paper test of matching pairs of words that rhyme with each other.

……………………………………………………………………

DISCRETE POINT VS. INTEGRATIVE TESTING

discrete point testing – Refers to the testing of one element at a time, item by item (indirect testing).

integrative testing – Requires the candidate to combine many language elements in the completion of a task (direct testing).

………………………………………………………………………

NORM-REFERENCED VS. CRITERION-REFERENCED TESTING

See the Baxter post for more detail.

………………………………………………………………………

OBJECTIVE TESTING VS. SUBJECTIVE TESTING

The distinction here is between methods of scoring.

……………………………………………………………………….

VALIDITY

content validity – It is obvious that a grammar test must be made up of items testing knowledge or control of grammar. But this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures. In order to judge whether or not a test has content validity, we need a specification of the skills or structures that it is meant to cover.

criterion-related validity – Another approach to validity is to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent assessment is thus the criterion measure against which the test is validated.

There are essentially two kinds of criterion-related validity: concurrent validity and predictive ability. Concurrent validity is established when the test and the criterion are administered at about the same time. Predictive validity means the degree to which a test can predict candidates’ future performance.

construct validity – When it can be demonstrated that it measures just the ability which it is supposed to measure.

face validity

……………………………………………………………..

RELIABILITY

Taking a test on a different day or a different time might yield different results.

the reliability coefficient (should be closest to ‘1’ – multiple choice questions are typically ‘1’ tests)

the standard error of measurement and the true score

How to make tests more reliable: If you are writing a test, see Chapter 5 for a good list of guidelines to make tests more reliable. A short list:

  • Is the task perfectly clear?
  • Is there more than one possible correct response?
  • Can candidates show the desired behavior without having the skill supposedly being tested?
  • Do candidates have enough time to perform the tasks?

RELIABILITY VS. VALIDITY

To be a valid, a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all. For example, as a writing test, we might require candidates to write down the translation equivalents of 500 words in their own language. This could well be a reliable test, but it is unlikely to be a valid test of writing.

If test specifications make clear what candidates have to be able to do, and with what degree of success, then students will have a clear picture of what they have to achieve.

Be clear what it is that you want to know and for what purpose.

……………………………………………………………………………..

During the 1970s, there was much excitement in the world of language testing about what was called the Unitary Competence Hypothesis. This was the suggestion that the nature of language ability was such that it was impossible to break it down into component parts. This hypothesis eventually proved false. A learner may be a good speaker but a poor writer. One way of measuring overall ability would be to measure a variety of separate abilities and then to combine scores.

………………………………………………………………………………

HOLISTIC VS. ANALYTIC SCORING

holistic scoring – Is often called ‘impressionistic scoring’.

analytic scoring – Requires a separate score for each of a number of aspects of a task. While it is doubtful that scorers can judge each of the aspects independently of the others (there is what is called a ‘halo effect’), the fact of having multiple ‘shots’ at assessing the student’s performance should lead to greater reliability.

Once the test is completed, a search should be made to identify ‘benchmark’ scripts which typify key levels of ability on each writing task.

………………………………………………………………………………..

ORAL TESTS

Make an oral test as long as is feasible, and give the candidate as many ‘fresh starts’ as possible.

…………………………………………………………………………………

TESTING READING

Chapter 11 provides nice lists of what tasks might be considered under testing macro vs micro skills. (Chapter 12 provides similar lists for listening tests.)

An excess of micro-skill test items should not be allowed to obscure the fact that the micro-skills are being taught not as an end in themselves, but as a means of improving macro-skills.

Do not repeatedly select texts of a particular kind simply because they are readily available.

Choose texts of appropriate length. Scanning works best with texts at least 2,000 words long.

………………………………………………………………………………….

The best test may give unreliable and invalid results if it is not well administered.

Advertisements