Language Testing and Assessment – an advanced resource book, Glenn Fulcher and Fred Davidson, Routledge Applied Linguistics, 2007

language testing and assessment

I started reading this book, but then stopped after the first section.

At over 400 pages, and some ten years more current than the other books on testing that I’ve summarized so far, this title is a nice addition to the field of language testing study. It  considers how testing terms and concepts are presented in prior literature and goes into much more academic depth in discussing their meaning and significance. It also reminded me how there are some terms that haven’t been satisfactorily explained by some of the other books I’ve read so far – such as construct validity and concurrent validity, and which I need to seek more clarity on. However, it quickly became clear that the content of this book is arguably more suited to a Master’s candidate reader than a Delta reader, and because time is limited, I am therefore putting this book away for the time being. Hopefully I will have another opportunity to come back to it in the future, as I think it freshens the debate.


FREE PDF – An A to Z of Second Language Assessment: How Language Teachers Understand Assessment Concepts, Christine Coombe (Ed.), British Council, 2018


This is a FREE resource (a 46-page PDF file) offered by the British Council that contains a glossary of testing and assessment terms.

It can be downloaded from here:

There is also this associated British Council Language Assessment in the Classroom Mooc:

Testing Spoken Language, A Handbook of Oral Testing Techniques, Nic Underhill, Cambridge Handbooks for Teachers, 1987


Image source: Pixabay

Summary notes

The most important influence on the development of language testing has been the legacy of psychometrics, in particular intelligence testing. A lot of time was devoted into proving there was a single measurable attribute called general intelligence. Psychometrics wanted to be a science. The aspects of human behavior that could be predicted and measured were emphasized. The multiple-choice test offered the learner no opportunity to behave as an individual. Individualism was described as ‘variance,’ and a lot of effort was put into reducing the amount of variance a test produces.

Expectations – Every culture values education highly, but does so in different ways.


  • interviewer
  • interlocutor – A person whose job is to help the learner to speak, but who is not required to assess him/her.
  • assessor
  • marker or rater
  • authentic
  • objective
  • stimulus
  • validity – Does the test measure what it’s supposed to?
  • reliability – Does the test give consistent results?
  • evaluate – Find out if the test is working.
  • moderate – To compare the way different assessors award marks and to reduce discrepancies.


Tests can be used to ask four basic kinds of question around:

  • proficiency
  • placement
  • diagnosis
  • achievement

Who does a learner speak to in a test?

  • Learner <> Interviewer/Assessor
  • Learner <> Interlocutor
  • Learner <> Learner
  • Learner <> Group

Chapter 3 presents a nice list of task types that could be given in oral exams.

A marking key or marking protocol sames time and uncertainty by specifying in advance, as far as possible, how markers should approach the marking of each question or task.

Performance criteria might include: Length of utterances; complexity; speed; flexibility; accuracy; appropriacy; independence; repetition; hesitation.

Mark categories could be given a weighting.

Few learners are ‘typical.’ It may be helpful to look for a range, not a point on a scale.

Additive marking is where the assessor has prepared a list of features to listen out for during the test. She awards a mark for each of these features that the learner produces correctly, and adds these marks to give the score. This is also known as an incremental mark system; the learner starts with a score of zero and earns each mark, one by one.

Subtractive marking is where the assessor subtracts one mark from a total for each mistake the learner makes, down to a minimum of zero.

Testing for Language Teachers, A. Hughes, Cambridge Handbooks for Language Teachers, 1989


Image source: Pixabay

Summary notes

backwash – The effect of testing on teaching and learning. Backwash might be either positive or negative. For example, a test that has a large part involving describing a chart/graph will probably result in an over-focus of teaching on describing charts/graphs.

Unreliability has two origins: features of the test itself (unclear instructions, ambiguous questions, items that require too much guesswork), and the way it is scored.

A test which is good for one purpose may be useless for another purpose.


proficiency tests – Designed to measure language ability regardless of course content or previous training. However, what counts as ‘proficient’?

achievement tests (final achievement tests, progress tests) – Achievement tests are directly related to language courses.

diagnostic tests – Used to identify students’ strengths and weaknesses.

placement tests – Used to assign students to a class level.



direct testing – Testing is said to be direct when it requires the candidate to perform precisely the skill which we wish to measure. If we want to know how well a student can write an essay, we get them to write an essay. Direct testing is often limited to a rather small sample of tasks.

indirect testing – Attempts to measure the abilities which underlie the skills in which we are interested. For example, Lado (1961) proposed a test of pronunciation ability through a pen and paper test of matching pairs of words that rhyme with each other.



discrete point testing – Refers to the testing of one element at a time, item by item (indirect testing).

integrative testing – Requires the candidate to combine many language elements in the completion of a task (direct testing).



See the Baxter post for more detail.



The distinction here is between methods of scoring.



content validity – It is obvious that a grammar test must be made up of items testing knowledge or control of grammar. But this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures. In order to judge whether or not a test has content validity, we need a specification of the skills or structures that it is meant to cover.

criterion-related validity – Another approach to validity is to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent assessment is thus the criterion measure against which the test is validated.

There are essentially two kinds of criterion-related validity: concurrent validity and predictive ability. Concurrent validity is established when the test and the criterion are administered at about the same time. Predictive validity means the degree to which a test can predict candidates’ future performance.

construct validity – When it can be demonstrated that it measures just the ability which it is supposed to measure.

face validity



Taking a test on a different day or a different time might yield different results.

the reliability coefficient (should be closest to ‘1’ – multiple choice questions are typically ‘1’ tests)

the standard error of measurement and the true score

How to make tests more reliable: If you are writing a test, see Chapter 5 for a good list of guidelines to make tests more reliable. A short list:

  • Is the task perfectly clear?
  • Is there more than one possible correct response?
  • Can candidates show the desired behavior without having the skill supposedly being tested?
  • Do candidates have enough time to perform the tasks?


To be a valid, a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all. For example, as a writing test, we might require candidates to write down the translation equivalents of 500 words in their own language. This could well be a reliable test, but it is unlikely to be a valid test of writing.

If test specifications make clear what candidates have to be able to do, and with what degree of success, then students will have a clear picture of what they have to achieve.

Be clear what it is that you want to know and for what purpose.


During the 1970s, there was much excitement in the world of language testing about what was called the Unitary Competence Hypothesis. This was the suggestion that the nature of language ability was such that it was impossible to break it down into component parts. This hypothesis eventually proved false. A learner may be a good speaker but a poor writer. One way of measuring overall ability would be to measure a variety of separate abilities and then to combine scores.



holistic scoring – Is often called ‘impressionistic scoring’.

analytic scoring – Requires a separate score for each of a number of aspects of a task. While it is doubtful that scorers can judge each of the aspects independently of the others (there is what is called a ‘halo effect’), the fact of having multiple ‘shots’ at assessing the student’s performance should lead to greater reliability.

Once the test is completed, a search should be made to identify ‘benchmark’ scripts which typify key levels of ability on each writing task.



Make an oral test as long as is feasible, and give the candidate as many ‘fresh starts’ as possible.



Chapter 11 provides nice lists of what tasks might be considered under testing macro vs micro skills. (Chapter 12 provides similar lists for listening tests.)

An excess of micro-skill test items should not be allowed to obscure the fact that the micro-skills are being taught not as an end in themselves, but as a means of improving macro-skills.

Do not repeatedly select texts of a particular kind simply because they are readily available.

Choose texts of appropriate length. Scanning works best with texts at least 2,000 words long.


The best test may give unreliable and invalid results if it is not well administered.

Evaluating Your Students, A. Baxter, Richmond, 1997


Image source: Pixabay

Summary notes:

The traditional way to assess students has been through using tests. Testing has largely been aligned with scientific study – to administer a (placement) test, to make some changes (i.e., to teach), and to re-test (an end-of-course test). With this, if something is too difficult to measure, it hasn’t traditionally been tested (although newer ideas allow for forms of assessment and evaluation in the form of things such as student portfolios).

A good test is: valid, reliable, practical, has no negative backwash.

  • VALID – There are three types of validity:

content validity – Does the test test what was covered in class?

construct validity – Does the test test what it’s supposed to test and nothing else?

face validity – Does the test look like it’s testing what it is supposed to test from an initial glance?

  • RELIABLE – There are two areas of reliability:

test reliability – If you gave the same test to the same person, would the result be the same?

scorer reliability – Would two people come up with the same mark for the same test?

  • PRACTICAL – How practical is the test to administer?



direct testing – We ask the student to perform what we want to test (preferable).

indirect testing – We test things that give us an indication of a student’s performance (less preferable).

norm-referenced testing – When the results of a test compare students (a popular notion among state and exam board tests).

criteria-referenced testing – When test results tell you about what an individual student can do.

summative testing – Done at the end of a semester/year.

formative testing – Ongoing assessment that allows change to take place before the course is over.

congruent testing – This looks at the whole process before it starts so that any issues get resolved before a course is underway.

profiles/profiling and analytic mark schemes – A profile is not so much a score, but a reference to a set of descriptors of a person’s ability. A student might fall into a certain band. Some students will not have a flat profile.



A cloze test is a test in which words are deleted not according to what we want to test (as regular gap fills), but on a regular basis. Thus, every seventh word, or near enough, will be deleted. A variation on this is the C-test (first letter (elsewhere, literature says first half) of words given).

When to give assistance (i.e., provide clues/hints) depends on three testing problems:

  • When we are testing the students’ ability to transform something.
  • When we want to force the student to use a desired item.
  • When we want to put the same idea in each student’s head.



It is obvious that students don’t always learn everything we teach. On the other hand, it must also be true that they learn things we don’t teach.

If a student has a question about a text, this might mean that he/she may be ready to learn it. Baxter calls this the saliency effect. However, what is suddenly salient for one individual will probably not be salient for the whole class. Also, what the individual is ready to learn will probably not fit in with the teacher’s plan. If the teacher is practising skimming, and a student asks what a particular word means, the teacher would probably tell them the word wasn’t important because they are practising skim-reading.

The Reading Lists

image source: Pixabay

! Work in progress… ! The below list is separated into sub-lists by category. I’m working through one category at a time, starting with Testing & Assessment, followed by Phonology.

A hyperlink means there is a relevant post to this book on this site, as it’s a book that I have read.

R = Recommended minimum reading; an item NOT marked with ‘R’ might purely be because I haven’t read that title rather than it not being useful.


TESTING Alderson, J. (2000) Assessing Reading, CUP

TESTING Alderson, J.C., Clapham, C. and Wall, D. (1995) Language Test Construction and Evaluation, CUP

TESTING Bachman, L. and Palmer, A.S. (2010) Language Assessment in Practice, OUP

TESTING Bailey, K. (1988) Learning about Language Assessment, Newbury House

R TESTING Baxter, A. (1999) Evaluating Your Students, Richmond

TESTING Buck, G. (2001) Assessing Listening, CUP

TESTING Burgess, S. and Head, K. (2005) How to Teach for Exams, Longman

TESTING Carr, N.T. (2011) Designing and Analysing Language Tests, OUP

TESTING Carroll, B. and Hall, P. (1985) Make Your Own Language Tests, Pergamon

TESTING Cohen, A.D. (1994) Assessing Language Ability in the Classroom, Heinle

TESTING Coombe, C. et al. (2012) The Cambridge Guide to SLA Assessment, CUP

TESTING Coombe, C. (2018) An A to Z of Second Language Assessment: How Language Teachers Understand Assessment Concepts, British Council

TESTING Douglas Brown, H. and Abewickrama, P. (2010) Language Assessment: principles and Classroom Practices, Longman

TESTING Douglas, D. (2000) Assessing Language for Specific Purposes, CUP

TESTING Fulcher, G. (2010) Practical Language Testing, Hodder Education

TESTING Fulcher, G. and Davidson, F. (2007) Language Testing and Assessment, Routledge

TESTING Genesee, F. and Upshur, J.A. (1996) Classroom Based Evaluation, OUP

TESTING Green, A. (2014) Exploring Language Assessment and Testing, Routledge

TESTING Harris, M. and McCann, P. (1994) Assessment, Macmillan

TESTING Harrison, A. (1983) A Language Testing Handbook, Macmillan

TESTING Heaton, J.B. (1988) Writing English Language Tests, Longman

TESTING Heaton, J.B. (1990) Language Testing, MEP

R TESTING Hughes, A. (2002) Testing for Language Teachers, CUP

TESTING Jang, E.E. (2014) Testing: Focus on Assessment, OUP

TESTING Luoma, S. (2004) Assessing Speaking, CUP

TESTING Madsen, H.S. (1983) Techniques in Testing, OUP

TESTING Martyniuk, W. (2012) Aligning Tests with the CEFR, CUP

TESTING McKay, P. (2006) Assessing Young Language Learners, CUP

TESTING McNamara, T. (2000) Language Testing, OUP

TESTING Purpura, J.E. (2004) Assessing Grammar, CUP

TESTING Rea-Dickins, P. and Germaine, K. (1992) Evaluation, OUP

TESTING Underhill, N. (1987) Testing Spoken Language, CUP

TESTING Weigle, S.C. (2002) Assessing Writing, CUP

TESTING Weir, C. (1988) Communicative Language Testing, Prentice Hall

TESTING Weir, C. (1993) Understanding and Developing Language Tests, Prentice Hall

TESTING Weir, C.J. (2005) Language Testing and Evaluation, Palgrave

TESTING Weir, C.J. and Roberts, J. (1994) Evaluation in ELT, Blackwell


PHONOLOGY Bowen, T. and Marks, J. (1992) The Pronunciation Book, Longman
PHONOLOGY Bradford, B. (1998) Intonation in Context, CUP
PHONOLOGY Brazil, D. (1997) The Communicative Value of Intonation in English, CUP
PHONOLOGY Brazil, D., Coulthard, C. and Johns, T. (1980) Discourse Intonation and Language Teaching, CUP
PHONOLOGY Celce-Murcia, M., Brinton, D.M. and Goodwin J.M. () Teaching Pronunciation,
PHONOLOGY Dalton and Seidlhofer (1994) Pronunciation, OUP
PHONOLOGY Fitzpatrick, F.A. (1995) Teacher’s Guide to Practical Pronunciation, Phoenix ELT
PHONOLOGY Gilbert, J.B. () Teaching Pronunciation, Using the Prosody Pyramid, CUP
PHONOLOGY Jenkins, J. (2000) The Phonology of English as an International Language
PHONOLOGY Kelly, G. (2000) How to Teach Pronunciation, Longman
PHONOLOGY Kenworthy, J. (1987) Teaching English Pronunciation, Longman
PHONOLOGY Kreidler, The Pronunciation of English: A Course Book, Blackwell
PHONOLOGY Marks, J. (2012) The Pronunciation Book, Peaslake Delta
PHONOLOGY McCarthy, P. () The Teaching of Pronunciation, CUP
PHONOLOGY Pennington, () English Phonology for Language Teachers, Longman
PHONOLOGY Roach, P. (2000) English Phonetics and Phonology, CUP
PHONOLOGY Underhill, A. (2005) Sound Foundations, Macmillan
PHONOLOGY Wells, J.C. (2006) English Intonation, CUP

Approaches and Methods in English Language Education


Image source: Pixabay

Well… It might take me a few more years to get through my entire reading list in preparation for a Cambridge DELTA course, but I’ve at least made a start by going over the books dealing with the history of language teaching approaches, and below is the resulting chronological summary. If this summary is of use to anyone else preparing for the DELTA, please feel free to make use of it.

What becomes noticeable about the history of English language education is the influence that was exerted by French and German scholars around the turn of the 20th century.

Useful references: Stern (1983) Fundamental Concepts of Language Teaching; Richards & Rogers (1986), Approaches and Methods in Language Teaching; Kelly (1969) 25 Centuries of Language Teaching

Stern: We should distinguish between the history of ideas on language teaching and the development of practice.


Middle Ages – Learning from books and focus on literary study emerged; at this time, England was trilingual: French (royal court, nobility, legal system), English (lower classes), Latin (scholars)

17th, 18th, 19th C – The ability to read and translate classical texts gave rise to the ‘grammar-translation’ method (first known in US as ‘Prussian Method’ because of German roots – e.g., Ploetz and Seidenstuecker); reading & writing paramount; sporadic anti GTs, e.g., Comenius – practice is all-important; learning grammar rules is not important

19th C – Series Method (variation of Direct Method)– Gouin: lang should be used to talk about experience rather than memorizing random word lists

1878 – First Berlitz school opened in Providence, Rhode Island [links: Direct Method – speaking & listening important]

1883 – Foundation of the Modern Language Association of America

1886 – Foundation of the International Phonetic Association; IPA felt that studying other languages should begin with the drilling of sounds (re. creation of International Phonetic Alphabet), followed by the study of everyday spoken language not formal literature

1892 – Foundation of the Modern Language Association (UK)

Early 20th C – Grammar-translation persisted; the ‘nature vs nurture’ study of children emerged and the Direct Method (or ‘Natural Method’ or ‘Reform Method’) arose in opposition to GT (inductive approach to grammar; Q&A based on a text; realia; role play; importance of pron – the first lang should be abandoned as the frame of reference) [links: Situational Language Teaching and TPR and Berlitz Method]; DM criticized for lacking a rigorous theoretical basis, and conversation practice considered impractical in schools with large classes

1904 – Jesperson, How to Teach a Foreign Language

1906 – Saussaure delivers the first tertiary linguistics course in Geneva; ‘langue’ vs. ‘parole’

1917 – Palmer, The Scientific Study and Teaching of Languages; Palmer = ‘father of British applied ling’, started as a Berlitz teacher

1920s & 1930s – Piaget’s theory of cognitive development – learning by interaction and scaffolding with an adult

1920s & 1930s – Situational Language Teaching, led by Palmer and Hornby; attempted to develop a more scientific foundation for the Direct Method; vocab and reading important; grammar was classified into sentence patterns (= substitution tables) and taught inductively; SLT coexists with the Oral Approach (teaching begins with the spoken language; new language points are introduced and practiced situationally; chorus repetition; dictation; drills; pair practice; visual aids)

1923-1927 – Ogden & Richards, Basic English – an attempt to simplify/rationalize language learning problems

1929 – Coleman The Teaching of Modern Foreign Languages in the United States. (= The Coleman Report); the recommendation that the primary objective of language teaching should be reading fluency = Reading Method

1933 – Bloomfield, Language – American structuralism: linguistics should be an empirical, descriptive science [links: Audiolingual, Silent Way, TPR embody structuralist view]

1939 – University of Michigan started the first English language institute in the US

WWII – had a huge impact on modern language study with large-scale migration and deployment of military; language study should now be delivered to the masses not just scholarly elite (through e.g., ‘language labs’ and Audiolingual Method – pattern drills; dialogs; intensive study; speech at core, but mastery of grammatical structures important; reflects behaviorism and habit learning)

1950s – Neo-Firthian Theory – UK-based – language must be understood in context of culture and meaning (re. anthropology)

1951 – Centre d’Etude du Francais Elementaire (CREDIF) established to counteract the falling global status of French

1953 – West, A General Service List of English Words and Hornby, Gatenby, Wakefield, The Advanced Learner’s Dictionary of Current English – standard references for developing learning materials

1954 – Hornby, Guide to Patterns & Usage in English – another standard reference for developing learning materials

1957 – Skinner, Verbal Behaviour – anti-systems; account for language learning through observable events; language learning is developed just as other human behavior is developed – i.e., through habits – stimulus > response > reinforcement

1957 – Chomsky, Syntactic Structures – anti-behaviorist: need to understand the system of rules ‘in’ a learner; Chomsky argued that behaviorist theory was inadequate for explaining creative language use; an innate language acquisition device (LAD)

1958 – Experiment in a British grammar school with an audiovisual language course (Ingram and Mace 1959)

1959 – Penfield & Roberts, Speech and Brain Mechanisms – explained critical period hypothesis (if language acquisition does not occur by puberty, full mastery of a language is not possible)

1960s – Focus on the learner as an individual; re. sociolinguists and the ‘speech community’

1960s – Pimsleur Method – dialog-based translation; instruction starts in learner’s L1; Pimsleur Lang Aptitude Battery (PLAB)

1963 – Anthony, defined approach (assumptions about lang), method (theory put into practice), and technique (implementation)

1965 onward – French immersion in Canadian schools started

1965 – Transformative generative grammar – led by Chomsky who (based on Humboldt) asked what linguistic knowledge must be presupposed in a native speaker to be able to produce and interpret sentences; TGG is a rule-governed system

1966 – TESOL Association (USA) founded

1968 – Bilingual Education Act (USA)

1970s – Reactions against the ‘method concept’: The Silent Way (Gattegno), Community Language Learning (Curran), Suggestopedia (Lozanov); a shift from methods to objectives, with attention on syllabus design (re. Allen, Candlin, Corder, Widdowson, Wilkens)

1970s – Communicative Language Teaching arose in UK – essentially integration of grammatical and functional teaching with learner-centered focus; language is a system for the expression of meaning

1970s – Asher’s TPR – adult language acquisition should parallel child first lang acquisition; develop comprehension skills first (aka Comprehension Approach); nonabstractions – e.g., imperatives – can be taught first by actions; stimulus-response view; draws in psychology findings around right-brain learning of motor movement

1970s – Gattegno’s Silent Way – Cuisenaire rods; learner should discover or create and solve problems rather than repeat – discovery learning similar to child development; early learning focuses on pron using Fidel charts; teacher should be silent

1970s – Curran’s Community Language Learning – learner-centered; learners ‘overhear’ others; the learner tells the knower what they want to say, and the knower tells them how to say it; focus on oral proficiency; learners become members of a community; echoes child learning

1972 – ‘Communicative competence’ was first used by Hymes (1972) (in contrast to Chomsky’s ‘linguistic competence’) to reflect the social view of language

1975 – Publication of the Threshold Level syllabuses (van Ek 1975) – forerunner to CEFR

1976 – Wilkins Notional Syllabuses – notional-functional approaches to language learning; in the functional view, language is a vehicle for the expression of functional meaning; a notional syllabus would not only include grammar, but also specify the topics, notions, and concepts the learner needs to communicate about; criticized for idea of merely replacing one item (grammar) with another (a notion/function) and is too simplistic [links: Communicative Language Teaching]

1978 – Widdowson – teaching a second language as communication rather than a system

1980s – Krashen’s ideas (e.g., Monitor theoryinput hypothesisnatural order hypothesisaffective filter hypothesis)

1980s – Terrell and Krashen’s Natural Approach – communication is primary focus; unlike Direct Method/Natural Method, places less emphasis on teacher monologues and accuracy; emphasis is on input/exposure (I + 1 input hypothesis) before re-producing the language; more about acquisition rather than structure – acquisition is the ‘natural’ way, whereas ‘learning’ is a conscious act

1980s – Lozanov’s Suggestopedia (similar to ‘Superlearning’) – utilizes Baroque music, classroom décor to maximize memory capacity; learners detach themselves from the past, like infant learners; teacher reads texts with varied delivery to help aid retention; 30 days of study, 12 learners, sitting in a circle

1980s – Prabhu’s Task-based learning (task-based instruction) – notion of ‘authentic’ lang to do meaningful tasks; assessment is based on outcome rather than accuracy; learners don’t necessarily need ‘language’ problems to learn a language

1990s – Ray’s Teaching Proficiency through Reading and Storytelling – a variation of TPR