How to Teach for Exams, Sally Burgess and Katie Head, Longman, 2005

how to teach for exams

This title would not be on my recommended Delta reading list. It’s a title that’s useful if you are new to teaching exam classes and want some ‘whole picture’ support, and/or have to work from a textbook that doesn’t give you any further ideas for class activities other than to complete all the tasks in the book from start to finish, since How to Teach for Exams contains several skills development activities in each section that you can try with your students to make teaching exam classes more lively.

The most important reminder is for teachers to be really familiar with the exam they are teaching, such as through working through practice papers to real time, and to stay up-to-date with any changes to the exam.

Assessing Speaking, Sari Luoma, CUP, 2004


Image source: Pixabay

Summary notes

This is another text that I would class as an ‘if-you-have-time’ text or an MA text on discourse rather than an essential text for Delta purposes. Therefore, I’m not going to provide all of the notes I made on this text, but just cover a few points that could be useful. The book is part of a series with titles commencing with ‘Assessing …’. I have only read Assessing Speaking so far.

The book covers important research in the areas of speaking, and flags up the influence of people such as Bygate, Hasselgren, and Hymes through examples of how test design has been shaped by them.

A couple of interesting items that I learned included the following:

Speaking tests conducted in pairs of students have had the concern that one speaker will heavily influence the other speaker, but Luoma lists several researchers who have concluded that the influence is negligible on overall results.

Luoma also mentions Hasselgren (1998) and Towell (1996), in reporting that speakers’ use of ‘smallwords’ – that is, common set phrases that fill, bridge, and keep the conversation going – can improve a listener’s (and similarly test rater’s) perception of fluency and competency of the speaker.

The book serves as a useful overall guide to speaking test construction and considerations. Rather than focusing on ‘teaching’ testing terms, the book shows how test concepts can be applied and critiqued in practice, and offers some examples of actual tests to illustrate these points. In terms of future direction (keeping in mind this book is already nearly two decades old), Luoma points to rating checklists. The ‘new’ concept of sociocultural theory also has implications on speaking being interactional, so does not lend itself well to the traditional speaking test one-on-one candidate-interlocutor/interviewer mode.

A new term that I haven’t seen in the books so far read, but which I don’t think would be likely to appear in the Delta test, is setting cut scores. Also called standard scores, this involves cutting raw scores into ranges and bands that determine pass, fail, etc.

Another item that I thought was a good reminder from a regular teaching point of view, was to explain tails and topicalisation to students, as this is an area that I don’t think I have ever seen covered in class course books, but would be worth including. This refers to the very normal spoken feature of presenting key information about the topic at the end or at the beginning of a sentence in a non-standard grammatical way. Luoma gives the example of Joe, his name is from Quirk and Greenbaum (1976).

Language Testing and Assessment – an advanced resource book, Glenn Fulcher and Fred Davidson, Routledge Applied Linguistics, 2007

language testing and assessment

I started reading this book, but then stopped after the first section.

At over 400 pages, and some ten years more current than the other books on testing that I’ve summarized so far, this title is a nice addition to the field of language testing study. It  considers how testing terms and concepts are presented in prior literature and goes into much more academic depth in discussing their meaning and significance. It also reminded me how there are some terms that haven’t been satisfactorily explained by some of the other books I’ve read so far – such as construct validity and concurrent validity, and which I need to seek more clarity on. However, it quickly became clear that the content of this book is arguably more suited to a Master’s candidate reader than a Delta reader, and because time is limited, I am therefore putting this book away for the time being. Hopefully I will have another opportunity to come back to it in the future, as I think it freshens the debate.

FREE PDF – An A to Z of Second Language Assessment: How Language Teachers Understand Assessment Concepts, Christine Coombe (Ed.), British Council, 2018


This is a FREE resource (a 46-page PDF file) offered by the British Council that contains a glossary of testing and assessment terms.

It can be downloaded from here:

There is also this associated British Council Language Assessment in the Classroom Mooc:

Testing Spoken Language, A Handbook of Oral Testing Techniques, Nic Underhill, Cambridge Handbooks for Teachers, 1987


Image source: Pixabay

Summary notes

The most important influence on the development of language testing has been the legacy of psychometrics, in particular intelligence testing. A lot of time was devoted into proving there was a single measurable attribute called general intelligence. Psychometrics wanted to be a science. The aspects of human behavior that could be predicted and measured were emphasized. The multiple-choice test offered the learner no opportunity to behave as an individual. Individualism was described as ‘variance,’ and a lot of effort was put into reducing the amount of variance a test produces.

Expectations – Every culture values education highly, but does so in different ways.


  • interviewer
  • interlocutor – A person whose job is to help the learner to speak, but who is not required to assess him/her.
  • assessor
  • marker or rater
  • authentic
  • objective
  • stimulus
  • validity – Does the test measure what it’s supposed to?
  • reliability – Does the test give consistent results?
  • evaluate – Find out if the test is working.
  • moderate – To compare the way different assessors award marks and to reduce discrepancies.


Tests can be used to ask four basic kinds of question around:

  • proficiency
  • placement
  • diagnosis
  • achievement

Who does a learner speak to in a test?

  • Learner <> Interviewer/Assessor
  • Learner <> Interlocutor
  • Learner <> Learner
  • Learner <> Group

Chapter 3 presents a nice list of task types that could be given in oral exams.

A marking key or marking protocol sames time and uncertainty by specifying in advance, as far as possible, how markers should approach the marking of each question or task.

Performance criteria might include: Length of utterances; complexity; speed; flexibility; accuracy; appropriacy; independence; repetition; hesitation.

Mark categories could be given a weighting.

Few learners are ‘typical.’ It may be helpful to look for a range, not a point on a scale.

Additive marking is where the assessor has prepared a list of features to listen out for during the test. She awards a mark for each of these features that the learner produces correctly, and adds these marks to give the score. This is also known as an incremental mark system; the learner starts with a score of zero and earns each mark, one by one.

Subtractive marking is where the assessor subtracts one mark from a total for each mistake the learner makes, down to a minimum of zero.

Testing for Language Teachers, A. Hughes, Cambridge Handbooks for Language Teachers, 1989


Image source: Pixabay

Summary notes

backwash – The effect of testing on teaching and learning. Backwash might be either positive or negative. For example, a test that has a large part involving describing a chart/graph will probably result in an over-focus of teaching on describing charts/graphs.

Unreliability has two origins: features of the test itself (unclear instructions, ambiguous questions, items that require too much guesswork), and the way it is scored.

A test which is good for one purpose may be useless for another purpose.


proficiency tests – Designed to measure language ability regardless of course content or previous training. However, what counts as ‘proficient’?

achievement tests (final achievement tests, progress tests) – Achievement tests are directly related to language courses.

diagnostic tests – Used to identify students’ strengths and weaknesses.

placement tests – Used to assign students to a class level.



direct testing – Testing is said to be direct when it requires the candidate to perform precisely the skill which we wish to measure. If we want to know how well a student can write an essay, we get them to write an essay. Direct testing is often limited to a rather small sample of tasks.

indirect testing – Attempts to measure the abilities which underlie the skills in which we are interested. For example, Lado (1961) proposed a test of pronunciation ability through a pen and paper test of matching pairs of words that rhyme with each other.



discrete point testing – Refers to the testing of one element at a time, item by item (indirect testing).

integrative testing – Requires the candidate to combine many language elements in the completion of a task (direct testing).



See the Baxter post for more detail.



The distinction here is between methods of scoring.



content validity – It is obvious that a grammar test must be made up of items testing knowledge or control of grammar. But this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures. In order to judge whether or not a test has content validity, we need a specification of the skills or structures that it is meant to cover.

criterion-related validity – Another approach to validity is to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent assessment is thus the criterion measure against which the test is validated.

There are essentially two kinds of criterion-related validity: concurrent validity and predictive ability. Concurrent validity is established when the test and the criterion are administered at about the same time. Predictive validity means the degree to which a test can predict candidates’ future performance.

construct validity – When it can be demonstrated that it measures just the ability which it is supposed to measure.

face validity



Taking a test on a different day or a different time might yield different results.

the reliability coefficient (should be closest to ‘1’ – multiple choice questions are typically ‘1’ tests)

the standard error of measurement and the true score

How to make tests more reliable: If you are writing a test, see Chapter 5 for a good list of guidelines to make tests more reliable. A short list:

  • Is the task perfectly clear?
  • Is there more than one possible correct response?
  • Can candidates show the desired behavior without having the skill supposedly being tested?
  • Do candidates have enough time to perform the tasks?


To be a valid, a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all. For example, as a writing test, we might require candidates to write down the translation equivalents of 500 words in their own language. This could well be a reliable test, but it is unlikely to be a valid test of writing.

If test specifications make clear what candidates have to be able to do, and with what degree of success, then students will have a clear picture of what they have to achieve.

Be clear what it is that you want to know and for what purpose.


During the 1970s, there was much excitement in the world of language testing about what was called the Unitary Competence Hypothesis. This was the suggestion that the nature of language ability was such that it was impossible to break it down into component parts. This hypothesis eventually proved false. A learner may be a good speaker but a poor writer. One way of measuring overall ability would be to measure a variety of separate abilities and then to combine scores.



holistic scoring – Is often called ‘impressionistic scoring’.

analytic scoring – Requires a separate score for each of a number of aspects of a task. While it is doubtful that scorers can judge each of the aspects independently of the others (there is what is called a ‘halo effect’), the fact of having multiple ‘shots’ at assessing the student’s performance should lead to greater reliability.

Once the test is completed, a search should be made to identify ‘benchmark’ scripts which typify key levels of ability on each writing task.



Make an oral test as long as is feasible, and give the candidate as many ‘fresh starts’ as possible.



Chapter 11 provides nice lists of what tasks might be considered under testing macro vs micro skills. (Chapter 12 provides similar lists for listening tests.)

An excess of micro-skill test items should not be allowed to obscure the fact that the micro-skills are being taught not as an end in themselves, but as a means of improving macro-skills.

Do not repeatedly select texts of a particular kind simply because they are readily available.

Choose texts of appropriate length. Scanning works best with texts at least 2,000 words long.


The best test may give unreliable and invalid results if it is not well administered.

Evaluating Your Students, A. Baxter, Richmond, 1997


Image source: Pixabay

Summary notes:

The traditional way to assess students has been through using tests. Testing has largely been aligned with scientific study – to administer a (placement) test, to make some changes (i.e., to teach), and to re-test (an end-of-course test). With this, if something is too difficult to measure, it hasn’t traditionally been tested (although newer ideas allow for forms of assessment and evaluation in the form of things such as student portfolios).

A good test is: valid, reliable, practical, has no negative backwash.

  • VALID – There are three types of validity:

content validity – Does the test test what was covered in class?

construct validity – Does the test test what it’s supposed to test and nothing else?

face validity – Does the test look like it’s testing what it is supposed to test from an initial glance?

  • RELIABLE – There are two areas of reliability:

test reliability – If you gave the same test to the same person, would the result be the same?

scorer reliability – Would two people come up with the same mark for the same test?

  • PRACTICAL – How practical is the test to administer?



direct testing – We ask the student to perform what we want to test (preferable).

indirect testing – We test things that give us an indication of a student’s performance (less preferable).

norm-referenced testing – When the results of a test compare students (a popular notion among state and exam board tests).

criteria-referenced testing – When test results tell you about what an individual student can do.

summative testing – Done at the end of a semester/year.

formative testing – Ongoing assessment that allows change to take place before the course is over.

congruent testing – This looks at the whole process before it starts so that any issues get resolved before a course is underway.

profiles/profiling and analytic mark schemes – A profile is not so much a score, but a reference to a set of descriptors of a person’s ability. A student might fall into a certain band. Some students will not have a flat profile.



A cloze test is a test in which words are deleted not according to what we want to test (as regular gap fills), but on a regular basis. Thus, every seventh word, or near enough, will be deleted. A variation on this is the C-test (first letter (elsewhere, literature says first half) of words given).

When to give assistance (i.e., provide clues/hints) depends on three testing problems:

  • When we are testing the students’ ability to transform something.
  • When we want to force the student to use a desired item.
  • When we want to put the same idea in each student’s head.



It is obvious that students don’t always learn everything we teach. On the other hand, it must also be true that they learn things we don’t teach.

If a student has a question about a text, this might mean that he/she may be ready to learn it. Baxter calls this the saliency effect. However, what is suddenly salient for one individual will probably not be salient for the whole class. Also, what the individual is ready to learn will probably not fit in with the teacher’s plan. If the teacher is practising skimming, and a student asks what a particular word means, the teacher would probably tell them the word wasn’t important because they are practising skim-reading.

The Reading Lists

image source: Pixabay

! Work in progress… ! The below list is separated into sub-lists by category. I’m working through one category at a time, starting with Testing & Assessment, followed by Phonology.

A hyperlink means there is a relevant post to this book on this site, as it’s a book that I have read.

R = Recommended minimum reading; an item NOT marked with ‘R’ might purely be because I haven’t read that title rather than it not being useful.


TESTING Alderson, J. (2000) Assessing Reading, CUP

TESTING Alderson, J.C., Clapham, C. and Wall, D. (1995) Language Test Construction and Evaluation, CUP

TESTING Bachman, L. and Palmer, A.S. (2010) Language Assessment in Practice, OUP

TESTING Bailey, K. (1988) Learning about Language Assessment, Newbury House

R TESTING Baxter, A. (1999) Evaluating Your Students, Richmond

TESTING Buck, G. (2001) Assessing Listening, CUP

TESTING Burgess, S. and Head, K. (2005) How to Teach for Exams, Longman

TESTING Carr, N.T. (2011) Designing and Analysing Language Tests, OUP

TESTING Carroll, B. and Hall, P. (1985) Make Your Own Language Tests, Pergamon

TESTING Cohen, A.D. (1994) Assessing Language Ability in the Classroom, Heinle

TESTING Coombe, C. et al. (2012) The Cambridge Guide to SLA Assessment, CUP

TESTING Coombe, C. (2018) An A to Z of Second Language Assessment: How Language Teachers Understand Assessment Concepts, British Council

TESTING Douglas Brown, H. and Abewickrama, P. (2010) Language Assessment: principles and Classroom Practices, Longman

TESTING Douglas, D. (2000) Assessing Language for Specific Purposes, CUP

TESTING Fulcher, G. (2010) Practical Language Testing, Hodder Education

TESTING Fulcher, G. and Davidson, F. (2007) Language Testing and Assessment, Routledge

TESTING Genesee, F. and Upshur, J.A. (1996) Classroom Based Evaluation, OUP

TESTING Green, A. (2014) Exploring Language Assessment and Testing, Routledge

TESTING Harris, M. and McCann, P. (1994) Assessment, Macmillan

TESTING Harrison, A. (1983) A Language Testing Handbook, Macmillan

TESTING Heaton, J.B. (1988) Writing English Language Tests, Longman

TESTING Heaton, J.B. (1990) Language Testing, MEP

R TESTING Hughes, A. (2002) Testing for Language Teachers, CUP

TESTING Jang, E.E. (2014) Testing: Focus on Assessment, OUP

TESTING Luoma, S. (2004) Assessing Speaking, CUP

TESTING Madsen, H.S. (1983) Techniques in Testing, OUP

TESTING Martyniuk, W. (2012) Aligning Tests with the CEFR, CUP

TESTING McKay, P. (2006) Assessing Young Language Learners, CUP

TESTING McNamara, T. (2000) Language Testing, OUP

TESTING Purpura, J.E. (2004) Assessing Grammar, CUP

TESTING Rea-Dickins, P. and Germaine, K. (1992) Evaluation, OUP

TESTING Underhill, N. (1987) Testing Spoken Language, CUP

TESTING Weigle, S.C. (2002) Assessing Writing, CUP

TESTING Weir, C. (1988) Communicative Language Testing, Prentice Hall

TESTING Weir, C. (1993) Understanding and Developing Language Tests, Prentice Hall

TESTING Weir, C.J. (2005) Language Testing and Evaluation, Palgrave

TESTING Weir, C.J. and Roberts, J. (1994) Evaluation in ELT, Blackwell


PHONOLOGY Bowen, T. and Marks, J. (1992) The Pronunciation Book, Longman
PHONOLOGY Bradford, B. (1998) Intonation in Context, CUP
PHONOLOGY Brazil, D. (1997) The Communicative Value of Intonation in English, CUP
PHONOLOGY Brazil, D., Coulthard, C. and Johns, T. (1980) Discourse Intonation and Language Teaching, CUP
PHONOLOGY Celce-Murcia, M., Brinton, D.M. and Goodwin J.M. () Teaching Pronunciation,
PHONOLOGY Dalton and Seidlhofer (1994) Pronunciation, OUP
PHONOLOGY Fitzpatrick, F.A. (1995) Teacher’s Guide to Practical Pronunciation, Phoenix ELT
PHONOLOGY Gilbert, J.B. () Teaching Pronunciation, Using the Prosody Pyramid, CUP
PHONOLOGY Jenkins, J. (2000) The Phonology of English as an International Language
PHONOLOGY Kelly, G. (2000) How to Teach Pronunciation, Longman
PHONOLOGY Kenworthy, J. (1987) Teaching English Pronunciation, Longman
PHONOLOGY Kreidler, The Pronunciation of English: A Course Book, Blackwell
PHONOLOGY Marks, J. (2012) The Pronunciation Book, Peaslake Delta
PHONOLOGY McCarthy, P. () The Teaching of Pronunciation, CUP
PHONOLOGY Pennington, () English Phonology for Language Teachers, Longman
PHONOLOGY Roach, P. (2000) English Phonetics and Phonology, CUP
PHONOLOGY Underhill, A. (2005) Sound Foundations, Macmillan
PHONOLOGY Wells, J.C. (2006) English Intonation, CUP