Language Education and Assessment – new open-access journal

Castledown publishing have released their first edition of the open-access journal Language Education and Assessment, which seemed timely for being in a phase of  testing reading for the Delta.

Language Education and Assessment can be accessed here: https://journals.castledown-publishers.com/index.php/lea/

Aims & Scope

Language Education and Assessment is a peer-reviewed international journal that provides full open access to its contents and aims to publish original manuscripts in the fields of second/foreign language (L2) education and language assessment. This journal purports to offer a forum for those involved in these fields to showcase their works addressing such topics as L2 teaching theories and methods, innovations in L2 teaching, culture in L2 teaching and learning, individual differences in L2 learning, validity issues in assessing language proficiency, standardized language proficiency tests, classroom-based language assessment, computerized and computer-adaptive language testing, alternative language assessment, alignment of language instruction and assessment, and other relevant areas of inquiry.

How to Teach for Exams, Sally Burgess and Katie Head, Longman, 2005

how to teach for exams

This title would not be on my recommended Delta reading list. It’s a title that’s useful if you are new to teaching exam classes and want some ‘whole picture’ support, and/or have to work from a textbook that doesn’t give you any further ideas for class activities other than to complete all the tasks in the book from start to finish, since How to Teach for Exams contains several skills development activities in each section that you can try with your students to make teaching exam classes more lively.

The most important reminder is for teachers to be really familiar with the exam they are teaching, such as through working through practice papers to real time, and to stay up-to-date with any changes to the exam.

Assessing Speaking, Sari Luoma, CUP, 2004

not-hear-2687975_1280

Image source: Pixabay

Summary notes

This is another text that I would class as an ‘if-you-have-time’ text or an MA text on discourse rather than an essential text for Delta purposes. Therefore, I’m not going to provide all of the notes I made on this text, but just cover a few points that could be useful. The book is part of a series with titles commencing with ‘Assessing …’. I have only read Assessing Speaking so far.

The book covers important research in the areas of speaking, and flags up the influence of people such as Bygate, Hasselgren, and Hymes through examples of how test design has been shaped by them.

A couple of interesting items that I learned included the following:

Speaking tests conducted in pairs of students have had the concern that one speaker will heavily influence the other speaker, but Luoma lists several researchers who have concluded that the influence is negligible on overall results.

Luoma also mentions Hasselgren (1998) and Towell (1996), in reporting that speakers’ use of ‘smallwords’ – that is, common set phrases that fill, bridge, and keep the conversation going – can improve a listener’s (and similarly test rater’s) perception of fluency and competency of the speaker.

The book serves as a useful overall guide to speaking test construction and considerations. Rather than focusing on ‘teaching’ testing terms, the book shows how test concepts can be applied and critiqued in practice, and offers some examples of actual tests to illustrate these points. In terms of future direction (keeping in mind this book is already nearly two decades old), Luoma points to rating checklists. The ‘new’ concept of sociocultural theory also has implications on speaking being interactional, so does not lend itself well to the traditional speaking test one-on-one candidate-interlocutor/interviewer mode.

A new term that I haven’t seen in the books so far read, but which I don’t think would be likely to appear in the Delta test, is setting cut scores. Also called standard scores, this involves cutting raw scores into ranges and bands that determine pass, fail, etc.

Another item that I thought was a good reminder from a regular teaching point of view, was to explain tails and topicalisation to students, as this is an area that I don’t think I have ever seen covered in class course books, but would be worth including. This refers to the very normal spoken feature of presenting key information about the topic at the end or at the beginning of a sentence in a non-standard grammatical way. Luoma gives the example of Joe, his name is from Quirk and Greenbaum (1976).

Language Testing and Assessment – an advanced resource book, Glenn Fulcher and Fred Davidson, Routledge Applied Linguistics, 2007

language testing and assessment

I started reading this book, but then stopped after the first section.

At over 400 pages, and some ten years more current than the other books on testing that I’ve summarized so far, this title is a nice addition to the field of language testing study. It  considers how testing terms and concepts are presented in prior literature and goes into much more academic depth in discussing their meaning and significance. It also reminded me how there are some terms that haven’t been satisfactorily explained by some of the other books I’ve read so far – such as construct validity and concurrent validity, and which I need to seek more clarity on. However, it quickly became clear that the content of this book is arguably more suited to a Master’s candidate reader than a Delta reader, and because time is limited, I am therefore putting this book away for the time being. Hopefully I will have another opportunity to come back to it in the future, as I think it freshens the debate.

FREE PDF – An A to Z of Second Language Assessment: How Language Teachers Understand Assessment Concepts, Christine Coombe (Ed.), British Council, 2018

HOW ASSESSMENT WORKS

This is a FREE resource (a 46-page PDF file) offered by the British Council that contains a glossary of testing and assessment terms.

It can be downloaded from here: www.britishcouncil.org/exam/aptis/research/assessment-literacy

There is also this associated British Council Language Assessment in the Classroom Mooc: https://www.futurelearn.com/courses/language-assessment

Testing Spoken Language, A Handbook of Oral Testing Techniques, Nic Underhill, Cambridge Handbooks for Teachers, 1987

tooth-2068131_1280

Image source: Pixabay

Summary notes

The most important influence on the development of language testing has been the legacy of psychometrics, in particular intelligence testing. A lot of time was devoted into proving there was a single measurable attribute called general intelligence. Psychometrics wanted to be a science. The aspects of human behavior that could be predicted and measured were emphasized. The multiple-choice test offered the learner no opportunity to behave as an individual. Individualism was described as ‘variance,’ and a lot of effort was put into reducing the amount of variance a test produces.

Expectations – Every culture values education highly, but does so in different ways.

TEST TERMS:

  • interviewer
  • interlocutor – A person whose job is to help the learner to speak, but who is not required to assess him/her.
  • assessor
  • marker or rater
  • authentic
  • objective
  • stimulus
  • validity – Does the test measure what it’s supposed to?
  • reliability – Does the test give consistent results?
  • evaluate – Find out if the test is working.
  • moderate – To compare the way different assessors award marks and to reduce discrepancies.

……………………………………………………………………………………………………………………

Tests can be used to ask four basic kinds of question around:

  • proficiency
  • placement
  • diagnosis
  • achievement

Who does a learner speak to in a test?

  • Learner <> Interviewer/Assessor
  • Learner <> Interlocutor
  • Learner <> Learner
  • Learner <> Group

Chapter 3 presents a nice list of task types that could be given in oral exams.

A marking key or marking protocol sames time and uncertainty by specifying in advance, as far as possible, how markers should approach the marking of each question or task.

Performance criteria might include: Length of utterances; complexity; speed; flexibility; accuracy; appropriacy; independence; repetition; hesitation.

Mark categories could be given a weighting.

Few learners are ‘typical.’ It may be helpful to look for a range, not a point on a scale.

Additive marking is where the assessor has prepared a list of features to listen out for during the test. She awards a mark for each of these features that the learner produces correctly, and adds these marks to give the score. This is also known as an incremental mark system; the learner starts with a score of zero and earns each mark, one by one.

Subtractive marking is where the assessor subtracts one mark from a total for each mistake the learner makes, down to a minimum of zero.

Testing for Language Teachers, A. Hughes, Cambridge Handbooks for Language Teachers, 1989

test-986935_1280

Image source: Pixabay

Summary notes

backwash – The effect of testing on teaching and learning. Backwash might be either positive or negative. For example, a test that has a large part involving describing a chart/graph will probably result in an over-focus of teaching on describing charts/graphs.

Unreliability has two origins: features of the test itself (unclear instructions, ambiguous questions, items that require too much guesswork), and the way it is scored.

A test which is good for one purpose may be useless for another purpose.

TYPES OF TESTS:

proficiency tests – Designed to measure language ability regardless of course content or previous training. However, what counts as ‘proficient’?

achievement tests (final achievement tests, progress tests) – Achievement tests are directly related to language courses.

diagnostic tests – Used to identify students’ strengths and weaknesses.

placement tests – Used to assign students to a class level.

…………………………………………………………………….

DIRECT TESTING VS. INDIRECT TESTING

direct testing – Testing is said to be direct when it requires the candidate to perform precisely the skill which we wish to measure. If we want to know how well a student can write an essay, we get them to write an essay. Direct testing is often limited to a rather small sample of tasks.

indirect testing – Attempts to measure the abilities which underlie the skills in which we are interested. For example, Lado (1961) proposed a test of pronunciation ability through a pen and paper test of matching pairs of words that rhyme with each other.

……………………………………………………………………

DISCRETE POINT VS. INTEGRATIVE TESTING

discrete point testing – Refers to the testing of one element at a time, item by item (indirect testing).

integrative testing – Requires the candidate to combine many language elements in the completion of a task (direct testing).

………………………………………………………………………

NORM-REFERENCED VS. CRITERION-REFERENCED TESTING

See the Baxter post for more detail.

………………………………………………………………………

OBJECTIVE TESTING VS. SUBJECTIVE TESTING

The distinction here is between methods of scoring.

……………………………………………………………………….

VALIDITY

content validity – It is obvious that a grammar test must be made up of items testing knowledge or control of grammar. But this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures. In order to judge whether or not a test has content validity, we need a specification of the skills or structures that it is meant to cover.

criterion-related validity – Another approach to validity is to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent assessment is thus the criterion measure against which the test is validated.

There are essentially two kinds of criterion-related validity: concurrent validity and predictive ability. Concurrent validity is established when the test and the criterion are administered at about the same time. Predictive validity means the degree to which a test can predict candidates’ future performance.

construct validity – When it can be demonstrated that it measures just the ability which it is supposed to measure.

face validity

……………………………………………………………..

RELIABILITY

Taking a test on a different day or a different time might yield different results.

the reliability coefficient (should be closest to ‘1’ – multiple choice questions are typically ‘1’ tests)

the standard error of measurement and the true score

How to make tests more reliable: If you are writing a test, see Chapter 5 for a good list of guidelines to make tests more reliable. A short list:

  • Is the task perfectly clear?
  • Is there more than one possible correct response?
  • Can candidates show the desired behavior without having the skill supposedly being tested?
  • Do candidates have enough time to perform the tasks?

RELIABILITY VS. VALIDITY

To be a valid, a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all. For example, as a writing test, we might require candidates to write down the translation equivalents of 500 words in their own language. This could well be a reliable test, but it is unlikely to be a valid test of writing.

If test specifications make clear what candidates have to be able to do, and with what degree of success, then students will have a clear picture of what they have to achieve.

Be clear what it is that you want to know and for what purpose.

……………………………………………………………………………..

During the 1970s, there was much excitement in the world of language testing about what was called the Unitary Competence Hypothesis. This was the suggestion that the nature of language ability was such that it was impossible to break it down into component parts. This hypothesis eventually proved false. A learner may be a good speaker but a poor writer. One way of measuring overall ability would be to measure a variety of separate abilities and then to combine scores.

………………………………………………………………………………

HOLISTIC VS. ANALYTIC SCORING

holistic scoring – Is often called ‘impressionistic scoring’.

analytic scoring – Requires a separate score for each of a number of aspects of a task. While it is doubtful that scorers can judge each of the aspects independently of the others (there is what is called a ‘halo effect’), the fact of having multiple ‘shots’ at assessing the student’s performance should lead to greater reliability.

Once the test is completed, a search should be made to identify ‘benchmark’ scripts which typify key levels of ability on each writing task.

………………………………………………………………………………..

ORAL TESTS

Make an oral test as long as is feasible, and give the candidate as many ‘fresh starts’ as possible.

…………………………………………………………………………………

TESTING READING

Chapter 11 provides nice lists of what tasks might be considered under testing macro vs micro skills. (Chapter 12 provides similar lists for listening tests.)

An excess of micro-skill test items should not be allowed to obscure the fact that the micro-skills are being taught not as an end in themselves, but as a means of improving macro-skills.

Do not repeatedly select texts of a particular kind simply because they are readily available.

Choose texts of appropriate length. Scanning works best with texts at least 2,000 words long.

………………………………………………………………………………….

The best test may give unreliable and invalid results if it is not well administered.