Reliability and validity are the two most important properties of a test. They form part of the Cambridge English VRIPQ approach as described in the Principles of Good Practice booklet. It is a general principle that in any testing situation one needs to maximise validity and reliability to produce the most useful results for test users, within existing practical constraints.
Cambridge English takes the view that reliability is an integral component of validity; there can be no validity without reliability. Hence any approach to estimating reliability must reflect potential sources of evidence for the construct validity of the tests.
Reliability (expressed normally by a figure between 0 and 1) indicates the replicability of the test scores when a test is given twice or more to the same group of people, when two tests that are constructed in the same manner are given to the same group of people , or when the same performance is marked independently by two different examiners. The expectation is that the candidates would receive nearly the same results on all occasions. If the candidates’ results are consistent on all occasions, the test is said to be reliable; the degree of score consistency is therefore a measure of reliability of the test.
There are various ways to estimate the reliability of an exam. Most Cambridge English exams have two main types of component: objective papers and performance papers. Objective papers are the ones that do not require human judgement for their scoring, i.e. tests of reading comprehension, listening comprehension and use of English. The scores achieved in these sub-components are simply calculated by adding up the total number of correct responses to each section. The reliability estimates for these papers are calculated using a statistic called Cronbach’s Alpha. The closer the Alpha is to 1, the more reliable the test is.
Writing performance is usually marked by one human rater, however a selection of responses are marked by a second or third rater as well. We use this sample of responses, marked by more than one examiner, to estimate reliability for writing by calculating a statistic called Gwet’s AC2. This statistic is an estimate of inter-rater reliability.
For speaking, the Feldt Reliability Test is applied. This can be used when the score of a test is the sum of scores given by two raters or judges. We use it to assess reliability for speaking, as almost all Cambridge English speaking tests use a paired format structure where two Oral Examiners assess the performance of the candidates.
What is common to all these methods is a scale which ranges between 0 and 1, very similar to the Alpha used for objective papers.
Scores from the sub-components of a qualification are reported on the Cambridge English Scale (CES). These sub-component CES scores are used to calculate a candidate’s overall score, also reported on the CES, and it is this which determines a candidate’s grade, and CEFR level where relevant. While it is worth having a measure of the reliability of each sub-component, what matters most to candidates and test users is the reliability of the overall score. We use the standard error of measurement (SEM) from the sub-components, as well as the standard deviation of the overall CES scores to calculate the reliability of the overall score.
SEM is not a separate approach to estimating reliability, but rather a different way of reporting it. Language testing is subject to the influence of many factors that are not relevant to the ability being measured. Such irrelevant factors contribute to what is called ‘measurement error’. The SEM is a transformation of reliability in terms of test scores. While reliability refers to a group of test takers, the SEM shows the impact of reliability on the likely score of an individual: it indicates how close a test taker’s score is likely to be to their ‘true score’, to within some stated probability. For example, where a candidate receives an overall CES score of 186 with an SEM of 2.5, there is a high probability that their true score is between 181 and 191. This is a very useful piece of information that test users can use in their decision making.
Tables 1–10 below report typical reliability and SEM figures for Cambridge English exams for 2022.
Components: The reliability figures for objective and performance papers are based on raw scores. SEM figures are based on CES scores for Cambridge English Qualifications and raw scores for TKT and YLE.
We can see from the tables below, that reliability is typically above 0.8 across all components. SEM is around 5 to 7 CES scores for objective components and around 2-3 for performance components, as well as for TKT and YLE.
Overall score: The overall reliability for these exams is typically above 0.90 and the SEM is around 2.3. These figures demonstrate a high degree of trustworthiness in the overall CES scores reported.
Table 1: Cambridge English: A2 Key (KET)
|
Reliability |
SEM |
Reading |
0.85 |
5.88 |
Writing |
0.94 |
2.36 |
Listening |
0.86 |
6.14 |
Speaking |
0.95 |
2.50 |
Total score |
0.96 |
2.29 |
Table 2: Cambridge English: A2 Key for Schools (KET for Schools)
|
Reliability |
SEM |
Reading |
0.84 |
5.88 |
Writing |
0.91 |
2.72 |
Listening |
0.84 |
6.17 |
Speaking |
0.93 |
2.67 |
Total score |
0.95 |
2.33 |
Table 3: Cambridge English: B1 Preliminary (PET)
|
Reliability |
SEM |
Reading |
0.90 |
5.53 |
Writing |
0.96 |
1.41 |
Listening |
0.83 |
6.49 |
Speaking |
0.96 |
2.07 |
Total score |
0.96 |
2.22 |
Table 4: Cambridge English: B1 Preliminary for Schools (PET for Schools)
|
Reliability |
SEM |
Reading |
0.89 |
5.62 |
Writing |
0.94 |
1.74 |
Listening |
0.83 |
6.36 |
Speaking |
0.93 |
2.58 |
Total score |
0.95 |
2.26 |
Table 5: Cambridge English: B2 First (FCE)
|
Reliability |
SEM |
Reading |
0.83 |
6.25 |
Writing |
0.93 |
1.89 |
Use of English |
0.83 |
6.97 |
Listening |
0.81 |
6.28 |
Speaking |
0.94 |
2.02 |
Total score |
0.95 |
2.32 |
Table 6: Cambridge English: B2 First for Schools (FCE for Schools)
|
Reliability |
SEM |
Reading |
0.79 |
6.59 |
Writing |
0.95 |
1.45 |
Use of English |
0.80 |
6.94 |
Listening |
0.84 |
5.96 |
Speaking |
0.95 |
1.96 |
Total score |
0.94 |
2.31 |
Table 7: Cambridge English: C1 Advanced (CAE)
|
Reliability |
SEM |
Reading |
0.82 |
6.70 |
Writing |
0.93 |
2.49 |
Use of English |
0.74 |
7.80 |
Listening
|
0.75 |
6.65 |
Speaking |
0.96 |
1.90 |
Total score |
0.93 |
2.53 |
Table 8: Cambridge English: C2 Proficiency (CPE)
|
Reliability |
SEM |
Reading |
0.71 |
9.22 |
Writing |
0.91 |
2.52 |
Use of English |
0.65 |
9.75 |
Listening |
0.74 |
8.23 |
Speaking |
0.96 |
2.14 |
Total score |
0.89 |
3.22 |
Table 9: Cambridge English: Young Learners
|
Reliability |
SEM |
Pre A1 Starters Listening |
0.84 |
1.52 |
Pre A1 Starters Reading and Writing |
0.87 |
1.60 |
A1 Movers Listening |
0.84 |
1.87 |
A1 Movers Reading and Writing |
0.88 |
2.11 |
A2 Flyers Listening |
0.88 |
1.76 |
A2 Flyers Reading and Writing |
0.93 |
2.47 |
Table 10: TKT (Teaching Knowledge Test)
|
Reliability |
SEM |
Module 1 |
0.89 |
3.70 |
Module 2 |
0.93 |
3.41 |
Module 3 |
0.91 |
3.19 |
TKT: CLIL |
0.89 |
3.44 |
TKT: YL |
0.93 |
3.09 |