As an organisation we are committed to producing research which helps us improve the service that we provide and which is of use to the wider community of education policymakers and practitioners. In this section you will find a variety of recent research and analysis which has been developed by researchers at WJEC.
Re-designing the role of examiner judgement in maintaining standards for UK general qualification examinations
Joanna Maziarz, Alayla Castle-Herbert, Siân Denner, Liz Phillips, Richard Harry
In order to ensure a fair assessment process and to maintain consistent standards year on year, UK awarding organisations use a combination of senior examiners’ expert judgement and statistical analysis to set grade boundaries and to 'award' high stake examinations. Influenced in part by changing regulatory requirements, UK awarding bodies have placed increasing emphasis on the use of statistical information to arrive at decisions relating to grade boundaries (Baird and Morrissey, 2005). This, to a large extent, has been a result of extensive research showing examiner judgement to be unreliable (cf. Baird and Dhillon, 2005). Nevertheless, the involvement of experts in standard setting remains crucial in preserving public trust in the UK examination system (Jones, 2009).
At WJEC, one of the UK’s main examination boards, we are undertaking research to design alternative, more reliable methods of capturing expert judgement, and to clarify the role that such evidence should play in the process of maintaining educational standards in the UK’s high stake examinations. To do this, we draw on psychological perspectives on judgement and decision-making which point to a wide range of heuristics and biases limiting experts’ ability to make valid judgements on standards (cf. Hardman 2009, Kirkebøen 2009, Kerr and Tindale 2004). We are employing a ‘design thinking’ approach (cf. Brown, 2009) utilising these insights to develop prototype methods for testing and review.
To date, our study has showed that only examiners who have experience of marking a particular paper are able to provide confident and reliable judgements. This means that general subject knowledge might not be sufficient. In addition, examinee-centred methods were shown to be more accessible for experts as those allowed them to draw on their experience more easily. It was difficult, however, to break experts’ habit of applying their expertise in a way that deviated from how they use it in their daily practice. This underlines the need for a high-quality training of examiners on standard setting methods that they are not familiar with. Furthermore, removing marks from scripts reviewed at award may reduce identified biases, but increase the level of cognitive load associated with the task – in the test phase, experts often fell back on re-marking scripts as a starting point for assessing its worthiness for a given grade. Comparative judgement of scripts may overcome this issue; however, experts also found it difficult to compare scripts holistically when candidates were stronger on different aspects of the assessment.
The next stage of research aims to draw on the aforementioned insights and prototype other standard setting approaches.
Changing ability and comparable outcomes – UK examinations of French, Spanish and German
Alayla Castle-Herbert, T. Alun Evans, Joanna Maziarz & Paul Morgan
In England, Northern Ireland and Wales a system of "Comparable Outcomes" is used to set predictions for how many candidates should receive different grades during their main exams at 17 (AS level) and 18 (A level). Candidates are split into deciles depending on the mean of their exam results at 16 (GCSEs) and predictions for outcomes are set so that each decile is expected to perform to the same historic standard. In French, Spanish and German the numbers entering AS and A level exams has dropped significantly and this may be affecting comparability.
In addition to looking at the Mean GCSE of candidates taking foreign language AS and A levels we also looked at how they did in the GCSE that corresponded to the qualification they were taking. Analysis was undertaken to see how this relationship changed between 2010 and 2016.
We found that candidates in lower deciles were more likely to have received top grades (A* or A) in 2016 than they were in 2010. This puts into question the assumption that candidates in these deciles were comparable between the two periods.
Do translated items perform the same way? The experience of assessment in a bilingual country
Wales is a bilingual country with candidates sitting exams through either the English or Welsh languages. For most subjects an original paper is professionally translated so that there are two papers and candidates can pick which one to sit. Considerable effort is exhorted to ensure that the translation produces questions in both mediums which are of identical difficulty. However until very recently no work has been done to use the data produced during marking to empirically measure differential item functioning.
These slides, originally presented at AEA Europe in 2017, use differential item functioning analysis for polytomous items to compare the performance of candidates on different items who took the papers through different languages. The analysis is performed across a number of different subjects.
Some of the analysis based on summer 2017 examinations. However pilot work shows that variation between languages, whilst statistically significant, has tended to be well under 5% of the marks available for an item and implies that these effect sizes are not educationally significant. The results are in the context of an evolving wider DIF strategy whereby flagged items are submitted for qualitative review.