Abstract

This study aims to examine the scoring reliability of comparative judgement under different sample sizes and standard error termination rule conditions. For this purpose, a Monte Carlo simulation study with 9 conditions and 82 iterations was conducted with sample sizes of 250, 500 and 1000 and standard error termination rules of 0.40, 0.35 and 0.30. In addition, a application for assessing writing skills was conducted with a sample of 50 students using the standard error termination rule of 0.40 and a maximum number of comparisons of 40. In the simulation study, scoring reliability was determined by true reliability, rank order accuracy and scale separation reliability. In the application, the correlation between scores that are obtained with a holistic rubric and ability estimates that are obtained with adaptive comparative judgement as well as the correlation between scores that are obtained using an analytic rubric and ability estimates that are obtained with adaptive comparative judgement were examined. In addition, scale separation reliability was calculated to obtain ability estimates using adaptive comparative judgement. The simulation results showed a high level of reliability in all conditions. Moreover, reliability was high, independent of the sample size. We conclude that stricter standard error termination rules lead to higher levels of reliability, but this requires performances to be subjected to a higher number of pairwise comparisons. The application results showed high scale separation reliability of .89 and correlations of over 0.70 with the scores obtained by using both holistic and analytic rubrics. Overall, the results of the study suggest that adaptive comparative judgement can be used in both classroom and large-scale assessment applications. In addition, adaptive comparative judgement is considered advantageous because it is easier to administer, does not require a difference in the testing process, and places the abilities on a continuous scale.

Keywords: Comparative judgement, Holistic assessment, Scale separation reliability, Pairwise comparison

References

  1. Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2(3), 451-462. doi:10.1177/014662167800200319
  2. Bartholomew, S. R., Nadelson, L. S., Goodridge, W. H., & Reeve, E. M. (2018). Adaptive comparative judgment as a tool for assessing open-ended design problems and model eliciting activities. Educational Assessment, 23(2), 85-101. doi:10.1080/10627197.2018.1444986
  3. Benton, T. (2021). Comparative judgement for linking two existing scales. Frontiers in Education, 6, 775203. doi:10.3389/feduc.2021.775203
  4. Bloxham, S. (2009). Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 34(2), 209-220. doi:10.1080/02602930801955978
  5. Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4), 324-345. doi:10.1093/biomet/39.3-4.324
  6. Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202-223.
  7. Bramley, T. (2007). Paired comparison methods. In P. Newton, J. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246-294). Qualifications and Curriculum Authority.
  8. Bramley, T., & Vitello, S. (2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43-58. doi:10.1080/0969594X.2017.1418734
  9. Christodoulou, D. (2024). Using comparative judgement to improve writing [webinar]. The Education Hub. Retrieved from https://theeducationhub.org.nz/using-comparative-judgement-to-improve-writing/#:~:text=Comparative%20judgement%20is%20a%20process,double%20marking%2C%20but%20much%20quicker
  10. Crisp, V. (2013). Criteria, comparison and past experiences: How do teachers make judgements when marking coursework? Assessment in Education: Principles, Policy & Practice, 20(1), 127-144. doi:10.1080/0969594X.2012.741059
  11. Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2020). Adaptive pairwise comparison for educational measurement. Journal of Educational and Behavioral Statistics, 45(3), 316-338. doi:10.3102/1076998619890589
  12. Crompvoets, E. A. V., Beguin, A. A., & Sijtsma, K. (2021). Pairwise comparison using a Bayesian selection algorithm: Efficient holistic measurement. PsyArXiv. doi:10.31234/osf.io/32nhp
  13. Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2022). On the bias and stability of the results of comparative judgment. Frontiers in Education, 6, 788202. doi:10.3389/feduc.2021.788202
  14. Crossley, S. A., Tian, Y., Baffour, P., Franklin, A., Kim, Y., Morris, W., . . . Boser, U. (2023). Measuring second language proficiency using the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus. International Journal of Learner Corpus Research, 9(2), 248-269. doi:10.1075/ijlcr.22026.cro
  15. Daniel, F., Microsoft Corporation, Weston, S., & Tenenbaum, D. (2022). doParallel: Foreach parallel adaptor for the 'parallel' package (Version 1.0.17) [Computer software]. Retrieved from https://CRAN.R-project.org/package=doParallel
  16. Goossens, M. ve De Maeyer, S. (2018). How to obtain efficient high reliabilities in assessing texts: Rubrics vs comparative judgement. Technology enhanced assessment. TEA 2017. Communications in Computer and Information Science, Springer, Cham. doi:10.1007/978-3-319-97807-9_2
  17. Gustafsson, J.-E. (1977). The Rasch model for dichotomous items: Theory, applications and a computer program. Göteborg: Göteborg University.
  18. Heldsinger, S. & Humphry, S. (2013). Using calibrated exemplars in the teacher-assessment of writing: an empirical study. Educational research, 55(3), 219-235. doi:10.1080/00131881.2013.825159
  19. Holmes, S. D., Meadows, M., Stockford, I., & He, Q. (2018). Investigating the comparability of examination difficulty using comparative judgement and Rasch modelling. International Journal of Testing, 18(4), 366-391. doi:10.1080/15305058.2018.1486316
  20. Humphry, S. M., & Heldsinger, S. (2019). A two‐stage method for classroom assessments of essay writing. Journal of Educational Measurement, 56(3), 505-520. doi:10.1111/jedm.12223
  21. Jones, I., & Davies, B. (2023). Comparative judgement in education research. International Journal of Research & Method in Education, 47(2), 170-181. doi:10.1080/1743727X.2023.2242273
  22. Laming, D. (2003). Human judgment: The eye of the beholder. Thomson Learning.
  23. Lesterhuis, M., Bouwer, R., Van Daal, T., Donche, V., & De Maeyer, S. (2022). Validity of comparative judgment scores: How assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 7, 823895. doi:10.3389/feduc.2022.823895
  24. Luce, R. D. (1959). Individual choice behaviours: A theoretical analysis. New York: John Wiley & Sons.
  25. MoNE Measurement and Evaluation Regulation. (2023). Resmi Gazete (Sayı: 32304). Retrieved from https://www.resmigazete.gov.tr/eskiler/2023/09/20230909-2.htm
  26. Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. doi:10.1080/0969594X.2012.665354
  27. R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.0) [Computer software]. https://www.r-project.org/ adresinden erişildi.
  28. Sims, M. E., Cox, T. L., Eckstein, G. T., Hartshorn, K. J., Wilcox, M. P., & Hart, J. M. (2020). Rubric rating with MFRM versus randomly distributed comparative judgment: A comparison of two approaches to second-language writing assessment. Educational Measurement: Issues and Practice, 39(4), 30-40. doi:10.1111/emip.12329
  29. Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay. Applied Measurement in Education, 29(3), 211-223. doi:10.1080/08957347.2016.1171769
  30. Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286. doi:10.1037/h0070288
  31. Thwaites, P., Kollias, C., & Paquot, M. (2024). Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts?, Assessing Writing, 60, doi:10.1016/j.asw.2024.100843
  32. Using anchors to link judging sessions. (2016). Retrieved from https://nomoremarkingltd.freshdesk.com/support/solutions/articles/16000029952-using-anchors-to-link-judging-sessions.
  33. Uysal, İ., Gürel, S., Şahin, M. D., İbileme, A. İ., & Yıldırım Görgülü, Y. (2024). Açık uçlu maddelerin karşılaştırmalı yargıyla puanlanmasında sabit sayıda uyarlamalı ve rassal eşlemeye dayalı bir simülasyon çalışması. 9th. International Conference on Measurement and Evaluation in Education, Anadolu University, Eskişehir.
  34. van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: Examining implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59-74. doi:10.1080/0969594X.2016.1253542
  35. Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541-562. doi:10.1080/0969594X.2019.1602027
  36. Verhavert, S., Furlong, A., & Bouwer, R. (2022). The accuracy and efficiency of a reference based adaptive selection algorithm for comparative judgment. Frontiers in Education, 6, 785919. doi:10.3389/feduc.2021.785919

How to cite

Gürel, S., Şahin, M., Uysal, İ., İbileme, A., & Gündüz, T. (2025). Adaptive Selection Algorithm and Standard Error Termination Rule in Comparative Judgement: An Application for Assessing Writing Skills. Education and Science, 50, 93-110. https://doi.org/10.15390/EB.2025.14123