Adaptive Selection Algorithm and Standard Error Termination Rule in Comparative Judgement: An Application for Assessing Writing Skills

Sungur Gürel; Murat Şahin; İbrahim Uysal; Ali İbileme; Tuba Gündüz

doi:10.15390/EB.2025.14123

Abstract

This study aims to examine the scoring reliability of comparative judgement under different sample sizes and standard error termination rule conditions. For this purpose, a Monte Carlo simulation study with 9 conditions and 82 iterations was conducted with sample sizes of 250, 500 and 1000 and standard error termination rules of 0.40, 0.35 and 0.30. In addition, a application for assessing writing skills was conducted with a sample of 50 students using the standard error termination rule of 0.40 and a maximum number of comparisons of 40. In the simulation study, scoring reliability was determined by true reliability, rank order accuracy and scale separation reliability. In the application, the correlation between scores that are obtained with a holistic rubric and ability estimates that are obtained with adaptive comparative judgement as well as the correlation between scores that are obtained using an analytic rubric and ability estimates that are obtained with adaptive comparative judgement were examined. In addition, scale separation reliability was calculated to obtain ability estimates using adaptive comparative judgement. The simulation results showed a high level of reliability in all conditions. Moreover, reliability was high, independent of the sample size. We conclude that stricter standard error termination rules lead to higher levels of reliability, but this requires performances to be subjected to a higher number of pairwise comparisons. The application results showed high scale separation reliability of .89 and correlations of over 0.70 with the scores obtained by using both holistic and analytic rubrics. Overall, the results of the study suggest that adaptive comparative judgement can be used in both classroom and large-scale assessment applications. In addition, adaptive comparative judgement is considered advantageous because it is easier to administer, does not require a difference in the testing process, and places the abilities on a continuous scale.

Keywords: Comparative judgement, Holistic assessment, Scale separation reliability, Pairwise comparison

References

Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2(3), 451-462. doi:10.1177/014662167800200319
Bartholomew, S. R., Nadelson, L. S., Goodridge, W. H., & Reeve, E. M. (2018). Adaptive comparative judgment as a tool for assessing open-ended design problems and model eliciting activities. Educational Assessment, 23(2), 85-101. doi:10.1080/10627197.2018.1444986
Benton, T. (2021). Comparative judgement for linking two existing scales. Frontiers in Education, 6, 775203. doi:10.3389/feduc.2021.775203
Bloxham, S. (2009). Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 34(2), 209-220. doi:10.1080/02602930801955978
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4), 324-345. doi:10.1093/biomet/39.3-4.324
Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202-223.
Bramley, T. (2007). Paired comparison methods. In P. Newton, J. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246-294). Qualifications and Curriculum Authority.
Bramley, T., & Vitello, S. (2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43-58. doi:10.1080/0969594X.2017.1418734
Christodoulou, D. (2024). Using comparative judgement to improve writing [webinar]. The Education Hub. Retrieved from https://theeducationhub.org.nz/using-comparative-judgement-to-improve-writing/#:~:text=Comparative%20judgement%20is%20a%20process,double%20marking%2C%20but%20much%20quicker
Crisp, V. (2013). Criteria, comparison and past experiences: How do teachers make judgements when marking coursework? Assessment in Education: Principles, Policy & Practice, 20(1), 127-144. doi:10.1080/0969594X.2012.741059
Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2020). Adaptive pairwise comparison for educational measurement. Journal of Educational and Behavioral Statistics, 45(3), 316-338. doi:10.3102/1076998619890589
Crompvoets, E. A. V., Beguin, A. A., & Sijtsma, K. (2021). Pairwise comparison using a Bayesian selection algorithm: Efficient holistic measurement. PsyArXiv. doi:10.31234/osf.io/32nhp
Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2022). On the bias and stability of the results of comparative judgment. Frontiers in Education, 6, 788202. doi:10.3389/feduc.2021.788202
Crossley, S. A., Tian, Y., Baffour, P., Franklin, A., Kim, Y., Morris, W., . . . Boser, U. (2023). Measuring second language proficiency using the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus. International Journal of Learner Corpus Research, 9(2), 248-269. doi:10.1075/ijlcr.22026.cro
Daniel, F., Microsoft Corporation, Weston, S., & Tenenbaum, D. (2022). doParallel: Foreach parallel adaptor for the 'parallel' package (Version 1.0.17) [Computer software]. Retrieved from https://CRAN.R-project.org/package=doParallel
Goossens, M. ve De Maeyer, S. (2018). How to obtain efficient high reliabilities in assessing texts: Rubrics vs comparative judgement. Technology enhanced assessment. TEA 2017. Communications in Computer and Information Science, Springer, Cham. doi:10.1007/978-3-319-97807-9_2
Gustafsson, J.-E. (1977). The Rasch model for dichotomous items: Theory, applications and a computer program. Göteborg: Göteborg University.
Heldsinger, S. & Humphry, S. (2013). Using calibrated exemplars in the teacher-assessment of writing: an empirical study. Educational research, 55(3), 219-235. doi:10.1080/00131881.2013.825159
Holmes, S. D., Meadows, M., Stockford, I., & He, Q. (2018). Investigating the comparability of examination difficulty using comparative judgement and Rasch modelling. International Journal of Testing, 18(4), 366-391. doi:10.1080/15305058.2018.1486316
Humphry, S. M., & Heldsinger, S. (2019). A two‐stage method for classroom assessments of essay writing. Journal of Educational Measurement, 56(3), 505-520. doi:10.1111/jedm.12223
Jones, I., & Davies, B. (2023). Comparative judgement in education research. International Journal of Research & Method in Education, 47(2), 170-181. doi:10.1080/1743727X.2023.2242273
Laming, D. (2003). Human judgment: The eye of the beholder. Thomson Learning.
Lesterhuis, M., Bouwer, R., Van Daal, T., Donche, V., & De Maeyer, S. (2022). Validity of comparative judgment scores: How assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 7, 823895. doi:10.3389/feduc.2022.823895
Luce, R. D. (1959). Individual choice behaviours: A theoretical analysis. New York: John Wiley & Sons.
MoNE Measurement and Evaluation Regulation. (2023). Resmi Gazete (Sayı: 32304). Retrieved from https://www.resmigazete.gov.tr/eskiler/2023/09/20230909-2.htm
Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. doi:10.1080/0969594X.2012.665354
R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.0) [Computer software]. https://www.r-project.org/ adresinden erişildi.
Sims, M. E., Cox, T. L., Eckstein, G. T., Hartshorn, K. J., Wilcox, M. P., & Hart, J. M. (2020). Rubric rating with MFRM versus randomly distributed comparative judgment: A comparison of two approaches to second-language writing assessment. Educational Measurement: Issues and Practice, 39(4), 30-40. doi:10.1111/emip.12329
Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay. Applied Measurement in Education, 29(3), 211-223. doi:10.1080/08957347.2016.1171769
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286. doi:10.1037/h0070288
Thwaites, P., Kollias, C., & Paquot, M. (2024). Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts?, Assessing Writing, 60, doi:10.1016/j.asw.2024.100843
Using anchors to link judging sessions. (2016). Retrieved from https://nomoremarkingltd.freshdesk.com/support/solutions/articles/16000029952-using-anchors-to-link-judging-sessions.
Uysal, İ., Gürel, S., Şahin, M. D., İbileme, A. İ., & Yıldırım Görgülü, Y. (2024). Açık uçlu maddelerin karşılaştırmalı yargıyla puanlanmasında sabit sayıda uyarlamalı ve rassal eşlemeye dayalı bir simülasyon çalışması. 9th. International Conference on Measurement and Evaluation in Education, Anadolu University, Eskişehir.
van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: Examining implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59-74. doi:10.1080/0969594X.2016.1253542
Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541-562. doi:10.1080/0969594X.2019.1602027
Verhavert, S., Furlong, A., & Bouwer, R. (2022). The accuracy and efficiency of a reference based adaptive selection algorithm for comparative judgment. Frontiers in Education, 6, 785919. doi:10.3389/feduc.2021.785919

Copyright and license

Copyright © 2025 The Author(s). This is an open access article distributed under the Creative Commons Attribution License (CC BY), which permits unrestricted use, distribution, and reproduction in any medium or format, provided the original work is properly cited.

How to cite

Gürel, S., Şahin, M., Uysal, İ., İbileme, A., & Gündüz, T. (2025). Adaptive Selection Algorithm and Standard Error Termination Rule in Comparative Judgement: An Application for Assessing Writing Skills. Education and Science, 50, 93-110. https://doi.org/10.15390/EB.2025.14123

Download citation

[ref1] Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2(3), 451-462. doi:10.1177/014662167800200319

[ref2] Bartholomew, S. R., Nadelson, L. S., Goodridge, W. H., & Reeve, E. M. (2018). Adaptive comparative judgment as a tool for assessing open-ended design problems and model eliciting activities. Educational Assessment, 23(2), 85-101. doi:10.1080/10627197.2018.1444986

[ref3] Benton, T. (2021). Comparative judgement for linking two existing scales. Frontiers in Education, 6, 775203. doi:10.3389/feduc.2021.775203

[ref4] Bloxham, S. (2009). Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 34(2), 209-220. doi:10.1080/02602930801955978

[ref5] Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4), 324-345. doi:10.1093/biomet/39.3-4.324

[ref6] Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202-223.

[ref7] Bramley, T. (2007). Paired comparison methods. In P. Newton, J. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246-294). Qualifications and Curriculum Authority.

[ref8] Bramley, T., & Vitello, S. (2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43-58. doi:10.1080/0969594X.2017.1418734

[ref9] Christodoulou, D. (2024). Using comparative judgement to improve writing [webinar]. The Education Hub. Retrieved from https://theeducationhub.org.nz/using-comparative-judgement-to-improve-writing/#:~:text=Comparative%20judgement%20is%20a%20process,double%20marking%2C%20but%20much%20quicker

[ref10] Crisp, V. (2013). Criteria, comparison and past experiences: How do teachers make judgements when marking coursework? Assessment in Education: Principles, Policy & Practice, 20(1), 127-144. doi:10.1080/0969594X.2012.741059

[ref11] Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2020). Adaptive pairwise comparison for educational measurement. Journal of Educational and Behavioral Statistics, 45(3), 316-338. doi:10.3102/1076998619890589

[ref12] Crompvoets, E. A. V., Beguin, A. A., & Sijtsma, K. (2021). Pairwise comparison using a Bayesian selection algorithm: Efficient holistic measurement. PsyArXiv. doi:10.31234/osf.io/32nhp

[ref13] Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2022). On the bias and stability of the results of comparative judgment. Frontiers in Education, 6, 788202. doi:10.3389/feduc.2021.788202

[ref14] Crossley, S. A., Tian, Y., Baffour, P., Franklin, A., Kim, Y., Morris, W., . . . Boser, U. (2023). Measuring second language proficiency using the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus. International Journal of Learner Corpus Research, 9(2), 248-269. doi:10.1075/ijlcr.22026.cro

[ref15] Daniel, F., Microsoft Corporation, Weston, S., & Tenenbaum, D. (2022). doParallel: Foreach parallel adaptor for the 'parallel' package (Version 1.0.17) [Computer software]. Retrieved from https://CRAN.R-project.org/package=doParallel

[ref16] Goossens, M. ve De Maeyer, S. (2018). How to obtain efficient high reliabilities in assessing texts: Rubrics vs comparative judgement. Technology enhanced assessment. TEA 2017. Communications in Computer and Information Science, Springer, Cham. doi:10.1007/978-3-319-97807-9_2

[ref17] Gustafsson, J.-E. (1977). The Rasch model for dichotomous items: Theory, applications and a computer program. Göteborg: Göteborg University.

[ref18] Heldsinger, S. & Humphry, S. (2013). Using calibrated exemplars in the teacher-assessment of writing: an empirical study. Educational research, 55(3), 219-235. doi:10.1080/00131881.2013.825159

[ref19] Holmes, S. D., Meadows, M., Stockford, I., & He, Q. (2018). Investigating the comparability of examination difficulty using comparative judgement and Rasch modelling. International Journal of Testing, 18(4), 366-391. doi:10.1080/15305058.2018.1486316

[ref20] Humphry, S. M., & Heldsinger, S. (2019). A two‐stage method for classroom assessments of essay writing. Journal of Educational Measurement, 56(3), 505-520. doi:10.1111/jedm.12223

[ref21] Jones, I., & Davies, B. (2023). Comparative judgement in education research. International Journal of Research & Method in Education, 47(2), 170-181. doi:10.1080/1743727X.2023.2242273

[ref22] Laming, D. (2003). Human judgment: The eye of the beholder. Thomson Learning.

[ref23] Lesterhuis, M., Bouwer, R., Van Daal, T., Donche, V., & De Maeyer, S. (2022). Validity of comparative judgment scores: How assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 7, 823895. doi:10.3389/feduc.2022.823895

[ref24] Luce, R. D. (1959). Individual choice behaviours: A theoretical analysis. New York: John Wiley & Sons.

[ref25] MoNE Measurement and Evaluation Regulation. (2023). Resmi Gazete (Sayı: 32304). Retrieved from https://www.resmigazete.gov.tr/eskiler/2023/09/20230909-2.htm

[ref26] Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. doi:10.1080/0969594X.2012.665354

[ref27] R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.0) [Computer software]. https://www.r-project.org/ adresinden erişildi.

[ref28] Sims, M. E., Cox, T. L., Eckstein, G. T., Hartshorn, K. J., Wilcox, M. P., & Hart, J. M. (2020). Rubric rating with MFRM versus randomly distributed comparative judgment: A comparison of two approaches to second-language writing assessment. Educational Measurement: Issues and Practice, 39(4), 30-40. doi:10.1111/emip.12329

[ref29] Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay. Applied Measurement in Education, 29(3), 211-223. doi:10.1080/08957347.2016.1171769

[ref30] Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286. doi:10.1037/h0070288

[ref31] Thwaites, P., Kollias, C., & Paquot, M. (2024). Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts?, Assessing Writing, 60, doi:10.1016/j.asw.2024.100843

[ref32] Using anchors to link judging sessions. (2016). Retrieved from https://nomoremarkingltd.freshdesk.com/support/solutions/articles/16000029952-using-anchors-to-link-judging-sessions.

[ref33] Uysal, İ., Gürel, S., Şahin, M. D., İbileme, A. İ., & Yıldırım Görgülü, Y. (2024). Açık uçlu maddelerin karşılaştırmalı yargıyla puanlanmasında sabit sayıda uyarlamalı ve rassal eşlemeye dayalı bir simülasyon çalışması. 9th. International Conference on Measurement and Evaluation in Education, Anadolu University, Eskişehir.

[ref34] van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: Examining implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59-74. doi:10.1080/0969594X.2016.1253542

[ref35] Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541-562. doi:10.1080/0969594X.2019.1602027

[ref36] Verhavert, S., Furlong, A., & Bouwer, R. (2022). The accuracy and efficiency of a reference based adaptive selection algorithm for comparative judgment. Frontiers in Education, 6, 785919. doi:10.3389/feduc.2021.785919

Education and Science

Adaptive Selection Algorithm and Standard Error Termination Rule in Comparative Judgement: An Application for Assessing Writing Skills

Authors

Abstract

References

Copyright and license

How to cite