Adaptive Selection Algorithm and Standard Error Termination Rule in Comparative Judgement: An Application for Assessing Writing Skills

Sungur Gürel; Murat Şahin; İbrahim Uysal; Ali İbileme; Tuba Gündüz

doi:10.15390/EB.2025.14123

Öz

Mevcut çalışmanın amacı örneklem büyüklüğü ve standart hata sonlandırma kuralı koşulları altında karşılaştırma yargılarıyla puanlamanın güvenirliğini incelemektir. Bu amaç doğrultusunda 250, 500 ve 1000 örneklem büyüklükleri ile 0,40, 0,35 ve 0,30’luk standart hata sonlandırma kuralları çaprazlanarak 9 koşullu, 82 tekrarlı bir Monte Carlo simülasyonu gerçekleştirilmiştir. Ayrıca, 50 öğrencilik bir örneklemle 0,40 standart hata durdurma kuralı ve 40 en fazla karşılaştırma sayısı kullanılarak gerçek bir uygulama yapılmıştır. Simülasyon çalışmasında puanlama güvenirliği gerçek güvenirlik, sıralama güvenirliği ve ölçek ayrıştırma güvenirliği ile belirlenmiştir. Gerçek veri ile yapılan uygulamada ise bütüncül ve analitik puanlama ile karşılaştırma yargılarıyla puanlama arasındaki korelasyon incelenmiş ve karşılaştırma yargılarıyla puanlama için ölçek ayrıştırma güvenirliği hesaplanmıştır. Simülasyon sonuçları tüm koşullarda yüksek düzeyde puanlama güvenirliği göstermiştir. Dahası puanlama güvenirliği örneklem büyüklüğünden bağımsız bulunmuştur. Daha katı standart hata sonlandırma kurallarının, daha yüksek güvenirlik düzeyleri sağladığı, ancak bunun için performansların daha yüksek sayıda ikili karşılaştırmaya tabi tutulması gerektiği sonucuna ulaşılmıştır. Gerçek uygulama sonuçları, 0.89’luk yüksek ölçek ayrıştırma güvenirliği göstermiş ve rubrik kullanılarak verilen puanlarla 0,70’in üzerinde bir korelasyon ortaya koymuştur. Genel olarak araştırma sonuçları, karşılaştırma yargılarıyla puanlamanın hem sınıf içi hem de geniş ölçekli uygulamalarda kullanılabilir olduğunu göstermektedir. Ayrıca, karşılaştırma yargılarıyla puanlamanın daha kolay uygulanması, test sürecinde bir farklılık gerektirmemesi ve puanları sürekli bir ölçek üzerine yerleştirmesi nedeniyle avantaj sağladığı düşünülmektedir.

Anahtar Kelimeler: Karşılaştırma yargıları, Bütünsel değerlendirme, Ölçek ayrıştırma güvenirliği, İkili karşılaştırma

Kaynakça

Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2(3), 451-462. doi:10.1177/014662167800200319
Bartholomew, S. R., Nadelson, L. S., Goodridge, W. H., & Reeve, E. M. (2018). Adaptive comparative judgment as a tool for assessing open-ended design problems and model eliciting activities. Educational Assessment, 23(2), 85-101. doi:10.1080/10627197.2018.1444986
Benton, T. (2021). Comparative judgement for linking two existing scales. Frontiers in Education, 6, 775203. doi:10.3389/feduc.2021.775203
Bloxham, S. (2009). Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 34(2), 209-220. doi:10.1080/02602930801955978
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4), 324-345. doi:10.1093/biomet/39.3-4.324
Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202-223.
Bramley, T. (2007). Paired comparison methods. In P. Newton, J. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246-294). Qualifications and Curriculum Authority.
Bramley, T., & Vitello, S. (2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43-58. doi:10.1080/0969594X.2017.1418734
Christodoulou, D. (2024). Using comparative judgement to improve writing [webinar]. The Education Hub. Retrieved from https://theeducationhub.org.nz/using-comparative-judgement-to-improve-writing/#:~:text=Comparative%20judgement%20is%20a%20process,double%20marking%2C%20but%20much%20quicker
Crisp, V. (2013). Criteria, comparison and past experiences: How do teachers make judgements when marking coursework? Assessment in Education: Principles, Policy & Practice, 20(1), 127-144. doi:10.1080/0969594X.2012.741059
Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2020). Adaptive pairwise comparison for educational measurement. Journal of Educational and Behavioral Statistics, 45(3), 316-338. doi:10.3102/1076998619890589
Crompvoets, E. A. V., Beguin, A. A., & Sijtsma, K. (2021). Pairwise comparison using a Bayesian selection algorithm: Efficient holistic measurement. PsyArXiv. doi:10.31234/osf.io/32nhp
Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2022). On the bias and stability of the results of comparative judgment. Frontiers in Education, 6, 788202. doi:10.3389/feduc.2021.788202
Crossley, S. A., Tian, Y., Baffour, P., Franklin, A., Kim, Y., Morris, W., . . . Boser, U. (2023). Measuring second language proficiency using the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus. International Journal of Learner Corpus Research, 9(2), 248-269. doi:10.1075/ijlcr.22026.cro
Daniel, F., Microsoft Corporation, Weston, S., & Tenenbaum, D. (2022). doParallel: Foreach parallel adaptor for the 'parallel' package (Version 1.0.17) [Computer software]. Retrieved from https://CRAN.R-project.org/package=doParallel
Goossens, M. ve De Maeyer, S. (2018). How to obtain efficient high reliabilities in assessing texts: Rubrics vs comparative judgement. Technology enhanced assessment. TEA 2017. Communications in Computer and Information Science, Springer, Cham. doi:10.1007/978-3-319-97807-9_2
Gustafsson, J.-E. (1977). The Rasch model for dichotomous items: Theory, applications and a computer program. Göteborg: Göteborg University.
Heldsinger, S. & Humphry, S. (2013). Using calibrated exemplars in the teacher-assessment of writing: an empirical study. Educational research, 55(3), 219-235. doi:10.1080/00131881.2013.825159
Holmes, S. D., Meadows, M., Stockford, I., & He, Q. (2018). Investigating the comparability of examination difficulty using comparative judgement and Rasch modelling. International Journal of Testing, 18(4), 366-391. doi:10.1080/15305058.2018.1486316
Humphry, S. M., & Heldsinger, S. (2019). A two‐stage method for classroom assessments of essay writing. Journal of Educational Measurement, 56(3), 505-520. doi:10.1111/jedm.12223
Jones, I., & Davies, B. (2023). Comparative judgement in education research. International Journal of Research & Method in Education, 47(2), 170-181. doi:10.1080/1743727X.2023.2242273
Laming, D. (2003). Human judgment: The eye of the beholder. Thomson Learning.
Lesterhuis, M., Bouwer, R., Van Daal, T., Donche, V., & De Maeyer, S. (2022). Validity of comparative judgment scores: How assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 7, 823895. doi:10.3389/feduc.2022.823895
Luce, R. D. (1959). Individual choice behaviours: A theoretical analysis. New York: John Wiley & Sons.
MoNE Measurement and Evaluation Regulation. (2023). Resmi Gazete (Sayı: 32304). Retrieved from https://www.resmigazete.gov.tr/eskiler/2023/09/20230909-2.htm
Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. doi:10.1080/0969594X.2012.665354
R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.0) [Computer software]. https://www.r-project.org/ adresinden erişildi.
Sims, M. E., Cox, T. L., Eckstein, G. T., Hartshorn, K. J., Wilcox, M. P., & Hart, J. M. (2020). Rubric rating with MFRM versus randomly distributed comparative judgment: A comparison of two approaches to second-language writing assessment. Educational Measurement: Issues and Practice, 39(4), 30-40. doi:10.1111/emip.12329
Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay. Applied Measurement in Education, 29(3), 211-223. doi:10.1080/08957347.2016.1171769
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286. doi:10.1037/h0070288
Thwaites, P., Kollias, C., & Paquot, M. (2024). Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts?, Assessing Writing, 60, doi:10.1016/j.asw.2024.100843
Using anchors to link judging sessions. (2016). Retrieved from https://nomoremarkingltd.freshdesk.com/support/solutions/articles/16000029952-using-anchors-to-link-judging-sessions.
Uysal, İ., Gürel, S., Şahin, M. D., İbileme, A. İ., & Yıldırım Görgülü, Y. (2024). Açık uçlu maddelerin karşılaştırmalı yargıyla puanlanmasında sabit sayıda uyarlamalı ve rassal eşlemeye dayalı bir simülasyon çalışması. 9th. International Conference on Measurement and Evaluation in Education, Anadolu University, Eskişehir.
van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: Examining implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59-74. doi:10.1080/0969594X.2016.1253542
Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541-562. doi:10.1080/0969594X.2019.1602027
Verhavert, S., Furlong, A., & Bouwer, R. (2022). The accuracy and efficiency of a reference based adaptive selection algorithm for comparative judgment. Frontiers in Education, 6, 785919. doi:10.3389/feduc.2021.785919

Telif hakkı ve lisans

Telif Hakkı © 2025 Yazar(lar). Açık erişimli bu makale, orijinal çalışmaya uygun şekilde atıfta bulunulması koşuluyla, herhangi bir ortamda veya formatta sınırsız kullanım, dağıtım ve çoğaltmaya izin veren Creative Commons Atıf Lisansı (CC BY) altında dağıtılmıştır.

Nasıl atıf yapılır

Gürel, S., Şahin, M., Uysal, İ., İbileme, A., & Gündüz, T. (2025). Karşılaştırma Yargılarıyla Puanlamada Uyarlamalı Eşleme ve Standart Hata Sonlandırma Kuralı: Yazma Becerisinin Ölçülmesine Yönelik bir Uygulama. Eğitim Ve Bilim, 50, 93-110. https://doi.org/10.15390/EB.2025.14123

Atıf biçimi indir

[ref1] Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2(3), 451-462. doi:10.1177/014662167800200319

[ref2] Bartholomew, S. R., Nadelson, L. S., Goodridge, W. H., & Reeve, E. M. (2018). Adaptive comparative judgment as a tool for assessing open-ended design problems and model eliciting activities. Educational Assessment, 23(2), 85-101. doi:10.1080/10627197.2018.1444986

[ref3] Benton, T. (2021). Comparative judgement for linking two existing scales. Frontiers in Education, 6, 775203. doi:10.3389/feduc.2021.775203

[ref4] Bloxham, S. (2009). Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 34(2), 209-220. doi:10.1080/02602930801955978

[ref5] Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3-4), 324-345. doi:10.1093/biomet/39.3-4.324

[ref6] Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202-223.

[ref7] Bramley, T. (2007). Paired comparison methods. In P. Newton, J. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246-294). Qualifications and Curriculum Authority.

[ref8] Bramley, T., & Vitello, S. (2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(1), 43-58. doi:10.1080/0969594X.2017.1418734

[ref9] Christodoulou, D. (2024). Using comparative judgement to improve writing [webinar]. The Education Hub. Retrieved from https://theeducationhub.org.nz/using-comparative-judgement-to-improve-writing/#:~:text=Comparative%20judgement%20is%20a%20process,double%20marking%2C%20but%20much%20quicker

[ref10] Crisp, V. (2013). Criteria, comparison and past experiences: How do teachers make judgements when marking coursework? Assessment in Education: Principles, Policy & Practice, 20(1), 127-144. doi:10.1080/0969594X.2012.741059

[ref11] Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2020). Adaptive pairwise comparison for educational measurement. Journal of Educational and Behavioral Statistics, 45(3), 316-338. doi:10.3102/1076998619890589

[ref12] Crompvoets, E. A. V., Beguin, A. A., & Sijtsma, K. (2021). Pairwise comparison using a Bayesian selection algorithm: Efficient holistic measurement. PsyArXiv. doi:10.31234/osf.io/32nhp

[ref13] Crompvoets, E. A. V., Béguin, A. A., & Sijtsma, K. (2022). On the bias and stability of the results of comparative judgment. Frontiers in Education, 6, 788202. doi:10.3389/feduc.2021.788202

[ref14] Crossley, S. A., Tian, Y., Baffour, P., Franklin, A., Kim, Y., Morris, W., . . . Boser, U. (2023). Measuring second language proficiency using the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus. International Journal of Learner Corpus Research, 9(2), 248-269. doi:10.1075/ijlcr.22026.cro

[ref15] Daniel, F., Microsoft Corporation, Weston, S., & Tenenbaum, D. (2022). doParallel: Foreach parallel adaptor for the 'parallel' package (Version 1.0.17) [Computer software]. Retrieved from https://CRAN.R-project.org/package=doParallel

[ref16] Goossens, M. ve De Maeyer, S. (2018). How to obtain efficient high reliabilities in assessing texts: Rubrics vs comparative judgement. Technology enhanced assessment. TEA 2017. Communications in Computer and Information Science, Springer, Cham. doi:10.1007/978-3-319-97807-9_2

[ref17] Gustafsson, J.-E. (1977). The Rasch model for dichotomous items: Theory, applications and a computer program. Göteborg: Göteborg University.

[ref18] Heldsinger, S. & Humphry, S. (2013). Using calibrated exemplars in the teacher-assessment of writing: an empirical study. Educational research, 55(3), 219-235. doi:10.1080/00131881.2013.825159

[ref19] Holmes, S. D., Meadows, M., Stockford, I., & He, Q. (2018). Investigating the comparability of examination difficulty using comparative judgement and Rasch modelling. International Journal of Testing, 18(4), 366-391. doi:10.1080/15305058.2018.1486316

[ref20] Humphry, S. M., & Heldsinger, S. (2019). A two‐stage method for classroom assessments of essay writing. Journal of Educational Measurement, 56(3), 505-520. doi:10.1111/jedm.12223

[ref21] Jones, I., & Davies, B. (2023). Comparative judgement in education research. International Journal of Research & Method in Education, 47(2), 170-181. doi:10.1080/1743727X.2023.2242273

[ref22] Laming, D. (2003). Human judgment: The eye of the beholder. Thomson Learning.

[ref23] Lesterhuis, M., Bouwer, R., Van Daal, T., Donche, V., & De Maeyer, S. (2022). Validity of comparative judgment scores: How assessors evaluate aspects of text quality when comparing argumentative texts. Frontiers in Education, 7, 823895. doi:10.3389/feduc.2022.823895

[ref24] Luce, R. D. (1959). Individual choice behaviours: A theoretical analysis. New York: John Wiley & Sons.

[ref25] MoNE Measurement and Evaluation Regulation. (2023). Resmi Gazete (Sayı: 32304). Retrieved from https://www.resmigazete.gov.tr/eskiler/2023/09/20230909-2.htm

[ref26] Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. doi:10.1080/0969594X.2012.665354

[ref27] R Core Team. (2023). R: A language and environment for statistical computing (Version 4.3.0) [Computer software]. https://www.r-project.org/ adresinden erişildi.

[ref28] Sims, M. E., Cox, T. L., Eckstein, G. T., Hartshorn, K. J., Wilcox, M. P., & Hart, J. M. (2020). Rubric rating with MFRM versus randomly distributed comparative judgment: A comparison of two approaches to second-language writing assessment. Educational Measurement: Issues and Practice, 39(4), 30-40. doi:10.1111/emip.12329

[ref29] Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as an approach to essay. Applied Measurement in Education, 29(3), 211-223. doi:10.1080/08957347.2016.1171769

[ref30] Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286. doi:10.1037/h0070288

[ref31] Thwaites, P., Kollias, C., & Paquot, M. (2024). Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts?, Assessing Writing, 60, doi:10.1016/j.asw.2024.100843

[ref32] Using anchors to link judging sessions. (2016). Retrieved from https://nomoremarkingltd.freshdesk.com/support/solutions/articles/16000029952-using-anchors-to-link-judging-sessions.

[ref33] Uysal, İ., Gürel, S., Şahin, M. D., İbileme, A. İ., & Yıldırım Görgülü, Y. (2024). Açık uçlu maddelerin karşılaştırmalı yargıyla puanlanmasında sabit sayıda uyarlamalı ve rassal eşlemeye dayalı bir simülasyon çalışması. 9th. International Conference on Measurement and Evaluation in Education, Anadolu University, Eskişehir.

[ref34] van Daal, T., Lesterhuis, M., Coertjens, L., Donche, V., & De Maeyer, S. (2019). Validity of comparative judgement to assess academic writing: Examining implications of its holistic character and building on a shared consensus. Assessment in Education: Principles, Policy & Practice, 26(1), 59-74. doi:10.1080/0969594X.2016.1253542

[ref35] Verhavert, S., Bouwer, R., Donche, V., & De Maeyer, S. (2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education: Principles, Policy & Practice, 26(5), 541-562. doi:10.1080/0969594X.2019.1602027

[ref36] Verhavert, S., Furlong, A., & Bouwer, R. (2022). The accuracy and efficiency of a reference based adaptive selection algorithm for comparative judgment. Frontiers in Education, 6, 785919. doi:10.3389/feduc.2021.785919

Eğitim ve Bilim

Karşılaştırma Yargılarıyla Puanlamada Uyarlamalı Eşleme ve Standart Hata Sonlandırma Kuralı: Yazma Becerisinin Ölçülmesine Yönelik bir Uygulama

Yazarlar

Öz

Kaynakça

Telif hakkı ve lisans

Nasıl atıf yapılır