Abstract

Open-ended items, which have been used as a measurement method for centuries in the evaluation of student achievement, have many advantages, such as measuring high-level skills, providing rich diagnostic information about the student, and not having chance success. However, today, open-ended items cannot be used in exams with a large number of students due to the potential for errors in the scoring process and disadvantages in terms of labour, time, and cost. At this point, Artificial Intelligence (AI) has an important potential in scoring open-ended items. The aim of this study is to examine the scoring performance of AI in scoring students' handwritten responses to open-ended items. In the study, an achievement test consisting of 3 open-ended and 10 multiple-choice items was developed within the scope of the Measurement and Assessment in Education course at a state university. Open-ended items were scored in a structured way (0-1-2), while multiple-choice items were scored as true-false (0-1). 84 participants took part in the study, and the open-ended items were scored by the expert group and the AI tool (ChatGPT-4o). The visual responses written by the students in their handwriting were scored by the AI tool in two different scenarios. In the first scenario, the AI tool was asked to score without giving any scoring criteria to the AI, whereas in the second scenario, the AI was asked to score according to the standard scoring criteria. The findings of the study showed that there were low agreement and correlation coefficients between the AI scores without criteria and expert scores, while there were high agreement and correlation coefficients between the AI scores with standard scoring criteria and expert scores. Similar to these findings, while the item discriminations of the AI scoring without criteria were quite low, the item discriminations of the AI scores with standard scoring criteria were high. In the study, the reasons for the discrepancies between expert scores and AI scores with standard criteria were also investigated and reported. The results show that AI can score handwritten open-ended items with standardized scoring criteria at a good level. In the future, with the development and transformation of AI, it is thought that it can reach scoring accuracy comparable to expert raters in terms of consistency.

Keywords: Open-ended item, Artificial intelligence, AI, ChatGPT, Automated scoring, Handwritten responses, Constructed response item

References

  1. Abdolreza Gharehbagh, Z., Mansourzadeh, A., Montazeri Khadem, A., & Saeidi, M. (2022). Reflections on using open-ended questions. Medical Education Bulletin, 3(2), 475-482.
  2. Agustianingsih, R., & Mahmudi, A. (2019). How to design open-ended questions?: Literature review. Journal of Physics: Conference Series, 1320(1). doi:10.1088/1742-6596/1320/1/012003
  3. Alers, H., Malinowska, A., Meghoe, G., & Apfel, E. (2024). Using ChatGPT-4 to grade open question exams. In K. Arai (Ed.), Advances in information and communication (pp. 1-9). Switzerland: Springer Nature. doi:10.1007/978-3-031-53960-2_1.
  4. Almusharraf, N., & Alotaibi, H. (2023). An error-analysis study from an EFL writing context: Human and automated essay scoring approaches. Technology Knowledge and Learning, 28(3), 1015-1031.
  5. Aydın, B., Algina, J., Leite, W. L., & Atılgan, H. (2018). Sosyal bilimler için r’a giriş. Ankara: Anı Yayıncılık.
  6. Aznar-Mas, L. E., Atarés Huerta, L., & Marin-Garcia, J. A. (2023). Effectiveness of the use of open-ended questions in student evaluation of teaching in an engineering degree. Journal of Industrial Engineering and Management, 16(3), 521. doi:10.3926/jiem.5620
  7. Baburajan, V., de Abreu e Silva, J., & Pereira, F. C. (2022). Open vs closed-ended questions in attitudinal surveys-comparing, combining, and interpreting using natural language processing. Transportation Research. Part C, Emerging Technologies, 137(12), 103589. doi:10.1016/j.trc.2022.103589
  8. Badger, E., & Thomas, B. (2019). Open-ended questions in reading. Practical Assessment, Research, and Evaluation, 3(1), 4.
  9. Baykul, Y. & Turgut, M. F. (2012). Eğitimde ölçme ve değerlendirme. Ankara: Pegem Akademi Yayıncılık.
  10. Beiting-Parrish, M., & Whitmer, J. (2023). Lessons learned about evaluating fairness from a data challenge to automatically score NAEP reading items. Chinese/English Journal of Educational Measurement and Evaluation, 4(3). doi:10.59863/nkcj9608
  11. Beksultanova, A. I., Vatyukova, O. Y., & Yalmaeva, M. A. (2020). Application of digital technologies in the educational process. Proceedings of the 2nd International Scientific and Practical Conference on Digital Economy (ISCDE 2020).
  12. Brookhart, S. M. (2010). How to assess higher-order thinking skills in your classroom. ASCD.
  13. Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: how it compares with human scoring. Education and Information Technologies. doi:10.1007/s10639-024-12891-w
  14. Byrne, B. M. (2010). Structural equation modeling with AMOS: Basic concepts, applications, and programming. New York: Routledge.
  15. Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. Ieee Access, 8, 75264-75278.
  16. Cohen, J. (1992). Statistical power analysis. Current Directions in Psychological Science, 1(3), 98-101. doi:10.1111/1467-8721.ep10768783
  17. Demir, S. (2023). Investigation of ChatGPT and real raters in scoring open-ended items in terms of inter-rater reliability. Uluslararası Türk Eğitim Bilimleri Dergisi, 2023(21), 1072-1099. doi:10.46778/goputeb.1345752
  18. Doğan, N. (2019). Eğitimde ölçme ve değerlendirme. Ankara: Pegem Akademi Yayınevi.
  19. Fernandez, N., Ghosh, A., Liu, N., Wang, Z., Choffin, B., Baraniuk, R., & Lan, A. (2022). Automated scoring for reading comprehension via in-context bert tuning. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), International Conference on Artificial Intelligence in Education (pp. 691-697). Cham: Springer International Publishing.
  20. Fitriyah, Y., Wahyudin, Suhendra, Nurhayati, H., & Febrianti, T. S. (2024). Open-ended approach for critical thinking skills in mathematics education: A meta-analysis. EduMatSains: Jurnal Pendidikan, Matematika Dan Sains, 9(1), 156-174. doi:10.33541/edumatsains.v9i1.5975
  21. Freedman, R. L. H. (1994). Open-ended questioning: A handbook for educators. Boston: Addison-Wesley.
  22. Gao, R., Merzdorf, H. E., Anwar, S., Hipwell, M. C., & Srinivasa, A. R. (2024). Automatic assessment of text-based responses in post-secondary education: A systematic review. Computers and Education: Artificial Intelligence, 6, 100206. doi:10.1016/j.caeai.2024.100206
  23. Geer, J. G. (1988). What do open-ended questions measure?. Public Opinion Quarterly, 52(3), 365-367.
  24. Güler, N. (2014). Analysis of open-ended statistics questions with many facet Rasch model. Eurasian Journal of Educational Research, 55, 73-90. doi:10.14689/ejer.2014.55.5
  25. Hair, J., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis (7th ed.). Upper Saddle River, NJ: Pearson Educational International.
  26. Hogan, T. P., & Murphy, G. (2007). Recommendations for preparing and scoring constructed-response items: What the experts say. Applied Measurement in Education, 20(4), 427-441. doi:10.1080/08957340701580736
  27. Jamil, F., & Hameed, I. A. (2023). Toward intelligent open-ended questions evaluation based on predictive optimization. Expert Systems with Applications, 231, 120640. doi:10.1016/j.eswa.2023.120640
  28. Jescovitch, L. N., Scott, E. E., Cerchiara, J. A., Merrill, J., Urban-Lurain, M., Doherty, J. H., & Haudek, K. C. (2021). Comparison of machine learning performance using analytic and holistic coding approaches across constructed response assessments aligned to a science learning progression. Journal of Science Education and Technology, 30(2), 150-167. doi:10.1007/s10956-020-09858-0
  29. Jukiewicz, M. (2024). The future of grading programming assignments in education: The role of ChatGPT in automating the assessment and feedback process. Thinking Skills and Creativity, 52, 101522. doi:10.1016/j.tsc.2024.101522
  30. Karadag, N., Boz Yuksekdag, B., Akyildiz, M., & Ibileme, A. I. (2020). Assessment and evaluation in open education system: Students’ opinions about Open-Ended Question (OEQ) practice. Turkish Online Journal of Distance Education, 22(1), 179-193. doi:10.17718/tojde.849903
  31. Karakaya, İ. (2022). Açık uçlu soruların hazırlanması, uygulanması ve değerlendirilmesi. Ankara: Pegem Yayınları.
  32. Karasar, N. (2012). Bilimsel araştırma yöntemi (24. bs.). Ankara: Nobel Yayın Dağıtım.
  33. Karimi, L. (2014). The effect of constructed-responses and multiple-choice tests on students’ course content mastery. Southern African Linguistics and Applied Language Studies, 32(3), 365-372. doi:10.2989/16073614.2014.997067
  34. Kartikasari, S. A., Usodo B., & Riyadi (2022). The effectiveness open-ended learning and creative problem solving models to teach creative thinking skills. Pegem Journal of Education and Instruction, 12(4), 29-38. doi:10.47750/pegegog.12.04.04
  35. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
  36. Lin, Y., Zheng, L., Chen, F., Sun, S., Lin, Z., & Chen, P. (2020). Design and implementation of intelligent scoring system for handwritten short answer based on deep learning. IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS), Dalian, China. doi:10.1109/ICAIIS49377.2020.9194943
  37. Lohman, D. F. (1993). Learning and the nature of educational measurement. NASSP Bulletin, 77(555), 41-53. doi:10.1177/019263659307755506
  38. Lu, M., Zhou, W., & Ji, R. (2021). Automatic scoring system for handwritten examination papers based on YOLO algorithm. Journal of Physics: Conference Series, 2026. doi:10.1088/1742-6596/2026/1/012030
  39. Maris, G., & Bechger, T. (2006). Scoring open ended questions. In Handbook of statistics (pp. 663-681). Hollanda: Elsevier.
  40. Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. doi:10.1016/j.rmal.2023.100050
  41. Monrat, N., Phaksunchai, M., & Chonchaiya, R. (2022). Developing students’ mathematical critical thinking skills using open-ended questions and activities based on student learning preferences. Education Research International, 2022, 1-11. doi:10.1155/2022/3300363
  42. Owan, V. J., Abang, K. B., Idika, D. O., Etta, E. O., & Bassey, B. A. (2023). Exploring the potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal of Mathematics, Science and Technology Education, 19(8), em2307. doi:10.29333/ejmste/13428
  43. Parker, J. L., Becker, K., & Carroca, C. (2023). ChatGPT for automated writing evaluation in scholarly writing instruction. Journal of Nursing Education, 62(12), 721-727.
  44. Patton, M. Q. (2002). Qualitative research and evaluation methods. Thousand Oaks: CA: Sage Publications.
  45. Pinto, G., Cardoso-Pereira, I., Ribeiro, D. M., Lucena, D., de Souza, A., & Gama, K. (2023). Large language models for rducation: Grading open-ended questions using ChatGPT. arXiv. doi:10.48550/ARXIV.2307.16696
  46. Poole, F. J., & Coss, M. D. (2024). Can ChatGPT reliably and accurately apply a rubric to L2 writing assessments? The devil is in the prompt(s). Journal of Technology & Chinese Language Teaching, 15(1).
  47. Ramineni, C., & Williamson, D. (2018). Understanding mean score differences between the e‐rater® automated scoring engine and humans for demographically based groups in the GRE® General Test. ETS Research Report Series, 2018(1), 1-31. doi:10.1002/ets2.12192
  48. Quah, B., Zheng, L., Sng, T. J. H., Yong, C. W., & Islam, I. (2024). Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Medical Education, 24(1). doi:10.1186/s12909-024-05881-6
  49. Sarwanto, F., Widi, L. E., & Chumdari. (2021). Open-Ended Questions to Assess CriticalThinking Skills in Indonesian Elementary School. International Journal of Instruction, 14(1), 615-630. doi:10.29333/iji.2021.14137a
  50. Senkivska, L. (2022). The role of digital technologies in education. Journal of Education, Health and Sport, 12(1), 419-423. doi:10.12775/jehs.2022.12.01.036
  51. Septiani, S., Retnawati, H., & Arliani, E. (2022). Designing closed-ended questions into open-ended questions to support student’s creative thinking skills and mathematical communication skills. JTAM (Jurnal Teori Dan Aplikasi Matematika), 6(3), 616. doi:10.31764/jtam.v6i3.8517
  52. Suherman, S., & Vidákovich, T. (2022). Assessment of mathematical creative thinking: A systematic review. Thinking Skills and Creativity, 44, 101019. doi:10.1016/j.tsc.2022.101019
  53. Sychev, O., Anikin, A., & Prokudin, A. (2020). Automatic grading and hinting in open-ended text questions. Cognitive Systems Research, 59, 264-272. doi:10.1016/j.cogsys.2019.09.025
  54. Uysal, İ., & Doğan, N. (2021). How reliable is it to automatically score open-ended items? An application in the Turkish language. Egitimde ve Psikolojide Olçme ve Degerlendirme Dergisi, 12(1), 28-53. doi:10.21031/epod.817396
  55. von Davier, M., Tyack, L., & Khorramdel, L. (2022). Automated scoring of graphical open-ended responses using artificial neural networks. arXiv. doi:10.48550/arXiv.2201.01783
  56. Winarso, W., & Hardyanti, P. (2019). Using the learning of reciprocal teaching based on open ended to improve mathematical critical thinking ability. EduMa: Mathematics Education Learning and Teaching, 8(1). doi:10.24235/eduma.v8i1.4632
  57. Xiao, C., Ma, W., Song, Q., Xu, S. X., Zhang, K., Wang, Y., & Fu, Q. (2024). Human-AI collaborative essay scoring: A dual-process framework with LLMs. arXiv. doi:10.48550/arXiv.2401.06431
  58. Yaneva, V., Baldwin, P., Jurich, D. P., Swygert, K., & Clauser, B. E. (2023). Examining ChatGPT performance on USMLE sample items and implications for assessment. Academic Medicine, 99(2), 192-197.
  59. Zesch, T., Horbach, A., & Zehner, F. (2023). To score or not to score: Factors influencing performance and feasibility of automatic content scoring of text responses. Educational Measurement Issues and Practice, 42(1), 44-58. doi:10.1111/emip.12544
  60. Zhang, D., & Yuan, X. (2022). Intelligent scoring of English composition by machine learning from the perspective of natural language processing. Mathematical Problems in Engineering, 2022, 1-9. doi:10.1155/2022/9070272

How to cite

Yiğiter, M., & Boduroğlu, E. (2025). Examining the Performance of Artificial Intelligence in Scoring Students’ Handwritten Responses to Open-Ended Items. Education and Science, 50, 1-18. https://doi.org/10.15390/EB.2025.14119