Abstract

A testlet refers to groups or clusters of items that are linked to a common stimulus such as a text, graphic or table. Due to the shared stimulus among these items, there is a high likelihood of inter-item dependency within the responses, which violates the assumption of local independence in Item Response Theory (IRT). This violation results in local dependence among the items within the testlets. Therefore, this study employed IRT and the Testlet Response Theory (TRT) models to assess the impact of local dependence stemming from testlets on item and ability parameter estimations, classification accuracy, and Differential Item/Bundle Functioning (DIF/DBF), and compared the findings obtained from these models. Responses to three testlets that were both in booklets 13 and 14 in the eTIMSS 2019 mathematics subtest were analysed using the mirt package in R software. The analysis revealed a moderate degree of local dependence in the testlets. Additionally, a very high correlation was observed between the item and ability parameter estimations derived from both models. Regarding classification accuracy, the IRT and TRT models demonstrated equivalent performance. When items were analysed both independently and as part of testlets, no items exhibited evidence of DIF/DBF based on gender. The findings indicate that IRT can tolerate the effects of testlets when the degree of local dependence is low to moderate.

Keywords: Testlet, Item response theory, Item parameter estimation, Ability estimation, Classification accuracy, Differential item/bundle functioning

References

  1. Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67-91.
  2. Baghaei, P., & Ravand, H. (2016). Modeling local item dependence in cloze and reading comprehension test items using testlet response theory. Psicológica, 37(1), 85-104.
  3. Baker, F. B. (2001). The basics of item response theory. Retrieved from https://files.eric.ed.gov/fulltext/ED458219.pdf
  4. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-472). Reading, MA: Addison- Wesley.
  5. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443-459. doi:10.1007/BF02293801
  6. Bradlow, E. T., Wainer, H., & Wang, X. (1999). Bayesian random effects model for testlets. ETS Research Report Series, 1998(1). doi:10.1002/j.2333-8504.1998.tb01752.x
  7. Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the r environment. Journal of Statistical Software, 48(6). doi:10.18637/jss.v048.i06
  8. Chang, Y., & Wang, J. (2010). Examining testlet effects on the PIRLS 2006 assessment. Paper presented at the 4th IEA International Research Conference, Gothenburg, Sweden.
  9. Cizek, G. J., & Bunch, M. (2007). Standard setting: A practitioner’s guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: SAGE.
  10. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44.
  11. Cook, K. F., Dodd, B. G., & Fitzpatrick, S. J. (1999). A comparison of three polytomous item response theory models in the context of testlet scoring. Journal of Outcome Measurement, 3(1), 1-20.
  12. Creswell, J. W. (2014). Research design: Qualitative, quantitative and mixed methods approaches (4th ed.). Thousand Oaks, CA: Sage.
  13. DeMars, C. (2010). Item response theory. Oxford: Oxford University Press.
  14. DeMars, C. E. (2006). Application of the bi‐factor multidimensional item response theory model to testlet‐based tests. Journal of Educational Measurement, 43(2), 145-168. doi:10.1111/j.1745-3984.2006. 00010.x
  15. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.
  16. Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the scholastic aptitude test. Journal of Educational Measurement, 23(4), 355-368.
  17. Eckes, T. (2014). Examining testlet effects in the TestDaF listening section: A testlet response theory modeling approach. Language Testing, 31(1), 39-61. doi:10.1177/0265532213492969
  18. Eckes, T., & Baghaei, P. (2015). Using testlet response theory to examine local dependence in C-Tests. Applied Measurement in Education, 28(2), 85-98. doi:10.1080/08957347.2014.1002919
  19. Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington: American Psychological Association. doi:10.1037/12074-000
  20. Eren, B., Gündüz, T., & Tan, Ş. (2023). Comparison of methods used in detection of DIF in cognitive diagnostic models with traditional methods: Applications in TIMSS 2011. Journal of Measurement and Evaluation in Education and Psychology, 14(1), 76-94. doi:10.21031/epod.1218144
  21. Ferne, T., & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language testing: Methodological advances, challenges, and recommendations. Language Assessment Quarterly, 4(2), 113-148. doi:10.1080/15434300701375923
  22. Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT Likelihood Ratio. Applied Psychological Measurement, 29(4), 278-295.
  23. Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423-436. doi:10.1007/BF02295430
  24. Gierl, M. J., Bisanz, J., Bisanz, G. L., & Boughton, K. A. (2003). Identifying content and cognitive skills that produce gender differences in mathematics: A demonstration of the multidimensionality-based DIF analysis paradigm. Journal of Educational Measurement, 40(4), 281-306
  25. Glas, C. A. W., Wainer, H., & Bradlow, E. T. (2000). MML and EAP estimation in testlet-based adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 271-287). Kluwer-Nijhoff.
  26. Hambleton, R., & Novick, M. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159-170.
  27. Hambleton, R. K., & Rogers, H. (1989). Solving criterion-referenced measurement problems with item response models. International Journal of Educational Research, 13(2), 145-160. doi:10.1016/0883-0355(89)90003-7
  28. Hambleton, R. K., & Swaminathan, H. (2013). Item response theory: Principles and applications. Berlin: Springer Science & Business Media.
  29. Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359.
  30. Ho, T.-H., & Dodd, B. G. (2012). Item selection and ability estimation procedures for a mixed-format adaptive test. Applied Measurement in Education, 25(4), 305-326. doi:10.1080/08957347.2012.714686
  31. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer, & H. I. Braun (Eds.), Test validity (pp. 129-145). Mahwah, NJ: Lawrence Erlbaum Associates.
  32. Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253-264.
  33. Jiao, H., Kamata, A., Wang, S., & Jin, Y. (2012). A multilevel testlet model for dual local dependence. Journal of Educational Measurement, 49(1), 82-100. doi:10.1111/j.1745-3984.2011.00161.x
  34. Karasar, N. (2016). Bilimsel araştırma yöntemi. Ankara: Nobel Yayıncılık.
  35. Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16(3), 207-222. doi:10.1207/S15324818AME1603_3
  36. Koziol, N. A. (2016). Parameter recovery and classification accuracy under conditions of Testlet dependency: A comparison of the traditional 2PL, Testlet, and Bi-Factor models. Applied Measurement in Education, 29(3), 184-195. doi:10.1080/08957347.2016.1171767
  37. Lathrop, Q. N. (2015). cacIRT: Classification accuracy and consistency under item response theory. Retrieved from https://CRAN.R-project.org/package=cacIRT
  38. Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. doi:10.1111/jedm.12048
  39. Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25(4), 357-372. doi:10.1177/01466210122032226
  40. Lee, S. Y., & Song, X. Y. (2004). Evaluation of the Bayesian and maximum likelihood approaches in analyzing structural equation models with small sample sizes. Multivariate Behavioral Research, 39(4), 653-686. doi:10.1207/s15327906mbr3904_4
  41. Lee, W.-C. (2010). Classification consistency and accuracy for complex assessment using item response theory. Journal of Educational Measurement, 47(1), 1-17.
  42. Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3-21. doi:10.1177/0146621605275414
  43. Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197.
  44. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge.
  45. Lu, J., Zhang, J., Zhang, Z., Xu, B., & Tao, J. (2021). A novel and highly effective Bayesian sampling algorithm based on the auxiliary variables to estimate the testlet effect models. Frontiers in Psychology, 12. doi:10.3389/fpsyg.2021.509575
  46. Ma, W., Terzi, R., & de la Torre, J. (2021). Detecting differential item functioning using multiple-group cognitive diagnosis models. Applied Psychological Measurement, 45(1), 37-53. doi:10.1177/0146621620965745
  47. Mendes-Barnett, S., & Ercikan, K. (2010). Examining sources of gender DIF in mathematics assessments using a confirmatory multidimensional model approach. Applied Measurement in Education, 19(4), 289-304.
  48. Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453-477. doi:10.1177/0265532214527277
  49. Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195. doi:10.1007/BF02293979
  50. Murphy, D. L., Dodd, B. G., & Vaughn, B. K. (2010). A comparison of item selection techniques for testlets. Applied Psychological Measurement, 34(6), 424-437. doi:10.1177/0146621609349804
  51. Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5(2), 107-124.
  52. Paek, I., & Fukuhara, H. (2015). An investigation of DIF mechanisms in the context of differential testlet effects. British Journal of Mathematical and Statistical Psychology, 68(1), 142-157. doi:10.1111/bmsp.12039
  53. Pinheiro, J. C., & Bates, D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational and Graphical Statistics, 4(1), 12-35. doi:10.1080/10618600.1995.10474663
  54. Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
  55. Rijmen, F. (2010). Formal relations and an empirical comparison among the Bi-Factor, the Testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361-372. doi:10.1111/j.1745-3984.2010.00118.x
  56. Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20(4), 355-371. doi:10.1177/014662169602000404
  57. Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, Research & Evaluation, 7(1), 14. doi:10.7275/an9m-2035
  58. Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, Research & Evaluation, 10(1), 13. doi:10.7275/56a5-6b14
  59. Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DBF as well as item bias/DIF. Psychometrika, 58(2), 159-194.
  60. Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28(3), 237-247. doi:10.1111/j.1745-3984.1991.tb00356.x
  61. Soysal, S., & Yılmaz Koğar, E. (2022). Item parameter recovery via traditional 2PL, Testlet and Bi-factor models for Testlet-Based tests. International Journal of Assessment Tools in Education, 9(1), 254-266. doi:10.21449/ijate.948182
  62. Subkoviak, M. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13(4), 265-275.
  63. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370.
  64. Tasdelen Teker, G., & Dogan, N. (2015). The effects of testlets on reliability and differential item functioning. Educational Sciences: Theory and Practice, 15(4), 969-980. doi:10.12738/estp.2015.4.2577
  65. Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99(1), 118-128.
  66. Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26(3), 247-260. doi:10.1111/j.1745-3984.1989.tb00331.x
  67. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147-169). Mahwah, NJ: Lawrence Erlbaum Associates.
  68. Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 Law School Admissions Test as an example. Applied Measurement in Education, 8(2), 157-186. doi:10.1207/s15324818ame0802_4
  69. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3-PL useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 245-270). Kluwer-Nijhoff.
  70. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge: Cambridge University Press.
  71. Wainer, H., Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185-201. doi:10.1111/j.1745-3984.1987.tb00274.x
  72. Wainer, H., Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1-14. doi:10.1111/j.1745-3984.1990.tb00730.x
  73. Wainer, H., & Lukhele, R. (1997). Managing the influence of DIF from big items: The 1988 advanced placement history test as an example. Applied Measurement in Education, 10(3), 201-215. doi:10.1207/s15324818ame1003_1
  74. Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203-220. doi:10.1111/j.1745-3984.2000.tb01083.x
  75. Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26(1), 109-128. doi:10.1177/014662160202600100
  76. Wang, W. C., & Wilson, M. (2005). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29(4), 296-318. doi:10.1177/0146621605276281
  77. Wang, W. C., & Yeh, L. Y. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, 479-498. doi:10.1177/0146621603259902
  78. Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 473-492. doi:10.1177/014662168200600408
  79. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA
  80. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187-213. doi:10.1111/j.1745-3984.1993.tb00423.x
  81. Yılmaz Koğar, E. (2021). Comparison of testlet effect on parameter estimates using different item response theory models. Journal of Measurement and Evaluation in Education and Psychology, 12(3), 254-266. doi:10.21031/epod.948227
  82. Zhan, P., Li, X., Wang, W.-C., & Bian, Y. (2015). The logistic testlet framework for within-item multidimensional testlet-effect. Paper presented at the 2015 International Meeting of the Psychometric Society (IMPS), Beijing Normal University, Beijing, China.
  83. Zhan, P., Liao, M., & Bian, Y. (2018). Joint testlet cognitive diagnosis modeling for paired local item dependence in response times and response accuracy. Frontiers in Psychology, 9, 607. doi:10.3389/fpsyg.2018.00607
  84. Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119-140. doi:10.1177/0265532209347

How to cite

Atalay Kabasakal, K., & Gören, S. (2025). Using Testlets in Education: eTIMSS 2019 as an Example. Education and Science, 50, 111-127. https://doi.org/10.15390/EB.2025.14104