Abstract
A testlet refers to groups or clusters of items that are linked to a common stimulus such as a text, graphic or table. Due to the shared stimulus among these items, there is a high likelihood of inter-item dependency within the responses, which violates the assumption of local independence in Item Response Theory (IRT). This violation results in local dependence among the items within the testlets. Therefore, this study employed IRT and the Testlet Response Theory (TRT) models to assess the impact of local dependence stemming from testlets on item and ability parameter estimations, classification accuracy, and Differential Item/Bundle Functioning (DIF/DBF), and compared the findings obtained from these models. Responses to three testlets that were both in booklets 13 and 14 in the eTIMSS 2019 mathematics subtest were analysed using the mirt package in R software. The analysis revealed a moderate degree of local dependence in the testlets. Additionally, a very high correlation was observed between the item and ability parameter estimations derived from both models. Regarding classification accuracy, the IRT and TRT models demonstrated equivalent performance. When items were analysed both independently and as part of testlets, no items exhibited evidence of DIF/DBF based on gender. The findings indicate that IRT can tolerate the effects of testlets when the degree of local dependence is low to moderate.
Keywords: Testlet, Item response theory, Item parameter estimation, Ability estimation, Classification accuracy, Differential item/bundle functioning
References
- Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67-91.
- Baghaei, P., & Ravand, H. (2016). Modeling local item dependence in cloze and reading comprehension test items using testlet response theory. Psicológica, 37(1), 85-104.
- Baker, F. B. (2001). The basics of item response theory. Retrieved from https://files.eric.ed.gov/fulltext/ED458219.pdf
- Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-472). Reading, MA: Addison- Wesley.
- Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443-459. doi:10.1007/BF02293801
- Bradlow, E. T., Wainer, H., & Wang, X. (1999). Bayesian random effects model for testlets. ETS Research Report Series, 1998(1). doi:10.1002/j.2333-8504.1998.tb01752.x
- Chalmers, R. P. (2012). Mirt: A multidimensional item response theory package for the r environment. Journal of Statistical Software, 48(6). doi:10.18637/jss.v048.i06
- Chang, Y., & Wang, J. (2010). Examining testlet effects on the PIRLS 2006 assessment. Paper presented at the 4th IEA International Research Conference, Gothenburg, Sweden.
- Cizek, G. J., & Bunch, M. (2007). Standard setting: A practitioner’s guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: SAGE.
- Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44.
- Cook, K. F., Dodd, B. G., & Fitzpatrick, S. J. (1999). A comparison of three polytomous item response theory models in the context of testlet scoring. Journal of Outcome Measurement, 3(1), 1-20.
- Creswell, J. W. (2014). Research design: Qualitative, quantitative and mixed methods approaches (4th ed.). Thousand Oaks, CA: Sage.
- DeMars, C. (2010). Item response theory. Oxford: Oxford University Press.
- DeMars, C. E. (2006). Application of the bi‐factor multidimensional item response theory model to testlet‐based tests. Journal of Educational Measurement, 43(2), 145-168. doi:10.1111/j.1745-3984.2006. 00010.x
- Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.
- Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the scholastic aptitude test. Journal of Educational Measurement, 23(4), 355-368.
- Eckes, T. (2014). Examining testlet effects in the TestDaF listening section: A testlet response theory modeling approach. Language Testing, 31(1), 39-61. doi:10.1177/0265532213492969
- Eckes, T., & Baghaei, P. (2015). Using testlet response theory to examine local dependence in C-Tests. Applied Measurement in Education, 28(2), 85-98. doi:10.1080/08957347.2014.1002919
- Embretson, S. E. (2010). Measuring psychological constructs: Advances in model-based approaches. Washington: American Psychological Association. doi:10.1037/12074-000
- Eren, B., Gündüz, T., & Tan, Ş. (2023). Comparison of methods used in detection of DIF in cognitive diagnostic models with traditional methods: Applications in TIMSS 2011. Journal of Measurement and Evaluation in Education and Psychology, 14(1), 76-94. doi:10.21031/epod.1218144
- Ferne, T., & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language testing: Methodological advances, challenges, and recommendations. Language Assessment Quarterly, 4(2), 113-148. doi:10.1080/15434300701375923
- Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT Likelihood Ratio. Applied Psychological Measurement, 29(4), 278-295.
- Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423-436. doi:10.1007/BF02295430
- Gierl, M. J., Bisanz, J., Bisanz, G. L., & Boughton, K. A. (2003). Identifying content and cognitive skills that produce gender differences in mathematics: A demonstration of the multidimensionality-based DIF analysis paradigm. Journal of Educational Measurement, 40(4), 281-306
- Glas, C. A. W., Wainer, H., & Bradlow, E. T. (2000). MML and EAP estimation in testlet-based adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 271-287). Kluwer-Nijhoff.
- Hambleton, R., & Novick, M. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159-170.
- Hambleton, R. K., & Rogers, H. (1989). Solving criterion-referenced measurement problems with item response models. International Journal of Educational Research, 13(2), 145-160. doi:10.1016/0883-0355(89)90003-7
- Hambleton, R. K., & Swaminathan, H. (2013). Item response theory: Principles and applications. Berlin: Springer Science & Business Media.
- Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359.
- Ho, T.-H., & Dodd, B. G. (2012). Item selection and ability estimation procedures for a mixed-format adaptive test. Applied Measurement in Education, 25(4), 305-326. doi:10.1080/08957347.2012.714686
- Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer, & H. I. Braun (Eds.), Test validity (pp. 129-145). Mahwah, NJ: Lawrence Erlbaum Associates.
- Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253-264.
- Jiao, H., Kamata, A., Wang, S., & Jin, Y. (2012). A multilevel testlet model for dual local dependence. Journal of Educational Measurement, 49(1), 82-100. doi:10.1111/j.1745-3984.2011.00161.x
- Karasar, N. (2016). Bilimsel araştırma yöntemi. Ankara: Nobel Yayıncılık.
- Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16(3), 207-222. doi:10.1207/S15324818AME1603_3
- Koziol, N. A. (2016). Parameter recovery and classification accuracy under conditions of Testlet dependency: A comparison of the traditional 2PL, Testlet, and Bi-Factor models. Applied Measurement in Education, 29(3), 184-195. doi:10.1080/08957347.2016.1171767
- Lathrop, Q. N. (2015). cacIRT: Classification accuracy and consistency under item response theory. Retrieved from https://CRAN.R-project.org/package=cacIRT
- Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334. doi:10.1111/jedm.12048
- Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25(4), 357-372. doi:10.1177/01466210122032226
- Lee, S. Y., & Song, X. Y. (2004). Evaluation of the Bayesian and maximum likelihood approaches in analyzing structural equation models with small sample sizes. Multivariate Behavioral Research, 39(4), 653-686. doi:10.1207/s15327906mbr3904_4
- Lee, W.-C. (2010). Classification consistency and accuracy for complex assessment using item response theory. Journal of Educational Measurement, 47(1), 1-17.
- Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3-21. doi:10.1177/0146621605275414
- Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197.
- Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge.
- Lu, J., Zhang, J., Zhang, Z., Xu, B., & Tao, J. (2021). A novel and highly effective Bayesian sampling algorithm based on the auxiliary variables to estimate the testlet effect models. Frontiers in Psychology, 12. doi:10.3389/fpsyg.2021.509575
- Ma, W., Terzi, R., & de la Torre, J. (2021). Detecting differential item functioning using multiple-group cognitive diagnosis models. Applied Psychological Measurement, 45(1), 37-53. doi:10.1177/0146621620965745
- Mendes-Barnett, S., & Ercikan, K. (2010). Examining sources of gender DIF in mathematics assessments using a confirmatory multidimensional model approach. Applied Measurement in Education, 19(4), 289-304.
- Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31(4), 453-477. doi:10.1177/0265532214527277
- Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195. doi:10.1007/BF02293979
- Murphy, D. L., Dodd, B. G., & Vaughn, B. K. (2010). A comparison of item selection techniques for testlets. Applied Psychological Measurement, 34(6), 424-437. doi:10.1177/0146621609349804
- Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5(2), 107-124.
- Paek, I., & Fukuhara, H. (2015). An investigation of DIF mechanisms in the context of differential testlet effects. British Journal of Mathematical and Statistical Psychology, 68(1), 142-157. doi:10.1111/bmsp.12039
- Pinheiro, J. C., & Bates, D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational and Graphical Statistics, 4(1), 12-35. doi:10.1080/10618600.1995.10474663
- Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
- Rijmen, F. (2010). Formal relations and an empirical comparison among the Bi-Factor, the Testlet, and a second-order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361-372. doi:10.1111/j.1745-3984.2010.00118.x
- Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20(4), 355-371. doi:10.1177/014662169602000404
- Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, Research & Evaluation, 7(1), 14. doi:10.7275/an9m-2035
- Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment, Research & Evaluation, 10(1), 13. doi:10.7275/56a5-6b14
- Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DBF as well as item bias/DIF. Psychometrika, 58(2), 159-194.
- Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28(3), 237-247. doi:10.1111/j.1745-3984.1991.tb00356.x
- Soysal, S., & Yılmaz Koğar, E. (2022). Item parameter recovery via traditional 2PL, Testlet and Bi-factor models for Testlet-Based tests. International Journal of Assessment Tools in Education, 9(1), 254-266. doi:10.21449/ijate.948182
- Subkoviak, M. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13(4), 265-275.
- Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370.
- Tasdelen Teker, G., & Dogan, N. (2015). The effects of testlets on reliability and differential item functioning. Educational Sciences: Theory and Practice, 15(4), 969-980. doi:10.12738/estp.2015.4.2577
- Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99(1), 118-128.
- Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple-categorical-response models. Journal of Educational Measurement, 26(3), 247-260. doi:10.1111/j.1745-3984.1989.tb00331.x
- Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147-169). Mahwah, NJ: Lawrence Erlbaum Associates.
- Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 Law School Admissions Test as an example. Applied Measurement in Education, 8(2), 157-186. doi:10.1207/s15324818ame0802_4
- Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3-PL useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 245-270). Kluwer-Nijhoff.
- Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge: Cambridge University Press.
- Wainer, H., Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185-201. doi:10.1111/j.1745-3984.1987.tb00274.x
- Wainer, H., Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27(1), 1-14. doi:10.1111/j.1745-3984.1990.tb00730.x
- Wainer, H., & Lukhele, R. (1997). Managing the influence of DIF from big items: The 1988 advanced placement history test as an example. Applied Measurement in Education, 10(3), 201-215. doi:10.1207/s15324818ame1003_1
- Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203-220. doi:10.1111/j.1745-3984.2000.tb01083.x
- Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26(1), 109-128. doi:10.1177/014662160202600100
- Wang, W. C., & Wilson, M. (2005). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29(4), 296-318. doi:10.1177/0146621605276281
- Wang, W. C., & Yeh, L. Y. (2003). Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement, 27, 479-498. doi:10.1177/0146621603259902
- Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 473-492. doi:10.1177/014662168200600408
- Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA
- Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187-213. doi:10.1111/j.1745-3984.1993.tb00423.x
- Yılmaz Koğar, E. (2021). Comparison of testlet effect on parameter estimates using different item response theory models. Journal of Measurement and Evaluation in Education and Psychology, 12(3), 254-266. doi:10.21031/epod.948227
- Zhan, P., Li, X., Wang, W.-C., & Bian, Y. (2015). The logistic testlet framework for within-item multidimensional testlet-effect. Paper presented at the 2015 International Meeting of the Psychometric Society (IMPS), Beijing Normal University, Beijing, China.
- Zhan, P., Liao, M., & Bian, Y. (2018). Joint testlet cognitive diagnosis modeling for paired local item dependence in response times and response accuracy. Frontiers in Psychology, 9, 607. doi:10.3389/fpsyg.2018.00607
- Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119-140. doi:10.1177/0265532209347
Copyright and license
Copyright © 2025 The Author(s). This is an open access article distributed under the Creative Commons Attribution License (CC BY), which permits unrestricted use, distribution, and reproduction in any medium or format, provided the original work is properly cited.
How to cite
Atalay Kabasakal, K., & Gören, S. (2025). Using Testlets in Education: eTIMSS 2019 as an Example. Education and Science, 50, 111-127. https://doi.org/10.15390/EB.2025.14104