Psychometrics Revisited: Recapitulation of the Major Trends in TESOL
Subject Areas : Research in English Language PedagogyMohammad Ali Salmani Nodoushan 1
1 - Institute for Humanities and Cultural Studies
Keywords: Measurement, Geralizability Theory, Differential item functioning, reliability, Classical Test Theory,
Abstract :
A test is a tool for making quantified value judgments and/or comparisons, and a good test is a bias-free gauge that does its value judgments and quantifications with precision. This requires that the test be at least reliable. In applied linguistics in general, and TESOL in specific, the question of test reliability has always been at the forefront of all test construction activities. As high-stakes gate-keeping tests gained more and more importance in a globalizing post-industrial world, the statistical procedures used to estimate their reliability indices, too, became more and more complex and precise. Classical Test Theory (CTT) is no longer preached, and test developers and testing agencies have resorted to Generalizability Theory (G-Theory) and Item Response Theory (IRT) as their main dishes; more recently, they have decided to spice up their activities with Differential Item Functioning (DIF). This paper seeks to provide the less-versed reader with a short and simple account of these topics. The aim of this paper is to turn the tumid prose describing complex mathematical and statistical topics in psychometrics and measurement into readable English so that students less versed in the field can make sense of them, and university professors can use the paper as a simple and informative source in their teaching activities.
Agresti, A. (2002). Categorical data analysis (2nd ed.). John Wiley & Sons.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge University Press.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.
Brennan, R. L. (1984). Estimating the dependability of scores. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 292-334). The Johns Hopkins University Press.
Brown, J. D., & Salmani Nodoushan, M. A. (2015). Language testing: The state of the art (An online interview with James Dean Brown). International Journal of Language Studies, 9(4), 133-143.
Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Sage.
Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues & Practice, 17, 31-44. doi: 10.1111/j.1745-3992.1998.tb00619.x
Clauser, B., Mazor, K., & Hambleton, R. K. (1993). The effects of purification of the matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6, 269-279. doi: 10.1207/s15324818ame0604_2
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. doi: 10.1037/0033-2909.112.1.155
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 292-334.
Cronbach, L. J. (1984). Essentials of psychological testing (4th ed.). Harper and Row.
Cronbach, L. J., Geleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurement: Theory of generalizability for scores and profiles. John Wiley & Sons.
de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford.
Donoghue, J. R., & Allen, N. L. (1993). Thin versus thick matching in the Mantel-Haenszel procedure for detecting DIF. Journal of Educational Statistics, 18, 131-154. doi: 10.2307/1165084
Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16, 407-424. doi: 10.1007/BF02288803
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (pp. 105-146). American Council on Education and Macmillan.
Finch, W. H., & French, B. F. (2011). Estimation of MIMIC model parameters with multilevel data. Structural Equation Modeling, 18, 229-252. doi: 10.1080/10705511.2011.557338
Fisher, R. A. (1925). Statistical methods for research workers. Oliver & Bond.
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (pp. 65-110). American Council on Education/Praeger.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kluwer Academic Publishers.
Harris, D. P. (1969). Testing English as a second language. McGraw Hill.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Holland & H. I. Braun (Eds.), Test validity (pp. 129–145). Lawrence Erlbaum Associates.
Holmes Finch, W., Immekus, J. C., & French, B. F. (2016). Applied psychometrics using SPSS and AMOS. Information Age Publishing, Inc.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329-349. doi: 10.1207/S15324818AME1404_2
Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125-160. doi: 10.1177/014662168200600201
Kane, M. T., & Brennan, R. L. (1980). Agreement coefficients as indices of dependability for domain-referenced tests. Applied Psychological Measurement, 4, 219-240.
Karami, H. (2018). On the impact of differential item functioning on test fairness: A Rasch modeling approach. International Journal of Language Studies, 12(3), 1-14.
Karami, H., & Salmani Nodoushan, M. A. (2011). Differential Item Functioning (DIF): Current problems and future directions. International Journal of Language Studies, 5(3), 133-142.
Lindquist, E. F. (1953). Design and analysis of experiments in psychology and education. Houghton Mifflin.
Linn, R. L. (2009). The concept of validity in the context of NCLB. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 195-212). JAM Press.
Lord, F. M. (1957). Do tests of the same length have the same standard error of measurement? Educational and Psychological Measurement, 22, 511-521. doi: 10.1177/001316445701700407
Mantel, N., & W. Haenszel (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748. doi: 10.1093/jnci/22.4.719
Messick, S. (1988). Validity. In L. R. Linn (Ed.), Educational measurement (pp. 13-103). American Council on Education/McMillan.
Michaelides, M. P. (2008). An illustration of a Mantel-Haenszel procedure to flag misbehaving common items in test equating. Practical Assessment, Research, and Evaluation, 13(7), online. http://pareonline.net/getvn.asp?v=13&n=7
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176. doi: 10.1177/014662169201600206
Narayanan, P., & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257-274. doi: 10.1177/014662169602000306
Nunemacher, J. (1999). Asymptotes, cubic curves, and the projective plane. Mathematics Magazine, 72(3), 183-192. doi:10.2307/2690881
Osborne, J. W. (2003). Effect sizes and the disattenuation of correlation and regression coefficients: Lessons from educational psychology. PARE: Practical Assessment, Research, and Evaluation, 8(11), online. doi: 10.7275/0k9h-tq64
Pedhazur, E. J., & Pedhazur Schmelkin, L. (1991). Measurement, design, and analysis: An integrated approach. Lawrence Erlbaum Associates.
Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495-502. doi: 10.1007/BF02294403.
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. University of Chicago Press.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116. doi: 10.1177/014662169301700201
Roussos, L. A., & Stout, W. F. (1996). Simulation studies of the effects of small sample and studied item parameters on SIBTEST and Mantel-Haenszel type I error performance. Journal of Educational Measurement, 33, 215-230. doi: 10.1111/j.1745-3984.1996.tb00490.x
Salmani Nodoushan, M. A. (2009). Measurement theory in language testing: Past traditions and current trends. Journal on Educational Psychology, 3(2), 1-12.
Salmani Nodoushan, M. A. (2020). Language assessment: Lessons learnt from the existing literature. International Journal of Language Studies, 14(2), 135-146.
Salmani Nodoushan, M. A. (2021a). Test affordances or test function? Did we get Messick’s message right? International Journal of Language Studies, 15(3), 127-140.
Salmani Nodoushan, M. A. (2021b). Washback or backwash? Revisiting the status quo of washback and test impact in EFL contexts. Studies in English Language and Education, 8(3), 869-884. DOI:10.24815/siele.v8i3.21406
Shavelson, R. J., Webb, N., & Rowley, G. L. (1989). Generalizability theory. American Psychologist, 44, 922-932. doi: 10.1037/0003-066X.44.6.922
Shavelson, R., & Webb, N. (1981). Generalizability theory: 1973-1980. British Journal of Mathematical and Statistical Psychology, 34, 133-166. doi: 10.1111/j.2044-8317.1981.tb00625.x
Spearman, (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271-295. doi: 10.1111/j.2044-8295.1910.tb00206.x
Swaminathan, H., & Rogers, H. J., (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370. https://www.jstor.org/stable/1434855
Thomas, D. R., & Zumbo, B. D. (1996). Using a measure of variable importance to investigate the standardization of discriminant coefficients. Journal of Educational & Behavioral Statistics, 21, 110-130. doi: 10.2307/1165213
Valette, R. M. (1977). Modern language testing (2nd. ed.). Harcourt College Publication.
van der Linden, W., & Hambleton, R. K. (1997). Handbook of modern item response theory. Springer.
Wu, A. D., Li, Z., & Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multi-group confirmatory factor analysis: A demonstration with TIMSS data. PARE: Practical Assessment, Research, and Evaluation, 12(3), 1-26. doi: 10.7275/mhqa-cd89
Yen, W. M., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111-154). American Council on Education and Praeger.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) scores. Directorate of Human Resources Research and Evaluation, department of National defense.
Zumbo, B. D. (2007). Validity: Foundational issues and statistical methodology. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (vol.26, pp. 45-79). Elsevier Science B.V.
Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. (ETA RR-12-08). http://www.ets.org/research/policy_research_reports/publications/report/2012/jevu