Equity on General English Achievement Tests through Gender-based DIF Analysis across Different Majors
Subject Areas :Mehri Jamalzadeh 1 , Ahmadreza Lotfi 2 , Masoud Rostami 3
1 - Department of English language, Isfahan (Khorasgan) Branch, Islamic Azad
University, Isfahan, Iran
2 - Department of English Language, Isfahan (Khorasgan) Branch, Islamic Azad University, Isfahan, Iran
3 - Department of Languages and Literature, Yazd University, Yazd, Iran
Keywords: gender, Equity, Test Validation, Differential Item Functioning (DIF), General English achievement test, IAUGEAT, IRT,
Abstract :
This study is an investigation of gender equity in the context of the General English Achievement Test developed and used at Islamic Azad University (Isfahan Branch, IRAN), henceforth IAUGEAT, with test takers majoring in different fields of study. A sample of 835 students sitting for IAUGEAT was chosen purposively. The test scores were analyzed by the one-parameter IRT model. A focus group interview (10 test developers and language teachers) was also used to inquire into their perceptions about the impact of test takers’ gender and major on test equity. The findings of the DIF analysis indicated a reciprocal action between item type and gender DIF as some items exhibited DIF across different subgroups. In three subgroups, they favored female students. In one subgroup, they favored males. In the other two subgroups, they favored males and females alike. The qualitative data obtained from the focus group interview further confirmed the results. In general, our findings strongly suggest that checking gender equity via a Rasch-model DIF analysis is both eminent and convergent with a qualitative evaluation of test-takers' performance by test developers and instructors.
Alavi, S. M., & Bordbar, S. (2018). Differential item functioning analysis of high-stakes test in terms of gender: A Rasch model approach. MOJES: Malaysian Online Journal of Educational Sciences, 5(1), 10-24.
Andrich, D., & Marais, I. (2019). A course in Rasch measurement theory. Measuring in the Educational, Social and Health Sciences, 41-53.
Angoff, W. H. (1993). Perspectives on differential item functioning methodology.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly: An International Journal, 2(1), 1-34. https:/ /doi.or g/10.1 207/s1 54343 11laq0201_1
Bachman, L. F., Palmer, A. S., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.
Banks, K. (2009). Using DDF in a post hoc analysis to understand sources of DIF. Educational Assessment, 14(2), 103-118.
Barati, H., & Ahmadi, A. R. (2012). Gender-based DIF across the subject area: A study of the Iranian National University Entrance Exam. Journal of Teaching Language Skills, 29(3), 1-26.
Bejar, I. I. (1990). A generative analysis of a three-dimensional spatial task. Applied Psychological Measurement, 14(3), 237-245.
Belzak, W., & Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6), 673.
Brennan, R. L. (2013). Commentary on “validating the interpretations and uses of test scores”. Journal of Educational Measurement, 50(1), 74-83.
Bond, T. G., & Fox, C. M. (2013). Applying the Rasch model: Fundamental measurement in the human sciences. Psychology Press.
Bordbar, S. (2021). Gender Differential Item Functioning (GDIF) Analysis in Iran’s University Entrance Exam. English Language in Focus (ELIF), 3(1), 49-68.
Camilli, G., & Shepard, L. A. (1994). MMSS: Methods for identifying biased test items.
Camilli, G. (2018). IRT Scoring and Test Blueprint Fidelity. Applied Psychological Measurement, 42(5), 393-400.
Chalhoub-Deville, M. (2016). Validity theory: Reform policies, accountability testing, and consequences. Language Testing, 33(4), 453-472.
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254-272. https://doi.org/10.1017/S0267190599190135
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2011). Building a validity argument for the Test of English as a Foreign Language TM. Routledge.
Chapelle, C. A., & Voss, E. (Eds.). (2021). Validity Argument in Language Testing: Case Studies of Validation Research. Cambridge University Press
Chen, M. Y., Liu, Y., & Zumbo, B. D. (2020). A propensity score method for investigating
differential item functioning in performance assessment. Educational and Psychological Measurement, 80(3), 476-498.
Cochran, W. G. (1954). Some methods for strengthening the common χ 2 tests. Biometrics, 10(4), 417-451.
Cumming, A. (2013). Assessing integrated writing tasks for academic purposes: Promises and perils. Language Assessment Quarterly, 10(1), 1-8. https: //doi.o rg/10.108 0/154343 03.2011.22016
Darabi Bazvand, A., & Ahmadi, A. (2020). Interpreting the Validity of a High-Stakes Test in Light of the Argument-Based Framework: Implications for Test Improvement. Research in Applied Linguistics, 11(1), 66-88.
DeMars, C. (2010). Item response theory. Oxford University Press.
Embretson, S. (1994). Applications of cognitive design systems to test development. In Cognitive assessment (pp. 107-135). Springer, Boston, MA. Educational Testing Service (2002). ETS standards for quality and fairness. Princeton, NJ:Author.
Educational Testing Service (2014). ETS standards for quality and fairness. Princeton, NJ: Author.
Geramipour, M. (2020). Item-focused trees approach in differential item functioning (DIF) analysis: a case study of an EFL reading comprehension test. Journal of Modern Research in English Language Studies, 7(2), 123-147.
Clauser, B., Mazor, K., & Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4), 269-279.
Hambleton, R. K., & Rogers, H. J. (1989). Detecting potentially biased test items: Comparison of IRT area and Mantel-Haenszel methods. Applied Measurement in Education, 2(4), 313-334.
Hernández, A., Tomás, I., Ferreres, A., & Lloret, S. (2015). THIRD EVALUATION OF TESTS PUBLISHED IN SPAIN. Papeles del Psicólogo, 36(1), 1-8.
Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer H & Braun HI (Eds.), Test validity (pp. 129–145). Hillsdale, NJ, US.
Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale NJ: Erlbaum.
Hope, D., Adamson, K., McManus, I. C., Chis, L., & Elder, A. (2018). Using differential item functioning to evaluate potential bias in a high stakes postgraduate knowledge based assessment. BMC Medical Education, 18(1), 1-7.
Im, G. H., & McNamara, T. (2017). Legitimate or illegitimate uses of test scores in contexts unrelated to test purposes. English Teaching, 72(2), 71-99.
Jamalzadeh, M., Lotfi, A. R., & Rostami, M. (2021). Assessing the validity of an IAU General English Achievement Test through hybridizing differential item functioning and differential distractor functioning. Language Testing in Asia, 11(1), 1-17. https:// doi.org/10.1186/s40468-021-00n124-7.
Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38(1), 79-93.
Kamata, A., & Vaughn, B. K. (2004). An Introduction to Differential Item Functioning Analysis. Learning Disabilities: A Contemporary Journal, 2(2), 49-69.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527.
Kane, M. (2006). Content-related validity evidence in test development. Handbook of Test Development, 1, 131-153.
Kane, M. (2012). All validity is construct validity. Or is it? Measurement:Interdisciplinary Research & Perspective, 10(1-2), 66-70.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73.
Kane, M., & Mislevy, R. (2017). Validating score interpretations based on response processes. In Validation of score meaning for the next generation of assessments (pp. 11-24). Routledge.https://doi.org/10.22059/jflr.2021.315079.783
Khodi, Ali, Karami, Hossein (2021). Differential Item Functioning and Test Performance: a Comparison Between the Rasch Model, Logistic Regression and Mantel-Haenszel. Journal of Foreign Language Research, 10 (4), 842-853. https://doi.org/ 10.22059/jflr.2021.315079.783
Kunnan, A. J. (2004). Test fairness. European language testing in a global context, 18, 27-48.
Lado, R. (1961). Language testing. London: Longmans.
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean. Rasch Measurement Transactions, 16(2), 878.
Mantel, N. (1963). Chi-square tests with one degree of freedom; extensions of the Mantel-Haenszel procedure. Journal of the American Statistical Association, 58(303), 690-700. https://doi.org/ 10.1080/01621459.1963.10500879
McKeown, S. B., & Oliveri, M. E. (2017). Exploratory analysis of differential item functioning and its possible sources in the National Survey of Student Engagement.
Mehrazmay, R., Ghonsooly, B., & De La Torre, J. (2021). Detecting Differential Item Functioning Using Cognitive Diagnosis Models: Applications of the Wald Test and Likelihood Ratio Test in a University Entrance Examination. Applied Measurement in Education, 1-23.https://doi.org/ 10.1g080/08957347.2021.1987906
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational researcher, 18(2), 5-11.
Messick, S. (1995). Standards of validity and the validity of standards in performance asessment. Educational Measurement: Issues and Practice, 14(4), 5-8.
Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45(1), 35-44.
Meyer, J. P. (2014). Applied measurement with jMetrik. Routledge.
Nakatsuhara, F., Taylor, L., & Jaiyote, S. (2018). The role of the L1 in testing L2 English. Cambridge University Press.
Osterlind, S. J. (1983). Test item bias (No. 30). Sage.
Ozdemir, B., & Alshamrani, A. H. (2020). Examining the Fairness of Language Test Across Gender with IRT-based Differential Item and Test Functioning Methods. International Journal of Learning, Teaching and Educational Research, 19(6), 27-45.
Pae, T. I. (2012). Causes of gender DIF on an EFL language test: A multiple-data analysis over nine years. Language Testing, 29(4), 533-554. https: //doi.o rg/10.11 77/02655 32211434 027
Paulsen, J., Svetina, D., Feng, Y., & Valdivia, M. (2020). Examining the impact of differential item functioning on classification accuracy in cognitive diagnostic models. Applied Psychological Measurement, 44(4), 267-281.
Purpura, J. E. (2011). Quantitative research methods in assessment and testing. In Handbook of research in second language teaching and learning (pp. 749-769). Routledge.
Purpura, J. E., Brown, J. D., & Schoonen, R. (2015). Improving the validity of quantitative measures in applied linguistics research 1. Language Learning, 65(S1), 37-75.
Ramsay, S., Barker, M., & Jones, E. (1999). Academic Adjustment and Learning Processes: a comparison of international and local students in first‐year university. Higher Education Research & Development, 18(1), 129-144.
Rasch, G. (1977). On specific objectivity. An attempt at formalizing the request for generality and validity of scientific statements in symposium on scientific objectivity, Vedbaek, Mau 14-16, 1976. Danish Year-Book of Philosophy Kobenhavn, 14, 58-94.
Ravand, H., Rohani, G., & Firoozi, T. (2019). Investigating Gender and Major DIF in the Iranian National University Entrance Exam Using Multiple-Indicators Multiple-Causes Structural Equation Modelling. Issues in Language Teaching, 8(1), 33-61.
Roussos, L. A., & Stout, W. (2004). Differential item functioning analysis. The Sage handbook of quantitative methodology for the social sciences, 107-116.
Ryan, K. E., & Bachman, L. F. (1992). Differential item functioning on two tests of EFL proficiency. Language testing, 9(1), 12-29.https://doi.org/ 10.1177/026553229200900103
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194.
Stansfield, C. W., & Hewitt, W. E. (2005). Examining the predictive validity of a screening test for court interpreters. Language Testing, 22(4), 438-462.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370.
Trice, A. G. (2007). Faculty Perspectives regarding Graduate International Students’ Isolation from Host National Students. International Education Journal, 8(1), 108-117.
Willingham, W. W. (1999). A systemic view of test fairness. Assessment in higher education: Issues of access, quality, student development, and public policy, 213-242.
Winke, P., & Brunfaut, T. (Eds.). (2021). The Routledge handbook of second language acquisition and language testing. Routledge.
Wright, R. J. (2007). Educational assessment: Tests and measurements in the age of accountability. Sage Publications.
Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2), 147-170.
Yang, R. P. J., Noels, K. A., & Saumure, K. D. (2006). Multiple routes to cross-cultura adaptation for international students: Mapping the paths between self-construals, English language confidence, and adjustment. International Journal of Intercultural Relations, 30(4), 487-506.
Yoon, G. Y. (2020). Item performance in context: Differential item functioning between pilot and formal administration of the Norwegian language test (Master’s thesis).
Zieky, M. J. (2016). Fairness in test design and development. Fairness in Educational Assessment and Measurement, 9-32.
Zhu, X., & Aryadoust, V. (2020). An investigation of mother tongue differential item functioning in a high-stakes computerized academic reading test. Computer Assisted Language Learning, 35(3), 1-25.