Equity on General English Achievement Tests through Gender-based DIF Analysis across Different Majors

Jamalzadeh, Mehri; Lotfi, Ahmadreza; Rostami, Masoud

doi:10.30495/jfl.2022.697333

Manuscript ID : JFL-2204-2236 (R2) Visit : 295 Page: 47 - 65

10.30495/jfl.2022.697333

Article Type: Original Research

Equity on General English Achievement Tests through Gender-based DIF Analysis across Different Majors

Subject Areas :

Mehri Jamalzadeh ¹ , Ahmadreza Lotfi ^{2
*} , Masoud Rostami ³

1 - Department of English language, Isfahan (Khorasgan) Branch, Islamic Azad University, Isfahan, Iran
2 - Department of English Language, Isfahan (Khorasgan) Branch, Islamic Azad University, Isfahan, Iran
3 - Department of Languages and Literature, Yazd University, Yazd, Iran

Received: 2022-10-18 Accepted : 2022-11-30 Published : 2022-12-01

Keywords: gender, Equity, Test Validation, Differential Item Functioning (DIF), General English achievement test, IAUGEAT, IRT,

Abstract :

This study is an investigation of gender equity in the context of the General English Achievement Test developed and used at Islamic Azad University (Isfahan Branch, IRAN), henceforth IAUGEAT, with test takers majoring in different fields of study. A sample of 835 students sitting for IAUGEAT was chosen purposively. The test scores were analyzed by the one-parameter IRT model. A focus group interview (10 test developers and language teachers) was also used to inquire into their perceptions about the impact of test takers’ gender and major on test equity. The findings of the DIF analysis indicated a reciprocal action between item type and gender DIF as some items exhibited DIF across different subgroups. In three subgroups, they favored female students. In one subgroup, they favored males. In the other two subgroups, they favored males and females alike. The qualitative data obtained from the focus group interview further confirmed the results. In general, our findings strongly suggest that checking gender equity via a Rasch-model DIF analysis is both eminent and convergent with a qualitative evaluation of test-takers' performance by test developers and instructors.

References:

Alavi, S. M., & Bordbar, S. (2018). Differential item functioning analysis of high-stakes test in terms of gender: A Rasch model approach. MOJES: Malaysian Online Journal of Educational Sciences, 5(1), 10-24.

Andrich, D., & Marais, I. (2019). A course in Rasch measurement theory. Measuring in the Educational, Social and Health Sciences, 41-53.

Angoff, W. H. (1993). Perspectives on differential item functioning methodology.

Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly: An International Journal, 2(1), 1-34. https:/ /doi.or g/10.1 207/s1 54343 11laq0201_1

Bachman, L. F., Palmer, A. S., & Palmer, A. S. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.

Banks, K. (2009). Using DDF in a post hoc analysis to understand sources of DIF. Educational Assessment, 14(2), 103-118.

Barati, H., & Ahmadi, A. R. (2012). Gender-based DIF across the subject area: A study of the Iranian National University Entrance Exam. Journal of Teaching Language Skills, 29(3), 1-26.

Bejar, I. I. (1990). A generative analysis of a three-dimensional spatial task. Applied Psychological Measurement, 14(3), 237-245.

Belzak, W., & Bauer, D. J. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6), 673.

Brennan, R. L. (2013). Commentary on “validating the interpretations and uses of test scores”. Journal of Educational Measurement, 50(1), 74-83.

Bond, T. G., & Fox, C. M. (2013). Applying the Rasch model: Fundamental measurement in the human sciences. Psychology Press.

Bordbar, S. (2021). Gender Differential Item Functioning (GDIF) Analysis in Iran’s University Entrance Exam. English Language in Focus (ELIF), 3(1), 49-68.

Camilli, G., & Shepard, L. A. (1994). MMSS: Methods for identifying biased test items.

Camilli, G. (2018). IRT Scoring and Test Blueprint Fidelity. Applied Psychological Measurement, 42(5), 393-400.

Chalhoub-Deville, M. (2016). Validity theory: Reform policies, accountability testing, and consequences. Language Testing, 33(4), 453-472.

Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254-272. https://doi.org/10.1017/S0267190599190135

Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2011). Building a validity argument for the Test of English as a Foreign Language TM. Routledge.

Chapelle, C. A., & Voss, E. (Eds.). (2021). Validity Argument in Language Testing: Case Studies of Validation Research. Cambridge University Press

Chen, M. Y., Liu, Y., & Zumbo, B. D. (2020). A propensity score method for investigating

differential item functioning in performance assessment. Educational and Psychological Measurement, 80(3), 476-498.

Cochran, W. G. (1954). Some methods for strengthening the common χ 2 tests. Biometrics, 10(4), 417-451.

Cumming, A. (2013). Assessing integrated writing tasks for academic purposes: Promises and perils. Language Assessment Quarterly, 10(1), 1-8. https: //doi.o rg/10.108 0/154343 03.2011.22016

Darabi Bazvand, A., & Ahmadi, A. (2020). Interpreting the Validity of a High-Stakes Test in Light of the Argument-Based Framework: Implications for Test Improvement. Research in Applied Linguistics, 11(1), 66-88.

DeMars, C. (2010). Item response theory. Oxford University Press.

Embretson, S. (1994). Applications of cognitive design systems to test development. In Cognitive assessment (pp. 107-135). Springer, Boston, MA. Educational Testing Service (2002). ETS standards for quality and fairness. Princeton, NJ:Author.

Educational Testing Service (2014). ETS standards for quality and fairness. Princeton, NJ: Author.

Geramipour, M. (2020). Item-focused trees approach in differential item functioning (DIF) analysis: a case study of an EFL reading comprehension test. Journal of Modern Research in English Language Studies, 7(2), 123-147.

Clauser, B., Mazor, K., & Hambleton, R. K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure. Applied Measurement in Education, 6(4), 269-279.

Hambleton, R. K., & Rogers, H. J. (1989). Detecting potentially biased test items: Comparison of IRT area and Mantel-Haenszel methods. Applied Measurement in Education, 2(4), 313-334.

Hernández, A., Tomás, I., Ferreres, A., & Lloret, S. (2015). THIRD EVALUATION OF TESTS PUBLISHED IN SPAIN. Papeles del Psicólogo, 36(1), 1-8.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer H & Braun HI (Eds.), Test validity (pp. 129–145). Hillsdale, NJ, US.

Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale NJ: Erlbaum.

Hope, D., Adamson, K., McManus, I. C., Chis, L., & Elder, A. (2018). Using differential item functioning to evaluate potential bias in a high stakes postgraduate knowledge based assessment. BMC Medical Education, 18(1), 1-7.

Im, G. H., & McNamara, T. (2017). Legitimate or illegitimate uses of test scores in contexts unrelated to test purposes. English Teaching, 72(2), 71-99.

Jamalzadeh, M., Lotfi, A. R., & Rostami, M. (2021). Assessing the validity of an IAU General English Achievement Test through hybridizing differential item functioning and differential distractor functioning. Language Testing in Asia, 11(1), 1-17. https:// doi.org/10.1186/s40468-021-00n124-7.

Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38(1), 79-93.

Kamata, A., & Vaughn, B. K. (2004). An Introduction to Differential Item Functioning Analysis. Learning Disabilities: A Contemporary Journal, 2(2), 49-69.

Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527.

Kane, M. (2006). Content-related validity evidence in test development. Handbook of Test Development, 1, 131-153.

Kane, M. (2012). All validity is construct validity. Or is it? Measurement:Interdisciplinary Research & Perspective, 10(1-2), 66-70.

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73.

Kane, M., & Mislevy, R. (2017). Validating score interpretations based on response processes. In Validation of score meaning for the next generation of assessments (pp. 11-24). Routledge.https://doi.org/10.22059/jflr.2021.315079.783

Khodi, Ali, Karami, Hossein (2021). Differential Item Functioning and Test Performance: a Comparison Between the Rasch Model, Logistic Regression and Mantel-Haenszel. Journal of Foreign Language Research, 10 (4), 842-853. https://doi.org/ 10.22059/jflr.2021.315079.783

Kunnan, A. J. (2004). Test fairness. European language testing in a global context, 18, 27-48.

Lado, R. (1961). Language testing. London: Longmans.

Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean. Rasch Measurement Transactions, 16(2), 878.

Mantel, N. (1963). Chi-square tests with one degree of freedom; extensions of the Mantel-Haenszel procedure. Journal of the American Statistical Association, 58(303), 690-700. https://doi.org/ 10.1080/01621459.1963.10500879

McKeown, S. B., & Oliveri, M. E. (2017). Exploratory analysis of differential item functioning and its possible sources in the National Survey of Student Engagement.

Mehrazmay, R., Ghonsooly, B., & De La Torre, J. (2021). Detecting Differential Item Functioning Using Cognitive Diagnosis Models: Applications of the Wald Test and Likelihood Ratio Test in a University Entrance Examination. Applied Measurement in Education, 1-23.https://doi.org/ 10.1g080/08957347.2021.1987906

Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational researcher, 18(2), 5-11.

Messick, S. (1995). Standards of validity and the validity of standards in performance asessment. Educational Measurement: Issues and Practice, 14(4), 5-8.

Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45(1), 35-44.

Meyer, J. P. (2014). Applied measurement with jMetrik. Routledge.

Nakatsuhara, F., Taylor, L., & Jaiyote, S. (2018). The role of the L1 in testing L2 English. Cambridge University Press.

Osterlind, S. J. (1983). Test item bias (No. 30). Sage.

Ozdemir, B., & Alshamrani, A. H. (2020). Examining the Fairness of Language Test Across Gender with IRT-based Differential Item and Test Functioning Methods. International Journal of Learning, Teaching and Educational Research, 19(6), 27-45.

Pae, T. I. (2012). Causes of gender DIF on an EFL language test: A multiple-data analysis over nine years. Language Testing, 29(4), 533-554. https: //doi.o rg/10.11 77/02655 32211434 027

Paulsen, J., Svetina, D., Feng, Y., & Valdivia, M. (2020). Examining the impact of differential item functioning on classification accuracy in cognitive diagnostic models. Applied Psychological Measurement, 44(4), 267-281.

Purpura, J. E. (2011). Quantitative research methods in assessment and testing. In Handbook of research in second language teaching and learning (pp. 749-769). Routledge.

Purpura, J. E., Brown, J. D., & Schoonen, R. (2015). Improving the validity of quantitative measures in applied linguistics research 1. Language Learning, 65(S1), 37-75.

Ramsay, S., Barker, M., & Jones, E. (1999). Academic Adjustment and Learning Processes: a comparison of international and local students in first‐year university. Higher Education Research & Development, 18(1), 129-144.

Rasch, G. (1977). On specific objectivity. An attempt at formalizing the request for generality and validity of scientific statements in symposium on scientific objectivity, Vedbaek, Mau 14-16, 1976. Danish Year-Book of Philosophy Kobenhavn, 14, 58-94.

Ravand, H., Rohani, G., & Firoozi, T. (2019). Investigating Gender and Major DIF in the Iranian National University Entrance Exam Using Multiple-Indicators Multiple-Causes Structural Equation Modelling. Issues in Language Teaching, 8(1), 33-61.

Roussos, L. A., & Stout, W. (2004). Differential item functioning analysis. The Sage handbook of quantitative methodology for the social sciences, 107-116.

Ryan, K. E., & Bachman, L. F. (1992). Differential item functioning on two tests of EFL proficiency. Language testing, 9(1), 12-29.https://doi.org/ 10.1177/026553229200900103

Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194.

Stansfield, C. W., & Hewitt, W. E. (2005). Examining the predictive validity of a screening test for court interpreters. Language Testing, 22(4), 438-462.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370.

Trice, A. G. (2007). Faculty Perspectives regarding Graduate International Students’ Isolation from Host National Students. International Education Journal, 8(1), 108-117.

Willingham, W. W. (1999). A systemic view of test fairness. Assessment in higher education: Issues of access, quality, student development, and public policy, 213-242.

Winke, P., & Brunfaut, T. (Eds.). (2021). The Routledge handbook of second language acquisition and language testing. Routledge.

Wright, R. J. (2007). Educational assessment: Tests and measurements in the age of accountability. Sage Publications.

Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2), 147-170.

Yang, R. P. J., Noels, K. A., & Saumure, K. D. (2006). Multiple routes to cross-cultura adaptation for international students: Mapping the paths between self-construals, English language confidence, and adjustment. International Journal of Intercultural Relations, 30(4), 487-506.

Yoon, G. Y. (2020). Item performance in context: Differential item functioning between pilot and formal administration of the Norwegian language test (Master’s thesis).

Zieky, M. J. (2016). Fairness in test design and development. Fairness in Educational Assessment and Measurement, 9-32.

Zhu, X., & Aryadoust, V. (2020). An investigation of mother tongue differential item functioning in a high-stakes computerized academic reading test. Computer Assisted Language Learning, 35(3), 1-25.

The Effect of Iranian EFL Teachers’ Self-Regulation and Emotional Labor on Their Reflective Action in EFL Online Classes
Print Date : 2023-10-01
Effects of Immersive Learning and Communicative Language Teaching on the Oral Complexity, Accuracy, and Fluency of Iranian EFL Learners with Expressive Language Disorder
Print Date : 2023-10-01
Fostering Academic Vocabulary Learning: Opportunities for Explicit Learning through a Mobile-Assisted App in the Field of Applied Linguistics
Print Date : 2023-04-01
A Study of the Utility of Meta-Cognitive Strategy Instruction for Ameliorating ESP Learners’ Autonomy
Print Date : 2023-10-01
Feuerstein’s Theory and Pedagogy: Epitomizing Mediated Learning Experience in Teaching Grammar
Print Date : 2023-10-01
The Surviving Nature: An Ecofeminist Study of Cormac McCarthy’s All the Pretty Horses
Print Date : 2023-10-01

Share To

Article Url

Equity on General English Achievement Tests through Gender-based DIF Analysis across Different Majors