Developing Financial Distress Prediction Models Based on Imbalanced Dataset: Random Undersampling and Clustering Based Undersampling Approaches
محورهای موضوعی : Financial EconomicsSeyed behrooz Razavi ghomi 1 , Alireza Mehrazin 2 , Mohammad reza shoorvarzi 3 , Abolghasem Masih Abadi 4
1 - Department of Accounting, Neyshabur Branch, Islamic Azad University, Neyshabur, Iran
2 - Department of Accounting, Neyshabur Branch, Islamic Azad University, Neyshabur, Iran
3 - Department of Accounting, Neyshabur Branch, Islamic Azad University, Neyshabur, Iran
4 - Department of Accounting, Sabzevar Branch, Islamic Azad University, Sabzevar, Iran
کلید واژه: Imbalanced datasets, Undersampling , Machine Learning, Financial distress prediction models, Financial ratios,
چکیده مقاله :
So far, distress prediction models have been based on balanced, such sampling is not consistent with the reality of the statistical community of companies. If the data are balanced, the bias in sample selection may lead to an underestimation of typeI error and an overestimation of the typeII error of models. Although imbalanced data-based models are compatible with reality, they have a higher typeI error compared to balanced data-based models. The cost of typeI error is more important to Beneficiaries than the cost of typeII error. In this study, for reducing typeI error of imbalanced data-based models, random and clustering-based undersampling were used. Tested data included 760 companies since 2007-2007 with 4 different degrees and the results of the H1 to H3 test represented them. In all cases of the typeI error, typeII error of balanced data-based models were lower and more, respectively, compared to imbalanced data-based models; also, in most cases, the geometric mean of balanced data-based models was higher compared to imbalanced data-based models, respectively. The results of testing H4 to H6 show that in most cases, typeI error, typeII error and the geometric mean criterion of models based on modified imbalanced data were less, more, and more, respectiively compared to the models based on imbalanced data, in other words, applying Undersampling methods on imbalanced training data led to a decrease in typeI error and an increase in typeII error and geometric mean criteria. As a result using models based on modified imbalanced data is suggested to Beneficiaries
So far, distress prediction models have been based on balanced, such sampling is not consistent with the reality of the statistical community of companies. If the data are balanced, the bias in sample selection may lead to an underestimation of typeI error and an overestimation of the typeII error of models. Although imbalanced data-based models are compatible with reality, they have a higher typeI error compared to balanced data-based models. The cost of typeI error is more important to Beneficiaries than the cost of typeII error. In this study, for reducing typeI error of imbalanced data-based models, random and clustering-based undersampling were used. Tested data included 760 companies since 2007-2007 with 4 different degrees and the results of the H1 to H3 test represented them. In all cases of the typeI error, typeII error of balanced data-based models were lower and more, respectively, compared to imbalanced data-based models; also, in most cases, the geometric mean of balanced data-based models was higher compared to imbalanced data-based models, respectively. The results of testing H4 to H6 show that in most cases, typeI error, typeII error and the geometric mean criterion of models based on modified imbalanced data were less, more, and more, respectiively compared to the models based on imbalanced data, in other words, applying Undersampling methods on imbalanced training data led to a decrease in typeI error and an increase in typeII error and geometric mean criteria. As a result using models based on modified imbalanced data is suggested to Beneficiaries
[1] Altman, E, I., Financial Ratios, Discriminant Analysis, and the Prediction of Corporate Bankruptcy, Journal of Finance, 1968; 23(4):589-609. doi: 10.2307/2978933.
[2] Anderson, R., The credit scoring toolkit: Theory and practice for retail credit risk management and decision automation, Oxford University Press, 2007.
[3] Anwar, M. N., Complexity measurement for dealing with class imbalance problems in classification modelling, Thesis for Doctor of Philosophy, Massey University, Institute of Fundamental Sciences, 2012.
[4] Balcaen, S., Ooghe, H., 35 years of studies on business failure: an overview of the classic statistical methodologies and their related problems, British Accounting Review, 2006; 38(1):63-93. doi: 10.1016/j.bar.2005.09.001.
[5] Beaver, W., Financial Ratios as Predictor of Failure, Journal of Accounting Research, 1966; 4:71-111.
[6] Breiman, L., Random Forests, Machine Learning, 2001; 45(1): 5-32.
[7] Brown, I., Mues, C., An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Systems with Applications, 2012; 39(3): 3446-3453. doi: 10.1016/j.eswa.2011.09.033
[8] Buda, M., A Systematic Study of the Class Imbalance Problem in Convolutional Neura Networks, Royal Institute of Technology, School of Computer Science and Communication, Sweden, 2017.
[9] Chawla, N. V., Japkowicz, N., Kotcz, A., Editorial: Special issue on learning from imbalanced data sets, ACM Sigkdd Explorations Newsletter, 2004; 6(1):1–6. doi: 10.1145/1007730.1007733.
[10] Chawlaet, N. V., Data mining for imbalanced datasets: An overview, Data Mining and Knowledge Discovery Handbook, 2009; 875-886, doi: 10.1007/978-0-387-09823-4_45.
[11] Chen, H.-J., Huang, S. Y., Lin, C.-S., Alternative diagnosis of corporate bankruptcy: A neuro fuzzy approach, Expert Systems with Applications, 2009; 36(4):7710-7720, doi: 10.1016/j.eswa.2008.09.023
[12] Faris, H., Abukhurma, R., Waref, A,. Saadeh, M., Mora, A. M., Castillo, P. A., Aljarah, I., Improving financial bankruptcy prediction in a highly imbalanced class distribution using oversampling and ensemble learning: a case fromthe Spanish market, Artificial Intelligence, 2020; (9): 31-53. doi: 0.1007/s13748-019-00197-9.
[13] García, S., Herrera, F., Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy, Evolutionary Computation, 2009; 17(3): 275–306. doi:10.1162/evco.2009.17.3.275.
[14] Ghasemi, S., Sarlak, A., Investigating the Impact of the Financial Crisis on Conservative Accounting and Transparency of Banking Information, Advances in Mathematical Finance and Applications, 2018; 3(3): 53-68 (in Persian). doi: 10.22034/AMFA.2018.544949.
[15] Ghatasheh, N., Hossam, F., Abukhurma, R., Castillo, P., Al-Madi, N., Mora, A., Hassanat., Cost-sensitive ensemble methods for bankruptcy prediction in a highly imbalanced data distribution: a real case from the Spanish market, Progress in Artificial Intelligence, 2020; 9: 361-375. doi: 10.1007/s13748-020-00219-x.
[16] Gordini, N., A genetic algorithm approach for SMEs bankruptcy prediction: Empirical evidence from Italy, Expert Systems with Applications, 2014; 41(14): 6433-6445, doi:10.1016/j.eswa.2014.04.026
[17] Haghparast, A., Momeni, A., and Gerd, A., Visual financial ratios and bankruptcy prediction of companies using convolutional neural network model, financial engineering and Tehran Stock Exchange, 2021; 12(46):.558-575(in Persian), DOR: 20.1001.1.22519165.1400.12.46.24.0
[18] Heidari Farahani, M., Ghayur, F., and Mansourfar, Gh., The effect of management behavioral aspects on financial distress, Financial accounting research. 2020; 11(3):117-134 (in Persian), doi:1 0.22108/far.2020.119602.1534
[19] Hsu, C.W., Chang, C.C., Lin, C.J., A Practical Guide to Support Vector Classification. Technical Report, Department of Computer Science and Information Engineering, National Taiwan University, 2004.
[20] Khoshtinat, M., and Qasouri, M., Comparison between hybrid financial ratios based on cash flows and accruals with financial ratios based solely on accruals in predicting corporate bankruptcy, Empirical Studies in Financial Accounting, 2005; 9(3): 43-61.
[21] Kim, M. J., Han, I., The discovery of experts' decision rules from qualitative bankruptcy data using genetic algorithms, Expert Systems with Applications, 2003; 25(4): 637-646. doi: 10.1016/S0957-4174(03)00102-7
[22] Kim, T., Ahn, H., A hybrid undersampling approach for better bankruptcy prediction, Journal of Intelligence and Information Systems, 2015; 21(2):173-190.
[23] Kotsiantis S., Pintelas P., Mixture of expert agents for handling imbalanced data sets, Annals of Mathematics, Computing & TeleInformatics, 2003; 1(1): 46–55.
[24] Li, H., Sun, J., Ranking-order case-based reasoning for financial distress prediction, Knowledge-based Systems, 2008; 21(8): 868–878. doi: 10.1016/j.knosys.2008.03.047
[25] Lin, SW., Ying, KC., Chen, SC., Lee, ZJ., Particle swarm optimization for parameter determination and feature selection of support vector machines, Expert Systems with Applications, 2008; 3: 605-617, doi: 10.1007/978-3-319-13563-2_51
[26] Lopez, V., Fernández, A., García, S., Palade, V., Herrera, F., An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences, 2013; 250: 113-141 .doi: 10.1002/(SICI)1099-131X(200004)19:3<219::AID-FOR752>3.0.CO;2-J
[27] Mann, H. B., Whitney, D. R., On a test of whether one of 2 random variables is stochastically larger than the other, Annals of Mathematical Statistics, 1947; 18(1):50‐60
[28] Mansourfar, Gh., Ghayur, F., Lotfi, B., The ability of support vector machine to predict financial distress, Empirical Accounting Research, 2015; (17)5: 177-195 (in Persian).
[29] McKee, T. E., Greenstein, M., Predicting Bankruptcy Using Recursive Partitioning and a Realistically Proportioned Data Set, Journal of Prediction, 2000; 19(3): 219-230. doi: 10.1002/(SICI)1099-131X(200004)19:3<219::AID-FOR752>3.0.CO;2-J
[30] Mohseni, R., Agha Babaei, R., and Ghorbani, V.M., Financial distress prediction with using efficiency as a predictor variable, Quarterly Journal of Economic Research and Policy, 2014; 21(65):14-123 (in Persian).
[31] Newton, G. w., Bankruptcy and Insolvency Accounting, practic and procedure, John Wiley & Sons, Inc. Seventh Edition. 2010;1.
[32] Olson, D. L., Delen, D., Meng, Y., Comparative analysis of data mining methods for bankruptcy Prediction, 2012; 464.473. doi: 10.1016/j.dss.2011.10.007.
[33] Ooghe, H., Joos. P., Failure prediction, explanation of misclassifications and incorporation of other relevant variables: result of empirical research in Belgium, Working paper, Department of Corporate Finance, Ghent University (Belgium), 1990
[34] Raei, R., Fallahpour, S., Financial distress prediction of companies using artificial neural network, Journal of Financial Research, 2004; 6(1): 39-69 (in Persian).
[35] Razavi, B., Mehrazin, A, R., Shoorvarzi, M, R., Massihabadi, A., Experimental Comparison of Financial Distress Prediction Models Using Imbalanced data sets, Advances in Mathematical Finance in Applications, 2022; 7(3). (in Persian), doi:10.22034/AMFA.2021.1905055.1461
[36] Rezaei, N., Javaheri, M., The Predictability of Neural Network and Genetic Algorithm from Companies’ Financial Crisis, Advances in Mathematical Finance in Applications, 2020; 5(2):183-196 (in Persian). doi: 10.22034/AMFA.2019.1863963.1195
[37] Saruei, S., The Study of Performance of Springerit, Zimsky and Ahlson Models in Predicting Bankruptcy of Listed Companies in Tehran Stock Exchange, M. A. thesis, Arak Islamic Azad University, Arak, Iran, 2010 (in Persian).
[38] Thabtah, F., Kamalov, F., Rajab, K., A new computational intelligence approach to detect autistic features for autism screening, International Journal of Medical Infromatics, 2018:112-117. doi: 10.1016/j.ijmedinf.2018.06.009.
[39] Vapnik, V., Statistical Learning Theory, New York: Springer, 1998; 2.
[40] Veganzones, D., Severin, E., An investigation of bankruptcy prediction in imbalanced datasets, Decision Support System, 2018; 112:111-124. doi: 10.1016/j.dss.2018.06.011
[41] Wei-Chao, L., Chih-Fong, Tsai., Ya-Han, Hu., Jing-Shang, Jhang., Clustering-based undersampling in class-imbalanced data,. Information Sciences, 2017; 409:17-26.
[42] Zmijewski, M. E., Methodological issues related to the estimation of financial distress prediction models, Journal of Accounting Research, 1984; 22: 59–82. doi: 10.2307/2490859
[43] Zoricák, M., Gnip, P., Drotár, P., Gazda, V., Bankruptcy prediction for small and medium-sized companies using severely imbalanced datasets, Economic Modelling, 2020; 84:165-176. doi: 10.1016/j.econmod.2019.04.003.