Risk Classification of Imbalanced Data for Car Insurance Companies: Machine Learning Approaches
Subject Areas : International Journal of Mathematical Modelling & Computations
Farzan Khamesian
1
,
Maryam Esna-Ashari
2
,
Eric Dei Ofosu-Hene
3
,
Farbod Khanizadeh
4
*
1 - Insurance Research Center, Tehran, Iran
2 - Insurance Research Center, Tehran, Iran
3 - Department of Accounting and Finance, Faculty of Business and Law, De Montfort University, Leicester, UK
4 - Insurance Research Center, Tehran, Iran
Keywords: Classification, Machine Learning, supervised Learning, Imbalanced Data, Claim Risk,
Abstract :
This paper presents a mechanism for insurance companies to assess the most effective features to classify the risk of their customers for third party liability (TPL) car insurance. Basically, the process of underwriting is carried out based on the expert experiences and the industry suffers from lack of a systematic method to categorize their policyholders with respect to the risk level. We analyzed 13,388 observations of an insurance claim dataset from body injury reports provided by an Iranian insurance company. The main challenge is the imbalanced dataset. Here we employ logistic regression and random forest with different resampling of the original data in order to increase the performance of models. Results indicate that the random forest with the hybrid resampling methods is the best classifier and furthermore, victim age, premium, car age and insured age are the most important factors for claims prediction.
[1] P. Baecke and L. Bocca, The value of vehicle telematics data in insurance risk selection processes,
Decision Support Systems, 98 (2017) 69–79.
[2] R. Barandela, R. M.Valdovinos, J. S. Snchez and F. J. Ferri, The imbalanced training sample problem: Under or over sampling?, In Joint IAPR international workshops on statistical techniques in
pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), Springer, Berlin,
Heidelberg, (2004) 806–814.
[3] Y. Bian, C. Yang, J. L. Zhao and L. Liang, Good drivers pay less: A study of usage-based vehicle
insurance models, Transportation research part A: policy and practice, 107 (2018) 20–34.
[4] N. Boodhun and M. Jayabalan, Risk prediction in life insurance industry using supervised learning
algorithms, Complex & Intelligent Systems, 4 (2) (2018) 145–154.
[5] R. L. Brown, D. Charters, S. Gunz and N. Haddow, Age as an Insurance Rate Class Variable,
University of Waterloo, (2004) 103–114.
[6] L. Cao and H. Shen, Imbalanced data classification using improved clustering algorithm and undersampling method, In 2019 20th International Conference on Parallel and Distributed Computing,
Applications and Technologies (PDCAT), IEEE, (2019) 358–363.
[7] N. V. Chawla, Data mining for imbalanced datasets: An overview, Data Mining and Knowledge
Discovery Handbook, Springer, Boston, MA, (2009) 875–886.
[8] D. Devi, S. K. Biswas and B. Purkayastha, A review on solution to class imbalance problem: Undersampling approaches, In 2020 International Conference on Computational Performance Evaluation
(ComPE), IEEE, (2020) 626–631.
[9] G. Dionne and C. Vanasse, Automobile insurance ratemaking in the presence of asymmetrical information, Journal of Applied Econometrics, 7 (2) (1992) 149–165.
[10] K. Divakar and K. Chitharanjan, Performance evaluation of credit card fraud transactions using
boosting algorithms, Int. J. Electron. Commun. Comput. Eng. IJECCE, 10 (6) (2019) 262–270.
[11] G. Douzas, F. Bacao and F. Last, Improving imbalanced learning through a heuristic oversampling
method based on k-means and SMOTE, Information Sciences, 465 (2018) 1–20.
[12] A. Fernndez, S.Garca, M. Galar, R. C. Prati, B. Krawczyk and F. Herrera, Cost-sensitive learning,
In Learning from Imbalanced Data Sets, Springer, Cham, (2018) 63–78.
162 F. Khamesian et al./ IJM2C, 12 - 03 (2022) 153-162.
[13] Y. L. Grize, W. Fischer and C. Ltzelschwab, Machine learning applications in nonlife insurance,
Applied Stochastic Models in Business and Industry, 36 (4) (2020) 523–537.
[14] H. Han, W. Y. Wang and B. H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, In International Conference on Intelligent Computing, Springer, Berlin,
Heidelberg, (2005) 878–887.
[15] S. E. Harrington and H. I. Doerpinghaus, The economics and politics of automobile insurance rate
classification, Journal of Risk and Insurance, 60 (1) (1993) 59–84.
[16] P. Hart, The condensed nearest neighbor rule (corresp.), IEEE Transactions on Information Theory,
14 (3) (1968) 515–516.
[17] J. Hegde and B. Rokseth, Applications of machine learning methods for engineering risk assessmentA
review, Safety Science, 122 (2020) 104492.
[18] Y. Huang and S. Meng, Automobile insurance classification ratemaking based on telematics driving
data, Decision Support Systems, 127 (2019) 113156.
[19] R. Jain, J. A. Alzubi, N. Jain and P. Joshi, Assessing risk in life insurance using ensemble learning,
Journal of Intelligent & Fuzzy Systems, 37 (2) (2019) 2969–2980.
[20] M. Kelly and N. Nielson, Age as a variable in insurance pricing and risk classification, The Geneva
Papers on Risk and Insurance-Issues and Practice, 31 (2) (2006) 212–232.
[21] S. B. Khakbaz, N. Hajiheydari and M. Pourestarabadi, Car insurance risk assessment with data
mining for an Iranian leading insurance company, International Journal of Business and Economics
Research, 3 (3) (2014) 128–134.
[22] M. Kubat and S. Matwin, Addressing the curse of imbalanced training sets: one-sided selection, In
Icml, 97 (1) (1997) 197.
[23] R. Malhotra and J. Jain, Handling imbalanced data using ensemble learning in software defect
prediction, In 2020 10th International Conference on Cloud Computing, Data Science & Engineering
(Confluence), IEEE, (2020) 300–304.
[24] H. M. Nguyen, E. W. Cooper and K. Kamei, Borderline over-sampling for imbalanced data classification, International Journal of Knowledge Engineering and Soft Data Paradigms, 3 (1) (2011)
4–21.
[25] N. Paltrinieri, L. Comfort and G. Reniers, Learning about risk: Machine learning for risk assessment,
Safety Science, 118 (2019) 475–486.
[26] C. V. Priscilla and D. P. Prabha, Influence of optimizing XGBoost to handle class imbalance in
credit card fraud detection, In 2020 Third International Conference on Smart Systems and Inventive
Technology (ICSSIT), IEEE, (2020) 1309–1315.
[27] S. Rawat, A. Rawat, D. Kumar and A. S. Sabitha, Application of machine learning and data visualization techniques for decision support in the insurance sector, International Journal of Information
Management Data Insights, 1 (2) (2021) 100012.
[28] D. Samson and H. Thomas, Linear models as aids in insurance decision making: the estimation of
automobile insurance claims, Journal of Business Research, 15 (3) (1987) 247–256.
[29] Z. Shams Esfandabadi and M. M. Seyyed Esfahani, Identifying and classifying the factors affecting
risk in automobile hull insurance in Iran using fuzzy Delphi method and factor analysis, Journal of
Industrial Engineering and Management Studies, 5 (2) (2018) 84–96.
[30] V. Sobanadevi and G. Ravi, Handling data imbalance using a heterogeneous bagging-based stacked
ensemble (HBSE) for credit card fraud detection, In Intelligence in Big Data TechnologiesBeyond
the Hype, Springer, Singapore, (2021) 517–525.
[31] Z. Sun, Q. Song, X. Zhu, H. Sun, B. Xu and Y. Zhou, A novel ensemble method for classifying
imbalanced data, Pattern Recognition, 48 (5) (2015) 1623–1637.
[32] Y. Tang, Y. Q. Zhang, N. V. Chawla and S. Krasser, SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39 (1) (2008)
281–288.
[33] J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi and M. Asadpour, Boosting methods for multi-class
imbalanced data classification: an experimental review, Journal of Big Data, 7 (1) (2020) 1–47.
[34] N. Thai-Nghe, Z. Gantner and L. Schmidt-Thieme, Cost-sensitive learning methods for imbalanced
data, In The 2010 International Joint Conference on Neural Networks (IJCNN), IEEE, (2010) 1–8.
[35] I. Tomek, Two modifications of CNN, IEEE Trans. Systems, Man and Cybernetics, 6 (1976) 769–772.
[36] P. Tryfos, On classification in automobile insurance, The Journal of Risk and Insurance, 47 (2)
(1980) 331–337.
[37] C. F. Tsai, W. C. Lin, Y. H. Hu and G. T. Yao, Under-sampling class imbalanced datasets by
combining clustering analysis and instance selection, Information Sciences, 477 (2019) 47–54.
[38] W. A. Wiegers, The use of age, sex, and marital status as rating variables in automobile insurance,
The University of Toronto Law journal, 39 (2) (1989) 149–210.
[39] S. J. Yen and Y.S. Lee, Cluster-based under-sampling approaches for imbalanced data distributions,
Expert Systems with Applications, 36 (3) (2009) 5718–5727.
[40] J. L. Yin and B. H. Chen, An advanced driver risk measurement system for usage-based insurance
on big driving data, IEEE Transactions on Intelligent Vehicles, 3 (4) (2018) 585–594.
[41] M. Zareapoor and P. Shamsolmoali, Application of credit card fraud detection: Based on bagging
ensemble classifier, Procedia Computer Science, 48 (2015) (2015) 679–685.
[42] S. Zhang, Cost-sensitive KNN classification, Neurocomputing,391 (2020) 234–242.
[43] Z. Zheng, Y. Cai and Y. Li, Oversampling method for imbalanced classification, Computing and
Informatics, 34 (5) (2015) 1017–1037.
[44] K. Zhuang, S. Wu and X. Gao, Auto insurance business analytics approach for customer segmentation
using multiple mixed-type data clustering algorithms, Tehniki vjesnik, 25 (6) (2018) 1783–1791