Designing a hybrid model for classification of imbalanced data in the field of third party insurance
Subject Areas : Multimedia Processing, Communications Systems, Intelligent SystemsMahnaz Manteqipour 1 * , parisa Rahimkhani 2
1 - Researcher
2 - Researcher
Keywords: Hybrid model, Data mining, imbalance data, third party insurance,
Abstract :
The major part of Iran's insurance industry portfolio is the field of compulsory civil liability insurance of motor vehicle owners against third parties. Therefore, detecting the behavior of this insurance field will be effective in order to provide better services to the customers of the insurance industry. Predicting the claim rates for insurance policies, based on the features saved for each insurance policy, is one of the problems of the insurance industry that can be solved with the help of data mining techniques. Insurance is designed using the law of large numbers. In simpler words, a sufficient number of insurance policies are issued, and a small part of this number of insurance policies deal with claims. From the sum of the issued insurance premiums, the cost of claims will be compensated. Therefore, the insurance industry is faced with imbalanced data. The imbalances of insurance industry data causes many challenges in data classification. In the field of third-party insurance and in the data set of this research, there are 14 features for every policies and the data imbalance ratio is 1 to 0.0092, which is considered severe imbalanced.MethodIn this research, we deal with the classification of severe imbalanced data in the field of third party insurance. To overcome the problem of imbalanced data, two hybrid models with different architectures based on 5 basic Gaussian Bayes models, support vectors, logistic regression, decision tree and nearest neighbor are designed. First proposed hybrid model is using random sampling from whole dataset and applying a resampling method for classification and second one selects samples from each labels separately and apply a classification model on the whole selected data. The results of these models are compared. ResultsThe obtained results show that the proposed hybrid models can predict the occurrence or non-occurrence of traffic accidents better than other data mining algorithms. The popular measures such as precisions and recalls of two proposed hybrid models show that second hybrid model has higher performance. And in ensemble phase, the number of models in simple voting as a hyper parameter can be adjusted based on the company's strategy. Also, the use of decision tree to ensemble basic models to build a combined model provides better results than simple voting of basic models.DiscussionTo do more research on the problem of imbalance data classification more complicated resampling data algorithms could be applied and the results be compared.
[1] |
K. P. Murphy., Probabilistic Machine Learning: An Introduction, MIT Press, 2022. |
[2] |
A. Fernández, . S. García and M. Galar, R, Learning from Imbalanced Data Sets, Springer, 2018. |
[3] |
S. Ardabili , A. Mosavi and . A. R. Varkonyi-Koczy, "Advances in Machine Learning Modeling Reviewing Hybrid and Ensemble Methods," Preprints , 2019. |
[4] |
G. . G. Sundarkumar and V. Ravi, "A novel hybrid under sampling method for mining unbalanced datasets in banking and insurance," Engineering Applications of Artificial Intelligence, vol. 37, p. 368–377, 2015. |
[5] |
S. I. V. Shamitha, S. K. Shamitha and V. Ilango, "A hybrid technique for health insurance fraud detection on highly imbalanced dataset," International Journal of Innovative technology and exploring engineering (IJITEE), vol. 8, no. 11, pp. 2278-3075, 2019. |
[6] |
S. Kotekani and I. Velchamy, "An Effective Data Sampling Procedure for Imbalanced Data learning on health insurance fraud detection, CIT," Journal of Computing and Information Technology,, vol. 28, no. 4, p. 269–285, (2020).. |
[7] |
J. Brownlee, Data Preparation for machine learning, Jason Brownle, 2020. |
[8] |
A. Géron, Hands-on Machine learning with scikit-learn, keras, tensorflow, Beijing, Boston, Farnham, Sebastopol, Tokyo: O’Reilly Media, Inc, 2019. |
[9] |
J. Kozak, Decision Tree and ensemble learning based on ant colony algorithm, Katowice, Poland: Springer, 2019. |
|
|
_||_
|
|
|
|
|