Predicting the annual cost of medical insurance using machine learning
Subject Areas : Application of artificial intelligence and information technologyAli Zhaleh karimi 1 , Ramin Dalir 2
1 - Master's student of artificial intelligence and robotics, Imam Hossein (AS) University, Tehran, Iran
2 - PhD Student of artificial intelligence, University of Zanjan, Zanjan, Iran
Keywords: Medical insurance, medical cost, classification, machine learning,
Abstract :
Health insurance is one of the ways to reduce the costs imposed on society.Studying and researching in the field of damages and diseases helps the stakeholders to easily make policies in this regard.The insurance rate is affected by some medical issues. Accurate estimation of individual health care and treatment costs is important for a range of stakeholders and health agencies.Therefore, by predicting medical expenses, both the insured and the insurer can predict the future to some extent and have better options for making decisions. One of the goals of this article is to predict the low, medium or high spending of people for the treatment of the disease and to identify the effective factors in health insurance costs. In this article, the data of the US Census Bureau including 1338 samples with the features of age, gender, body mass index (BMI),smoking,number of dependents,region and annual cost are used. In the proposed method, the data set is first analyzed and reviewed in order to get a general view of it and to identify the influencing factors in the treatment cost.Then, by pre-processing and categorizing costs into low, medium and high, the data is converted into a form suitable for classification. In the next step, classification algorithms are used to learn the category of each of the samples, and by evaluating them, the best algorithm is selected. In the end, with the method of parameter improvement and algorithm parameters adjustment, the performance of the algorithm is improved and the annual cost prediction model is created.Examining the dataset showed that being a smoking, increasing age and being overweight have an effect on treatment costs.The classification results also show that the random forest algorithm has the ability to predict low, medium, and high costs for disease treatment with 91% accuracy.
Arab, M., Fathian, M., & Aliahmadi Jeshfaghani, H. (2022). Forecast of Medical Expenses of Iran Health Insurance Organization Using Machine Learning Based Methods. Iranian Journal of Health Insurance, 0-0.
Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.
Dong, S., & Fei, D. (2021). Improve the interpretability by decision tree regression: exampled by an insurance dataset. 2021 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI),
Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons.
Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2), 1.
Islam, M. A., Nag, A., Chandra, P., Fahim, S. F. A., & Hoque, M. M. (2023). Healthcare Cost Patterns and Prediction: Investigating Personal Datasets Using Data Analytics. Authorea Preprints.
Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling. Packt publishing ltd.
Loh, W. Y. (2011). Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1), 14-23.
Marquardt, D. W., & Snee, R. D. (1975). Ridge regression in practice. The American Statistician, 29(1), 3-20.
Rish, I. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 workshop on empirical methods in artificial intelligence,
Schapire, R. E., & Freund, Y. (2013). Boosting: Foundations and algorithms. Kybernetes, 42(1), 164-166.
Syarif, I., Prugel-Bennett, A., & Wills, G. (2016). SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommunication Computing Electronics and Control), 14(4), 1502-1509.
Tajaddodi Nodehi, M., Hosseini Khatibani, S., Yazdinejad, M., & Zolfi, S. (2023). Predicting people's health insurance costs using machine learning and ensemble learning methods. Iranian Journal of Insurance Research, 13(1), 1-14. https://doi.org/10.22056/ijir.2024.01.01
Tianqi, C., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.
Zhang, S., Li, X., Zong, M., Zhu, X., & Wang, R. (2017). Efficient kNN classification with different numbers of nearest neighbors. IEEE transactions on neural networks and learning systems, 29(5), 1774-1785.