Investigating water quality in wetlands using multivariate statistical analysis and machine learning models
Subject Areas : Article frome a thesisَAbdosamad Davoodi 1 , Reza Mohammadpour 2 , Tooraj Sabzevari 3
1 - PhD student in Water and Hydraulic Structures, Islamic Azad University, Estahban Branch, Fars, Estahban
2 - Assistant Professor, Water Department, Estahban Branch, Islamic Azad University, Estahban Branch
3 - Associate Professor, Water Department, Estahban Branch, Islamic Azad University, Estahban Branch
Keywords: Hierarchical cluster analysis, Machine learning, Water quality factors, Principal component analysis, Constructed wetlands, Random forest, XGBoost,
Abstract :
Introduction: Water is considered one of the main foundations of sustainable development of societies, while clean water resources are a major prerequisite for environmental protection and economic, political, social and cultural development. The increasing demand for water, increasing living standards and the spread of water resource pollution due to the development of agricultural, urban and industrial activities have led to a chaotic environmental situation and intensified water resource pollution, which will make it difficult to control.
Methods: Multivariate statistical methods and data mining have been used to investigate water quality in many studies. Cluster analysis (CA) and discriminant analysis (DA) were used to identify pollution sources in river basins. In order to systematically compare the assumptions of the analytical methods used, the theoretical foundations of each method were examined. Nonparametric methods such as percentage elimination (PR) and sign test (ST) were applicable without the need to assume a specific data distribution, while classical multivariate methods including PCA and FA were used with the assumption of multivariate normality and linear relationships between variables (as confirmed by KMO and Bartlett tests). Machine learning models including Random Forest and XGBoost with the ability to analyze nonlinear relationships and resist collinearity, SVM with sensitivity to feature scaling and the need for separable space, and regression methods such as PLS and Stepwise with the assumption of linear relationships and the need for cross-validation to prevent overfitting were used.
Findings: According to the results obtained from the statistical methods of percentage elimination and sign test, it was observed that the wetland plays a fundamental and key role in the entire drainage system; therefore, using the statistical methods of LDA, PCA/FA and HACA, all water quality factors in the wetland are examined. Also, principal component analysis (PCA) plays a positive role in prioritizing the importance of each factor in pollution, so that it places the more important factors in the first component and the less important factors in the subsequent components. The results obtained from the principal component analysis show that the components with more than one eigenvalue are considered the most important components that justify the variance.
1. میرهاشمی، مریم. شاهنظری، علی و نصیر احمدی، کامران. 1402. ارزيابی آلودگی نیترات و فسفات تالاب میانکاله با استفاده از مدلWASP. تحقیقات منابع آب ایران، سال نوزدهم، شماره4.
2. Aydin, H., Ustaoğlu, F., Tepe, Y., & Soylu, E. N. (2021). Assessment of water quality of streams in northeast Turkey by water quality index and multiple statistical methods. Environmental Forensics, 22(1-2), 270-287.
3. Ayub, K. R., Sidek, L. M., Ainan, A., Zakaria, N. A., Ghani, A. A., & Abdullah, R. (2005). Storm water treatment using bio‐ecological drainage system. International Journal of River Basin Management, 3(3), 215-221.
4. Azhar, S. C., Aris, A. Z., Yusoff, M. K., et al. (2015). Classification of river water quality using multivariate analysis. Procedia Environmental Sciences, 30, 79-84.
5. Bahrami, M., Khaksar, E., & Khaksar, E. (2020). Spatial variation assessment of groundwater quality using multivariate statistical analysis (Case Study: Fasa Plain, Iran). Journal of Groundwater Science and Engineering, 8(3), 230-243.
6. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
7. Casillas-García, L. F., de Anda, J., Yebra-Montes, C., Shear, H., Díaz-Vázquez, D., & Gradilla-Hernández, M. S. (2021). Development of a specific water quality index for the protection of aquatic life of a highly polluted urban river. Ecological Indicators, 129, 107899.
8. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
9. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
10. Dornelas, F. L., Machado, M. B., & von Sperling, M. (2009). Performance evaluation of planted and unplanted subsurface-flow constructed wetlands for the post-treatment of UASB reactor effluents. Water Science and Technology, 60(12), 3025-3033.
11. Espejo, L., Kretschmer, N., Oyarzún, J., Meza, F., Núñez, J., Maturana, H., Soto, G., Oyarzo, P., Garrido, M., Suckel, F., Amezaga, J., & Oyarzún, R. (2012). Application of water quality indices and analysis of the surface water quality monitoring network in semiarid North-Central Chile. Environmental Monitoring and Assessment, 184, 5571–5588.
12. Farmaki, E. G., Thomaidis, N. S., Simeonov, V., & Efstathiou, C. E. (2012). A comparative chemometric study for water quality expertise of the Athenian water reservoirs. Environmental Monitoring and Assessment, 184(12), 7635-7652.
13. Fernández, N., Ramírez, A., & Solano, F. (2004). Physico-chemical water quality indices—a comparative review. Bistua: Revista de la Facultad de Ciencias Básicas, 19–30.
14. Gazzaz, N. M., Yusoff, M. K., Ramli, M. F., Aris, A. Z., & Juahir, H. (2012). Characterization of spatial patterns in river water quality using chemometric pattern recognition techniques. Marine Pollution Bulletin, 64(4), 688-698.
15. Greenwell, B. M., Boehmke, B. C., & McCarthy, A. J. (2018). A simple and effective model-based variable importance measure. Journal of Open Source Software, 3(30), 1051.
16. Guerrero, J., Mahmoud, A., Alam, T., Chowdhury, M. A., Adetayo, A., Ernest, A., & Jones, K. D. (2020). Water quality improvement and pollutant removal by two regional detention facilities with constructed wetlands in South Texas. Sustainability, 12(7), 2844.
17. Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer.
18. Ismail, W. R., Z. A. Rahaman, N. A. Zakaria, A. Ab. Ghani, R. Abdullah & M. Mansor. 2008. Nutrients and water quality of the ecological components of the BioEcological Drainage System (BIOECODS), USM, Penang, Malaysia. In Asian Wetland Symposium 2008 Ha Noi, Vietnam.
19. Ismail, M. , SK Ariful, H, . Sujit Kumar ,R, . Jay, K, .(2024). Assessing intra and interannual variability of water quality in the Sundarban mangrove dominated estuarine ecosystem using remote sensing and hybrid machine learning models, Journal of Cleaner Production, Volume 442, 25 February 2024, 140889
20. James, J., Sandhya, L. and Thomas, C., 2013, December. Detection of phishing URLs using machine learning techniques. In 2013 international conference on control communication and computing (ICCC) (pp. 304-309). IEEE.
21. Johari N. E., Changc. K., Wahid M. A., Ghani A. A. and Talib S. A. (2015). Water Quality Level in Stormwater Runoff through Constructed Wetland under Tropical Climate. E-proceedings of the 36th IAHR World Congress 28 June – 3 July, 2015, The Hague, the Netherlands, 1-6.
22. Kadlec, R. H., & Wallace, S. D. (2008). Treatment wetlands (2nd ed.). CRC Press.
23. Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1-26. https://doi.org/10.18637/jss.v028.i05
24. Manly, B. F. J. (1986). Multivariate statistical methods: A primer (3rd ed.). Chapman & Hall.
25. Massart, D. L., Vandeginste, B. G. M., Deming, S. N., Michotte, Y., & Kaufman, L. (1988). Chemometrics: Textbook. Elsevier.
26. Mohammadpour, R., Shaharuddin, S., Chang, C. K., Zakaria, N. A., & Ghani, A. A. (2014). Spatial pattern analysis for water quality in free-surface constructed wetland. Water Science and Technology, 70(7), 1161-1167.
27. Montgomery, L., Miège, C., Miller, J., Scambos, T.A., Wallin, B., Miller, O., Solomon, D.K., Forster, R. and Koenig, L., 2020. Hydrologic properties of a highly permeable firn aquifer in the Wilkins Ice Shelf, Antarctica. Geophysical Research Letters, 47(22), p.e2020GL089552.
28. Murdoch, P. S., Baron, J. S., & Miller, T. L. (2000). Potential effects of climate change on surface water quality in North America. Journal of the American Water Resources Association, 36, 347-366.
29. Nguyen, H. D., Hong Quan, N., Quang, N. X., Hieu, N. D., & Thang, L. V. (2021). Spatio-temporal pattern of water quality in the Saigon-Dong Nai river system due to waste water pollution sources. International Journal of River Basin Management, 19(2), 221-243.
30. Nosrati, K., Derafshi, Kh., Ghrachahi, S. and Rahimi, Kh. (2011). Assessment of Surface Water Quality in Haraz-Ghara Soo Watershed using Multivariate Statistical Techniques. Research Earth Science Journal, 5, 41-55.
31. Quoc Bao Pham, Reza Mohammadpour, Nguyen Thi Thuy Linh, Meriame Mohajane, Ameneh Pourjasem, Saad Sh Sammen, Duong Tran Anh & Van Thai Nam.(2021). Application of soft computing to predict water quality in wetland. Volume 28, pages 185–200.
32. Razmkhah, H., Abrishamchi, A., & Torkian, A. (2010). Evaluation of spatial and temporal variation in water quality by pattern recognition techniques: A case study on Jajrood River (Tehran, Iran). Journal of Environmental Management, 91, 852-860.
33. Uuemaa, E., Palliser, C. C., Hughes, A. O., & Tanner, C. C. (2018). Effectiveness of a natural headwater wetland for reducing agricultural nitrogen loads. Water, 10(3), 287.
34. Vymazal, J. (2007). Removal of nutrients in various types of constructed wetlands. Science of the Total Environment, 380(1–3), 48-65.
35. Wang, L., Li, X., & Cui, W. (2012). Fuzzy neural networks enhanced evaluation of wetland surface water quality. International Journal of Computer Applications in Technology, 44, 235–240.
36. Wenlan Yang, Bolin Fu, Sunzhe Li, Zhinan Lao, Tengfang Deng, Wen He, Hongchang He, Zhikun Chen.(2023). Monitoring multi-water quality of internationally important karst wetland through deep learning, multi-sensor and multi-platform remote sensing images: A case study of Guilin, China. Ecological Indicators,Volume 154, October 2023, 110755.
37. Wold, H. (1985). Partial least squares. In S. Kotz & N. L. Johnson (Eds.), ncyclopedia of Statistical Sciences (Vol. 6, pp. 581–591). Wiley.
38. Zakaria, N. A., A. Ab Ghani & K. R. Ayub. 2007. Efficiency of Ecological Pond for Stormwater Pollutants Removal. In 9th Symposium of the Malaysian Society of Applied Biology: Exploring The Science of Life As a Catalyst for Technological Advancement, 0143-0150. Bayview Georgetown, Penang.
39. Zhang, Liu, J., , D., Tang, Q., Xu, H., Huang, S., Shang, D., & Liu, R. (2021). Water quality assessment and source identification of the Shuangji River (China) using multivariate statistical methods. PloS One, 16(1), e0245525.
