The combination of dimensionality reduction methods and machine learning algorithms in the optimization of Maroon River water quality prediction

Document Type : Research Paper

Authors

Department of Environmental Engineering, Faculty of Water and Environmental Engineering, Shahid Chamran University of Ahvaz, Ahvaz, Iran.

Abstract

 
Water resources face challenges such as climate change and human activities. Sustainable water management is extremely important to solve this problem. More and more people are using artificial intelligence, especially machine learning, to predict and manage water quality. These AI methods are excellent at identifying patterns in water data and improving water quality management. This study examines the water quality of the Maroon River using a combination of factor analysis and machine learning. Data on various water quality parameters were collected from three stations over a period of ten years and the water quality index was calculated. Then, different machine learning algorithms were used to predict the water quality index. In a further step, factor analysis was performed to extract the important features of the input for the optimal algorithm. The performance of the studied algorithms was determined at each step using evaluation criteria. The results showed that in the first step, the Random Forest algorithm (R2 (0.78), RMSE (2.65)) had the best performance in predicting water quality index. It was also found that among the three algorithms studied, nitrate is the most important input parameter, while acidity is the least important. By reducing the number of inputs to 3 important parameters, the performance of the Random Forest algorithm (R2 (0.74), RMSE (2.86)) almost reached the level of 8 input parameters. Combining insights from factor analysis and feature importance analysis can provide a more comprehensive understanding of the complex relationships among water quality parameters and help develop more effective water management.

Keywords

Main Subjects


EXTENDED ABSTRACT

 

Introduction:

In today's world, water resources have attracted much attention due to their unique importance. These resources are of great value as one of the vital bases for human life, environmental protection and economic development. With population increase, climate change and human pressures, water resources are facing many challenges and threats, especially in dry areas. These challenges include reducing water quality and quantity, destroying water resources, and creating serious problems for freshwater consumption. Therefore, the importance of investigating and sustainable management of water resources is of particular importance. In this regard, the use of artificial intelligence methods, especially machine learning, is increasingly used in predicting and modelling water quality and water resources management. Due to their ability to detect patterns and complex relationships in water quality data, these methods are considered effective tool for improving water quality management and maintenance.

Materials and Methods:

The present study examines the water quality of the Maroon River, one of the most important rivers in Iran, which plays an important role in the development of urban and rural areas. The data used include parameters such as temperature, biochemical oxygen demand, phosphate... for 10 years have been collected from different stations. In the first step, these data have been used as inputs for forecasting models. Then, dimension reduction methods such as factor analysis have been used to extract important features. In the next step, different machine learning algorithms such as Linear Regression،Random Forest، Extra Trees وLight Gradient Boosting Machine have been used to predict the water quality index, and the performance of the algorithms was evaluated using criteria such as root mean square error and coefficient of determination.

Results and Discussion:

The p-value of Bartlett's test in this research was close to zero and it can be concluded that there is a significant correlation between the variables and the data are suitable for factor analysis and dimension reduction. The values of the variance inflation coefficient for the water quality parameters used in this research showed that total coliform and phosphate variables have little colinearity with other independent variables. The prediction results of the water quality index using the 8 studied parameters as input showed that the random forest and regression algorithms showed the highest and lowest agreement with the real data, respectively. Because the regression algorithm uses a straight line to predict the dependent variable's values based on the independent variables and performs poorly in complex problems with non-linear interactions. The results also showed that nitrate is the most important input parameter and acidity is less important for the three studied algorithms.

Conclusion:

By combining the insights obtained from factor analysis and feature importance analysis, researchers can better understand the complex relationships between water quality parameters and create more effective strategies for water management and pollution control.

Author Contributions

Fereshteh Sayahi: Design, Analysis, and Interpretation of data Writing- Original draft preparation, Visualization. Laleh Divband Hafshejani: Conceptualization, Methodology, Design, Revision of the manuscript and Editing. Parvaneh Tishehzan: Design, Revision of the manuscript and Editing. Hamid Abdolabadi: Analysis and Interpretation of data.

Data Availability Statement

Data can be sent from the corresponding author by email upon request.

Acknowledgements

We are grateful to the Research Council of Shahid Chamran University of Ahvaz for financial support (GN SCU.WE1402.47794).

Ethical considerations

The authors avoided data fabrication, falsification, plagiarism, and misconduct.

Adjovu, G. E., Stephen, H., & Ahmad, S. (2023). A machine learning approach for the estimation of total dissolved solids concentration in Lake Mead using electrical conductivity and temperature. Water, 15(13), 2439.
Ahmed, U., Mumtaz, R., Anwar, H., Shah, A. A., Irfan, R., & García-Nieto, J. (2019). Efficient water quality prediction using supervised machine learning. Water, 11(11), 2210.
Ali, N., Chen, J., Fu, X., Hussain, W., Ali, M., Iqbal, S. M., Anees, A., Hussain, M., Rashid, M., & Thanh, H. V. (2023). Classification of reservoir quality using unsupervised machine learning and cluster analysis: Example from Kadanwari gas field, SE Pakistan. Geosystems and Geoenvironment, 2(1), 100123.
Azrour, M., Mabrouki, J., Fattah, G., Guezzaz, A., & Aziz, F. (2022). Machine learning algorithms for efficient water quality prediction. Modeling Earth Systems and Environment, 8(2), 2793-2801.
Belzak, W. C., & Bauer, D. J. (2019). Interaction effects may actually be nonlinear effects in disguise: A review of the problem and potential solutions. Addictive behaviors, 94, 99-108.
Chen, B., Mu, X., Chen, P., Wang, B., Choi, J., Park, H., Xu, S., Wu, Y., & Yang, H. (2021). Machine learning-based inversion of water quality parameters in typical reach of the urban river by UAV multispectral data. Ecological Indicators, 133, 108434.
Chen, P., Wang, B., Wu, Y., Wang, Q., Huang, Z., & Wang, C. (2023). Urban River water quality monitoring based on self-optimizing machine learning method using multi-source remote sensing data. Ecological Indicators, 146, 109750.
Deng, T., Chau, K.-W., & Duan, H.-F. (2021). Machine learning based marine water quality prediction for coastal hydro-environment management. Journal of Environmental Management, 284, 112051.
Divband Hafshejani, L., Naseri, A. A., Moradzadeh, M., Daneshvar, E., & Bhatnagar, A. (2022). Applications of soft computing techniques for prediction of pollutant removal by environmentally friendly adsorbents (case study: the nitrate adsorption on modified hydrochar). Water Science & Technology, 86(5), 1066-1082.
Giao, N. T., Nhien, H. T. H., Anh, P. K., & Thuptimdang, P. (2022). Combination of water quality, pollution indices, and multivariate statistical techniques for evaluating the surface water quality variation in Can Tho City, Vietnam. Environmental Monitoring and Assessment, 194(11), 844.
Hafshejani, L. D., Naseri, A. A., Hooshmand, A., Mohammadi, A. S., & Abbasi, F. (2024). Prediction of nitrate leaching from soil amended with biosolids by machine learning algorithms. Ain Shams Engineering Journal, 102783.
Haggerty, R., Sun, J., Yu, H., & Li, Y. (2023). Application of machine learning in groundwater quality modeling-A comprehensive review. Water Research, 119745.
Huang, M. V. (2022). Impact of Environmental Factors on the Algae Overgrowth in Pond Water. Journal of Student Research, 11(3).
Ismail, A. H., & Robescu, D. (2019). Application of multivariate statistical techniques in water quality assessment of Danube river, Romania. Environ. Eng. Manag. J, 18, 719-726.
Jakubowicz, P., Steliga, T., & Wojtowicz, K. (2022). Analysis of Temperature Influence on Precipitation of Secondary Sediments during Water Injection into an Absorptive Well. Energies, 15(23), 9130.
Jatnika, H., Huda, M., Amelia, R. R., Manuhutu, M. A., Windarto, A. P., Sumantrie, P., & Waluyo, A. (2021, February). Analysis of data mining in the group of water pollution areas using the K-means method in Indonesia. In Journal of Physics: Conference Series (Vol. 1783, No. 1, p. 012014). IOP Publishing.
Khaire, U. M., & Dhanalakshmi, R. (2022). Stability of feature selection algorithm: A review. Journal of King Saud University-Computer and Information Sciences, 34(4), 1060-1073.
Khouri, L., & Al-Mufti, M. B. (2022). Assessment of surface water quality using statistical analysis methods: Orontes River (Case study). Baghdad Science Journal, 19(5), 0981-0981.
Koryakov, A., Makar, S., Lukyanets, A., & Moreva, E. (2023). Peculiarities of Statistical Water Quality Assessment in an Industrial Region. Polish Journal of Environmental Studies, 32(1).
Krishnan, S., & Manikandan, R. (2024). Water quality prediction: A data-driven approach exploiting advanced machine learning algorithms with data augmentation. Journal of Water and Climate Change.
Kyriazos, T., & Poga, M. (2023). Dealing with multicollinearity in factor analysis: the problem, detections, and solutions. Open Journal of Statistics, 13(3), 404-424.
Li, Y., Mi, W., Ji, L., He, Q., Yang, P., Xie, S., & Bi, Y. (2023). Urbanization and agriculture intensification jointly enlarge the spatial inequality of river water quality. Science of the Total Environment, 878, 162559.
Patil, V. B., Pinto, S. M., Govindaraju, T., Hebbalu, V. S., Bhat, V., & Kannanur, L. N. (2020). Multivariate statistics and water quality index (WQI) approach for geochemical assessment of groundwater quality—a case study of Kanavi Halla Sub-Basin, Belagavi, India. Environmental Geochemistry and Health42, 2667-2684.
Schäfer, B., Beck, C., Rhys, H., Soteriou, H., Jennings, P., Beechey, A., & Heppell, C. M. (2022). Machine learning approach towards explaining water quality dynamics in an urbanised river. Scientific Reports, 12(1), 12346.
Shareef, M. A. (2019). Assessment of Tigris River water quality using multivariate statistical techniques. Tikrit Journal of Engineering Sciences, 26(4), 26-31.
Sharma, V., Sharma, M., Pandita, S., Kumar, V., Kour, J., & Sharma, N. (2021). Assessment of water quality using different pollution indices and multivariate statistical techniques. In Heavy metals in the environment: 165-178.
Stojković, J., Papić, P., Ćuk, M., & Todorović, M. (2013). Application of factor analysis in identification of dominant hydrogeochemical processes of some nitrogenous groundwater of Serbia. Geoloski anali Balkanskoga poluostrva, (74), 57-62.
Varghese, I. S., & Gunasundari, R. (2024). Cubic Grey Relational Luong Attention Bidirectional Long Short-Term Memory based Dissolved Oxygen Prediction in River. International Journal of Intelligent Systems and Applications in Engineering, 12(11s), 387-395.
Watkins, K. (2006). Human Development Report 2006-Beyond scarcity: Power, poverty and the global water crisis. UNDP Human Development Reports (2006).
Zhu, M., Wang, J., Yang, X., Zhang, Y., Zhang, L., Ren, H., Wu, B., & Ye, L. (2022). A review of the application of machine learning in water quality evaluation. Eco-Environment & Health.