بازسازی داده‌های گمشده اقلیمی با استفاده از ترکیب روش بازسازی چندگانه با معادلات زنجیره‌ای (MICE) و مدل‌های تقویتی یادگیری ماشین در حوضه آبریز دریاچه ارومیه

نوع مقاله : مقاله پژوهشی

نویسندگان

گروه علوم و مهندسی آب، دانشکده کشاورزی، دانشگاه صنعتی اصفهان، اصفهان، ایران

چکیده

در دسترس بودن داده‌های کامل و دقیق اقلیمی، نقش کلیدی در تحلیل‌های اقلیم‌شناسی، مطالعات هیدرولوژیکی و مدیریت منابع آب دارد. با این حال، داده‌های ثبت‌شده در ایستگاه‌های هواشناسی معمولاً با گمشدگی مواجه‌اند که در صورت بازسازی نادرست می‌تواند موجب انحراف در نتایج مدل‌سازی شود. در این پژوهش، به‌منظور بازسازی داده‌های گمشده اقلیمی در شش ایستگاه منتخب تبریز، بناب، ارومیه، مراغه، سقز و سراب، واقع در حوضه آبریز دریاچه ارومیه، چهار مدل شامل MICE، MICE–GBR، MICE–XGB و MICE–LGBM مورد بررسی و مقایسه قرار گرفتند. برای ارزیابی کارایی مدل‌ها از شاخص‌های آماری R2، NRMSE، |PBIAS|  و KGE استفاده شد. نتایج نشان داد مدل‌های ترکیبی مبتنی بر الگوریتم‌های تقویتی نسبت به مدل پایه MICE عملکرد دقیق‌تر و باثبات‌تری دارند. در میان آن‌ها، مدل MICE–XGB با میانگین R2 بالاتر از 90/0 و KGE بیش از 92/0 در اغلب ایستگاه‌ها بهترین نتایج را ارائه داد. کمترین خطاها در متغیرهای دمایی و بیشترین در متغیرهای وابسته به ابرناکی مشاهده شد. مقدار |PBIAS∣  در تمام مدل‌ها کمتر از 025/0 درصد بود که نشان‌دهنده عدم وجود بایاس سیستماتیک قابل توجه است. همچنین مقایسه زمان اجرای مدل‌ها نشان داد روش‌های تقویتی علی‌رغم دقت بالا، از نظر محاسباتی بهینه و مقرون‌به‌صرفه هستند. در مجموع، یافته‌ها بیانگر کارایی بالای مدل‌های ترکیبی MICE با یادگیری تقویتی در بازسازی داده‌های اقلیمی و پیشنهاد به‌کارگیری آن‌ها در تحلیل‌های اقلیمی و هیدرولوژیکی آتی است.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Reconstruction of Missing Climatic Data Using Combination of Multiple Imputation by Chained Equations (MICE) and Boosting-Based Machine Learning Approaches in the Urmia Lake Basin

نویسندگان [English]

  • Mohammad Shayannejad
  • Mohammad Jamali
  • Saeid Eslamian
Department of Water Science and Engineering, College of Agriculture, Isfahan University of Technology, Isfahan,, Iran
چکیده [English]

The availability of complete and accurate climatic data plays a crucial role in climatological analyses, hydrological studies, and water resource management. However, meteorological station records often contain missing values, which, if not properly reconstructed, can introduce significant bias into subsequent modeling and analysis. In this study, missing climatic data were reconstructed for six selected meteorological stations—Tabriz, Bonab, Urmia, Maragheh, Saqez, and Sarab—located in the Urmia Lake Basin. Four models, including MICE, MICE–GBR, MICE–XGB, and MICE–LGBM, were developed and compared. Model performance was evaluated using statistical indices such as R², NRMSE, |PBIAS|, and KGE. Results revealed that hybrid MICE models based on boosting algorithms provided more accurate and stable reconstructions than the conventional MICE model. Among the tested models, MICE–XGB achieved the best overall performance, with average R² exceeding 0.90 and KGE above 0.92 across most stations. The lowest errors were observed for temperature-related variables, while the highest occurred in cloudiness-related parameters. The |PBIAS| values for all models were below 0.025%, indicating negligible systematic bias. Furthermore, model runtime comparisons demonstrated that boosting-based methods, despite their high accuracy, remained computationally efficient and cost-effective. Overall, the findings confirm the superior capability of hybrid MICE models combined with boosting algorithms for reconstructing missing climatic data, highlighting their potential for future climatological and hydrological analyses in data-scarce environments.

کلیدواژه‌ها [English]

  • Boosting learning
  • Climatic variables
  • MICE algorithm
  • Missing data reconstruction
  • Urmia Lake Basin

Introduction

Accurate and continuous climate data are essential for climatological analysis, hydrological modeling, and water resources management, particularly in regions facing rapid climatic fluctuations. However, meteorological records often contain missing values due to sensor malfunctions, network interruptions, and human errors. If these missing data are not properly reconstructed, subsequent analyses such as drought assessment, evapotranspiration modeling, and climate change projections may be biased and unreliable (Afrifa-Yamoah et al., 2020; Hersbach et al., 2020).

Traditional gap-filling techniques such as linear regression, spatial interpolation, and principal component analysis have been widely used, yet they often fail to accurately capture nonlinear and interdependent relationships among climatic variables (Matinzadeh et al., 2013). In recent years, machine learning methods—especially boosting-based algorithms such as Gradient Boosting (GBR), Extreme Gradient Boosting (XGB), and Light Gradient Boosting Machine (LGBM)—have shown superior performance in modeling complex, nonlinear datasets (Alejo-Sanchez et al., 2025).

The present study aims to evaluate and compare the performance of four gap-filling models—MICE, MICE–GBR, MICE–XGB, and MICE–LGBM—for reconstructing missing climate data across six meteorological stations in the Urmia Lake Basin, northwestern Iran. By integrating the multivariate iterative chained equations (MICE) approach with ensemble boosting algorithms, this research investigates whether hybrid learning frameworks can significantly improve reconstruction accuracy and reduce systematic bias in climatic datasets.

Method

The study area includes six representative synoptic stations—Tabriz, Bonab, Urmia, Maragheh, Saqez, and Sarab—with elevations ranging from 1315 to 1682 m. Daily time series of key climatic variables, including temperature (Tmean, Tmin, Tmax), relative humidity (RH), sea-level pressure (SLP), vapor pressure (SVP), cloudiness (CLD), and evapotranspiration (ET), were analyzed. Four gap-filling approaches were developed and implemented in Python using the scikit-learn and XGBoost/LightGBM libraries:

MICE (baseline model): a chained regression-based multiple imputation method;

MICE–GBR: MICE integrated with Gradient Boosting Regression;

MICE–XGB: MICE coupled with Extreme Gradient Boosting;

MICE–LGBM: MICE combined with LightGBM.

The models were evaluated using four widely adopted statistical indices:

Coefficient of Determination (R²) for accuracy and explained variance;

Normalized Root Mean Square Error (NRMSE) for relative reconstruction error;

Percent Bias (|PBIAS|) for assessing systematic deviation;

Kling–Gupta Efficiency (KGE) for combined evaluation of bias, correlation, and variability.

In addition, the computational runtime of each model was recorded to assess efficiency and potential trade-offs between accuracy and speed.

Results

The comparative analysis demonstrated that the hybrid boosting-based models substantially outperformed the basic MICE model in reconstructing missing climate data. The MICE–XGB model achieved the highest accuracy across most stations and variables, with mean values of R² > 0.90 and KGE > 0.92. The MICE–LGBM model followed closely, offering comparable performance with slightly lower computational cost. The lowest reconstruction errors were obtained for temperature and pressure-related variables, which exhibit smoother temporal patterns and stronger inter-variable correlations. In contrast, cloudiness (CLD) and evapotranspiration (ET) showed higher error values due to their nonlinear and discontinuous nature. These findings align with Li et al. (2021) and Badrzadeh et al. (2022), who reported similar challenges in reconstructing cloud and radiation data.

Across all models and stations, the absolute percent bias (|PBIAS|) remained below 0.025%, confirming the absence of systematic bias and the robustness of the reconstruction framework. Moreover, spatial evaluation revealed consistent model performance across stations with varying topography and elevation, highlighting the generalizability of the hybrid MICE–boosting methods.

Regarding computational efficiency, the MICE–XGB model, despite being slightly slower than MICE and MICE–LGBM, achieved a favorable balance between accuracy and runtime. Its runtime was less than half that of standard deep learning approaches reported in similar studies, indicating the computational practicality of boosting-based models for large-scale climatic applications.

Conclusions

This study demonstrated that integrating the MICE framework with boosting algorithms such as XGB, GBR, and LGBM significantly enhances the accuracy and reliability of climate data reconstruction. Among the evaluated models, MICE–XGB provided the most consistent and accurate results, particularly for temperature and pressure variables. The extremely low |PBIAS| and high KGE values indicate excellent agreement between reconstructed and observed data, confirming the suitability of these hybrid methods for climatological and hydrological modeling. From a practical perspective, the findings highlight the potential of ensemble-based machine learning approaches in addressing missing data challenges in meteorological datasets—especially in basins such as Urmia Lake, where data discontinuity poses serious limitations for environmental analysis. The balance between computational efficiency and predictive precision makes these hybrid models ideal candidates for operational climate monitoring systems and regional reanalysis datasets. Future work should focus on extending this framework to spatiotemporal imputation of gridded datasets, integrating remote sensing and reanalysis data, and exploring deep hybrid networks (e.g., MICE–CatBoost or MICE–LSTM) for improved temporal pattern reconstruction.

Funding

The authors received no specific funding for this work.

Authorship contribution

Conceptualization, M.J. and M.Sh.; methodology, M.J. and M.Sh.; software, M.Sh.; validation, M.J., M.Sh. and S.E.; formal analysis, M.J.; investigation, M.J. and M.Sh.; resources, M.J. and S.E.; data curation, M.J.; writing—original draft preparation, M.J.; writing—review and editing, M.Sh. and S.E. All authors have read and agreed to the published version of the manuscript.

Declaration of Generative AI and AI-assisted technologies in the writing process

The authors declare that no generative AI or AI-assisted technologies were used in the writing, analysis, or preparation of this manuscript. The authors take full responsibility for the content of this publication.

Data availability statement

Data available on request from the authors

Acknowledgements

We acknowledge the Iran Meteorological Organization for providing the historical data. The authors thank the anonymous reviewers for their valuable comments and suggestions.

Ethical considerations

The authors avoided data fabrication, falsification, plagiarism, and misconduct.

Conflict of interest

The authors declare no conflict of interest.

  1. Afrifa-Yamoah, E., Mueller, U. A., Taylor, S. M., & Fisher, A. J. (2020). Missing data imputation of high-resolution temporal climate time series data. Meteorological Applications, 27(1), e1873. https://doi.org/https://doi.org/10.1002/met.1873

    Alejo-Sanchez, L. E., Márquez-Grajales, A., Salas-Martínez, F., Franco-Arcega, A., López-Morales, V., Acevedo-Sandoval, O. A., González-Ramírez, C. A., & Villegas-Vega, R. (2025). Missing data imputation of climate time series: A review. MethodsX, 15, 103455. https://doi.org/10.1016/j.mex.2025.103455

    Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40-49. https://doi.org/https://doi.org/10.1002/mpr.329

    Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA. https://doi.org/10.1145/2939672.2939785

    Costa, T., Falcão, B., Mohamed, M. A., Annuk, A., & Marinho, M. (2024). Employing machine learning for advanced gap imputation in solar power generation databases. Sci Rep, 14(1), 23801. https://doi.org/10.1038/s41598-02-74342-2

    Davari, S., Eslamian, S., Jamali, M., & Safavi, H. R. (2025). Application of Machine Learning Algorithms for Groundwater Level Prediction in the Najafabad Plain. Sci Rep.

    Farzandi, M., Sanaeinejad, H., Rezaei-Pazhan, H., & Sarmad, M. (2022). Improving estimation of missing data in historical monthly precipitation by evolutionary methods in the semi-arid area. Environment, Development and Sustainability, 24(6), 8313-8332. https://doi.org/10.1007/s10668-021-01784-4

    Fazel Najafabadi, E., & Shayannejad, M. (2025). Evaluation the efficiency of machine learning boosting methods for estimating the water quality index of the Zayandeh Rood River. Iranian Journal of Soil and Water Research, 56(5), 1355-1378. https://doi.org/10.22059/ijswr.2025.392173.-669906

    Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29(5), 1189-1232. http://www.jstor.org/stable/2699986

    Golkhatmi, N. S. N., & Farzandi, M. (2024). Enhancing Rainfall Data Consistency and Completeness: A Spatiotemporal Quality Control Approach and Missing Data Reconstruction Using MICE on Large Precipitation Datasets. Water Resources Management, 38(3), 815-833. https://doi.org/10.1007/s11269-023-03567-0

    Gupta Hoshin, V., Sorooshian, S., & Yapo Patrice, O. (1999). Status of Automatic Calibration for Hydrologic Models: Comparison with Multilevel Expert Calibration. Journal of Hydrologic Engineering, 4(2), 135-143. https://doi.org/10.1061/(ASCE)1084-0699(1999)4:2

    Gupta, H. V., Kling, H., Yilmaz, K. K., & Martinez, G. F. (2009). Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of Hydrology, 377(1), 80-91. https://doi.org/https://doi.org/10.1016/j.jhydrol.20

    Hasanpour Kashani, M., & Dinpashoh, Y. (2012). Evaluation of efficiency of different estimation methods for missing climatological data. Stochastic Environmental Research and Risk Assessment, 26(1), 59-71. https://doi.org/10.1007/s00477-011-053

    Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Simmons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., . . . Thépaut, J.-N. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730), 1999-2049. https://doi.org/https://doi.org/10.1002/qj.3803

    Hosseinpour, S., Sharafati, A., & Abghari, H. (2025). Downscaling of two selected GCM data using a hybrid deep learning method of Wavelet-CNN-LSTM in Iran. Theoretical and Applied Climatology, 156(9): 459. DOI:10.1007/s00704-025-05685-8

    Jääskeläinen, E., Manninen, T., Hakkarainen, J., & Tamminen, J. (2022). Filling gaps of black-sky surface albedo of the Arctic sea ice using gradient boosting and brightness temperature data. International Journal of Applied Earth Observation and Geoinformation, 107, 102701. https://doi.org/https://doi.org/10.1016/j.jag.2022.1

    Davari, S., Elamian, S., Jamali, M., & Safavi, H. R. (2025). Application of machine learning algorithms for groundwater level prediction in the Najafabad plain. Scientific Reports, 14(3), 743-752. https://doi.org/10.1038/s41598-025-32376-1

    Jamali Jezeh, M., Shayannejad, M., & Hejazi, S. M. (2020). Evaluation the Performance of Filters Made of BC, PET and PP Textiles in Removing Oil Contaminants from Water [Research]. Journal of Water and Soil Science, 24(4), 295-312. https://doi.org/10.47176.jwss.24.4.42931

    Jamali, M., Gohari, A., & Akhavan Saraf, G. (2024). Spatiotemporal evaluation of temperature and precipitation extremes indices over Iran under the influence of climate change. Water and Irrigation Management, 14(3), 743-752. https://doi.org/10.22059/jwim.2024.374814.1156

    Jamali, M., Eslamian, S., Shayannejad, M., & Gohari, A. (2026). Observed warming–driven aridification and climate-type transitions across Iran. Journal of Arid Environments, 19(2), -. https://doi.org/10.1016/j.jaridenv.2026.105606

    Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.

    Khosravi, G., Nafarzadegan, A. R., Nohegar, A., Fathizadeh, H., & Malekian, A. (2015). A modified distance-weighted approach for filling annual precipitation gaps: application to different climates of Iran. Theoretical and Applied Climatology, 119(1), 33-42. https://doi.org/10.1007/s00704-014-1091-5

    Knoben, W. J. M., Freer, J. E., & Woods, R. A. (2019). Technical note: Inherent benchmark or not? Comparing Nash–Sutcliffe and Kling–Gupta efficiency scores. Hydrol. Earth Syst. Sci., 23(10), 4323-4331. https://doi.org/10.5194/hess-23-4323-2019

    Legates, D. R., & McCabe Jr, G. J. (1999). Evaluating the use of “goodness-of-fit” Measures in hydrologic and hydroclimatic model validation. Water Resources Research, 35(1), 233-241. https://doi.org/https://doi.org/10.1029/1998WR900018

    Little, R., & Rubin, D. (1987). Multiple imputation for nonresponse in surveys. Wiley, 10, 9780470316696.

    Matinzadeh, M. m., Fattahi, R., Shayanzadeh, M., & Abdollahi, K. (2013). Estimation and Reconstruction of Annual Maximum 24-H Rainfall Data Using Combination of Genetic Algorithm and Artificial Neural Networks Models (Case Study: Chaharmahal va Bakhtiyari Province). ijwmse, 7(22), 53. http://jwmsei.ir/article-1-245-fa.html

    1. Moriasi, D., G. Arnold, J., W. Van Liew, M., L. Bingner, R., D. Harmel, R., & L. Veith, T. (2007). Model Evaluation Guidelines for Systematic Quantification of Accuracy in Watershed Simulations. Transactions of the ASABE, 50(3), 885-900. https://doi.org/https://doi.org/10.13031/2013.23153

    Plein, M., Feigel, G., Zeeman, M., Dormann, C. F., & Christen, A. (2025). Using Gradient Boosting for gap-filling to analyze temperature and humidity patterns in an urban weather station network in Freiburg, Germany. Urban Climate, 62, 102496. https://doi.org/https://doi.org/10.1016/j.uclim.2025.102496

    Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147-177. https://doi.org/10.1037/1082-989X.7.2.147

    Van Buuren, S. (2000). Multivariate imputation by chained equations: MICE V1. 0 user's manual. Leiden: TNO.

    Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 45, 1-67.

    Willmott, C. J. (1981). ON THE VALIDATION OF MODELS. Physical Geography, 2(2), 184-194. https://doi.org/10.1080/02723646.1981.10642213