Decision Tree Algorithms in Water Quality Classification: A Comparative Study of Random Forest, XGBoost, and C5.0
Abstract
Safe drinking water is more than a convenience; public health officials often call it a cornerstone of survival. United Nations International Children’s Emergency Fund (UNICEF) reported that, shockingly, roughly two billion people still drink water that is neither clean nor tested. Pathogenic bacteria from human feces and livestock waste taint roughly 70% of available sources, creating a silent epidemic. Scientists express water quality into five levels: poor, marginal, fair, good, and excellent – named as the Water Quality Index (WQI) designed by the Canadian Council of Ministers of the Environment (CCME). This research measured the performance of three decision-tree classifiers, including Random Forest, XGBoost, and C5.0 to predict water quality. The preprocessing pipeline was thorough, involving label encoding, use of synthetic minor over-sampling technique (SMOTE) for balancing imbalanced classes, and an exploratory phase to examine outliers and irregularities within the dataset. According to the findings, Random Forest finished at an impressive test result with 98% of accuracy. XGBoost and C5.0 follows close behind at about 96%, but the latter turned out to be the fastest, edging out both XGBoost and Random Forest, making C5.0 a preferable when a time-sensitive or emergency decision is needed. In short, this research highlights the importance of modern preprocessing tools combined with machine learning algorithms in monitoring water quality.
References
Dikananda, A. R., S. Jumini, N. Tarihoran, S. Christinawati, W. Trimastuti, and R. Rahim (2022). Comparison of Decision Tree Classification Methods and Gradient Boosted Trees. TEM Journal, 11(1); 316–322
Eyring, V., W. D. Collins, P. Gentine, E. A. Barnes, M. Barreiro, T. Beucler, and L. Zanna (2024). Pushing the Frontiers in Climate Modelling and Analysis with Machine Learning. Nature Climate Change, 14(9); 916–928
Filho, W. L., A. M. Azul, L. Brandli, A. L. Salvia, and T. Wall (2022). Clean Water and Sanitation. Springer International Publishing
Fitriyana, S., A. Syarif, F. Rossyking, and M. R. Faisal (2024). Application of Random Forest Method Classification for Glycosylation in Lysine Protein Sequences. Integra: Journal of Integrated Mathematics and Computer Science, 1(2); 49–54
Garcia, S., J. Luengo, F. Herrera, et al. (2015). Data Preprocessing in Data Mining, volume 72. Springer
García, S., S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera (2016). Big Data Preprocessing: Methods and Prospects. Big Data Analytics, 1(1); 1–22
Ghazvini, A., J. Awwalu, and A. Abu Bakar (2014). Comparative Analysis of Algorithms in Supervised Classification: A Case Study of Bank Notes Dataset. International Journal of Computer Trends and Technology, 17(1); 39–43
Gikas, G. D., G. K. Sylaios, V. A. Tsihrintzis, I. K. Konstantinou, T. Albanis, and I. Boskidis (2020). Comparative Evaluation of River Chemical Status Based on WFD Methodology and CCME Water Quality Index. Science of the Total Environment, 745; 140849
He, Y.-H. (2023). Machine Learning in Pure Mathematics and Theoretical Physics. World Scientific
Holcomb, D. A. and J. R. Stewart (2020). Microbial Indicators of Fecal Pollution: Recent Progress and Challenges in Assessing Water Quality. Current Environmental Health Reports, 7; 311–324
Ilić, M., Z. Srdjević, and B. Srdjević (2022). Water Quality Prediction Based on Naive Bayes Algorithm. Water Science and Technology, 85(4); 1027–1039
Joloudari, J. H., M. Haderbadi, A. Mashmool, M. Ghasemigol, S. S. Band, and A. Mosavi (2020). Early Detection of the Advanced Persistent Threat Attack Using Performance Analysis of Deep Learning. IEEE Access, 8; 186125–186137
Komorowski, M., D. C. Marshall, J. D. Salciccioli, and Y. Crutain (2016). Exploratory Data Analysis. In MIT Critical Data, editor, Secondary Analysis of Electronic Health Records. Springer Cham, pages 185–203
Majumder, M. G., S. D. Gupta, and J. Paul (2022). Perceived Usefulness of Online Customer Reviews: A Review Mining Approach Using Machine Learning & Exploratory Data Analysis. Journal of Business Research, 150; 147–164
Markoulidakis, I., G. Kopsiaftis, I. Rallis, and I. Georgoulas (2021). Multi-Class Confusion Matrix Reduction Method and Its Application on Net Promoter Score Classification Problem. In Proceedings of the 14th Pervasive Technologies Related to Assistive Environments Conference. pages 412–419
Mutoffar, M. M., M. Naseer, and A. Fadillah (2022). Klasifikasi Kualitas Air Sumur Menggunakan Algoritma Random Forest. NARATIF: Jurnal Ilmiah Nasional Riset Aplikasi dan Teknik Informatika, 4(2); 138–146
Myint, K. L. and H. H. K. Tin (2021). Analyzing the Comparison of C4.5, CART and C5.0 Algorithms on Heart Disease Dataset Using Decision Tree Method. In ICIDSSD 2020. page 174
Nasir, N., A. Kansal, O. Alshaltone, F. Barneih, M. Sameer, A. Shanableh, and A. Al-Shamma’a (2022). Water Quality Classification Using Machine Learning Algorithms. Journal of Water Process Engineering, 48; 102920
Nurdin, M., Wamiliana, A. Junaidi, and F. R. Lumbanraja (2024). Comparison of Support Vector Regression and Random Forest Regression Performance in Vehicle Fuel Consumption Prediction. Integra: Journal of Integrated Mathematics and Computer Science, 1(2); 60–67
Pandya, R. and J. Pandya (2015). C5.0 Algorithm to Improved Decision Tree with Feature Selection and Reduced Error Pruning. International Journal of Computer Applications, 117(16); 18–21
Pansara, R. R., B. Y. Kasula, A. B. Bhatia, and P. Whig (2024). Enhancing Sustainable Development Through Machine Learning-Driven Master Data Management. In International Conference on Sustainable Development Through Machine Learning, AI and IoT. pages 332–341
Parmar, A., R. Katariya, and V. Patel (2019). A Review on Random Forest: An Ensemble Classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018. pages 758–763
Pavlov, Y. L. (2000). Random Forests. VSP
Pradhan, P., L. Costa, D. Rybski, W. Lucht, and J. P. Kropp (2017). A Systematic Study of Sustainable Development Goal (SDG) Interactions. Earth’s Future, 5(11); 1169–1179
Pradipta, G. A., R. Wardoyo, A. Musdholifah, I. N. H. Sanjaya, and M. Ismail (2021). SMOTE for Handling Imbalanced Data Problem: A Review. In 2021 Sixth International Conference on Informatics and Computing (ICIC). pages 1–8
Primajaya, A. and B. N. Sari (2018). Random Forest Algorithm for Prediction of Precipitation. Indonesian Journal of Artificial Intelligence and Data Mining (IJAIDM), 1(1); 27–31
Prusty, S., S. Patnaik, and S. K. Dash (2022). SKCV: Stratified K-Fold Cross-Validation on ML Classifiers for Predicting Cervical Cancer. Frontiers in Nanotechnology, 4; 972421
Rolnick, D., P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont, N. Jaques, A. Waldman-Brown, et al. (2022). Tackling Climate Change with Machine Learning. ACM Computing Surveys (CSUR), 55(2); 1–96
Shen, C. (2018). A Transdisciplinary Review of Deep Learning Research and Its Relevance for Water Resources Scientists. Water Resources Research, 54(11); 8558–8593
Theissler, A., M. Thomas, M. Burch, and F. Gerschner (2022). ConfusionVis: Comparative Evaluation and Selection of Multi-Class Classifiers Based on Confusion Matrices. Knowledge-Based Systems, 247; 108651
UNICEF (2019). Progress on Household Drinking Water, Sanitation and Hygiene, 2000–2017. Accessed: 2025-06-13
XGBoost Developers (2023). XGBoost Release 1.5.0-Dev. Accessed: 2025-06-13
Xu, X., T. Lai, S. Jahan, F. Farid, and A. Bello (2022). A Machine Learning Predictive Model to Detect Water Quality and Pollution. Future Internet, 14(11); 324
Yogeshwari, A., J. Anubama, M. J. M. M. Jenitha, M. C. Geetha, M. D. Ramalakshmi, and B. Pavithra (2023). Water Quality Prediction Using CatBoost Classifier Algorithm. Journal of Survey in Fisheries Sciences, 10(4S); 2850–2855
Yuan, Q., H. Shen, T. Li, Z. Li, S. Li, Y. Jiang, H. Xu, W. Tan, Q. Yang, J. Wang, et al. (2020). Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sensing of Environment, 241; 111716
Zhu, M., J. Wang, X. Yang, Y. Zhang, L. Zhang, H. Ren, L. Ye, et al. (2022). A Review of the Application of Machine Learning in Water Quality Evaluation. Eco-Environment & Health, 1(2); 107–116
Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.