Decision Tree Algorithms in Water Quality Classification: A Comparative Study of Random Forest, XGBoost, and C5.0

Dewi Asiah Shofiana, Melan Caniadi, Ridho Sholehurrohman, Aristoteles

Abstract

Safe drinking water is more than a convenience; public health officials often call it a cornerstone of survival. United Nations International Children’s Emergency Fund (UNICEF) reported that, shockingly, roughly two billion people still drink water that is neither clean nor tested. Pathogenic bacteria from human feces and livestock waste taint roughly 70% of available sources, creating a silent epidemic. Scientists express water quality into five levels: poor, marginal, fair, good, and excellent – named as the Water Quality Index (WQI) designed by the Canadian Council of Ministers of the Environment (CCME). This research measured the performance of three decision-tree classifiers, including Random Forest, XGBoost, and C5.0 to predict water quality. The preprocessing pipeline was thorough, involving label encoding, use of synthetic minor over-sampling technique (SMOTE) for balancing imbalanced classes, and an exploratory phase to examine outliers and irregularities within the dataset. According to the findings, Random Forest finished at an impressive test result with 98% of accuracy. XGBoost and C5.0 follows close behind at about 96%, but the latter turned out to be the fastest, edging out both XGBoost and Random Forest, making C5.0 a preferable when a time-sensitive or emergency decision is needed. In short, this research highlights the importance of modern preprocessing tools combined with machine learning algorithms in monitoring water quality.

References

Chen, T. and C. Guestrin (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pages 785–794

Dikananda, A. R., S. Jumini, N. Tarihoran, S. Christinawati, W. Trimastuti, and R. Rahim (2022). Comparison of Decision Tree Classification Methods and Gradient Boosted Trees. TEM Journal, 11(1); 316–322

Eyring, V., W. D. Collins, P. Gentine, E. A. Barnes, M. Barreiro, T. Beucler, and L. Zanna (2024). Pushing the Frontiers in Climate Modelling and Analysis with Machine Learning. Nature Climate Change, 14(9); 916–928

Filho, W. L., A. M. Azul, L. Brandli, A. L. Salvia, and T. Wall (2022). Clean Water and Sanitation. Springer International Publishing

Fitriyana, S., A. Syarif, F. Rossyking, and M. R. Faisal (2024). Application of Random Forest Method Classification for Glycosylation in Lysine Protein Sequences. Integra: Journal of Integrated Mathematics and Computer Science, 1(2); 49–54

Garcia, S., J. Luengo, F. Herrera, et al. (2015). Data Preprocessing in Data Mining, volume 72. Springer

García, S., S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera (2016). Big Data Preprocessing: Methods and Prospects. Big Data Analytics, 1(1); 1–22

Ghazvini, A., J. Awwalu, and A. Abu Bakar (2014). Comparative Analysis of Algorithms in Supervised Classification: A Case Study of Bank Notes Dataset. International Journal of Computer Trends and Technology, 17(1); 39–43

Gikas, G. D., G. K. Sylaios, V. A. Tsihrintzis, I. K. Konstantinou, T. Albanis, and I. Boskidis (2020). Comparative Evaluation of River Chemical Status Based on WFD Methodology and CCME Water Quality Index. Science of the Total Environment, 745; 140849

He, Y.-H. (2023). Machine Learning in Pure Mathematics and Theoretical Physics. World Scientific

Holcomb, D. A. and J. R. Stewart (2020). Microbial Indicators of Fecal Pollution: Recent Progress and Challenges in Assessing Water Quality. Current Environmental Health Reports, 7; 311–324

Ilić, M., Z. Srdjević, and B. Srdjević (2022). Water Quality Prediction Based on Naive Bayes Algorithm. Water Science and Technology, 85(4); 1027–1039

Joloudari, J. H., M. Haderbadi, A. Mashmool, M. Ghasemigol, S. S. Band, and A. Mosavi (2020). Early Detection of the Advanced Persistent Threat Attack Using Performance Analysis of Deep Learning. IEEE Access, 8; 186125–186137

Komorowski, M., D. C. Marshall, J. D. Salciccioli, and Y. Crutain (2016). Exploratory Data Analysis. In MIT Critical Data, editor, Secondary Analysis of Electronic Health Records. Springer Cham, pages 185–203

Majumder, M. G., S. D. Gupta, and J. Paul (2022). Perceived Usefulness of Online Customer Reviews: A Review Mining Approach Using Machine Learning & Exploratory Data Analysis. Journal of Business Research, 150; 147–164

Markoulidakis, I., G. Kopsiaftis, I. Rallis, and I. Georgoulas (2021). Multi-Class Confusion Matrix Reduction Method and Its Application on Net Promoter Score Classification Problem. In Proceedings of the 14th Pervasive Technologies Related to Assistive Environments Conference. pages 412–419

Mutoffar, M. M., M. Naseer, and A. Fadillah (2022). Klasifikasi Kualitas Air Sumur Menggunakan Algoritma Random Forest. NARATIF: Jurnal Ilmiah Nasional Riset Aplikasi dan Teknik Informatika, 4(2); 138–146

Myint, K. L. and H. H. K. Tin (2021). Analyzing the Comparison of C4.5, CART and C5.0 Algorithms on Heart Disease Dataset Using Decision Tree Method. In ICIDSSD 2020. page 174

Nasir, N., A. Kansal, O. Alshaltone, F. Barneih, M. Sameer, A. Shanableh, and A. Al-Shamma’a (2022). Water Quality Classification Using Machine Learning Algorithms. Journal of Water Process Engineering, 48; 102920

Nurdin, M., Wamiliana, A. Junaidi, and F. R. Lumbanraja (2024). Comparison of Support Vector Regression and Random Forest Regression Performance in Vehicle Fuel Consumption Prediction. Integra: Journal of Integrated Mathematics and Computer Science, 1(2); 60–67

Pandya, R. and J. Pandya (2015). C5.0 Algorithm to Improved Decision Tree with Feature Selection and Reduced Error Pruning. International Journal of Computer Applications, 117(16); 18–21

Pansara, R. R., B. Y. Kasula, A. B. Bhatia, and P. Whig (2024). Enhancing Sustainable Development Through Machine Learning-Driven Master Data Management. In International Conference on Sustainable Development Through Machine Learning, AI and IoT. pages 332–341

Parmar, A., R. Katariya, and V. Patel (2019). A Review on Random Forest: An Ensemble Classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018. pages 758–763

Pavlov, Y. L. (2000). Random Forests. VSP

Pradhan, P., L. Costa, D. Rybski, W. Lucht, and J. P. Kropp (2017). A Systematic Study of Sustainable Development Goal (SDG) Interactions. Earth’s Future, 5(11); 1169–1179

Pradipta, G. A., R. Wardoyo, A. Musdholifah, I. N. H. Sanjaya, and M. Ismail (2021). SMOTE for Handling Imbalanced Data Problem: A Review. In 2021 Sixth International Conference on Informatics and Computing (ICIC). pages 1–8

Primajaya, A. and B. N. Sari (2018). Random Forest Algorithm for Prediction of Precipitation. Indonesian Journal of Artificial Intelligence and Data Mining (IJAIDM), 1(1); 27–31

Prusty, S., S. Patnaik, and S. K. Dash (2022). SKCV: Stratified K-Fold Cross-Validation on ML Classifiers for Predicting Cervical Cancer. Frontiers in Nanotechnology, 4; 972421

Rolnick, D., P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont, N. Jaques, A. Waldman-Brown, et al. (2022). Tackling Climate Change with Machine Learning. ACM Computing Surveys (CSUR), 55(2); 1–96

Shen, C. (2018). A Transdisciplinary Review of Deep Learning Research and Its Relevance for Water Resources Scientists. Water Resources Research, 54(11); 8558–8593

Theissler, A., M. Thomas, M. Burch, and F. Gerschner (2022). ConfusionVis: Comparative Evaluation and Selection of Multi-Class Classifiers Based on Confusion Matrices. Knowledge-Based Systems, 247; 108651

UNICEF (2019). Progress on Household Drinking Water, Sanitation and Hygiene, 2000–2017. Accessed: 2025-06-13

XGBoost Developers (2023). XGBoost Release 1.5.0-Dev. Accessed: 2025-06-13

Xu, X., T. Lai, S. Jahan, F. Farid, and A. Bello (2022). A Machine Learning Predictive Model to Detect Water Quality and Pollution. Future Internet, 14(11); 324

Yogeshwari, A., J. Anubama, M. J. M. M. Jenitha, M. C. Geetha, M. D. Ramalakshmi, and B. Pavithra (2023). Water Quality Prediction Using CatBoost Classifier Algorithm. Journal of Survey in Fisheries Sciences, 10(4S); 2850–2855

Yuan, Q., H. Shen, T. Li, Z. Li, S. Li, Y. Jiang, H. Xu, W. Tan, Q. Yang, J. Wang, et al. (2020). Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sensing of Environment, 241; 111716

Zhu, M., J. Wang, X. Yang, Y. Zhang, L. Zhang, H. Ren, L. Ye, et al. (2022). A Review of the Application of Machine Learning in Water Quality Evaluation. Eco-Environment & Health, 1(2); 107–116

Authors

Dewi Asiah Shofiana
dewi.asiah@fmipa.unila.ac.id (Primary Contact)
Melan Caniadi
Ridho Sholehurrohman
Aristoteles
Shofiana, D. A., Caniadi, M., Sholehurrohman, R., & Aristoteles. (2025). Decision Tree Algorithms in Water Quality Classification: A Comparative Study of Random Forest, XGBoost, and C5.0. Science and Technology Indonesia, 10(4), 999–1011. https://doi.org/10.26554/sti.2025.10.4.999-1011

Article Details