Handling Missing Data Using Combination of Deletion Technique, Mean, Mode and Artificial Neural Network Imputation for Heart Disease Dataset
Abstract
The University of California Irvine Heart disease dataset had missing data on several attributes. The missing data can loss the important information of the attributes, but it cannot be deleted immediately on dataset. To handle missing data, there are several ways including deletion, imputation by mean, mode, or with prediction methods. In this study, the missing data were handled by deletion technique if the attribute had more than 70% missing data. Otherwise, it were handled by mean and mode method to impute missing data that had missing data less or equal 1%. The artificial neural network was used to handle the attribute that had missing data more than 1%. The results of the techniques and methods used to handle missing data were measured based on the performance results of the classification method on data that has been handled the problem of missing data. In this study the classification method used is Artificial Neural Network, Naïve Bayes, Support Vector Machine, and K-Nearest Neighbor. The performance results of classification methods without handling missing data were compared with the performance results of classification methods after imputation missing data on dataset for accuracy, sensitivity, specificity and ROC. In addition, the comparison of the Mean Squared Error results was also used to see how close the predicted label in the classification was to the original label. The lowest Mean Squared Error wasobtained by Artificial Neural Network, which means that the Artificial Neural Network worked very well on dataset that has been handled missing data compared to other methods. The result of accuracy, specificity, sensitivity in each classification method showed that imputation missing data could increase the performance of classification, especially for the Artificial Neural Network method.
References
Chauhan, Himadri, Vipin Kumar, Sumit Pundir, and Emmanuel Pilli. 2013. “A Comparative Study of Classification Techniques for Intrusion Detection.” Pp. 40–43 in Proceedings - 2013 International Symposium on Computational and Business Intelligence, ISCBI 2013.
Choudhury, Suvra Jyoti, and Nikhil R. Pal. 2019. “Imputation of Missing Data with Neural Networks for Classification.” Knowledge-Based Systems 182:104838. doi: 10.1016/j.knosys.2019.07.009.
Crone, Sven F., and Steven Finlay. 2012. “Instance Sampling in Credit Scoring: An Empirical Study of Sample Size and Balancing.” International Journal of Forecasting 28(1):224–38. doi: 10.1016/j.ijforecast.2011.07.006.
Desiani, Anita, Sugandi Yahdin, and Annisa Kartikasari. 2021. “Handling the Imbalanced Data with Missing Value Elimination SMOTE in the Classification of the Relevance Education Background with Graduates Employment.” IAES International Journal of Artificial Intelligence (IJ-AI) 10(2):346–54. doi: 10.11591/ijai.v10.i2.pp346-354.
Eekhout, Iris, Henrica C. W. De Vet, Jos W. R. Twisk, Jaap P. L. Brand, Michiel R. De Boer, and Martijn W. Heymans. 2014. “Missing Data in a Multi-Item Instrument Were Best Handled by Multiple Imputation at the Item Score Level.” Journal of Clinical Epidemiology 67(3):335–42. doi: 10.1016/j.jclinepi.2013.09.009.
El-Bialy, Randa, Mostafa A. Salamay, Omar H. Karam, and M. Essam Khalifa. 2015. “Feature Analysis of Coronary Artery Heart Disease Data Sets.” Procedia Computer Science 65(Iccmit):459–68. doi: 10.1016/j.procs.2015.09.132.
García, Salvador, Julian Luengo, and Francisco Herrera. 2015. Data Preprocessing in Data Mining. Springer International Publishing.
Hernández-Pereira, Elena M., Diego Álvarez-Estévez, and Vicente Moret-Bonillo. 2015. “Automatic Classification of Respiratory Patterns Involving Missing Data Imputation Techniques.” Biosystems Engineering 138:65–76. doi: 10.1016/j.biosystemseng. 2015.06.011.
Huang, Jianglin, Jacky Wai Keung, Federica Sarro, Yan Fu Li, Y. T. Yu, W. K. Chan, and Hongyi Sun. 2017. “Cross-Validation Based K Nearest Neighbor Imputation for Software Quality Datasets: An Empirical Study.” Journal of Systems and Software 132:226–52. doi: 10.1016/j.jss.2017.07.012.
Jasoni, A., and W. Steinbrunn. 2013. “UCI Machine Learning Repository.: Retrieved from UCI Machine Learning Repository.” UCI MAchine Learning Repository, Retrieved from UCI Machine Learning Repository. Retrieved August 25, 2018 (http://archive.ics.uci. edu/ml/Datasets/Heart+Disease).
Jing, Xiao Yuan, Fumin Qi, Fei Wu, and Baowen Xu. 2016. “Missing Data Imputation Based on Low-Rank Recovery and Semi-Supervised Regression for Software Effort Estimation.” Pp. 607–18 in Proceedings - International Conference on Software Engineering. Vols. 14-22-May-.
Karim, Md Nazmul, Christopher M. Reid, Lavinia Tran, Andrew Cochrane, and Baki Billah. 2017. “Missing Value Imputation Improves Mortality Risk Prediction Following Cardiac Surgery: An Investigation of an Australian Patient Cohort.” Heart Lung and Circulation 26(3):301–8. doi: 10.1016/j.hlc.2016.06.1214.
Lan, Qiujun, Xuqing Xu, Haojie Ma, and Gang Li. 2020. “Multivariable Data Imputation for the Analysis of Incomplete Credit Data.” Expert Systems with Applications 141. doi: 10.1016/j.eswa.2019.112926.
Long, Nguyen Cong, Phayung Meesad, and Herwig Unger. 2015. “A Highly Accurate Firefly Based Algorithm for Heart Disease Prediction.” Expert Systems with Applications 42(21):8221–31. doi: 10.1016/j.eswa.2015.06.024.
Luengo, Julián, Alberto Fernández, Salvador García, and Francisco Herrera. 2011. “Addressing Data Complexity for Imbalanced Data Sets: Analysis of SMOTE-Based Oversampling and Evolutionary Undersampling.” Soft Computing 15(10):1909–36. doi: 10.1007/s00500-010-0625-8.
Malarvizhi, T. 2012. “K-NN Classifier Performs Better than k-Means Clustering in Missing Value Imputation.” IOSR Journal of Computer Engineering 6(5):12–15.
Manimekalai, K., and A. Kavitha. 2018. “MISSING VALUE IMPUTATION AND NORMALIZATION TECHNIQUES IN MYOCARDIAL INFARCTION.” ICTACT JOURNAL ON SOFT COMPUTING 08(April):1655–62. doi: 10.21917/ijsc.2018.0230.
Mehrotra, Devan V., Fang Liu, and Thomas Permutt. 2017. “Missing Data in Clinical Trials: Control-Based Mean Imputation and Sensitivity Analysis.” Pharmaceutical Statistics 16(5):378–92. doi: 10.1002/pst.1817.
Misir, R., and R. K. Samanta. 2017. “A Study on Performance of UCI Hungarian Dataset Using Missing Value Management Techniques.” International Journal of Computer Sciences and Engineering 5(3):40–44.
Mohammad, Al Khaldy, and Kambhampati Chandrasekhar. 2018. “Performance Analysis of Various Missing Value Imputation Metgods on Heart Failure Dataset.” Pp. 21–22 in Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016,. Vol. 16. Springer International Publishing AG.
Moorthy, Kohbalan, Mohd Mohamad, and Safaai Deris. 2014. “A Review on Missing Value Imputation Algorithms for Microarray Gene Expression Data.” Current Bioinformatics 9(1):18–22. doi: 10.2174/1574893608999140109120957.
Nishanth, Kancherla Jonah, and Vadlamani Ravi. 2016. “Probabilistic Neural Network Based Categorical Data Imputation.” Neurocomputing 218:17–25. doi: 10.1016/j.neucom.2016.08.044.
Pedersen, Alma B., Ellen M. Mikkelsen, Deirdre Cronin-Fenton, Nickolaj R. Kristensen, Tra My Pham, Lars Pedersen, and Irene Petersen. 2017. “Missing Data and Multiple Imputation in Clinical Epidemiological Research.” Clinical Epidemiology 9:157–66. doi: 10.2147/CLEP.S129785.
Poolsawad, N., L. Moore, C. Kambhampati, and J. G. F. Cleland. 2012. “Handling Missing Values in Data Mining - A Case Study of Heart Failure Dataset.” Pp. 2934–38 in Proceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012.
Purwar, Archana, and Sandeep Kumar Singh. 2015. “Hybrid Prediction Model with Missing Value Imputation for Medical Data.” Expert Systems with Applications 42(13):5621–31. doi: 10.1016/j.eswa.2015.02.050.
Rahakbauw, Dorteus Lodewyik, Ferry Kondo Lembang, and Yuniasih M. J. Taihuttu. 2016. “Analisis Dan Prediksi Penyakit Jantung Koroner Di Kota Ambon Menggunakan Jaringan Saraf Tiruan.” Ilmu Matematika Dan Terapan 10(2):97–105.
Rahman, M. Mostafizur, and Darryl N. Davis. 2012. “Fuzzy Unordered Rules Induction Algorithm Used as Missing Value Imputation Methods for K-Mean Clustering on Real Cardiovascular Data.” Lecture Notes in Engineering and Computer Science 2197(1):391–94.
Resti, Yulia, Endang Sri Kresnawati, Des Alwine Zayanti, and Ning Eliyati. 2021. “Diagnosis of Diabetes Mellitus in Women of Reproductive Age Using The Prediction Methods of Naive Bayes , Discriminant Analysis , and Logistic Regression.” Science and Technology Indonesia 6(2):97–104.
Salleh, Mohd Najib Mohd, and Nurul Ashikin Samat. 2017. “FCMPSO: An Imputation for Missing Data Features in Heart Disease Classification.” IOP Conference Series: Materials Science and Engineering 226(1). doi: 10.1088/1757-899X/226/1/012102.
Saputra, Widodo, T. Tulus, Muhammad Zarlis, Rahmat Widia Sembiring, and Dedy Hartama. 2017. “Analysis Resilient Algorithm on Artificial Neural Network Backpropagation.” in Journal of Physics: Conference Series. Vol. 930.
Shah, Syed Muhammad Saqlain, Safeera Batool, Imran Khan, Muhammad Usman Ashraf, Syed Hussnain Abbas, and Syed Adnan Hussain. 2017. “Feature Extraction through Parallel Probabilistic Principal Component Analysis for Heart Disease Diagnosis.” Physica A: Statistical Mechanics and Its Applications 482:796–807. doi: 10.1016/j.physa.2017.04.113.
Shilaskar, Swati, and Ashok Ghatol. 2013. “Feature Selection for Medical Diagnosis: Evaluation for Cardiovascular Diseases.” Expert Systems with Applications 40(10):4146–53. doi: 10.1016/j.eswa.2013.01.032.
Silva-Ramírez, Esther Lydia, Rafael Pino-Mejías, Manuel López-Coello, and María Dolores Cubiles-de-la-Vega. 2011. “Missing Value Imputation on Missing Completely at Random Data Using Multilayer Perceptrons.” Neural Networks 24(1):121–29. doi: 10.1016/j.neunet.2010.09.008.
Somasundaram, R. S., and R. Nedunchezhian. 2012. “Missing Value Imputation Using Refined Mean Substitution.” International Journal of Computer Science Issues 9(4):306–13.
Statlog. 2004. “Heart Data Set.” Report. Retrieved (https://archive.ics.uci.edu/ml/datasets /Statlog+(Heart).).
Stekhoven, Daniel J., and Peter Bühlmann. 2012. “Missforest-Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28(1):112–18. doi: 10.1093/bioinformatics/btr597.
Stewart, Jack, Gavin Manmathan, and Peter Wilkinson. 2017. “Primary Prevention of Cardiovascular Disease: A Review of Contemporary Guidance and Literature.” JRSM Cardiovascular Disease 6(April):204800401668721. doi: 10.1177/2048004016687211.
Subbalakshmi, G. 2011. “Decision Support in Heart Disease Prediction System Using Naive Bayes.” Indian Journal of Computer Science and Engineering (IJCSE) Decision 2(2):170–76.
Ting, Pei-Ya, Tomotaka Wada, Yi-Lun Chiu, Min-Te Sun, Kazuya Sakai, Wei-Shinn Ku, Andy An-Kai Jeng, and Jing-Shyang Hwu. 2020. “Freeway Travel Time Prediction Using Deep Hybrid Model — Taking Sun Yat-Sen Freeway as an Example.” IEEE Transactions on Vehicular Technology 14(8):1–1. doi: 10.1109/tvt.2020.2999358.
Tsai, Chih Fong, Miao Ling Li, and Wei Chao Lin. 2018. “A Class Center Based Approach for Missing Value Imputation.” Knowledge-Based Systems 151:124–35. doi: 10.1016/j.knosys.2018.03.026.
Vangipuram, Radhakrishna, Rajesh Kumar Gunupudi, Veereswara Kumar Puligadda, and Janaki Vinjamuri. 2020. “A Machine Learning Approach for Imputation and Anomaly Detection in IoT Environment.” Expert Systems Special Is(March):1–16. doi: 10.1111/exsy.12556.
Vazifehdan, Mahin, Mohammad Hossein Moattar, and Mehrdad Jalali. 2019. “A Hybrid Bayesian Network and Tensor Factorization Approach for Missing Value Imputation to Improve Breast Cancer Recurrence Prediction.” Journal of King Saud University - Computer and Information Sciences 31(2):175–84. doi: 10.1016/j.jksuci.2018.01.002.
Zriqat, Israa Ahmed, Ahmad Mousa Altamimi, and Mohammad Azzeh. 2017. “A Comparative Study for Predicting Heart Diseases Using Data Mining Classification Methods.” (April).
Authors
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.