LSTM-CNN Hybrid Model Performance Improvement with BioWordVec for Biomedical Report Big Data Classification

Dian Kurniasari, Warsono, Mustofa Usman, Favorisen Rosyking Lumbanraja, Wamiliana

Abstract

The rise in mortality rates due to leukemia has fueled the swift expansion of publications concerning the disease. The increase in publications has dramatically affected the enhancement of biomedical literature, further complicating the manual extraction of pertinent material on leukemia. Text classification is an approach used to retrieve pertinent and top-notch information from the biomedical literature. This research suggests employing an LSTM-CNN hybrid model to tackle imbalanced data classification in a dataset of PubMed abstracts centred on leukemia. Random Undersampling and Random Oversampling techniques are merged to tackle the data imbalance problem. The classification model’s performance is improved by utilizing a pre trained word embedding created explicitly for the biomedical domain, BioWordVec. Model evaluation indicates that hybrid resampling techniques with domain-specific pre-trained word embeddings can enhance model performance in classification tasks, achieving accuracy, precision, recall, and f1-score of 99.55%, 99%, 100%, and 99%, respectively. The results suggest that this research could be an alternative technique to help obtain information about leukemia.

References

Abdu, S. A., A. H. Yousef, and A. Salem (2021). Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey. Information Fusion, 76; 204–226

Abdulrauf Sharifai, G. and Z. Zainol (2020). Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm. Genes, 11(7); 717

Afendras, G. and M. Markatou (2019). Optimality of Training/test Size and Resampling Effectiveness in Cross-Validation. Journal of Statistical Planning and Inference, 199; 286–301

Ahmed, N., A. Yigit, Z. Isik, and A. Alpkocak (2019). Identification of Leukemia Subtypes from Microscopic Images Using Convolutional Neural Network. Diagnostics, 9(3); 104

Akpatsa, S. K., X. Li, and H. Lei (2021). A Survey and Future Perspectives of Hybrid Deep Learning Models for Text Classification. In Artificial Intelligence and Security: 7th International Conference, ICAIS 2021, Dublin, Ireland, July 19–23, 2021, Proceedings, Part I 7. Springer, pages 358–369

Albawi, S., T. A. Mohammed, and S. Al-Zawi (2017). Understanding of a Convolutional Neural Network. In 2017 International Conference on Engineering and Technology (ICET). Ieee, pages 1–6

ALRashdi, R. and S. O’Keefe (2019). Deep Learning and Word Embeddings for Tweet Classification for Crisis Response. ArXiv Preprint ArXiv:1903.11024

Alshdaifat, E., D. Alshdaifat, A. Alsarhan, F. Hussein, and S. M. F. S. El-Salhi (2021). The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance. Data, 6(2); 11

Asudani, D. S., N. K. Nagwani, and P. Singh (2023). Impact of Word Embedding Models on Text Analytics in Deep Learning Environment: A Review. Artificial Intelligence Review, 56(9); 10345-10425

Berlyand, L. and P.-E. Jabin (2023). Mathematics of Deep Learning: An Introduction. de Gruyter Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5; 135–146

Chen, C. C., Z. Liu, G. Yang, C. C. Wu, and Q. Ye (2020). An Improved Fault Diagnosis Using 1D convolutional Neural Network Model. Electronics, 10(1); 59

Dobbin, K. K. and R. M. Simon (2011). Optimally Splitting Cases for Training and Testing High Dimensional Classifiers. BMC Medical Genomics, 4(1); 1–8

Du, M., W. Chen, K. Liu, L. Wang, Y. Hu, Y. Mao, X. Sun, Y. Luo, J. Shi, and K. Shao (2022). The Global Burden of Leukemia and Its Attributable Factors in 204 Countries and Territories: Findings from the Global Burden of Disease 2019 Study and Projections to 2030. Journal of Oncology, 2022; 1–14

Fotouhi, S., S. Asadi, and M. W. Kattan (2019). A Comprehensive Data Level Analysis for Cancer Diagnosis on Imbalanced Data. Journal of Biomedical Informatics, 90; 103089

Hasib, K. M., S. Azam, A. Karim, A. Al Marouf, F. J. M. Shamrat, S. Montaha, K. C. Yeo, M. Jonkman, R. Alhajj, and J. G. Rokne (2023). MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data. IEEE Access, 11; 93048–93063

Hassan, H., N. B. Ahmad, and S. Anuar (2020). Improved Students’ Performance Prediction for Multi-Class Imbalanced Problems Using Hybrid and Ensemble Approach in Educational Data Mining. In Journal of Physics: Conference Series, volume 1529. IOP Publishing, page 052041

Hung, B. T. (2019). Document Classification by Using Hybrid Deep Learning Approach. In Context Aware Systems and Applications, and Nature of Computation and Communication: 8th EAI International Conference, ICCASA 2019, and 5th EAI International Conference, ICTCC 2019, My Tho City, Vietnam, November 28-29, 2019, Proceedings. Springer, pages 167–177

Jain, V. and K. L. Kashyap (2023). Ensemble Hybrid Model for Hindi COVID-19 Text Classification with Metaheuristic Optimization Algorithm. Multimedia Tools and Applications, 82(11); 16839-16859

Korde, V. and C. N. Mahender (2012). Text Classification and Classifiers: A Survey. International Journal of Artificial Intelligence & Applications, 3(2); 85

Kowsari, K., K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown (2019). Text Classification Algorithms: A Survey. Information, 10(4); 150

Lai, S., L. Xu, K. Liu, and J. Zhao (2015). Recurrent Convolutional Neural Networks for Text Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29

Lan, Y., Y. Hao, K. Xia, B. Qian, and C. Li (2020). Stacked Residual Recurrent Neural Networks with Cross-Layer Attention for Text Classification. IEEE Access, 8; 70401–70410

Li, Q., H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He (2022). A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13(2); 1–41

Li, W., P. Liu, Q. Zhang, and W. Liu (2019). An Improved Approach for Text Sentiment Classification Based on a Deep Neural Network Via a Sentiment Attention Mechanism. Future Internet, 11(4); 96

Li, Y. and T. Yang (2018). Word Embedding for Understanding Natural Language: A Survey. Guide to Big Data Applications, 26; 83–104

Markoulidakis, I., G. Kopsiaftis, I. Rallis, and I. Georgoulas (2021). Multi-Class Confusion Matrix Reduction Method and Its Application on Net Promoter Score Classification Problem. In The 14th Pervasive Technologies Related to Assistive Environments Conference. pages 412–419

Masic, I. and A. Ferhatovica (2012). Review of Most Important Biomedical Databases for Searching of Biomedical Scientific Literature. Donald School Journal of Ultrasound in Obstetrics and Gynecology, 6; 343–61

Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed Representations of Words and Phrases and Their Compositionality. Advances in Neural Information Processing Systems, 26; 1–9

Nguyen, Q. H., H.-B. Ly, L. S. Ho, N. Al-Ansari, H. V. Le, V. Q. Tran, I. Prakash, and B. T. Pham (2021). Influence of Data Splitting on Performance of Machine Learning Models in Prediction of Shear Strength of Soil. Mathematical Problems in Engineering, 2021; 1–15

Pennington, J., R. Socher, and C. D. Manning (2014). Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pages 1532–1543

Pham, B. T., I. Prakash, A. Jaafari, and D. T. Bui (2018). Spatial Prediction of Rainfall-Induced Landslides Using Aggregating One-Dependence Estimators Classifier. Journal of the Indian Society of Remote Sensing, 46; 1457–1470

Picard, R. R. and K. N. Berk (1990). Data Splitting. The American Statistician, 44(2); 140–147 Pristyanto, Y., I. Pratama, and A. F. Nugraha (2018). Data Level Approach for Imbalanced Class Handling on Educational Data Mining Multiclass Classification. In 2018 International Conference on Information and Communications Technology (ICOIACT). IEEE, pages 310–314

Rabut, B. A., A. C. Fajardo, and R. P. Medina (2019). Multi-Class Document Classification Using Improved Word Embeddings. In Proceedings of the 2nd International Conference on Computing and Big Data. pages 42–46

Rathnayaka, P., S. Abeysinghe, C. Samarajeewa, I. Manchanayake, and M. Walpola (2018). Sentylic at IEST 2018: Gated Recurrent Neural Network and Capsule Network Based Approach for Implicit Emotion Detection. arXiv preprint arXiv:1809.01452; 254–259

Rios, A. and R. Kavuluru (2015). Convolutional Neural Networks for Biomedical Text Classification: Application in Indexing Biomedical Articles. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. pages 258–267

Sabbeh, S. F. and H. A. Fasihuddin (2023). A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification. Electronics, 12(6); 1425

Saeed, A., S. Shoukat, K. Shehzad, I. Ahmad, A. Eshmawi, A. H. Amin, and E. Tag-Eldin (2022). A Deep Learning-Based Approach for the Diagnosis of Acute Lymphoblastic Leukemia. Electronics, 11(19); 3168

Tai, K. S., R. Socher, and C. D. Manning (2015). Improved Semantic Representations from Tree Structured Long Short-Term Memory Networks. arXiv preprint arXiv:1503.00075, 1; 1556–1566

Tasdelen, A. and B. Sen (2021). A Hybrid CNN-LSTM Model for Pre-miRNA Classification. Scientific Reports, 11(1); 14125

Tober, M. (2011). Pubmed, Sciencedirect, Scopus or Google Scholar–Which Is the Best Search Engine for an Effective Literature Research in Laser Medicine? Medical Laser Application, 26(3); 139–144

Wang, K., P. Zhang, and J. Su (2020). A Text Classification Method Based on the Merge-LSTM-CNN Model. In Journal of Physics: Conference Series, volume 1646. IOP Publishing, page 012110

Weeks, G. R. (1864). Pyæmia, or Leukæmia. The Boston Medical and Surgical Journal, 69(23); 459-463

Yang, L. and A. Shami (2020). On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. Neurocomputing, 415; 295–316

Zhang, J., Y. Li, J. Tian, and T. Li (2018). LSTM-CNN Hybrid Model for Text Classification. In 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). IEEE, pages 1675–1680

Zhang, Y., Q. Chen, Z. Yang, H. Lin, and Z. Lu (2019). BioWordVec, Improving Biomedical Word Embeddings with Subword Information and MeSH. Scientific Data, 6(1); 52

Zhao, H., J. Xie, and H. Wang (2022). Graph Convolutional Network Based on Multi-Head Pooling for Short Text Classification. IEEE Access, 10; 11947–11956

Zhou, P., Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu (2016). Text Classification Improved by Integrating Bidirectional LSTM with Two-Dimensional Max Pooling. arXiv preprint arXiv:1611.06639; 3485–3495

Zhu, L., Z. Zhu, C. Zhang, Y. Xu, and X. Kong (2023). Multimodal Sentiment Analysis Based on Fusion Methods: A Survey. Information Fusion, 95; 306–325

Authors

Dian Kurniasari
Warsono
warsono.1963@fmipa.unila.ac.id (Primary Contact)
Mustofa Usman
Favorisen Rosyking Lumbanraja
Wamiliana
Kurniasari, D., Warsono, Usman, M., Lumbanraja, F. R., & Wamiliana. (2024). LSTM-CNN Hybrid Model Performance Improvement with BioWordVec for Biomedical Report Big Data Classification. Science and Technology Indonesia, 9(2), 273–283. https://doi.org/10.26554/sti.2024.9.2.273-283

Article Details

Most read articles by the same author(s)