The Development of Classification Algorithm Models on Spam SMS Using Feature Selection and SMOTE

Rachma Chrysanti; Sony Hartono Wijaya; Toto Haryanto

doi:10.33096/ilkom.v16i3.2220.356-370

The Development of Classification Algorithm Models on Spam SMS Using Feature Selection and SMOTE

Rachma Chrysanti^(1*); Sony Hartono Wijaya⁽²⁾; Toto Haryanto⁽³⁾;

(1) IPB University
(2) IPB University
(3) IPB University
(*) Corresponding Author

Abstract

Short Message Service (SMS) is a widely used communication media. Unfortunately, the increasing usage of SMS has resulted in the emergence of SMS spam, which often disturbs the comfort of cellphone users. Developing a classification model as a solution for filtering SMS spam is very important to minimize disruption and loss to cellphone users due to SMS spam. To address this issue, utilize the Naïve Bayes algorithm and Support Vector Machine (SVM) along with Chi-square and Information Gain. This study focuses on the classification and analysis of SMS spam on a cellular operator service in a telecommunications company using machine learning techniques. This study applies and combines a combination of classification methods including Naive Bayes and Support Vector Machine (SVM). The combination is carried out with Chi-square and Information Gain feature selection to reduce irrelevant features. This study also applies a combination with data balancing techniques using the Synthetic Minority Oversampling Technique (SMOTE) to balance the number of unbalanced classes. The results show that SMOTE improves classification performance. SVM performs spam SMS classification or not spam Model 7 (SVM) achieves accuracy 98,55% and it has improved the performance when it was combined with SMOTE Model 10 (SVM + SMOTE) achieves F1-score 99,23% in performing spam SMS classification or not this outperforms all other models. These results indicate that the SVM algorithm achieved better performance in detecting spam SMS compared to Naive Bayes, which demonstrated a lower level of accuracy. These results illustrate the effectiveness of combining machine learning models to enhance classification accuracy with balanced data, emphasizing the model that exhibited the most substantial improvement in performance.

Keywords

Chi-square; Feature Selection; Naïve Bayes; SMOTE; Spam Detection.

Full Text:

PDF

Article Metrics

Abstract view: 869 times
PDF view: 306 times

Digital Object Identifier

https://doi.org/10.33096/ilkom.v16i3.2220.356-370

Cite

How to cite item

References

P. Medina Aguerrebere, E. Medina, and T. Gonzalez Pacanowski, “Promoting Health Education Through Mobile Apps: A Quantitative Analysis of American Hospitals,” Healthc., vol. 10, no. 11, pp. 1–14, 2022, doi: 10.3390/healthcare10112231.

S. Kamarudin, L. Tang, J. Bolong, and N. A. Adzharuddin, “A Systematic Literature Review of Mitigating Cyber Security Risk,” Qual. Quant., vol. 58, no. 4, pp. 3251–3273, 2024, doi: 10.1007/s11135-023-01791-9.

D. M. D. Oliveira, L. Pedro, and C. Santos, “The Use of Mobile Applications in Higher Education Classes: A Comparative Pilot Study of The Students’ perceptions and real usage,” Smart Learn. Environ., vol. 8, no. 1, p. 14, 2021, doi: 10.1186/s40561-021-00159-6.

L. K. Osei, Y. Cherkasova, and K. M. Oware, “Unlocking The Full Potential of Digital Transformation in Banking: A Bibliometric Review and Emerging Trend,” Futur. Bus. J., vol. 9, no. 1, 2023, doi: 10.1186/s43093-023-00207-2.

Y. Song, T. Natori, and X. Yu, “Tracing the Evolution of E-Government: A Visual Bibliometric Analysis from 2000 to 2023,” Adm. Sci., vol. 14, no. 7, 2024, doi: 10.3390/admsci14070133.

A. Qamar, A. Karim, and V. Chang, “Mobile Malware Attacks: Review, Taxonomy & Future Directions,” Futur. Gener. Comput. Syst., vol. 97, pp. 887–909, 2019, doi: 10.1016/J.FUTURE.2019.03.007.

S. Kumar and S. Gupta, “Legitimate and spam SMS classification employing novel Ensemble feature selection algorithm,” Multimed. Tools Appl., vol. 83, no. 7, pp. 19897–19927, 2024, doi: 10.1007/s11042-023-16327-4.

E. G. Jain, “A Comparative Analyzing of SMS Spam Using Topic Models,” P. K. Singh, Z. Polkowski, S. Tanwar, S. K. Pandey, G. Matei, and D. Pirvu, Eds., Cham: Springer International Publishing, 2021, pp. 91–99. doi: 10.1007/978-3-030-66218-9_10

N. Terli, P. Chintakayala, V. M. Angaluri, and S. Sodagudi, “Detection of Spam in SMS Using Machine Learning Algorithms BT - Smart Trends in Computing and Communications,” T. Senjyu, C. So-In, and A. Joshi, Eds., Singapore: Springer Nature Singapore, 2023, pp. 417–427, doi: 10.1007/978-981-99-0838-7_37.

K. S. Chong and N. Shah, “Comparison of Naive Bayes and SVM Classification in Grid-Search Hyperparameter Tuned and Non-Hyperparameter Tuned Healthcare Stock Market Sentiment Analysis,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 12, pp. 90–94, 2022, doi: 10.14569/IJACSA.2022.0131213.

J. Manuel, A. Acosta, and R. G. Cabrera, “Analysis of the Influence of Preprocessing Techniques on Text Classification Accuracy : An Investigation with the Naive Bayes Model and the Reuters-21578 Dataset,” vol. 9, no. 10, pp. 5220–5229, 2023.

N. Kewsuwun and S. Kajornkasirat, “A Sentiment Analysis Model of Agritech Startup on Facebook Comments Using Naive Bayes Classifier,” Int. J. Electr. Comput. Eng., vol. 12, no. 3, pp. 2829–2838, 2022, doi: 10.11591/ijece.v12i3.pp2829-2838.

J. Sangeetha and D. U. Kumaran, “Comparison of Sentiment Analysis on Online Product Reviews Using Optimised RNN-LSTM with Support Vector Machine,” Webology, vol. 19, no. 1, pp. 3883–3898, 2022, doi: 10.14704/web/v19i1/web19256.

A. Ghourabi and M. Alohaly, “Enhancing Spam Message Classification and Detection Using Transformer-Based Embedding and Ensemble Learning,” 2023. doi: 10.3390/s23083861.

A. K. Rastogi, S. Taterh, and B. S. Kumar, “Dimensionality Reduction Algorithms in Machine Learning: A Theoretical and Experimental Comparison,” 2023. doi: 10.3390/engproc2023059082.

Sutriawan, Muljono, Khairunnisa, Z. Alamin, T. A. Lorosae, and S. Ramadhan, “Improving Performance Sentiment Movie Review Classification Using Hybrid Feature TFIDF, N-Gram, Information Gain and Support Vector Machine,” Math. Model. Eng. Probl., vol. 11, no. 2, pp. 375–384, 2024, doi: 10.18280/mmep.110209.

A. Sikri, N. P. Singh, and S. Dalal, “Analysis of Rank Aggregation Techniques for Rank Based on The Feature Selection Technique,” Int. J. Recent Innov. Trends Comput. Commun., vol. 11, no. 3s, pp. 95–108, Mar. 2023, doi: 10.17762/ijritcc.v11i3s.6160.

A. Abraham, · Paramartha, D. Jyotsna, K. Mandal, A. Bhattacharya, and S. Dutta, “Emerging Technologies in Data Mining and Information Security”, vol. 813. 2019. doi: 10.1007/978-981-13-1498-8.

L. Gao, M. Ye, X. Lu, and D. Huang, “Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification,” Genomics. Proteomics Bioinformatics, vol. 15, no. 6, pp. 389–395, Dec. 2017, doi: 10.1016/j.gpb.2017.08.002.

M. Suhaidi, R. A. Kadir, and S. Tiun, “The Impact of Preprocessing Techniques Towards Word Embedding BT - Advances in Visual Informatics,” H. Badioze Zaman, P. Robinson, A. F. Smeaton, R. L. De Oliveira, B. N. Jørgensen, T. K. Shih, R. Abdul Kadir, U. H. Mohamad, and M. N. Ahmad, Eds., Singapore: Springer Nature Singapore, 2024, pp. 421–429, doi: 10.1007/978-981-99-7339-2_35.

S. Bahassine, A. Madani, M. Al-Sarem, and M. Kissi, “Feature selection Using an Improved Chi-Square for Arabic Text Classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 32, no. 2, pp. 225–231, Feb. 2020, doi: 10.1016/j.jksuci.2018.05.010.

M. Bordoloi and S. K. Biswas, “Sentiment Analysis: A Survey on Design Framework, Applications and Future Scopes, ” vol. 56, no. 11. Springer Netherlands, 2023. doi: 10.1007/s10462-023-10442-2.

S. Dhelim et al., “Detecting Mental Distresses Using Social Behavior Analysis in the Context of COVID-19: A Survey,” ACM Comput. Surv., vol. 55, no. 14 S, 2023, doi: 10.1145/3589784.

V. Rupapara, F. Rustam, H. F. Shahzad, A. Mehmood, I. Ashraf, and G. S. Choi, “Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model,” IEEE Access, vol. 9, pp. 78621–78634, 2021, doi: 10.1109/ACCESS.2021.3083638.

S. Mayabadi and H. Saadatfar, “Two Density-Based Sampling Approaches for Imbalanced and Overlapping Data,” Knowledge-Based Syst., vol. 241, p. 108217, 2022, doi: 10.1016/j.knosys.2022.108217.

Y. Qu, Z., Li, H., Wang, Y., Zhang, J., Abu-Siada, A., & Yao, “Detection of Electricity Theft Behavior Based on Technique and Random Forest Classifier,” Energies, vol. 13, no. 8, p. 2039, 2020, doi: 10.3390/en13082039.

B. Santoso, H. Wijayanto, K. A. Notodiputro, and B. Sartono, “Synthetic Over Sampling Methods for Handling Class Imbalanced Problems : A Review,” IOP Conf. Ser. Earth Environ. Sci., vol. 58, no. 2, p. 012031, Mar. 2017, doi: 10.1088/1755-1315/58/1/012031.

F. Rustam, A. Mehmood, M. Ahmad, S. Ullah, D. M. Khan, and G. S. Choi, “Classification of Shopify App User Reviews Using Novel Multi Text Features,” IEEE Access, vol. 8, pp. 30234–30244, 2020, doi: 10.1109/ACCESS.2020.2972632.

S. Muñoz and C. A. Iglesias, “A Text Classification Approach to Detect Psychological Stress Combining a Lexicon-Based Feature Framework with Distributional Representations,” Inf. Process. Manag., vol. 59, no. 5, p. 103011, 2022, doi: 10.1016/j.ipm.2022.103011.

R. Çekik and M. Kaya, “A New Feature Selection Metric Based on Rough Sets and Information Gain in Text Classification,” vol. 10, no. 4, pp. 472–486, 2023, doi: 10.54287/gujsa.1379024.

M. Krendzelak and F. Jakab, “Text Categorization with Machine Learning and Hierarchical Structures,” ICETA 2015 - 13th IEEE Int. Conf. Emerg. eLearning Technol. Appl. Proc., no. November 2015, 2016, doi: 10.1109/ICETA.2015.7558486.

A. M. Rahat, A. Kahir, and A. K. M. Masum, “Comparison of Naive Bayes and SVM Algorithm based on Sentiment Analysis Using Review Dataset,” in 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART), 2019, pp. 266–270. doi: 10.1109/SMART46866.2019.9117512.

X. Zhang, H. Zhao, S. Zhang, and R. Li, “A Novel Deep Neural Network Model for Multi-Label Chronic Disease Prediction,” Front. Genet., vol. 10, no. APR, pp. 1–11, Apr. 2019, doi: 10.3389/fgene.2019.00351.

K. B. Lin, W. Weng, R. K. Lai, and Ping Lu, “Imbalance Data Classification Algorithm Based on SVM and Clustering Function,” Proc. 9th Int. Conf. Comput. Sci. Educ. ICCCSE 2014, no. Iccse, pp. 544–548, 2014, doi: 10.1109/ICCSE.2014.6926521.

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

ILKOM Jurnal Ilmiah indexed by

___________________________________________________________
ILKOM Jurnal Ilmiah
ISSN 2548-7779
Published by Prodi Teknik Informatika FIK Universitas Muslim Indonesia
W : https://fikom.umi.ac.id/
E : jurnal.ilkom@umi.ac.id

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me