Comparative Study of Random Forest and Ordinal Regression in Concept Map Quality Assessment: The Role of TF-IDF, BERT, and SMOTE-based Balancing


Nurul Rismayanti(1); Didik Dwi Prasetya(2*); Triyanna Widiyaningtyas(3); Tsukasa Hirashima(4);

(1) Universitas Negeri Malang
(2) Universitas Negeri Malang
(3) Universitas Negeri Malang
(4) Hiroshima University
(*) Corresponding Author

  

Abstract


Automatic assessment of concept map quality is an important challenge in the field of education, particularly in evaluating students' conceptual understanding objectively and efficiently. This study aims to compare the performance of two machine learning algorithms, namely Random Forest and Ordinal Regression, in classifying the quality of concept maps. The evaluation was conducted on three approaches to text feature representation: Term Frequency-Inverse Document Frequency (TF-IDF), Bidirectional Encoder Representations from Transformers (BERT), and a combination of both (TF-IDF + BERT). Additionally, this study compares the performance of the models under two dataset conditions: original data and data balanced using the Synthetic Minority Over-sampling Technique (SMOTE), to address the class imbalance that often occurs in educational data. The data used consists of a collection of propositions from students' concept maps that have been labeled with ordinal scores based on quality. Text representation is extracted using the TF-IDF and BERT approaches, and then used as input to build the classification model. Performance evaluation was conducted using the metrics of Accuracy, Precision, Recall, F1-score, Cohen’s Kappa, and MAE. The results show that the Ordinal Regression model with TF-IDF representation combined with SMOTE achieved the best performance, with an accuracy of 0.8777, an F1-score of 0.8773, and a Cohen’s Kappa of 0.7701. These results indicate that classical feature representations like TF-IDF remain effective in limited data scenarios, and that the SMOTE technique successfully improved the model's performance by reducing bias towards the majority class. This research contributes to the development of an automatic concept map assessment system and suggests optimal classification strategies for educational datasets with ordinal and imbalanced characteristics

Keywords


Concept Map Classification; Random Forest; Ordinal Regression; SMOTE; TF-IDF

  
  

Full Text:

PDF
  

Article Metrics

Abstract view: 0 times
PDF view: 0 times
     

Digital Object Identifier

doi  https://doi.org/10.33096/ilkom.v17i3.2906.336-345
  

Cite

References


C. Yang, “Neural Concept Map Generation for Effective Document Classification with Interpretable Structured Summarization,” SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1629–1632, 2020, doi: 10.1145/3397271.3401312.

H. Choi, “Enhancement of Knowledge Concept Maps Using Deductive Reasoning with Educational Data,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14798. pp. 104–116, 2024, doi: 10.1007/978-3-031-63028-6_9.

S. Gao, “Research on Intention Recognition of Educational Counseling Combining BERT and Concept Map,” Proceedings - 2023 12th International Conference of Information and Communication Technology, ICTech 2023. pp. 236–240, 2023, doi: 10.1109/ICTech58362.2023.00054.

W. Satria and M. Riasetiawan, “Essay answer classification with smote random forest and adaboost in automated essay scoring,” IJCCS, vol. 17, no. 4, p. 359, 2023, doi: https://doi.org/10.22146/ijccs.82548.

T. Wongvorachan, S. He, and O. Bulut, “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” Information, vol. 14, no. 1, p. 54, Jan. 2023, doi: 10.3390/info14010054.

F. P. Arifianti and A. Salam, “XGBoost and Random Forest Optimization using SMOTE to Classify Air Quality,” Adv. Sustain. Sci. Eng. Technol., 2024, doi: https://doi.org/10.26877/asset.v6i1.18136.

C. Michael Lauw, H. Hairani, I. Saifuddin, J. Ximenes Guterres, M. Maariful Huda, and M. Mayadi, “Combination of Smote and Random Forest Methods for Lung Cancer Classification,” Int. J. Eng. Comput. Sci. Appl., vol. 2, no. 2, pp. 59–64, Sep. 2023, doi: 10.30812/ijecsa.v2i2.3333.

D. Nurmalasari, H. R. Yuliantoro, and D. H. Qudsi, “Improving Panic Disorder Classification Using SMOTE and Random Forest,” J. Appl. Informatics Comput., vol. 8, no. 2, pp. 272–279, 2024, doi: https://doi.org/10.30871/jaic.v8i2.8315.

K. Karfindo, R. Turaina, and R. Saputra, “Optimalisasi Klasifikasi Umpan Balik Mahasiswa Terhadap Layanan Kampus dengan Sinergi Random Forest dan Smote,” J. Nas. Komputasi dan Teknol. Inf., vol. 6, no. 6, pp. 820–827, Jan. 2024, doi: 10.32672/jnkti.v6i6.7269.

D. D. Prasetya, A. Pinandito, Y. Hayashi, and T. Hirashima, “Analysis of quality of knowledge structure and students’ perceptions in extension concept mapping,” Res. Pract. Technol. Enhanc. Learn., vol. 17, no. 1, p. 14, Dec. 2022, doi: 10.1186/s41039-022-00189-9.

F. Kamalov, S. E. Choutri, and A. F. Atiya, “Analytical formulation of synthetic minority oversampling technique (SMOTE) for imbalanced learning,” Gulf J. Math., vol. 19, no. 1, pp. 400–415, Jan. 2025, doi: 10.56947/gjom.v19i1.2639.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.

I. K. Dharmendra, I. M. A. W. Putra, and Y. P. Atmojo, “Evaluation of the Effectiveness of SMOTE and Random Under Sampling in Emotion Classification of Tweets,” INFORMATICS Educ. Prof. J. Informatics, vol. 9, no. 2, p. 182, Dec. 2024, doi: 10.51211/itbi.v9i2.3183.

M. Sergii V. and N. Oleksandr V., “Data preprocessing and tokenization techniquesfortechnical Ukrainian texts,” Appl. Asp. Inf. Technol., vol. 6, no. 3, pp. 318–326, Sep. 2023, doi: 10.15276/aait.06.2023.22.

M. V. J. da Silva, E. E. Santana, F. M. F. Lobato, and A. F. L. Jacob Jr., “Preprocessing Applied to Legal Text Mining: analysis and evaluation of the main techniques used,” in Anais do XX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2023), Sep. 2023, pp. 1010–1021, doi: 10.5753/eniac.2023.234555.

A. Kathuria, A. Gupta, and R. K. Singla, “A Review of Tools and Techniques for Preprocessing of Textual Data,” 2021, pp. 407–422.

K. Yusupov, “Comparative Analysis of Machine Learning and Deep Learning Models for Email Spam Classification Using TF-IDF and Word Embedding Techniques,” Lecture Notes on Data Engineering and Communications Technologies, vol. 231. pp. 114–122, 2025, doi: 10.1007/978-3-031-76452-3_11.

G. Popoola, “Sentiment Analysis of Financial News Data using TF-IDF and Machine Learning Algorithms,” 2024 IEEE 3rd International Conference on AI in Cybersecurity, ICAIC 2024. 2024, doi: 10.1109/ICAIC60265.2024.10433843.

S. M. M. Hossain, “TF-IDF feature-based spam filtering of mobile SMS using a machine learning approach,” Applied Intelligence for Industry 4.0. pp. 162–175, 2023, [Online]. Available: https://api.elsevier.com/content/abstract/scopus_id/85161154224.

Y. Qiu, “ChatGPT and finetuned BERT: A comparative study for developing intelligent design support systems,” Intell. Syst. with Appl., vol. 21, 2024, doi: 10.1016/j.iswa.2023.200308.

A. S. Khan, “Integrating BERT Embeddings with SVM for Prostate Cancer Prediction,” Proceedings - 6th International Conference on Electrical Engineering and Information and Communication Technology, ICEEICT 2024. pp. 574–579, 2024, doi: 10.1109/ICEEICT62016.2024.10534547.

A. C. Mazari, “BERT-based ensemble learning for multi-aspect hate speech detection,” Cluster Comput., vol. 27, no. 1, pp. 325–339, 2024, doi: 10.1007/s10586-022-03956-x.

J. W. Sun, “Text Classification Algorithm Based on TF-IDF and BERT,” Proceedings - 2022 11th International Conference of Information and Communication Technology, ICTech 2022. pp. 533–536, 2022, doi: 10.1109/ICTech55460.2022.00112.

L. Gomes, “BERT- and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: A comparative study,” Inf. Softw. Technol., vol. 160, 2023, doi: 10.1016/j.infsof.2023.107217.

A. Gupta and S. Gupta, “Enhanced Classification of Imbalanced Medical Datasets using Hybrid Data-Level, Cost-Sensitive and Ensemble Methods,” Int. Res. J. Multidiscip. Technovation, pp. 58–76, Apr. 2024, doi: 10.54392/irjmt2435.

K. V. Ramana, Y. B, S. Sj, P. Ponsudha, S. Pd, and A. V. Sangeetha, “Applying Cost-Sensitive Learning Methods to Improve Extremely Unbalanced Big Data Problems Using Random Forest,” in 2023 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), May 2023, pp. 1–7, doi: 10.1109/ACCAI58221.2023.10199250.

H. Cui, H. Xu, and J. Li, “Optimization of random forest algorithm based on mixed sampling additional feature selection,” in 2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Jan. 2023, pp. 461–467, doi: 10.1109/ICCECE58074.2023.10135433.

J. Simićević, “Ordinal Regression Model of Parking Search Time,” Promet - Traffic Transp., vol. 35, no. 6, pp. 904–916, 2023, doi: 10.7307/ptt.v35i6.291.

Y. Tajima, “Ordinal Regression Based on the Distributional Distance for Tabular Data,” IEICE Trans. Inf. Syst., no. 3, pp. 357–364, 2023, doi: 10.1587/transinf.2022EDP7071.

C. Lee, “Ordinal Regression for Beef Grade Classification,” Digest of Technical Papers - IEEE International Conference on Consumer Electronics, vol. 2023. 2023, doi: 10.1109/ICCE56470.2023.10043530.

N. Habbat, “Sentiment analysis of imbalanced datasets using BERT and ensemble stacking for deep learning,” Eng. Appl. Artif. Intell., vol. 126, 2023, doi: 10.1016/j.engappai.2023.106999.

A. Zaboli, “Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage?,” Am. J. Emerg. Med., vol. 79, pp. 44–47, 2024, doi: 10.1016/j.ajem.2024.02.008.

W. Satria and M. Riasetiawan, “Essay Answer Classification with SMOTE Random Forest and AdaBoostin Automated Essay Scoring,” IJCCS (Indonesian J. Comput. Cybern. Syst., vol. 17, no. 4, p. 359, Oct. 2023, doi: 10.22146/ijccs.82548.

S. M. M. Hossain, K. M. A. Kamal, A. Sen, and I. H. Sarker, TF-IDF Feature-Based Spam Filtering of Mobile SMS Using a Machine Learning Approach. 2023.


Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Nurul Rismayanti, Didik Dwi Prasetya, Triyanna Widiyaningtyas, Tsukasa Hirashima

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.