Penanganan Data Churn Tidak Seimbang Menggunakan Pembobotan pada Model Supervised Machine Learning


Sitti Nurhaliza(1*); Andi Harismahyanti(2); Morina A Fathan(3); Muhammad Edy Rizal(4); Muh. Zarkawi Yahya(5); Asfar Asfar(6);

(1) Program Studi Sains Data, Univeristas Tadulako, Palu, Indonesia
(2) Program Studi Sains Data, Univeristas Tadulako, Palu, Indonesia
(3) Program Studi Statistika, Univeristas Tadulako, Palu, Indonesia
(4) Program Studi Sains Data, Univeristas Tadulako, Palu, Indonesia
(5) Program Studi Statistika, Univeristas Tadulako, Palu, Indonesia
(6) Program Studi Sains Data, Univeristas Tadulako, Palu, Indonesia
(*) Corresponding Author

  

Abstract


Customer churn merupakan tantangan strategis dalam industri digital karena berdampak langsung pada pendapatan dan biaya akuisisi pelanggan baru. Salah satu kendala utama dalam membangun model prediksi churn adalah ketidakseimbangan kelas, dimana proporsi pelanggan churn hanya 11,4% dibandingkan 88,6% non-churn, dengan imbalance ratio hampir 8:1. Ketidakseimbangan ini berpotensi menurunkan sensitivitas model terhadap kelas minoritas. Meskipun berbagai teknik penanganan imbalance telah banyak diteliti, studi yang secara sistematis mengevaluasi efektivitas class weighting pada model-model klasifikasi dasar dalam konteks churn dengan ketidakseimbangan ekstrem masih terbatas. Penelitian ini bertujuan mengevaluasi efektivitas teknik pembobotan kelas (class weighting) dalam meningkatkan kinerja model klasifikasi churn pada data telekomunikasi JABODETABEK tahun 2019. Pendekatan supervised machine learning digunakan dengan lima algoritma utama: regresi logistik, K-Nearest Neighbors (KNN), decision tree, naive Bayes, dan random forest. Evaluasi dilakukan menggunakan stratified 5-fold cross-validation dan metrik yang relevan untuk data tidak seimbang, yaitu recall, specificity, F1-score, dan AUC-ROC. Hasil penelitian menunjukkan bahwa penerapan class weighting memberikan peningkatan signifikan pada nilai recall, khususnya pada model decision tree, KNN, dan naive Bayes. Model Naive Bayes Balanced memberikan performa terbaik dengan recall di atas 0,75, meskipun terjadi sedikit penurunan specificity sebagai trade-off. Secara umum, strategi pembobotan kelas terbukti mengurangi bias terhadap kelas mayoritas dan menghasilkan keseimbangan metrik yang lebih baik. Temuan ini menegaskan bahwa teknik penyeimbangan kelas, meskipun sederhana, tetap krusial untuk meningkatkan akurasi identifikasi pelanggan berisiko churn dan dapat dijadikan referensi praktis dalam pengembangan sistem retensi pelanggan di sektor industri telekomunikasi.

Keywords


Churn; Imbalance Data; Weighted; Supervised Machine Learning

  
  

Full Text:

PDF
  

Article Metrics

Abstract view: 257 times
PDF view: 170 times
     

Digital Object Identifier

doi  https://doi.org/10.33096/busiti.v7i1.3118
  

Cite

References


Rofik, J. Unjung, dan B. Prasetyo, “Enhancing costumer churn prediction with stacking ensemble and stratified k-fold,” Bull. Electr. Eng. Informatics, vol. 14, no. 1, hal. 398–408, 2025, doi: https://doi.org/10.11591/eei.v14i1.8112.

P. Boozary, S. Sheykhan, H. GhorbanTanhaei, dan C. Magazzino, “Enhancing customer retention with machine learning: A comparative analysis of ensemble models for accurate churn prediction,” Int. J. Inf. Manag. Data Insights, vol. 5, no. 1, hal. 100331, 2025, doi: https://doi.org/10.1016/j.jjimei.2025.100331.

I. N. M. Adiputra, P. Wanchai, dan P. C. Lin, “Optimized customer churn prediction using tabular generative adversarial network (GAN)-based hybrid sampling method and cost-sensitive learning,” PeerJ Comput. Sci., vol. 11, hal. 1–29, 2025, doi: https://doi.org/10.7717/peerj-cs.2949.

H. T. T. Binh dan Y.-K. Kwon, “An Effective SMOTE-Based oversampling technique for class imbalance in software defect prediction,” IEEE Access, vol. 2, no. 8, hal. 24379–24392, 2020, doi: https://doi.org/10.1109/ACCESS.2020.2970401.

R. Suguna, J. Suriya Prakash, H. Aditya Pai, T. R. Mahesh, V. Vinoth Kumar, dan T. E. Yimer, “Mitigating class imbalance in churn prediction with ensemble methods and SMOTE,” Sci. Rep., vol. 15, no. 1, hal. 1–21, 2025, doi: 10.1038/s41598-025-01031-0.

I. N. M. Adiputra dan P. Wanchai, “CTGAN-ENN: a tabular GAN-based hybrid sampling method for imbalanced and overlapped data in customer churn prediction,” J. Big Data, vol. 11, no. 1, 2024, doi: https://doi.org/10.1186/s40537-024-00982-x.

A. C. Bahnsen, D. Aouada, dan B. Ottersten, “A novel cost-sensitive framework for customer churn predictive modeling,” Decis. Anal., vol. 2, no. 1, 2015, doi: https://doi.org/10.1186/s40165-015-0014-6.

D. J. Benkendorf, S. D. Schwartz, D. R. Cutler, dan C. P. Hawkins, “Correcting for the effects of class imbalance improves the performance of machine-learning based species distribution models,” Ecol. Modell., vol. 483, hal. 110414, 2023, doi: https://doi.org/10.1016/j.ecolmodel.2023.110414.

X. Liu, L. Guo, dan Y. Guo, “Cost-sensitive learning for imbalanced classification.,” Pattern Recognit., vol. 105, hal. 107298, 2019, doi: https://doi.org/10.1016/j.patcog.2020.107298.

J. Wang dan H. Yao, “A Comparative study of ensemble learning techniques for imbalanced classification,” IEEE Access, vol. 8, hal. 111763–111780, 2020, doi: https://doi.org/10.1109/ACCESS.2020.3003177.

A. Sharma dan A. Bhardwaj, “A Review on ensemble models for class imbalance in churn prediction,” J. King Saud Univ. – Comput. Inf. Sci., vol. 35, no. 10, hal. 101806, 2023, doi: https://doi.org/10.1016/j.jksuci.2023.101806.

H. Wickham et al., “Welcome to the tidyverse,” J. open source Softw., vol. 4, no. 43, hal. 1686, 2019, doi: 10.21105/joss.01686.

M. Binder, F. Pfisterer, M. Lang, L. Schneider, L. Kotthofi, dan B. Bischl, “Mlr3Pipelines - Flexible machine learning Pipelines in R,” J. Mach. Learn. Res., vol. 22, hal. 1–7, 2021, [Daring]. Tersedia pada: http://jmlr.org/papers/v22/21-0281.html

M. Steininger, K. Kobs, P. Davidson, A. Krause, dan A. Hotho, “Density-based weighting for imbalanced regression,” Mach. Learn., vol. 110, no. 8, hal. 2187–2211, 2021, doi: https://doi.org/10.1007/s10994-021-06023-5.

A. Harismahyanti, Indahwati, A. Fitrianto, dan Erfiani, “Outlier detection on high dimensional data using minimum vector variance (Mvv),” Barekeng, vol. 16, no. 3, hal. 797–804, 2022, doi: https://doi.org/10.30598/barekengvol16iss3pp797-804.

H. He dan E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, hal. 1263–1284, 2009, doi: https://doi.org/10.1109/TKDE.2008.239.

R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in IJCAI’95: Proceedings of the 14th international joint conference on Artificial intelligence, 1995, hal. 1137–1143. doi: https://dl.acm.org/doi/10.5555/1643031.1643047.

D. W. Hosmer, S. Lemeshow, dan R. X. Sturdivant, Applied Logistic Regression, Third Edit. Canada: John Wiley & Sons, Inc., 2013. doi: https://doi.org/10.1002/9781118548387.

L. Breiman, J. Friedman, R. A. Olshen, dan C. J. Stone, Classification and Regression Trees, 1st Editio. New York: Chapman and Hall/CRC, 2017. doi: https://doi.org/10.1201/9781315139470.

H. Zhang, “The Optimality of Naive Bayes,” in Proceedings of the seventeenth international florida artificial intelligence research society conference (FLAIRS 2004), California: The AAAI Press, 2004, hal. 1–6.

L. Breiman, Random Forests. Netherlands: Kluwer Academic Publishers, 2001. doi: https://doi.org/10.1023/A:1010933404324.

M. Sokolova dan G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf. Process. Manag., vol. 45, no. 4, hal. 427–437, 2009, doi: https://doi.org/10.1016/j.ipm.2009.03.002.


Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Buletin Sistem Informasi dan Teknologi Islam (BUSITI)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.