Performance comparison of support vector machine (SVM) with linear kernel and polynomial kernel for multiclass sentiment analysis on twitter

Sentiment analysis is a technique to extract information of one’s perception, called sentiment, on an issue or event. This study employs sentiment analysis to classify society’s response on covid-19 virus posted at twitter into 4 polars, namely happy, sad, angry, and scared. Classification technique used is support vector machine (SVM) method which compares the classification performance figure of 2 linear kernel functions, linear and polynomial. There were 400 tweet data used where each sentiment class consists of 100 data. Using the testing method of k-fold cross validation, the result shows the accuracy value of linear kernel function is 0.28 for unigram feature and 0.36 for trigram feature. These figures are lower compared to accuracy value of kernel polynomial with 0.34 and 0.48 for unigram and trigram feature respectively. On the other hand, testing method of confusion matrix suggests the highest performance is obtained by using kernel polynomial with accuracy value of 0.51, precision of 0.43, recall of 0.45, and f-measure of 0.51. E-ISSN 2548-7779 ILKOM Jurnal Ilmiah Vol. 13, No. 2, August 2021, pp. 168-174 169 Mukarramah, et. al. (Performance comparison of support vector machine (SVM) with linear kernel and polynomial kernel for multiclass sentiment analysis on twitter) tf-idf. The data will be used as a dataset for the sentiment classification process. The output of this process is a prediction of the sentiment experienced by Twitter users about the Covid-19 outbreak. In addition, this process produces performance values for each kernel used by calculating the values of accuracy, precision, recall and fmeasure. This research uses the Support Vector Machine (SVM) method with 2 kernel functions, namely linear and polynomial. Extraction of the unigram and trigram features was also carried out to produce tweet predictions related to the Covid-19 outbreak with 4 sentiment classes, namely happy, sad, angry, and afraid. This process also measures and compares the best performance of the classification method used. The program flow of this research can be seen in the following Figure 2a and 2b. Figure 2.a Flowchart of classification model determination Figure 2.b Flowchart of classification process The data set is obtained from the process of crawling data on Twitter using the Twitter API and then labeled into 4 sentiment classes, namely happy, sad, angry, and scared. After going through the labeling stage, the next stage is preprocessing which is selecting data that is likely to cause problems in the results of data processing due to inappropriate data selected to make it easier to process by the system created. Pre-processing consists of several stages, namely: a) Cleansing is used to remove URLs, mentions, usernames, RTs, hashtags, numbers, punctuation marks, and emoticons. b) Case folding is used to uniform all letters into lowercase letters. c) Tokenizing is used to cut each word in a sentence d) Stemming is used to change the word in the sentence to the basic word form e) Stopword is used to remove words that have no meaning and are not needed in the classification process. After pre-processing, feature extraction using unigram and trigram is carried out. While the former is the word extract in the review sentence with n=1 or single term, the latter is extract of n-words in the review sentence with n=3 [14]. The next stage is TF-IDF, a statistical-based weighting technique, which is applied in various information mining issues [15]. This technique is used to compute a weight to each word which signifies the importance of the word in the document. After weighting each word using TF-IDF, the data will be stored as training data which will then be used as a classification model to determine sentiment analysis from Twitter using the Support Vector Machine SVM method). The Support Vector Machine (SVM) classification process uses the python programming language with the ScikitLearn library. 170 ILKOM Jurnal Ilmiah Vol. 13, No. 2, August 2021, pp. 168-174 E-ISSN 2548-7779 Mukarramah, et. al. (Performance comparison of support vector machine (SVM) with linear kernel and polynomial kernel for multiclass sentiment analysis on twitter) Results and Discussion This section discusses the classification and testing results of the tweet classification model that has been built. After the process of crawling Twitter data and labeling them according to the sentiment experienced, the next stage is pre-processing. A. Pre-Processing The initial stage step is cleansing which is to remove URLs, mentions, usernames, RTs, hashtags, numbers, punctuation marks, and emoticons as shown in Table 1. Table 1. Tweet after Cleansing Tweet Label Mengikuti arus dunia yang menyedihkan Corona anjing Angry Orang orang sedang phobia virus Corona flu dan demam efek kehujanan kemarin aja di jauhin terus suruh periksa kedokter Scared The next step is case folding which is to uniform all letters into lowercase as given in Table 2. Table 2. Tweet after Case Folding Tweet Label mengikuti arus dunia yang menyedihkan corona anjing Angry orang orang sedang phobia virus corona flu dan demam efek kehujanan kemarin aja di jauhin terus suruh periksa kedokter Scared Next, tokenizing cut each word in a sentence as shown in Table 3. Table 3. Tweet after Tokenizing Tweet Label ‘mengikuti’, ‘arus’, ‘dunia’, ‘yang’, ‘menyedihkan’, ‘corona’, ‘anjing’ Angry ‘orang’, ‘orang’, ‘sedang’, ‘phobia’, ‘virus’, ‘corona’, ‘flu’, ‘dan’, ‘demam’, ‘efek’, ‘kehujanan’, ‘kemarin’, ‘aja’, ‘di’, ‘jauhin’, ‘terus’, ‘suruh’, ‘periksa’, ‘kedokter’ Scared Stemming is the next step which is to change word in the sentence to the basic form as shown in Table 4. Table 4. Tweet after Stemming Tweet Label ‘ikut’, ‘arus’, ‘dunia’, ‘yang’, ‘sedih’, ‘corona’, ‘anjing’ Marah ‘orang’, ‘orang’, ‘sedang’, ‘phobia’, ‘virus’, ’corona’, ‘flu’, ‘dan’, ‘demam’, ‘efek’, ‘hujan’, ‘kemarin’, ‘aja’, ‘di’, ‘jauh’, ‘terus’, ‘suruh’, ‘periksa’, ‘dokter’ Takut The final step is stopword which is to remove words that have no meaning as shown in Table 5. Table 5. Tweet after Stopword Tweet Label arus dunia sedih corona anjing Marah orang orang phobia virus corona flu demam efek hujan kemarin jauhin suruh periksa dokter Takut B. Feature Extraction This study applies n-gram with word extraction on tweet data with n=1, called unigram, and n=3, called trigram. The feature extraction results can be seen in Table 6. Table 6. Feature Extraction Using Unigram and Trigram Unigram Trigram Label arus dunia sedih corona anjing arus dunia sedih dunia sedih corona sedih corona anjing Angry orang phobia orang phobia virus phobia virus corona Scared E-ISSN 2548-7779 ILKOM Jurnal Ilmiah Vol. 13, No. 2, August 2021, pp. 168-174 171 Mukarramah, et. al. (Performance comparison of support vector machine (SVM) with linear kernel and polynomial kernel for multiclass sentiment analysis on twitter) Unigram Trigram Label virus corona flu demam efek hujan kemarin jauh suruh periksa dokter virus corona flu corona flu deman flu demam efek demam efek hujan efek hujan kemarin hujan kemarin C. TF-IDF The extracted data is still in the form of qualitative data so that it needs to be reprocessed into quantitative data using the TF-IDF calculation. Below are three documents to show the results of the TF-IDF calculation. The results of the calculation of tweet data using TF-IDF can be seen in Table 7. Table 7. TF-IDF Calculation TF IDF TF-IDF Term 0.142857 1.393043 0.199006 Korona 0.142857 5.892852 0.841836 Rasa 0.142857 6.298317 0.899760 Rebah 0.142857 6.991465 0.998781 Selasa 0.142857 4.912023 0.701718 Semenjak 0.142857 6.991465 0.9987871 Senin D. Classification Process After passing through the labelling, pre-processing, feature extraction, and TF-IDF, the next stage is to carry out classification process using the SVM model. In order to determine the prediction results of the classification model used suit the label or not, tweet data obtained from the testing data set is given. The data can be seen in Table 8. Table 8. Data Testing Tweet Label alhamdulillah, negative corona positif cantik.. Happy pensi sekolahku gagal tahun ini garagara corona Sad kenapa sih masih banyak orang yang gasuka pakai masker, gak takut kena corona yah Angry orang-orang takut ibunya kena korona jadi pada nggak berani bantuin Afraid The prediction results can be seen in Table 9. Table 9. Prediction Results of Unigram Feature Tweet Kernel Prediksi alhamdulillah, negative corona positif cantik.. Linear Afraid Polynomial Afraid pensi sekolahku gagal tahun ini garagara corona Linear Sad Polynomial Happy kenapa sih masih banyak orang yang gasuka pakai masker, gak takut kena corona yah Linear Afraid Polynomial Sad orang-orang takut ibunya kena korona jadi pada nggak berani bantuin Linear Afraid Polynomial Sad Table 9 shows the prediction results from the classification model using the unigram feature providing some results which are not suitable. Using a linear kernel, it was found that only the second tweet is predicted to be correct. Meanwhile, by using a polynomial kernel, all tweets are predicted to be wrong. 172 ILKOM Jurnal Ilmiah Vol. 13, No. 2, August 2021, pp. 168-174 E-ISSN 2548-7779 Mukarramah, et. al. (Performance comparison of support vector machine (SVM) with linear kernel and polynomial kernel for multiclass sentiment analysis on twitter) Table 10. Prediction Results of Trigram Feature Tweet Kernel Prediction alhamdulillah, negative corona positif cantik.. Linear Afraid Polynomial Happy pensi sekolahku gagal tahun ini garagara corona Linear Afraid Polynomial Sad kenapa sih masih banyak orang yang gasuka pakai masker, gak takut kena corona yah Linear Happy Polynomial Sad orang-orang takut ibunya kena korona jadi pada nggak berani bantuin Linear Happy

tf-idf. The data will be used as a dataset for the sentiment classification process. The output of this process is a prediction of the sentiment experienced by Twitter users about the Covid-19 outbreak. In addition, this process produces performance values for each kernel used by calculating the values of accuracy, precision, recall and fmeasure. This research uses the Support Vector Machine (SVM) method with 2 kernel functions, namely linear and polynomial. Extraction of the unigram and trigram features was also carried out to produce tweet predictions related to the Covid-19 outbreak with 4 sentiment classes, namely happy, sad, angry, and afraid. This process also measures and compares the best performance of the classification method used. The program flow of this research can be seen in the following Figure 2a and 2b. The data set is obtained from the process of crawling data on Twitter using the Twitter API and then labeled into 4 sentiment classes, namely happy, sad, angry, and scared. After going through the labeling stage, the next stage is preprocessing which is selecting data that is likely to cause problems in the results of data processing due to inappropriate data selected to make it easier to process by the system created. Pre-processing consists of several stages, namely: a) Cleansing is used to remove URLs, mentions, usernames, RTs, hashtags, numbers, punctuation marks, and emoticons. b) Case folding is used to uniform all letters into lowercase letters. c) Tokenizing is used to cut each word in a sentence d) Stemming is used to change the word in the sentence to the basic word form e) Stopword is used to remove words that have no meaning and are not needed in the classification process.
After pre-processing, feature extraction using unigram and trigram is carried out. While the former is the word extract in the review sentence with n=1 or single term, the latter is extract of n-words in the review sentence with n=3 [14]. The next stage is TF-IDF, a statistical-based weighting technique, which is applied in various information mining issues [15]. This technique is used to compute a weight to each word which signifies the importance of the word in the document.
After weighting each word using TF-IDF, the data will be stored as training data which will then be used as a classification model to determine sentiment analysis from Twitter using the Support Vector Machine SVM method). The Support Vector Machine (SVM) classification process uses the python programming language with the Scikit-Learn library.

Results and Discussion
This section discusses the classification and testing results of the tweet classification model that has been built. After the process of crawling Twitter data and labeling them according to the sentiment experienced, the next stage is pre-processing.

A. Pre-Processing
The initial stage step is cleansing which is to remove URLs, mentions, usernames, RTs, hashtags, numbers, punctuation marks, and emoticons as shown in Table 1. The next step is case folding which is to uniform all letters into lowercase as given in Table 2. Next, tokenizing cut each word in a sentence as shown in Table 3. Table 3. Tweet after Tokenizing Tweet Label

Scared
Stemming is the next step which is to change word in the sentence to the basic form as shown in Table 4. The final step is stopword which is to remove words that have no meaning as shown in Table 5.

B. Feature Extraction
This study applies n-gram with word extraction on tweet data with n=1, called unigram, and n=3, called trigram. The feature extraction results can be seen in Table 6.

C. TF-IDF
The extracted data is still in the form of qualitative data so that it needs to be reprocessed into quantitative data using the TF-IDF calculation. Below are three documents to show the results of the TF-IDF calculation. The results of the calculation of tweet data using TF-IDF can be seen in Table 7.

D. Classification Process
After passing through the labelling, pre-processing, feature extraction, and TF-IDF, the next stage is to carry out classification process using the SVM model. In order to determine the prediction results of the classification model used suit the label or not, tweet data obtained from the testing data set is given. The data can be seen in Table 8. The prediction results can be seen in Table 9.  Table 9 shows the prediction results from the classification model using the unigram feature providing some results which are not suitable. Using a linear kernel, it was found that only the second tweet is predicted to be correct. Meanwhile, by using a polynomial kernel, all tweets are predicted to be wrong.  Table 10 shows the prediction results from the classification model using the trigram feature, some results were found not suitable. There were no tweets predicted to be correct in the result of linear kernel. Meanwhile, using a kernel polynomial, the first and second tweets are predicted to be correct, the third and fourth tweets are predicted incorrect.

E. Testing
The tweet classification test was carried out by measuring the accuracy, precision, recall, and f-measure values from the Support Vector Machine calculation with 2 kernel functions, namely linear and polynomial, and using unigram and trigram feature extraction. The data used for the classification model testing process was obtained from Indonesian-language tweets raising the topic of Covid-19 with a total of 400 tweets divided into 4 classes, namely 100 tweets with happy labels, 100 tweets with sad labels, 100 tweets with angry labels, and 100 tweets with scared labels. The testing of the performance of the classification model employs the k-fold cross validation and confusion matrix methods.
The cross-validation test uses a value of k=4. In each iteration, one of the folds will be selected as testing data and the remaining will become training data. Each data can only be used as testing data once. The process of calculating the accuracy of the data testing will continue to repeat until all iterations are complete.  Table 11 shows the results of the evaluation process of the classification model using 4-fold cross validation. In the first iteration, the highest accuracy is obtained using a polynomial kernel with trigram feature extraction. Similarly, the highest accuracy at the second, third and fourth iterations were obtained by using a polynomial kernel and trigram feature extraction, with 0.437, 0.450, and 0.512 respectively.
The confusion matrix test is used to visualize the performance of the classification algorithm which is used by comparing the actual value with the predicted value. The parameters that will be measured in this test are the values of accuracy, precision, recall, and f-measure.   2.a, 2.b, 2.c and 2.d show the output of the testing process using linear and polynomial kernel functions with unigram and trigram feature extraction. The best performance in both linear and polynomial kernel were obtained by using the trigram feature with an accuracy value of 0.55 and 0.50 respectively. Table 12 shows a comparison of the results of the Support Vector Machine performance using linear and polynomial kernels obtained by calculating the average value of each performance obtained for each kernel and feature used. Classification calculations using polynomial kernels and trigram features produce better test scores overall than using linear kernels with unigram or trigram features. The comparison of performance results of support vector machine using linear and polynomial kernel can also be seen in the following. Chart.

Conclusion
Based on the results and discussion, it is concluded that the best performance of the support ventor machine method is obtained by using the polynomial kernel function with an accuracy value of 0.512, precision of 0.437, recall of 0.45, and f-measurement of 0.512. Therefore, the most appropriate kernel function and feature extraction to be applied in this study is a polynomial kernel with trigram features.