Ad Code



A. Cross Industry Standard Platform for Data Mining (CRISP-DM)

For this study, the CRISP-DM was adopted in defining several stages and activities conducted to accomplish the objectives of this study. Cross Industry Standard Platform for Data Mining (CRISP-DM) is one of the standard data mining processes that is popularly used for solving businesses and research problems with the help of its six phases: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. Figure 2 below described the study model workflow.  

C.   Dataset Description

As this research is aimed at classifying public sentiment regarding Covid-19 vaccine, dataset description and labelled sentiment classes were provided in Table I, and II respectively.

Table I. Datasets Description



Dataset characteristics


Missing Value


Number of instances


Number of attributes





Sentiment Classes







Table III. Sample of the Raw Twitter Data Collected.

#DawnButlerBrent Those working on the COVID19 vaccine must work closely with @SickleCellUK and other related groups to ensure safety of the people.

#apbenven  Each Covid19 vaccine option has its own challenges  @It s new territory for pharmacies  especially when it comes to storingJ

ZenootUK of manufacturers surveyed by BDO predicted that their business would fully recover within a year of a vaccine for COVID

@AyannaPressley  Black Lives Matter also means #Any vaccine must have efficacy for those w high blood pressure  amp  diabetes    Priority

@JamesEKHildreth  FAQ to me regarding COVID    vaccine  what will it cost  Nothing  We  taxpayers  have already paid for them  Drug competition

@Reuters #Pfizer and BioNTech have applied to the European drugs regulator for conditional authorization of their COVID 19 vaccine. Good news.

@NBCNews After submitting their vaccine attempts for FDA emergency use authorization…Moderna and Pfizer are moving millions of doses.

#Public Citizen @Taxpayers paid for the entire development of Moderna’s COVID19 vaccine #We paid and It must be fine

D. Dataset Pre-processing

Social media data usually comes with noisy data such as Twitter data, it contains emoticon, punctuations, URL’s, and many more unwanted characters that can affect the performance of a model in detecting the vital information in the tweet. This section employed various pre-processing technique to remove those unwanted characters that do not help and not important in sentiment emotions classification. These pre-processing techniques includes URL removal, special characters removal, hashtag removal, tokenisation, stopword removal, stemming, lemmatization, normalization of characters and whitespace replacement for username.


1.      URL removal: there are a lot of twitter users that includes URLs in their tweet to suggest other followers to click on link, for example Removing such URL is important for sentiment emotions because it is a noisy data in the tweet.  

2.  Username removal: almost every tweet contains username such as “@Sanha4Real”, the @ character indicate a person or someone whom the tweet is referring to. This kind of symbols is a noisy data, also the username is not so important in detecting sentiment emotion. In this case, the username will be replaced with whitespace.

3.  Hashtag removal: Most of the tweet that comes up with a hashtag “#” is referring to a topic of discussion or expressing an opinion of a tweet. Such character needs to be removed as it is noisy in sentiment analysis.   

4.   Negation handling: this is very important in this domain of sentiment analysis because sentiment classifications involves sentiment polarity such as negative word. An example of negative word that can appear in tweets are: “not” appeared as “n’t”, would not as “wouldn’t”, does not as “doesn’t”, such type of words need to handle to address the issue of affecting negations.

5.  Normalization of characters: it is very common to come across many tweets containing character like “feeeeeeeel”, “saaaaaaaad”, “angryyyyyy”. These kinds of characters need to be normalised so that they turn them into a formal word because they might contain an important sentiment polarity. To deal with the repeated characters, we replace those characters that repeated three times to a single character.

6.  Removal of Punctuation: Every punctuation appeared in a tweet are not important in detecting sentiment emotions. Therefore, it is vital to remove them in a dataset as they are noisy and could cause problem to the classifiers.

7. Stopword removal: there are some words in tweets that do not contribute any meaningful information in the context of sentiment emotions detection such as “is” “a” “this” “that” “and” “all”. Such type of common word that appeared in tweets are meaningless and they need to be removed.

8. Stemming and Lemmatization: we apply this pre-processing technique to help in transforming words to their root form as well as reducing the feature space, for example “playing, plays, played” would be transform to its original root as “play”. Therefore, such type of prefixes and suffixes “ing”, “ed”, “er” would be stem to their root word.

9.  Emoticons Symbols: this type of symbols “J”, “L” are important in other research domain but are not important in text emotion detection. Therefore, this study considers them as noisy data because we focused only in sentiment text emotions. 

E.   Feature Engineering

Count vector and term frequency-inverse document frequency (TF-IDF) technique were selected in this study to identify and transform the Twitter text into a feature engineering. Count vector is also referred to vocabulary of words which is a common encoding scheme of a given word in a document while TF-IDF is a numerical statistic that shows how important a word is in a document from a collection of corpuses. Corpus is large set of structured text and languages. Likewise, TF-IDF is a product of TF and IDF as shown in equation 1 below:


        tf – idf (t,d) = tf(t,d) * idf(t,f)                     (1)


From the above equation tf (t,d) represent the term frequency which shows the occurances of term t in a document d, meaning how many time t took place in a document, while the idf (t,d) indicate the inverse document frequency which can be calculated as equation 2 below:


idf (t,d) = log * nd  / 1+df(d,t)                    (2)


From the above equation 2, nd represent the total number of the documents while the df (d,t) represent the total number of documents d that contains the term t. Meanwhile, the addition of constant 1 to the denominator is ideal and serves the function of adding a non-zero value to words that appear in all training samples. Also, the Log is used to prevent overweight of low document frequencies. The following figure 3.4 indicate the procedure of pre-processing (data cleaning) using count vector and TF-IDF feature engineering.


Close Menu