Methodology
A. Cross Industry Standard Platform for Data Mining (CRISP-DM)
For this study, the CRISP-DM was adopted in defining several stages and activities conducted to accomplish the objectives of this study. Cross Industry Standard Platform for Data Mining (CRISP-DM) is one of the standard data mining processes that is popularly used for solving businesses and research problems with the help of its six phases: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. Figure 2 below described the study model workflow.
C.
Dataset Description
As this research is
aimed at classifying public sentiment regarding Covid-19 vaccine, dataset
description and labelled sentiment classes were provided in Table I, and II respectively.
Table I. Datasets Description
Properties |
Description |
Dataset characteristics |
Multivariate |
Missing Value |
None |
Number of instances |
5,200 |
Number of attributes |
3 |
Table II. LABELLED SENTIMENT CLASSES.
S/No |
Sentiment
Classes |
1 |
Positive |
2 |
Neutral |
3 |
Negative |
Table III. Sample of
the Raw Twitter Data Collected.
#DawnButlerBrent
Those working on the COVID19 vaccine must work closely with @SickleCellUK and
other related groups to ensure safety of the people. |
#apbenven Each Covid19 vaccine option has its own
challenges @It s new territory for
pharmacies especially when it comes to
storingJ |
ZenootUK of
manufacturers surveyed by BDO predicted that their business would fully
recover within a year of a vaccine for COVID |
@AyannaPressley Black Lives Matter also means #Any vaccine
must have efficacy for those w high blood pressure amp
diabetes Priority |
@JamesEKHildreth FAQ to me regarding COVID vaccine
what will it cost Nothing We
taxpayers have already paid for
them Drug competition |
@Reuters #Pfizer
and BioNTech have applied to the European drugs regulator for conditional
authorization of their COVID 19 vaccine. Good news. |
@NBCNews After
submitting their vaccine attempts for FDA emergency use authorization…Moderna
and Pfizer are moving millions of doses. |
#Public Citizen
@Taxpayers paid for the entire development of Moderna’s COVID19 vaccine #We
paid and It must be fine |
D. Dataset Pre-processing
Social media data
usually comes with noisy data such as Twitter data, it contains emoticon,
punctuations, URL’s, and many more unwanted characters that can affect the
performance of a model in detecting the vital information in the tweet. This
section employed various pre-processing technique to remove those unwanted
characters that do not help and not important in sentiment emotions
classification. These pre-processing techniques includes URL removal, special
characters removal, hashtag removal, tokenisation, stopword removal, stemming,
lemmatization, normalization of characters and whitespace replacement for
username.
1. URL removal: there are a lot of twitter users that includes URLs in their tweet to suggest other followers to click on link, for example https://www.who.int/. Removing such URL is important for sentiment emotions because it is a noisy data in the tweet.
2. Username
removal: almost every tweet
contains username such as “@Sanha4Real”, the @ character indicate a person or
someone whom the tweet is referring to. This kind of symbols is a noisy data,
also the username is not so important in detecting sentiment emotion. In this
case, the username will be replaced with whitespace.
3. Hashtag
removal: Most of the tweet
that comes up with a hashtag “#” is referring to a topic of discussion or
expressing an opinion of a tweet. Such character needs to be removed as it is
noisy in sentiment analysis.
4. Negation
handling: this is very
important in this domain of sentiment analysis because sentiment
classifications involves sentiment polarity such as negative word. An example
of negative word that can appear in tweets are: “not” appeared as “n’t”, would
not as “wouldn’t”, does not as “doesn’t”, such type of words need to handle to
address the issue of affecting negations.
5. Normalization
of characters: it is very
common to come across many tweets containing character like “feeeeeeeel”,
“saaaaaaaad”, “angryyyyyy”. These kinds of characters need to be normalised so
that they turn them into a formal word because they might contain an important
sentiment polarity. To deal with the repeated characters, we replace those
characters that repeated three times to a single character.
6. Removal
of Punctuation: Every
punctuation appeared in a tweet are not important in detecting sentiment
emotions. Therefore, it is vital to remove them in a dataset as they are noisy
and could cause problem to the classifiers.
7. Stopword
removal: there are some words
in tweets that do not contribute any meaningful information in the context of
sentiment emotions detection such as “is” “a” “this” “that” “and” “all”. Such
type of common word that appeared in tweets are meaningless and they need to be
removed.
8. Stemming
and Lemmatization: we apply
this pre-processing technique to help in transforming words to their root form
as well as reducing the feature space, for example “playing, plays, played”
would be transform to its original root as “play”. Therefore, such type of
prefixes and suffixes “ing”, “ed”, “er” would be stem to their root word.
9. Emoticons Symbols: this type of symbols “J”, “L” are important in other research domain but are not important in text emotion detection. Therefore, this study considers them as noisy data because we focused only in sentiment text emotions.
E.
Feature Engineering
Count vector and term
frequency-inverse document frequency (TF-IDF) technique were selected in this
study to identify and transform the Twitter text into a feature engineering.
Count vector is also referred to vocabulary of words which is a common encoding
scheme of a given word in a document while TF-IDF is a numerical statistic that
shows how important a word is in a document from a collection of corpuses.
Corpus is large set of structured text and languages. Likewise, TF-IDF is a
product of TF and IDF as shown in equation 1 below:
tf – idf (t,d) =
tf(t,d) * idf(t,f)
(1)
From the above
equation tf (t,d) represent the term frequency which shows the occurances of
term t in a document d, meaning how many time t took place in a document, while
the idf (t,d) indicate the inverse document frequency which can be calculated
as equation 2 below:
idf (t,d) = log * nd / 1+df(d,t) (2)
From
the above equation 2, nd represent the total number of the documents
while the df (d,t) represent the total number of documents d that
contains the term t. Meanwhile, the addition of constant 1 to the denominator
is ideal and serves the function of adding a non-zero value to words that
appear in all training samples. Also, the Log is used to prevent overweight of
low document frequencies. The following figure 3.4 indicate the procedure of
pre-processing (data cleaning) using count vector and TF-IDF feature
engineering.