Introduction

This project aims to build supervised machine learning models to identify human sentiment on Twitter. A pipeline consisting of pre-processing, feature extraction, model training, prediction and evaluation is implemented.

Dataset

The main dataset comes from the full training dataset of \cite{rosenthal2017semeval}. Tweets in the train set and the evaluation set are labeled with polarity (negative / neutral / positive). The major task is to develop and optimize classification models based on train / evaluation data and predict the labels of the test data.

The train set and the evaluation set have 22987 and 4926 tweets respectively. The distribution of polarity shows homogeneity between two data sets: nearly half of the tweets are neutral; around 30% are positive while 21-22% are negative.

In the csv files, the frequencies of 45 tokens are selected as the features of tweets (hereinafter referred to as Golden45).

Related Work

It is a common practice to remove stop-words during pre-processing however \citep{saif2012semantic} reports that removal of stop-words is unnecessary and leads to less discriminative features. \citep{go2009twitter} reduces the feature space by removing usernames and links. \citep{pang2002thumbs} shows that unigram features outperform bigrams on sentiment classification of movie reviews. \citep{kouloumpis2011twitter} leverages Twitter hashtags to identify sentiment polarity of tweets. \citep{taboada2011lexicon} utilizes a dictionary of words with associated polarity and incorporates negation and intensification to identify sentiment.

Pre-Processing

URLs are links to external resources and are presumed to be uninformative. URLs can be detected and deleted by regex https?:\/\/\S+.

In Twitter, @ is used to mention a user. User names can be identified by regex @\S+ and thus removed.

Then, tweets are tokenized into sequences of unigram alphanumeric tokens.

Feature Extraction

Two vectorizers are utilized to extract features: Term Count and TF-IDF (Term Frequency–Inverse Document Frequency). Term Count simply counts the occurrences of tokens. TF (term frequency) normalizes Term Count by dividing the number of tokens in a document and IDF (inverse document frequency) gives less weight to a token that appears in many documents.

Term Count and TF-IDF are computed against the a-priori dictionary (Golden45). Additionally, features with higher dimensions are also extracted. Words across the corpus are sorted by their term count or TF-IDF weight and only the top $m$ words will be kept in the dictionary. Here, features with $m$ = 1k, 2k, 5k, 10k, 15k, 20k and 25k dimensions are prepared for further processing.

Two vectorizers will be contrasted in analysis part.

Classifiers

Multiple classifiers are employed in this project.

• Gaussian Naïve Bayes
• Random Forest
• Decision Tree
• K Neighbors (5 neighbors)
• Support Vector Machine
• Dummy Classifier
• Logistic Regression (LIBLINEAR)

A dummy classifier simply categorizes everything into neutral given nearly half of the tweets in the dataset are neutral (Figure sentiment_distribution).

Logistic Regression models the relationship between a binary dependent variable and independent variables. LR can be used as a binary classifier by choosing a cut-off value (a hyperplane in high dimensional space). LIBLINEAR extends logistic regression for multi-class problems by implementing the one-vs-the-rest strategy \cite{fan2008liblinear}.

Evaluation with 45 Dimensions

Methods

Accuracy is used as an aggregated metric to compare different classifiers. Precision and recall are used to drill down the performance of a classifier.

The dummy classifier has an accuracy of 48.7%. However, the recall rates of positive and negative are 0%. Therefore, this dummy classifier is strongly biased though the overall accuracy is moderate.

Overall Result

Classifiers are evaluated by the Golden45 features extracted from the evaluation set. Figure accuracy_45 shows the accuracy of different combinations vectorizers and classifiers. The performance of two vectorizers are similar. Logistic Regress has the highest accuracy (56%) with slight advantage while K Neighbors is the least accurate (48%).

SVM vs K Neighbors

SVM with 55% accuracy is only second to Logistic Regression. Figure metrics_Count_SVC_45 and metrics_Count_KNeighbors_45 show the metrics of the SVM classifier and the K Neighbors classifier. Both classifiers have similar precision scores with neutral (around 53%). However, SVM has much higher precision with negative (54% vs 27%) and positive (75% vs 53%).

The confusion matrix (Figure confusion_Count_KNeighbors_45) clearly shows that KNN classifies a host of negative / positive tweets as neutral.

Naïve Bayes Drill-down

Gaussian NB classifier has an accuracy around 54%. Figure confusion_Count_GaussianNB_45 shows that 715 (69%) negative tweets and 818 (56%) positive tweets are predicted as neutral.

$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y)$$ where $P(y)$ is the relative frequency of class $y$ in the training set. The smaller the dictionary is, the more Naïve Bayes classifier relies on $P(y)$. If the a-priori dictionary contains only one word (i.e. single dimensional features are extracted), the Naïve Bayes classifier will approximate the dummy classifier mentioned before, which always returns neutral.

The accuracy can potentially be improved by adding more representative words to the a-priori dictionary. However, dictionary enrichment is not included in the scope of this project.

Evaluation with High Dimensions

Features with higher dimensions ($m$ = 1k, 2k, 5k, 10k, 15k, 20k and 25k) are used to train and evaluate Logistic Regression classifiers. As shown in Figure accuracy_comparison, accuracy peaks at 65.6% when features with 15k dimensions are extracted. Also, TF-IDF vectorizer works slightly better than Term Count.

Figure metrics_Tfidf_LogisticRegression_15k shows that precision metrics are relative balanced among three labels. The recall rates of negative and positive are also better than other classifiers.

Naïve Bayes and Random Forest

In Figure accuracy_45_vs_15k, when the number of dimensions increases from 45 to 15k, the accuracy of Naïve Bayes drops dramatically. Curse of dimensionality is a potential explanation for the fall.

Logistic Regression models $P(Y|X)$ while Naïve Bayes estimates $P(Y)$ and $P(X|Y)$. When assumptions about independence between the features hold, LR and NB converge towards identical classifiers \cite{MitchellMLChapter3}. However, token in tweets are naturally correlated. In this case, LR performs more accurately than NB if sufficient training examples are available \cite{ng2002discriminative}.

The performance of Random Forest boosts together with the number of dimensions. Random Forest is more resilient to dimensionality than Naïve Bayes because individual trees in the forest use a subset of the features.

Discussion

TF-IDF features are extracted from tweets against a dictionary with 15k unigram tokens. The Logistic Regression classifier achieves an accuracy percentage of 65.6%. The precision rates among three labels are also balanced. As a result, it is feasible to use tweet text to identify people sentiment on Twitter.

Limitation

Current design only handles unigram tokens. For example, "not happy" is not interpreted as a whole and negation cannot be correctly interpreted, not even to mention complex expressions like "better than nothing". Besides, emoticon and emoji are ignored during feature extraction although a smiling / tearing face shows strong feeling.

Conclusion

A pipeline consisting of pre-processing, feature extraction, model training, prediction and evaluation is implemented. A Logistic Regression classifier is chosen as the final implementation, which achieves 65.6% of accuracy. High dimensional features outperform features with selected 45 dimensions. The reason why logistics regression copes well with high dimensional data is also elaborated.

Peer Review

Briefly summarise what the author has done

The author analysed a range of classifiers, contrasting the accuracy of classifiers using feature sets of both before and after TF-IDF transformations as well as contrasting the 45 selected features with a larger feature set constructed by the author. The final conclusion was that higher feature sets outperformed smaller feature sets.

Indicate what you think that the author has done well, and why

The depiction of the pipeline was a very nice touch to explain the methodology of the study. The method and dataset is concise and clear and the Literature review was closely linked to the ideas explored in the report. Good job. The contrast between count and TF-IDF is a nice and useful comparison. The drill-downs of methods were detail but lacked a link back to the specific context.

Indicate what you think could have been improved, and why

Instead of scatter plots, maybe using lines with different colour can help differentiate better. The author could have explored more of the difference in performance of the classifiers and linked it back to the real world context of people sentiments. More focus should be on whether machine learning can help identify people sentiment, in which cases does it fail and in which sorts of tweet does it work well.

References

@inproceedings{rosenthal2017semeval,
author={Rosenthal, Sara and Farra, Noura and Nakov, Preslav},
booktitle={Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017)},
pages={502--518},
year={2017}
}

@article{fan2008liblinear,
title={LIBLINEAR: A library for large linear classification},
author={Fan, Rong-En and Chang, Kai-Wei and Hsieh, Cho-Jui and Wang, Xiang-Rui and Lin, Chih-Jen},
journal={Journal of machine learning research},
volume={9},
number={Aug},
pages={1871--1874},
year={2008}
}

@inproceedings{ng2002discriminative,
title={On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes},
author={Ng, Andrew Y and Jordan, Michael I},
booktitle={Advances in neural information processing systems},
pages={841--848},
year={2002}
}

@inbook{MitchellMLChapter3,
title={Machine Learning},
chapter={Generative and Discriminative classifiers: Naive Bayes and Logistic Regression},
author={Mitchell, Tom Michael},
year={2017}
}

@inproceedings{saif2012semantic,
author={Saif, Hassan and He, Yulan and Alani, Harith},
booktitle={International semantic web conference},
pages={508--524},
year={2012},
organization={Springer}
}

title={Twitter sentiment classification using distant supervision},
author={Go, Alec and Bhayani, Richa and Huang, Lei},
journal={CS224N Project Report, Stanford},
volume={1},
number={12},
pages={2009},
year={2009}
}

author={Kouloumpis, Efthymios and Wilson, Theresa and Moore, Johanna},
booktitle={Fifth International AAAI conference on weblogs and social media},
year={2011}
}

@inproceedings{pang2002thumbs,
title={Thumbs up?: sentiment classification using machine learning techniques},
author={Pang, Bo and Lee, Lillian and Vaithyanathan, Shivakumar},
booktitle={Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10},
pages={79--86},
year={2002},
organization={Association for Computational Linguistics}
}